Pandas Joining

Learning Objectives

After this lesson, you will be able to:

Concatenate objects with .append() and .concat().
Combine objects with .join() and .merge().
Combine timeseries objects with .merge_ordered().
Traditionally, this functionality is performed in a relational database, such as SQL.
With Pandas, you'll be able to perform the same operations in Python! The backend is numpy, a powerful linear algebra library which helps keep things speedy.

We actually will commence this lesson directly in the Jupyter Notebook, pandas-join.ipynb, to walk through the what, why, and how all at once.

Here we have slides reviewing the key concepts.

Joining is the process of taking a single dataframe and combining it with another dataframe.
Traditionally, this would be done with SQL.
- SQL is database designed and optimized to distribute data across many tables.

Joining is important because:
- It allows us to reduce the size of a database.
- It allows us to increase the speed at which data is queried and returned.
- It allows us to reduce the redundancy of the data stored in the database.
Joining is fundamental to proper data architecture, and we'll get to do it in Pandas!

pd.merge(df1, df2, how='left', left_index=True, right_index=True, suffixes=('_df1', '_df2'))

index	letter_df1	number_df1	letter_df2	number_df2
0	a	1	e	5.0
1	b	2	f	6.0
2	c	3	NaN	NaN
3	d	4	NaN	NaN

SQL uses JOIN. Pandas has two semi-equivalent functions:
- pd.join - used for joining dataframes on their indices only
- pd.merge - used for joining dataframes on any column you want
Since pd.merge is more powerful and generalizes better, we'll focus on pd.merge
SQL uses UNION. Pandas, again, has two semi-equivalent functions:
- pd.append - stacks dataframes on top of each other
- pd.concat - stacks dataframes on top of or next to each other
Since pd.concat is more powerful and generalizes better, we'll focus on pd.concat