{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "

\n", " \n", "

\n", "
\n", " \n", "

Joining Table with Pandas

\n", "
\n", "
\n", "\n", "Pandas provides support for combining `Series`, `DataFrame` and even `xarray` (3 dimensional `DataFrame`s, formerly known in pandas v0.20.0 as `Panel`s) objects with various kinds of set logic for the indicies and relational algebra functionality in the case of join / merge-type operations. More simply stated, this allows you to combine `DataFrame`s.\n", "\n", "\n", "
\n", " Overview\n", " \n", "
\n", "\n", "\n", "
\n", " Table of Contents\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Import Pandas

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Concatenate and Append

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Concatenate

\n", "\n", "Concatenate sticks dataframes together, either on top of each other, or next to each other.\n", "\n", "```python\n", "Signature: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)\n", "Docstring:\n", "Concatenate pandas objects along a particular axis with optional set logic\n", "along the other axes.\n", "```\n", "\n", "First, let's create two dataframes, `df1` and `df2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# KEEP\n", "df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# KEEP\n", "df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's stick the dataframes on top of each other using `concat`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's stick the dataframes next to each other using `concat`. Use of the `axis` kwarg will help us here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
\n", "

Append

\n", "\n", "Append is very similar to `concat`, except it limits itself to a specific case of `concat`, where `axis=0` (stack on top of each other) and `join=outer` (how to handle the axis of the second dataframe). For almost all cases, `concat` has all the functionality of `append` (and more) and can replace `append` entirely.\n", "\n", "```python\n", "Signature: pd.DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=None)\n", "Docstring:\n", "Append rows of `other` to the end of this frame, returning a new\n", "object. Columns not in this frame are added as new columns.\n", "```\n", "\n", "Also note that `append` is a DataFrame and Series method, and not a pandas library function like `concat` is." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Joining

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join

\n", "\n", "`join` allows us to compare two dataframes, and combine them by using a matching column known as a `key`. Normally, during joins, this key is explicitly stated (we'll get to this with `merge` in our next example). With `join`, the `key` joining the table is always the `index` of the first table with (by default) the index of the second table. \n", "\n", "```python\n", "Signature: pd.DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)\n", "Docstring:\n", "Join columns with other DataFrame either on index or on a key\n", "column. Efficiently Join multiple DataFrame objects by index at once by\n", "passing a list.\n", "```\n", "\n", "First, let's create two dataframes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# KEEP\n", "df1 = pd.DataFrame([['a', 1], ['b', 2], ['c', 3], ['d', 4]], columns=['letter', 'number'])\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# KEEP\n", "df2 = pd.DataFrame([['e', 5], ['f', 6]], columns=['letter', 'number'])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, lets `join` these two dataframes. Note that we will `key`, or 'line up', the two dataframes based on their `indicies`.\n", "\n", "Note that, when joining dataframes with any common column names, we will need to supply a `lsuffix` or `rsuffix` kwarg. This is appended to the end of the column name of the returned, joined dataframe to differentiate and identify the source column. Here, we'll use `_df1` to identify that the column shown came from the `df1` dataframe, and `_df2` as a suffix to identify its origin as the `df2` dataframe. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how we have joined the two dataframes on their indicies, which creates a null for rows of index 2 and 3 in `df2`. This is expected and correct.\n", "\n", "Also note that the default join behavior of `join` is `left`. We can change this with the `how` kwarg.\n", "\n", "For reference, here are the common types of joins. Join types won't be covered in this lesson.\n", "

\n", "\n", "

\n", "\n", "The type of join we performed above is shown in the upper-left most figure in the above chart." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Merge

\n", "\n", "Similar to `join` is `merge`. The difference between the two is the keying behavior. `merge` has a richer API (more functionality) and allows one to join on columns in the source dataframe other than the index. Because `merge` can effectively do everything that `join` can do, and more - it is recommended to always use `merge` unless code brevity is the top concern. \n", "\n", "```python\n", "Signature: pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)\n", "Docstring:\n", "Merge DataFrame objects by performing a database-style join operation by\n", "columns or indexes.\n", "```\n", "\n", "Note that `merge` is both a DataFrame method as well as a pandas function. Below, we'll be using the pandas function, `pd.merge()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we've achieved the same exact output as we did with `join`, but it took a little more explicit work. Let's run through the arguments for clarity:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Merge on Non-Index Columns

\n", "\n", "This brings us to our next point: merging on columns that are not the index columns. This is very, very common in SQL joins and this technique can be used to join just about any DataFrame.\n", "\n", "First, let's create some more realistic data - stocks!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# KEEP\n", "openprice = pd.DataFrame({'Symbol1': ['AAPL', 'DHR', 'DAL', 'AMZN'], 'OpenPrice': [217.51, 96.54, 51.45, 1703.34]})\n", "wkhigh = pd.DataFrame({'Symbol2': ['DAL', 'AMZN', 'AAPL', 'DHR'], '52wkHigh': [60.79, 2050.49, 233.47, 110.11]})\n", "stockname = pd.DataFrame({'Symbol3': ['AMZN', 'DHR', 'DAL', 'AAPL'], 'Name': ['Amazon', 'Danaher', 'Delta Airlines', 'Apple']})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's join the openprice and wkhigh dataframes together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how our `Symbol` column isn't in the same order in each dataframe. This is intentional, and note that the dataframe on the left, `openprice` dictates the order of the dataframe on the right, `wkhigh`. Also note that the shared key between the two dataframes is exempt from having a suffix applied to it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now it's your turn!

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note that we're using the DataFrame .merge() method here for brevity\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Exercise - Adventure Works

\n", "

\n", "\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Table Joins on Live Data

\n", "\n", "Here are the data dictionaries we'll be using for the following exercise:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = pd.read_csv('../data/Production.Product.csv', sep='\\t')\n", "ps = pd.read_csv('../data/Production.ProductSubcategory.csv', sep='\\t')\n", "soh = pd.read_csv('../data/Sales.SalesOrderHeader.csv', sep='\\t', nrows=1000)\n", "sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\\t', nrows=1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join Product Tables

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join Sales Order Header and Sales Order Detail Tables

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join Sales Order Header, Sales Order Detail, and Product Tables

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }