You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

580 lines
20 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
" <span>\n",
" <p align=\"left\">\n",
" <img align=\"left\" style=\"padding-right: 5px\" valign=\"center\" src=\"https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png\" width=\"28px\">\n",
" </p>\n",
" </span>\n",
" <span>\n",
" <h1>Joining Table with Pandas</h1>\n",
" </span>\n",
"</div>\n",
"\n",
"Pandas provides support for combining `Series`, `DataFrame` and even `xarray` (3 dimensional `DataFrame`s, formerly known in pandas v0.20.0 as `Panel`s) objects with various kinds of set logic for the indicies and relational algebra functionality in the case of join / merge-type operations. More simply stated, this allows you to combine `DataFrame`s.\n",
"\n",
"<!-- Overview -->\n",
"<details>\n",
" <summary>Overview</summary>\n",
" <ul>\n",
" <li><b>In this session, we'll cover:</b></li>\n",
" <br>\n",
" <ul>\n",
" <li>Concatenating objects with <code>.append()</code> and <code>.concat()</code></li>\n",
" <li>Combining objects with <code>.join()</code> and <code>.merge()</code></li>\n",
" <li>Combining timeseries objects with <code>.merge_ordered()</code></li>\n",
" <li>Traditionally, this functionality is performed in a relational database, such as <a href=\"https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join\">SQL</a>. With pandas, you'll be able to perform the same operations - in python! The backend is <code>numpy</code>, a powerful linear algebra library which helps keep things speedy</li>\n",
" </ul>\n",
" <br>\n",
" <li><b>Why Join?</b></li>\n",
" <br>\n",
" <ul>\n",
" <li>You might be asking yourself - why keep data separated in different files? <i>Why not just keep it all in one file?</i></li>\n",
" <li>The answer stems from a thing called <a href=\"https://support.microsoft.com/en-us/help/283878/description-of-the-database-normalization-basics\">database normalization</a>. When a database is <i>normalized</i>, it is structured in such a way that redundancy of data is minimized. This allows a database to be faster, smaller, and more flexible when it comes time to change the data inside of it</li>\n",
" <li>The manifestation of this <i>normalization</i> is data that is represented within multiple <a href=\"https://en.wikipedia.org/wiki/Table_(database)\">tables</a> (which are effectively dataframes), related to each other by <a href=\"https://www.studytonight.com/dbms/database-key.php\">keys</a>, or columns in one table that equal a column in another table, allowing them to be joined. In this case, our tables are the <code>.csv</code> files we'll be importing</li>\n",
" </ul>\n",
" </ul>\n",
"</details>\n",
"\n",
"<!-- TOC -->\n",
"<details>\n",
" <summary>Table of Contents</summary>\n",
" <ul>\n",
" <li><a href=\"#import\">Import</a></li>\n",
" <li><a href=\"#conapp\">Concatenate and Append</a></li>\n",
" <ul>\n",
" <li><a href=\"#concatenate\">Concatenate</a></li>\n",
" <li><a href=\"#append\">Append</a></li>\n",
" </ul>\n",
" <li><a href=\"#joining\">Joining</a></li>\n",
" <ul>\n",
" <li><a href=\"join\">Join</a></li>\n",
" <li><a href=\"#merge\">Merge</a></li>\n",
" <ul>\n",
" <li><a href=\"#merge_keycols\">Merge on Non-Index Columns</a></li>\n",
" <li><a href=\"#yourturn\">Now it's Your Turn!</a></li>\n",
" </ul>\n",
" </ul>\n",
" <li><a href=\"#exercise\">Exercise - AdventureWorks</a></li>\n",
" <ul>\n",
" <li><a href=\"#p_exercise\">Table Joins on Live Data</a></li>\n",
" <ul>\n",
" <li><a href=\"#ex_pp\">Join Product Tables</a></li>\n",
" <li><a href=\"#ex_soh_sod\">Join Sales Order Header and Sales Order Detail Tables</a></li>\n",
" <li><a href=\"#ex_soh_sod_pt\">Join Sales Order Header, Sales Order Detail, and Product Tables</a></li>\n",
" </ul>\n",
" </ul>\n",
" </ul>\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"import\"></div>\n",
"<h2>Import Pandas</h2>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"conapp\"></div>\n",
"<h2>Concatenate and Append</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"concatenate\"></div>\n",
"<h3>Concatenate</h3>\n",
"\n",
"Concatenate sticks dataframes together, either on top of each other, or next to each other.\n",
"\n",
"```python\n",
"Signature: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)\n",
"Docstring:\n",
"Concatenate pandas objects along a particular axis with optional set logic\n",
"along the other axes.\n",
"```\n",
"\n",
"First, let's create two dataframes, `df1` and `df2`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# KEEP\n",
"df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# KEEP\n",
"df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])\n",
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's stick the dataframes on top of each other using `concat`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's stick the dataframes <b>next</b> to each other using `concat`. Use of the `axis` kwarg will help us here."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <div id=\"append\"></div>\n",
"<h3>Append</h3>\n",
"\n",
"Append is very similar to `concat`, except it limits itself to a specific case of `concat`, where `axis=0` (stack on top of each other) and `join=outer` (how to handle the axis of the second dataframe). For almost all cases, `concat` has all the functionality of `append` (and more) and can replace `append` entirely.\n",
"\n",
"```python\n",
"Signature: pd.DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=None)\n",
"Docstring:\n",
"Append rows of `other` to the end of this frame, returning a new\n",
"object. Columns not in this frame are added as new columns.\n",
"```\n",
"\n",
"Also note that `append` is a DataFrame and Series method, and not a pandas library function like `concat` is."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"joining\"></div>\n",
"<h2>Joining</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"join\"></div>\n",
"<h3>Join</h3>\n",
"\n",
"`join` allows us to compare two dataframes, and combine them by using a matching column known as a `key`. Normally, during joins, this key is explicitly stated (we'll get to this with `merge` in our next example). With `join`, the `key` joining the table is always the `index` of the first table with (by default) the index of the second table. \n",
"\n",
"```python\n",
"Signature: pd.DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)\n",
"Docstring:\n",
"Join columns with other DataFrame either on index or on a key\n",
"column. Efficiently Join multiple DataFrame objects by index at once by\n",
"passing a list.\n",
"```\n",
"\n",
"First, let's create two dataframes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# KEEP\n",
"df1 = pd.DataFrame([['a', 1], ['b', 2], ['c', 3], ['d', 4]], columns=['letter', 'number'])\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# KEEP\n",
"df2 = pd.DataFrame([['e', 5], ['f', 6]], columns=['letter', 'number'])\n",
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, lets `join` these two dataframes. Note that we will `key`, or 'line up', the two dataframes based on their `indicies`.\n",
"\n",
"Note that, when joining dataframes with any common column names, we will need to supply a `lsuffix` or `rsuffix` kwarg. This is appended to the end of the column name of the returned, joined dataframe to differentiate and identify the source column. Here, we'll use `_df1` to identify that the column shown came from the `df1` dataframe, and `_df2` as a suffix to identify its origin as the `df2` dataframe. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how we have joined the two dataframes on their indicies, which creates a null for rows of index 2 and 3 in `df2`. This is expected and correct.\n",
"\n",
"Also note that the default join behavior of `join` is `left`. We can change this with the `how` kwarg.\n",
"\n",
"For reference, here are the common types of joins. Join types won't be covered in this lesson.\n",
"<p align=\"center\">\n",
"<img width=\"500px\" src=\"https://i.stack.imgur.com/udQpD.jpg\">\n",
"</p>\n",
"\n",
"The type of join we performed above is shown in the upper-left most figure in the above chart."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"merge\"></div>\n",
"<h3>Merge</h3>\n",
"\n",
"Similar to `join` is `merge`. The difference between the two is the <i>keying behavior</i>. `merge` has a richer API (more functionality) and allows one to join on columns in the source dataframe <i>other than the index</i>. Because `merge` can effectively do everything that `join` can do, and more - it is recommended to always use `merge` unless code brevity is the top concern. \n",
"\n",
"```python\n",
"Signature: pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)\n",
"Docstring:\n",
"Merge DataFrame objects by performing a database-style join operation by\n",
"columns or indexes.\n",
"```\n",
"\n",
"Note that `merge` is <i>both</i> a DataFrame method as well as a pandas function. Below, we'll be using the pandas function, `pd.merge()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we've achieved the same exact output as we did with `join`, but it took a little more explicit work. Let's run through the arguments for clarity:\n",
"\n",
"<ul>\n",
" <li><code>df1</code>: this is the first dataframe, and considered to be on the 'left' of <code>df2</code></li>\n",
" <li><code>df2</code>: this is the second dataframe, considered to be on the right of <code>df1</code></li>\n",
" <li><code>how='left'</code>: this states the type of join; see the above SQL join table</li>\n",
" <li><code>left_index=True</code>: this uses the index of <code>df1</code> as the join key for the left table</li>\n",
" <li><code>right_index=True</code>: this uses the index of <code>df2</code> as the join key for the right table</li>\n",
" <li><code>suffixes</code>: this places <code>_df1</code> after column names which came from <code>df1</code></li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"merge_keycols\"></div>\n",
"<h4>Merge on Non-Index Columns</h4>\n",
"\n",
"This brings us to our next point: merging on columns that are not the index columns. This is very, very common in SQL joins and this technique can be used to join just about any DataFrame.\n",
"\n",
"First, let's create some more realistic data - stocks!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# KEEP\n",
"openprice = pd.DataFrame({'Symbol1': ['AAPL', 'DHR', 'DAL', 'AMZN'], 'OpenPrice': [217.51, 96.54, 51.45, 1703.34]})\n",
"wkhigh = pd.DataFrame({'Symbol2': ['DAL', 'AMZN', 'AAPL', 'DHR'], '52wkHigh': [60.79, 2050.49, 233.47, 110.11]})\n",
"stockname = pd.DataFrame({'Symbol3': ['AMZN', 'DHR', 'DAL', 'AAPL'], 'Name': ['Amazon', 'Danaher', 'Delta Airlines', 'Apple']})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's join the <code>openprice</code> and <code>wkhigh</code> dataframes together."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how our `Symbol` column isn't in the same order in each dataframe. This is intentional, and note that the dataframe on the left, `openprice` dictates the order of the dataframe on the right, `wkhigh`. Also note that the shared key between the two dataframes is exempt from having a <code>suffix</code> applied to it. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"yourturn\"></div>\n",
"<h4>Now it's your turn!</h4>\n",
"\n",
"<ul>\n",
" <li><code>merge</code> the <code>openprice</code> and <code>stockname</code> dataframes and inspect the result</li>\n",
" <li><code>merge</code> all three dataframes together and inspect the result</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note that we're using the DataFrame .merge() method here for brevity\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"exercise\"></div>\n",
"<h2>Exercise - Adventure Works</h2>\n",
"<p align=\"right\">\n",
"<img src=\"http://lh6.ggpht.com/_XjcDyZkJqHg/TPaaRcaysbI/AAAAAAAAAFo/b1U3q-qbTjY/AdventureWorks%20Logo%5B5%5D.png?imgmax=800\">\n",
"</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_exercise\"></div>\n",
"<h3>Table Joins on Live Data</h3>\n",
"\n",
"Here are the data dictionaries we'll be using for the following exercise:\n",
"\n",
"<ul>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html\">Production.Product</a></li>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Production.ProductSubCategory.html\">Production.ProductSubcategory</a></li>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderHeader.html\">Sales.SalesOrderHeader</a></li>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html\">Sales.SalesOrderDetail</a></li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p = pd.read_csv('../data/Production.Product.csv', sep='\\t')\n",
"ps = pd.read_csv('../data/Production.ProductSubcategory.csv', sep='\\t')\n",
"soh = pd.read_csv('../data/Sales.SalesOrderHeader.csv', sep='\\t', nrows=1000)\n",
"sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\\t', nrows=1000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"ex_pp\"></div>\n",
"<h4>Join Product Tables</h4>\n",
"\n",
"<ul>\n",
" <li>Using the <code>Production.Product.ProductID</code> and <code>Production.ProductSubcategory.ProductID</code> keys, join the <code>Production.Product</code> and <code>Production.ProductSubcategory</code> tables</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"ex_soh_sod\"></div>\n",
"<h4>Join Sales Order Header and Sales Order Detail Tables</h4>\n",
"\n",
"<ul>\n",
" <li>Join the <code>Sales.SalesOrderHeader</code> and <code>Sales.SalesOrderDetail</code> tables</li>\n",
" <li>Don't forget to use your data dictionaries!</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"ex_soh_sod_pt\"></div>\n",
"<h4>Join Sales Order Header, Sales Order Detail, and Product Tables</h4>\n",
"\n",
"<ul>\n",
" <li>Join the <code>Sales.SalesOrderHeader</code>, <code>Sales.SalesOrderDetail</code>, and <code>Production.Product</code> tables</li>\n",
" <li>Don't forget to use your data dictionaries!</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}