You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1795 lines
58 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
" <span>\n",
" <p align=\"left\">\n",
" <img align=\"left\" style=\"padding-right: 5px\" valign=\"center\" src=\"https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png\" width=\"28px\">\n",
" </p>\n",
" </span>\n",
" <span>\n",
" <h1>Joining Table with Pandas</h1>\n",
" </span>\n",
"</div>\n",
"\n",
"Pandas provides support for combining `Series`, `DataFrame` and even `xarray` (3 dimensional `DataFrame`s, formerly known in pandas v0.20.0 as `Panel`s) objects with various kinds of set logic for the indicies and relational algebra functionality in the case of join / merge-type operations. More simply stated, this allows you to combine `DataFrame`s.\n",
"\n",
"<!-- Overview -->\n",
"<details>\n",
" <summary>Overview</summary>\n",
" <ul>\n",
" <li><b>In this session, we'll cover:</b></li>\n",
" <br>\n",
" <ul>\n",
" <li>Concatenating objects with <code>.append()</code> and <code>.concat()</code></li>\n",
" <li>Combining objects with <code>.join()</code> and <code>.merge()</code></li>\n",
" <li>Combining timeseries objects with <code>.merge_ordered()</code></li>\n",
" <li>Traditionally, this functionality is performed in a relational database, such as <a href=\"https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join\">SQL</a>. With pandas, you'll be able to perform the same operations - in python! The backend is <code>numpy</code>, a powerful linear algebra library which helps keep things speedy</li>\n",
" </ul>\n",
" <br>\n",
" <li><b>Why Join?</b></li>\n",
" <br>\n",
" <ul>\n",
" <li>You might be asking yourself - why keep data separated in different files? <i>Why not just keep it all in one file?</i></li>\n",
" <li>The answer stems from a thing called <a href=\"https://support.microsoft.com/en-us/help/283878/description-of-the-database-normalization-basics\">database normalization</a>. When a database is <i>normalized</i>, it is structured in such a way that redundancy of data is minimized. This allows a database to be faster, smaller, and more flexible when it comes time to change the data inside of it</li>\n",
" <li>The manifestation of this <i>normalization</i> is data that is represented within multiple <a href=\"https://en.wikipedia.org/wiki/Table_(database)\">tables</a> (which are effectively dataframes), related to each other by <a href=\"https://www.studytonight.com/dbms/database-key.php\">keys</a>, or columns in one table that equal a column in another table, allowing them to be joined. In this case, our tables are the <code>.csv</code> files we'll be importing</li>\n",
" </ul>\n",
" </ul>\n",
"</details>\n",
"\n",
"<!-- TOC -->\n",
"<details>\n",
" <summary>Table of Contents</summary>\n",
" <ul>\n",
" <li><a href=\"#import\">Import</a></li>\n",
" <li><a href=\"#conapp\">Concatenate and Append</a></li>\n",
" <ul>\n",
" <li><a href=\"#concatenate\">Concatenate</a></li>\n",
" <li><a href=\"#append\">Append</a></li>\n",
" </ul>\n",
" <li><a href=\"#joining\">Joining</a></li>\n",
" <ul>\n",
" <li><a href=\"join\">Join</a></li>\n",
" <li><a href=\"#merge\">Merge</a></li>\n",
" <ul>\n",
" <li><a href=\"#merge_keycols\">Merge on Non-Index Columns</a></li>\n",
" <li><a href=\"#yourturn\">Now it's Your Turn!</a></li>\n",
" </ul>\n",
" </ul>\n",
" <li><a href=\"#exercise\">Exercise - AdventureWorks</a></li>\n",
" <ul>\n",
" <li><a href=\"#p_exercise\">Table Joins on Live Data</a></li>\n",
" <ul>\n",
" <li><a href=\"#ex_pp\">Join Product Tables</a></li>\n",
" <li><a href=\"#ex_soh_sod\">Join Sales Order Header and Sales Order Detail Tables</a></li>\n",
" <li><a href=\"#ex_soh_sod_pt\">Join Sales Order Header, Sales Order Detail, and Product Tables</a></li>\n",
" </ul>\n",
" </ul>\n",
" </ul>\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"import\"></div>\n",
"<h2>Import Pandas</h2>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pandas v0.24.2\n",
"Numpy v1.16.3\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"conapp\"></div>\n",
"<h2>Concatenate and Append</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"concatenate\"></div>\n",
"<h3>Concatenate</h3>\n",
"\n",
"Concatenate sticks dataframes together, either on top of each other, or next to each other.\n",
"\n",
"```python\n",
"Signature: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)\n",
"Docstring:\n",
"Concatenate pandas objects along a particular axis with optional set logic\n",
"along the other axes.\n",
"```\n",
"\n",
"First, let's create two dataframes, `df1` and `df2`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number\n",
"0 a 1\n",
"1 b 2"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# KEEP\n",
"df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number\n",
"0 c 3\n",
"1 d 4"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# KEEP\n",
"df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])\n",
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's stick the dataframes on top of each other using `concat`. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number\n",
"0 a 1\n",
"1 b 2\n",
"0 c 3\n",
"1 d 4"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's stick the dataframes <b>next</b> to each other using `concat`. Use of the `axis` kwarg will help us here."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number letter number\n",
"0 a 1 c 3\n",
"1 b 2 d 4"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df2], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"append\"></div>\n",
"<h3>Append</h3>\n",
"\n",
"Append is very similar to `concat`, except it limits itself to a specific case of `concat`, where `axis=0` (stack on top of each other) and `join=outer` (how to handle the axis of the second dataframe). For almost all cases, `concat` has all the functionality of `append` (and more) and can replace `append` entirely.\n",
"\n",
"```python\n",
"Signature: pd.DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=None)\n",
"Docstring:\n",
"Append rows of `other` to the end of this frame, returning a new\n",
"object. Columns not in this frame are added as new columns.\n",
"```\n",
"\n",
"Also note that `append` is a DataFrame and Series method, and not a pandas library function like `concat` is."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number\n",
"0 a 1\n",
"1 b 2\n",
"0 c 3\n",
"1 d 4"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.append(df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"joining\"></div>\n",
"<h2>Joining</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"join\"></div>\n",
"<h3>Join</h3>\n",
"\n",
"`join` allows us to compare two dataframes, and combine them by using a matching column known as a `key`. Normally, during joins, this key is explicitly stated (we'll get to this with `merge` in our next example). With `join`, the `key` joining the table is always the `index` of the first table with (by default) the index of the second table. \n",
"\n",
"```python\n",
"Signature: pd.DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)\n",
"Docstring:\n",
"Join columns with other DataFrame either on index or on a key\n",
"column. Efficiently Join multiple DataFrame objects by index at once by\n",
"passing a list.\n",
"```\n",
"\n",
"First, let's create two dataframes."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number\n",
"0 a 1\n",
"1 b 2\n",
"2 c 3\n",
"3 d 4"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# KEEP\n",
"df1 = pd.DataFrame([['a', 1], ['b', 2], ['c', 3], ['d', 4]], columns=['letter', 'number'])\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter</th>\n",
" <th>number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>e</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>f</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter number\n",
"0 e 5\n",
"1 f 6"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# KEEP\n",
"df2 = pd.DataFrame([['e', 5], ['f', 6]], columns=['letter', 'number'])\n",
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, lets `join` these two dataframes. Note that we will `key`, or 'line up', the two dataframes based on their `indicies`.\n",
"\n",
"Note that, when joining dataframes with any common column names, we will need to supply a `lsuffix` or `rsuffix` kwarg. This is appended to the end of the column name of the returned, joined dataframe to differentiate and identify the source column. Here, we'll use `_df1` to identify that the column shown came from the `df1` dataframe, and `_df2` as a suffix to identify its origin as the `df2` dataframe. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter_df1</th>\n",
" <th>number_df1</th>\n",
" <th>letter_df2</th>\n",
" <th>number_df2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" <td>e</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter_df1 number_df1 letter_df2 number_df2\n",
"0 a 1 e 5.0\n",
"1 b 2 f 6.0\n",
"2 c 3 NaN NaN\n",
"3 d 4 NaN NaN"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.join(df2, lsuffix='_df1', rsuffix='_df2')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how we have joined the two dataframes on their indicies, which creates a null for rows of index 2 and 3 in `df2`. This is expected and correct.\n",
"\n",
"Also note that the default join behavior of `join` is `left`. We can change this with the `how` kwarg.\n",
"\n",
"For reference, here are the common types of joins. Join types won't be covered in this lesson.\n",
"<p align=\"center\">\n",
"<img width=\"500px\" src=\"https://i.stack.imgur.com/udQpD.jpg\">\n",
"</p>\n",
"\n",
"The type of join we performed above is shown in the upper-left most figure in the above chart."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"merge\"></div>\n",
"<h3>Merge</h3>\n",
"\n",
"Similar to `join` is `merge`. The difference between the two is the <i>keying behavior</i>. `merge` has a richer API (more functionality) and allows one to join on columns in the source dataframe <i>other than the index</i>. Because `merge` can effectively do everything that `join` can do, and more - it is recommended to always use `merge` unless code brevity is the top concern. \n",
"\n",
"```python\n",
"Signature: pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)\n",
"Docstring:\n",
"Merge DataFrame objects by performing a database-style join operation by\n",
"columns or indexes.\n",
"```\n",
"\n",
"Note that `merge` is <i>both</i> a DataFrame method as well as a pandas function. Below, we'll be using the pandas function, `pd.merge()`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>letter_df1</th>\n",
" <th>number_df1</th>\n",
" <th>letter_df2</th>\n",
" <th>number_df2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" <td>e</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" letter_df1 number_df1 letter_df2 number_df2\n",
"0 a 1 e 5.0\n",
"1 b 2 f 6.0\n",
"2 c 3 NaN NaN\n",
"3 d 4 NaN NaN"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df1, df2, how='left', left_index=True, right_index=True, suffixes=('_df1', '_df2'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we've achieved the same exact output as we did with `join`, but it took a little more explicit work. Let's run through the arguments for clarity:\n",
"\n",
"<ul>\n",
" <li><code>df1</code>: this is the first dataframe, and considered to be on the 'left' of <code>df2</code></li>\n",
" <li><code>df2</code>: this is the second dataframe, considered to be on the right of <code>df1</code></li>\n",
" <li><code>how='left'</code>: this states the type of join; see the above SQL join table</li>\n",
" <li><code>left_index=True</code>: this uses the index of <code>df1</code> as the join key for the left table</li>\n",
" <li><code>right_index=True</code>: this uses the index of <code>df2</code> as the join key for the right table</li>\n",
" <li><code>suffixes</code>: this places <code>_df1</code> after column names which came from <code>df1</code></li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"merge_keycols\"></div>\n",
"<h4>Merge on Non-Index Columns</h4>\n",
"\n",
"This brings us to our next point: merging on columns that are not the index columns. This is very, very common in SQL joins and this technique can be used to join just about any DataFrame.\n",
"\n",
"First, let's create some more realistic data - stocks!"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# KEEP\n",
"openprice = pd.DataFrame({'Symbol': ['AAPL', 'DHR', 'DAL', 'AMZN'], 'OpenPrice': [217.51, 96.54, 51.45, 1703.34]})\n",
"wkhigh = pd.DataFrame({'Symbol': ['DAL', 'AMZN', 'AAPL', 'DHR'], '52wkHigh': [60.79, 2050.49, 233.47, 110.11]})\n",
"stockname = pd.DataFrame({'Symbol': ['AMZN', 'DHR', 'DAL', 'AAPL'], 'Name': ['Amazon', 'Danaher', 'Delta Airlines', 'Apple']})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's join the <code>openprice</code> and <code>wkhigh</code> dataframes together."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Symbol</th>\n",
" <th>OpenPrice</th>\n",
" <th>52wkHigh</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AAPL</td>\n",
" <td>217.51</td>\n",
" <td>233.47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>DHR</td>\n",
" <td>96.54</td>\n",
" <td>110.11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>DAL</td>\n",
" <td>51.45</td>\n",
" <td>60.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AMZN</td>\n",
" <td>1703.34</td>\n",
" <td>2050.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Symbol OpenPrice 52wkHigh\n",
"0 AAPL 217.51 233.47\n",
"1 DHR 96.54 110.11\n",
"2 DAL 51.45 60.79\n",
"3 AMZN 1703.34 2050.49"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(openprice, wkhigh, how='left', left_on='Symbol', right_on='Symbol')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how our `Symbol` column isn't in the same order in each dataframe. This is intentional, and note that the dataframe on the left, `openprice` dictates the order of the dataframe on the right, `wkhigh`. Also note that the shared key between the two dataframes is exempt from having a <code>suffix</code> applied to it. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"yourturn\"></div>\n",
"<h4>Now it's your turn!</h4>\n",
"\n",
"<ul>\n",
" <li><code>merge</code> the <code>openprice</code> and <code>stockname</code> dataframes and inspect the result</li>\n",
" <li><code>merge</code> all three dataframes together and inspect the result</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Symbol</th>\n",
" <th>OpenPrice</th>\n",
" <th>Name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AAPL</td>\n",
" <td>217.51</td>\n",
" <td>Apple</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>DHR</td>\n",
" <td>96.54</td>\n",
" <td>Danaher</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>DAL</td>\n",
" <td>51.45</td>\n",
" <td>Delta Airlines</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AMZN</td>\n",
" <td>1703.34</td>\n",
" <td>Amazon</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Symbol OpenPrice Name\n",
"0 AAPL 217.51 Apple\n",
"1 DHR 96.54 Danaher\n",
"2 DAL 51.45 Delta Airlines\n",
"3 AMZN 1703.34 Amazon"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(openprice, stockname, how='left', left_on='Symbol', right_on='Symbol')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Symbol</th>\n",
" <th>OpenPrice</th>\n",
" <th>Name</th>\n",
" <th>52wkHigh</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AAPL</td>\n",
" <td>217.51</td>\n",
" <td>Apple</td>\n",
" <td>233.47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>DHR</td>\n",
" <td>96.54</td>\n",
" <td>Danaher</td>\n",
" <td>110.11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>DAL</td>\n",
" <td>51.45</td>\n",
" <td>Delta Airlines</td>\n",
" <td>60.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AMZN</td>\n",
" <td>1703.34</td>\n",
" <td>Amazon</td>\n",
" <td>2050.49</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Symbol OpenPrice Name 52wkHigh\n",
"0 AAPL 217.51 Apple 233.47\n",
"1 DHR 96.54 Danaher 110.11\n",
"2 DAL 51.45 Delta Airlines 60.79\n",
"3 AMZN 1703.34 Amazon 2050.49"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Note that we're using the DataFrame .merge() method here for brevity\n",
"pd.merge(openprice, stockname, how='left', left_on='Symbol', right_on='Symbol') \\\n",
" .merge(wkhigh, how='left', left_on='Symbol', right_on='Symbol')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"exercise\"></div>\n",
"<h2>Exercise - Adventure Works</h2>\n",
"<p align=\"right\">\n",
"<img src=\"http://lh6.ggpht.com/_XjcDyZkJqHg/TPaaRcaysbI/AAAAAAAAAFo/b1U3q-qbTjY/AdventureWorks%20Logo%5B5%5D.png?imgmax=800\">\n",
"</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_exercise\"></div>\n",
"<h3>Table Joins on Live Data</h3>\n",
"\n",
"Here are the data dictionaries we'll be using for the following exercise:\n",
"\n",
"<ul>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html\">Production.Product</a></li>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Production.ProductSubCategory.html\">Production.ProductSubcategory</a></li>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderHeader.html\">Sales.SalesOrderHeader</a></li>\n",
" <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html\">Sales.SalesOrderDetail</a></li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"p = pd.read_csv('../data/Production.Product.csv', sep='\\t')\n",
"ps = pd.read_csv('../data/Production.ProductSubcategory.csv', sep='\\t')\n",
"soh = pd.read_csv('../data/Sales.SalesOrderHeader.csv', sep='\\t', nrows=1000)\n",
"sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\\t', nrows=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"ex_pp\"></div>\n",
"<h4>Join Product Tables</h4>\n",
"\n",
"<ul>\n",
" <li>Using the <code>Production.Product.ProductID</code> and <code>Production.ProductSubcategory.ProductID</code> keys, join the <code>Production.Product</code> and <code>Production.ProductSubcategory</code> tables</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ProductID</th>\n",
" <th>Name_p</th>\n",
" <th>ProductNumber</th>\n",
" <th>MakeFlag</th>\n",
" <th>FinishedGoodsFlag</th>\n",
" <th>Color</th>\n",
" <th>SafetyStockLevel</th>\n",
" <th>ReorderPoint</th>\n",
" <th>StandardCost</th>\n",
" <th>ListPrice</th>\n",
" <th>...</th>\n",
" <th>ProductModelID</th>\n",
" <th>SellStartDate</th>\n",
" <th>SellEndDate</th>\n",
" <th>DiscontinuedDate</th>\n",
" <th>rowguid_p</th>\n",
" <th>ModifiedDate_p</th>\n",
" <th>ProductCategoryID</th>\n",
" <th>Name</th>\n",
" <th>rowguid</th>\n",
" <th>ModifiedDate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Adjustable Race</td>\n",
" <td>AR-5381</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>1000</td>\n",
" <td>750</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>2008-04-30 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Bearing Ball</td>\n",
" <td>BA-8327</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>1000</td>\n",
" <td>750</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>2008-04-30 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>{58AE3C20-4F3A-4749-A7D4-D568806CC537}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>BB Ball Bearing</td>\n",
" <td>BE-2349</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>800</td>\n",
" <td>600</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>2008-04-30 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 29 columns</p>\n",
"</div>"
],
"text/plain": [
" ProductID Name_p ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"0 1 Adjustable Race AR-5381 0 0 \n",
"1 2 Bearing Ball BA-8327 0 0 \n",
"2 3 BB Ball Bearing BE-2349 1 0 \n",
"\n",
" Color SafetyStockLevel ReorderPoint StandardCost ListPrice ... \\\n",
"0 NaN 1000 750 0.0 0.0 ... \n",
"1 NaN 1000 750 0.0 0.0 ... \n",
"2 NaN 800 600 0.0 0.0 ... \n",
"\n",
" ProductModelID SellStartDate SellEndDate DiscontinuedDate \\\n",
"0 NaN 2008-04-30 00:00:00 NaN NaN \n",
"1 NaN 2008-04-30 00:00:00 NaN NaN \n",
"2 NaN 2008-04-30 00:00:00 NaN NaN \n",
"\n",
" rowguid_p ModifiedDate_p \\\n",
"0 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} 2014-02-08 10:01:36.827000000 \n",
"1 {58AE3C20-4F3A-4749-A7D4-D568806CC537} 2014-02-08 10:01:36.827000000 \n",
"2 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} 2014-02-08 10:01:36.827000000 \n",
"\n",
" ProductCategoryID Name rowguid ModifiedDate \n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"\n",
"[3 rows x 29 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(p, ps, how='left', left_on='ProductSubcategoryID', right_on='ProductSubcategoryID', suffixes=('_p', '')).head(3)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"ex_soh_sod\"></div>\n",
"<h4>Join Sales Order Header and Sales Order Detail Tables</h4>\n",
"\n",
"<ul>\n",
" <li>Join the <code>Sales.SalesOrderHeader</code> and <code>Sales.SalesOrderDetail</code> tables</li>\n",
" <li>Don't forget to use your data dictionaries!</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SalesOrderID</th>\n",
" <th>RevisionNumber</th>\n",
" <th>OrderDate</th>\n",
" <th>DueDate</th>\n",
" <th>ShipDate</th>\n",
" <th>Status</th>\n",
" <th>OnlineOrderFlag</th>\n",
" <th>SalesOrderNumber</th>\n",
" <th>PurchaseOrderNumber</th>\n",
" <th>AccountNumber</th>\n",
" <th>...</th>\n",
" <th>SalesOrderDetailID</th>\n",
" <th>CarrierTrackingNumber</th>\n",
" <th>OrderQty</th>\n",
" <th>ProductID</th>\n",
" <th>SpecialOfferID</th>\n",
" <th>UnitPrice</th>\n",
" <th>UnitPriceDiscount</th>\n",
" <th>LineTotal</th>\n",
" <th>rowguid_y</th>\n",
" <th>ModifiedDate_y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>43659</td>\n",
" <td>8</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2011-06-12 00:00:00</td>\n",
" <td>2011-06-07 00:00:00</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>SO43659</td>\n",
" <td>PO522145787</td>\n",
" <td>10-4020-000676</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>4911-403C-98</td>\n",
" <td>1.0</td>\n",
" <td>776.0</td>\n",
" <td>1.0</td>\n",
" <td>2024.994</td>\n",
" <td>0.0</td>\n",
" <td>2024.994</td>\n",
" <td>{B207C96D-D9E6-402B-8470-2CC176C42283}</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>43659</td>\n",
" <td>8</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2011-06-12 00:00:00</td>\n",
" <td>2011-06-07 00:00:00</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>SO43659</td>\n",
" <td>PO522145787</td>\n",
" <td>10-4020-000676</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>4911-403C-98</td>\n",
" <td>3.0</td>\n",
" <td>777.0</td>\n",
" <td>1.0</td>\n",
" <td>2024.994</td>\n",
" <td>0.0</td>\n",
" <td>6074.982</td>\n",
" <td>{7ABB600D-1E77-41BE-9FE5-B9142CFC08FA}</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>43659</td>\n",
" <td>8</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2011-06-12 00:00:00</td>\n",
" <td>2011-06-07 00:00:00</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>SO43659</td>\n",
" <td>PO522145787</td>\n",
" <td>10-4020-000676</td>\n",
" <td>...</td>\n",
" <td>3.0</td>\n",
" <td>4911-403C-98</td>\n",
" <td>1.0</td>\n",
" <td>778.0</td>\n",
" <td>1.0</td>\n",
" <td>2024.994</td>\n",
" <td>0.0</td>\n",
" <td>2024.994</td>\n",
" <td>{475CF8C6-49F6-486E-B0AD-AFC6A50CDD2F}</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 36 columns</p>\n",
"</div>"
],
"text/plain": [
" SalesOrderID RevisionNumber OrderDate DueDate \\\n",
"0 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n",
"1 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n",
"2 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n",
"\n",
" ShipDate Status OnlineOrderFlag SalesOrderNumber \\\n",
"0 2011-06-07 00:00:00 5 0 SO43659 \n",
"1 2011-06-07 00:00:00 5 0 SO43659 \n",
"2 2011-06-07 00:00:00 5 0 SO43659 \n",
"\n",
" PurchaseOrderNumber AccountNumber ... SalesOrderDetailID \\\n",
"0 PO522145787 10-4020-000676 ... 1.0 \n",
"1 PO522145787 10-4020-000676 ... 2.0 \n",
"2 PO522145787 10-4020-000676 ... 3.0 \n",
"\n",
" CarrierTrackingNumber OrderQty ProductID SpecialOfferID UnitPrice \\\n",
"0 4911-403C-98 1.0 776.0 1.0 2024.994 \n",
"1 4911-403C-98 3.0 777.0 1.0 2024.994 \n",
"2 4911-403C-98 1.0 778.0 1.0 2024.994 \n",
"\n",
" UnitPriceDiscount LineTotal rowguid_y \\\n",
"0 0.0 2024.994 {B207C96D-D9E6-402B-8470-2CC176C42283} \n",
"1 0.0 6074.982 {7ABB600D-1E77-41BE-9FE5-B9142CFC08FA} \n",
"2 0.0 2024.994 {475CF8C6-49F6-486E-B0AD-AFC6A50CDD2F} \n",
"\n",
" ModifiedDate_y \n",
"0 2011-05-31 00:00:00 \n",
"1 2011-05-31 00:00:00 \n",
"2 2011-05-31 00:00:00 \n",
"\n",
"[3 rows x 36 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here, we're adding in an optional concept of validation. This is a one-to-many merge,\n",
"# since we can have multiple products (detail) in each sales order (header). Note the\n",
"# header table is to the left of the detail table.\n",
"pd.merge(soh, sod, how='left', left_on='SalesOrderID', right_on='SalesOrderID').head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"ex_soh_sod_pt\"></div>\n",
"<h4>Join Sales Order Header, Sales Order Detail, and Product Tables</h4>\n",
"\n",
"<ul>\n",
" <li>Join the <code>Sales.SalesOrderHeader</code>, <code>Sales.SalesOrderDetail</code>, and <code>Production.Product</code> tables</li>\n",
" <li>Don't forget to use your data dictionaries!</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SalesOrderID</th>\n",
" <th>RevisionNumber</th>\n",
" <th>OrderDate</th>\n",
" <th>DueDate</th>\n",
" <th>ShipDate</th>\n",
" <th>Status</th>\n",
" <th>OnlineOrderFlag</th>\n",
" <th>SalesOrderNumber</th>\n",
" <th>PurchaseOrderNumber</th>\n",
" <th>AccountNumber</th>\n",
" <th>...</th>\n",
" <th>ProductLine</th>\n",
" <th>Class</th>\n",
" <th>Style</th>\n",
" <th>ProductSubcategoryID</th>\n",
" <th>ProductModelID</th>\n",
" <th>SellStartDate</th>\n",
" <th>SellEndDate</th>\n",
" <th>DiscontinuedDate</th>\n",
" <th>rowguid</th>\n",
" <th>ModifiedDate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>43659</td>\n",
" <td>8</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2011-06-12 00:00:00</td>\n",
" <td>2011-06-07 00:00:00</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>SO43659</td>\n",
" <td>PO522145787</td>\n",
" <td>10-4020-000676</td>\n",
" <td>...</td>\n",
" <td>M</td>\n",
" <td>H</td>\n",
" <td>U</td>\n",
" <td>1.0</td>\n",
" <td>19.0</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2012-05-29 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>{02935111-A546-4C6D-941F-BE12D42C158E}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>43659</td>\n",
" <td>8</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2011-06-12 00:00:00</td>\n",
" <td>2011-06-07 00:00:00</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>SO43659</td>\n",
" <td>PO522145787</td>\n",
" <td>10-4020-000676</td>\n",
" <td>...</td>\n",
" <td>M</td>\n",
" <td>H</td>\n",
" <td>U</td>\n",
" <td>1.0</td>\n",
" <td>19.0</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2012-05-29 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>{7920BC3B-8FD4-4610-93D2-E693A66B6474}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>43659</td>\n",
" <td>8</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2011-06-12 00:00:00</td>\n",
" <td>2011-06-07 00:00:00</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>SO43659</td>\n",
" <td>PO522145787</td>\n",
" <td>10-4020-000676</td>\n",
" <td>...</td>\n",
" <td>M</td>\n",
" <td>H</td>\n",
" <td>U</td>\n",
" <td>1.0</td>\n",
" <td>19.0</td>\n",
" <td>2011-05-31 00:00:00</td>\n",
" <td>2012-05-29 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>{1B486300-7E64-4C5D-A9BA-A8368E20C5A0}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 60 columns</p>\n",
"</div>"
],
"text/plain": [
" SalesOrderID RevisionNumber OrderDate DueDate \\\n",
"0 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n",
"1 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n",
"2 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n",
"\n",
" ShipDate Status OnlineOrderFlag SalesOrderNumber \\\n",
"0 2011-06-07 00:00:00 5 0 SO43659 \n",
"1 2011-06-07 00:00:00 5 0 SO43659 \n",
"2 2011-06-07 00:00:00 5 0 SO43659 \n",
"\n",
" PurchaseOrderNumber AccountNumber ... ProductLine Class Style \\\n",
"0 PO522145787 10-4020-000676 ... M H U \n",
"1 PO522145787 10-4020-000676 ... M H U \n",
"2 PO522145787 10-4020-000676 ... M H U \n",
"\n",
" ProductSubcategoryID ProductModelID SellStartDate \\\n",
"0 1.0 19.0 2011-05-31 00:00:00 \n",
"1 1.0 19.0 2011-05-31 00:00:00 \n",
"2 1.0 19.0 2011-05-31 00:00:00 \n",
"\n",
" SellEndDate DiscontinuedDate \\\n",
"0 2012-05-29 00:00:00 NaN \n",
"1 2012-05-29 00:00:00 NaN \n",
"2 2012-05-29 00:00:00 NaN \n",
"\n",
" rowguid ModifiedDate \n",
"0 {02935111-A546-4C6D-941F-BE12D42C158E} 2014-02-08 10:01:36.827000000 \n",
"1 {7920BC3B-8FD4-4610-93D2-E693A66B6474} 2014-02-08 10:01:36.827000000 \n",
"2 {1B486300-7E64-4C5D-A9BA-A8368E20C5A0} 2014-02-08 10:01:36.827000000 \n",
"\n",
"[3 rows x 60 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here, again we are using the resultant dataframe of the first merge and applying the .merge\n",
"# dataframe METHOD to join in the Product table. Note that the product table is a many to 1\n",
"# merge - there are multiple sales orders that all may reference the same product (i.e)\n",
"# we have sold a mountain bike, model xyz, more than once.\n",
"pd.merge(soh, sod, how='left', left_on='SalesOrderID', right_on='SalesOrderID') \\\n",
" .merge(p, how='left', left_on='ProductID', right_on='ProductID').head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}