{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "

\n", " \n", "

\n", "
\n", " \n", "

Joining Table with Pandas

\n", "
\n", "
\n", "\n", "Pandas provides support for combining `Series`, `DataFrame` and even `xarray` (3 dimensional `DataFrame`s, formerly known in pandas v0.20.0 as `Panel`s) objects with various kinds of set logic for the indicies and relational algebra functionality in the case of join / merge-type operations. More simply stated, this allows you to combine `DataFrame`s.\n", "\n", "\n", "
\n", " Overview\n", " \n", "
\n", "\n", "\n", "
\n", " Table of Contents\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Import Pandas

" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pandas v0.24.2\n", "Numpy v1.16.3\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Concatenate and Append

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Concatenate

\n", "\n", "Concatenate sticks dataframes together, either on top of each other, or next to each other.\n", "\n", "```python\n", "Signature: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)\n", "Docstring:\n", "Concatenate pandas objects along a particular axis with optional set logic\n", "along the other axes.\n", "```\n", "\n", "First, let's create two dataframes, `df1` and `df2`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# KEEP\n", "df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0c3
1d4
\n", "
" ], "text/plain": [ " letter number\n", "0 c 3\n", "1 d 4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# KEEP\n", "df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's stick the dataframes on top of each other using `concat`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
0c3
1d4
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2\n", "0 c 3\n", "1 d 4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's stick the dataframes next to each other using `concat`. Use of the `axis` kwarg will help us here." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumberletternumber
0a1c3
1b2d4
\n", "
" ], "text/plain": [ " letter number letter number\n", "0 a 1 c 3\n", "1 b 2 d 4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Append

\n", "\n", "Append is very similar to `concat`, except it limits itself to a specific case of `concat`, where `axis=0` (stack on top of each other) and `join=outer` (how to handle the axis of the second dataframe). For almost all cases, `concat` has all the functionality of `append` (and more) and can replace `append` entirely.\n", "\n", "```python\n", "Signature: pd.DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=None)\n", "Docstring:\n", "Append rows of `other` to the end of this frame, returning a new\n", "object. Columns not in this frame are added as new columns.\n", "```\n", "\n", "Also note that `append` is a DataFrame and Series method, and not a pandas library function like `concat` is." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
0c3
1d4
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2\n", "0 c 3\n", "1 d 4" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.append(df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Joining

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join

\n", "\n", "`join` allows us to compare two dataframes, and combine them by using a matching column known as a `key`. Normally, during joins, this key is explicitly stated (we'll get to this with `merge` in our next example). With `join`, the `key` joining the table is always the `index` of the first table with (by default) the index of the second table. \n", "\n", "```python\n", "Signature: pd.DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)\n", "Docstring:\n", "Join columns with other DataFrame either on index or on a key\n", "column. Efficiently Join multiple DataFrame objects by index at once by\n", "passing a list.\n", "```\n", "\n", "First, let's create two dataframes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0a1
1b2
2c3
3d4
\n", "
" ], "text/plain": [ " letter number\n", "0 a 1\n", "1 b 2\n", "2 c 3\n", "3 d 4" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# KEEP\n", "df1 = pd.DataFrame([['a', 1], ['b', 2], ['c', 3], ['d', 4]], columns=['letter', 'number'])\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumber
0e5
1f6
\n", "
" ], "text/plain": [ " letter number\n", "0 e 5\n", "1 f 6" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# KEEP\n", "df2 = pd.DataFrame([['e', 5], ['f', 6]], columns=['letter', 'number'])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, lets `join` these two dataframes. Note that we will `key`, or 'line up', the two dataframes based on their `indicies`.\n", "\n", "Note that, when joining dataframes with any common column names, we will need to supply a `lsuffix` or `rsuffix` kwarg. This is appended to the end of the column name of the returned, joined dataframe to differentiate and identify the source column. Here, we'll use `_df1` to identify that the column shown came from the `df1` dataframe, and `_df2` as a suffix to identify its origin as the `df2` dataframe. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letter_df1number_df1letter_df2number_df2
0a1e5.0
1b2f6.0
2c3NaNNaN
3d4NaNNaN
\n", "
" ], "text/plain": [ " letter_df1 number_df1 letter_df2 number_df2\n", "0 a 1 e 5.0\n", "1 b 2 f 6.0\n", "2 c 3 NaN NaN\n", "3 d 4 NaN NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.join(df2, lsuffix='_df1', rsuffix='_df2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how we have joined the two dataframes on their indicies, which creates a null for rows of index 2 and 3 in `df2`. This is expected and correct.\n", "\n", "Also note that the default join behavior of `join` is `left`. We can change this with the `how` kwarg.\n", "\n", "For reference, here are the common types of joins. Join types won't be covered in this lesson.\n", "

\n", "\n", "

\n", "\n", "The type of join we performed above is shown in the upper-left most figure in the above chart." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Merge

\n", "\n", "Similar to `join` is `merge`. The difference between the two is the keying behavior. `merge` has a richer API (more functionality) and allows one to join on columns in the source dataframe other than the index. Because `merge` can effectively do everything that `join` can do, and more - it is recommended to always use `merge` unless code brevity is the top concern. \n", "\n", "```python\n", "Signature: pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)\n", "Docstring:\n", "Merge DataFrame objects by performing a database-style join operation by\n", "columns or indexes.\n", "```\n", "\n", "Note that `merge` is both a DataFrame method as well as a pandas function. Below, we'll be using the pandas function, `pd.merge()`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letter_df1number_df1letter_df2number_df2
0a1e5.0
1b2f6.0
2c3NaNNaN
3d4NaNNaN
\n", "
" ], "text/plain": [ " letter_df1 number_df1 letter_df2 number_df2\n", "0 a 1 e 5.0\n", "1 b 2 f 6.0\n", "2 c 3 NaN NaN\n", "3 d 4 NaN NaN" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='left', left_index=True, right_index=True, suffixes=('_df1', '_df2'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we've achieved the same exact output as we did with `join`, but it took a little more explicit work. Let's run through the arguments for clarity:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Merge on Non-Index Columns

\n", "\n", "This brings us to our next point: merging on columns that are not the index columns. This is very, very common in SQL joins and this technique can be used to join just about any DataFrame.\n", "\n", "First, let's create some more realistic data - stocks!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# KEEP\n", "openprice = pd.DataFrame({'Symbol': ['AAPL', 'DHR', 'DAL', 'AMZN'], 'OpenPrice': [217.51, 96.54, 51.45, 1703.34]})\n", "wkhigh = pd.DataFrame({'Symbol': ['DAL', 'AMZN', 'AAPL', 'DHR'], '52wkHigh': [60.79, 2050.49, 233.47, 110.11]})\n", "stockname = pd.DataFrame({'Symbol': ['AMZN', 'DHR', 'DAL', 'AAPL'], 'Name': ['Amazon', 'Danaher', 'Delta Airlines', 'Apple']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's join the openprice and wkhigh dataframes together." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SymbolOpenPrice52wkHigh
0AAPL217.51233.47
1DHR96.54110.11
2DAL51.4560.79
3AMZN1703.342050.49
\n", "
" ], "text/plain": [ " Symbol OpenPrice 52wkHigh\n", "0 AAPL 217.51 233.47\n", "1 DHR 96.54 110.11\n", "2 DAL 51.45 60.79\n", "3 AMZN 1703.34 2050.49" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(openprice, wkhigh, how='left', left_on='Symbol', right_on='Symbol')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how our `Symbol` column isn't in the same order in each dataframe. This is intentional, and note that the dataframe on the left, `openprice` dictates the order of the dataframe on the right, `wkhigh`. Also note that the shared key between the two dataframes is exempt from having a suffix applied to it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now it's your turn!

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SymbolOpenPriceName
0AAPL217.51Apple
1DHR96.54Danaher
2DAL51.45Delta Airlines
3AMZN1703.34Amazon
\n", "
" ], "text/plain": [ " Symbol OpenPrice Name\n", "0 AAPL 217.51 Apple\n", "1 DHR 96.54 Danaher\n", "2 DAL 51.45 Delta Airlines\n", "3 AMZN 1703.34 Amazon" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(openprice, stockname, how='left', left_on='Symbol', right_on='Symbol')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SymbolOpenPriceName52wkHigh
0AAPL217.51Apple233.47
1DHR96.54Danaher110.11
2DAL51.45Delta Airlines60.79
3AMZN1703.34Amazon2050.49
\n", "
" ], "text/plain": [ " Symbol OpenPrice Name 52wkHigh\n", "0 AAPL 217.51 Apple 233.47\n", "1 DHR 96.54 Danaher 110.11\n", "2 DAL 51.45 Delta Airlines 60.79\n", "3 AMZN 1703.34 Amazon 2050.49" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that we're using the DataFrame .merge() method here for brevity\n", "pd.merge(openprice, stockname, how='left', left_on='Symbol', right_on='Symbol') \\\n", " .merge(wkhigh, how='left', left_on='Symbol', right_on='Symbol')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Exercise - Adventure Works

\n", "

\n", "\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Table Joins on Live Data

\n", "\n", "Here are the data dictionaries we'll be using for the following exercise:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "p = pd.read_csv('../data/Production.Product.csv', sep='\\t')\n", "ps = pd.read_csv('../data/Production.ProductSubcategory.csv', sep='\\t')\n", "soh = pd.read_csv('../data/Sales.SalesOrderHeader.csv', sep='\\t', nrows=1000)\n", "sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\\t', nrows=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join Product Tables

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProductIDName_pProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPrice...ProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguid_pModifiedDate_pProductCategoryIDNamerowguidModifiedDate
01Adjustable RaceAR-538100NaN10007500.00.0...NaN2008-04-30 00:00:00NaNNaN{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000NaNNaNNaNNaN
12Bearing BallBA-832700NaN10007500.00.0...NaN2008-04-30 00:00:00NaNNaN{58AE3C20-4F3A-4749-A7D4-D568806CC537}2014-02-08 10:01:36.827000000NaNNaNNaNNaN
23BB Ball BearingBE-234910NaN8006000.00.0...NaN2008-04-30 00:00:00NaNNaN{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}2014-02-08 10:01:36.827000000NaNNaNNaNNaN
\n", "

3 rows × 29 columns

\n", "
" ], "text/plain": [ " ProductID Name_p ProductNumber MakeFlag FinishedGoodsFlag \\\n", "0 1 Adjustable Race AR-5381 0 0 \n", "1 2 Bearing Ball BA-8327 0 0 \n", "2 3 BB Ball Bearing BE-2349 1 0 \n", "\n", " Color SafetyStockLevel ReorderPoint StandardCost ListPrice ... \\\n", "0 NaN 1000 750 0.0 0.0 ... \n", "1 NaN 1000 750 0.0 0.0 ... \n", "2 NaN 800 600 0.0 0.0 ... \n", "\n", " ProductModelID SellStartDate SellEndDate DiscontinuedDate \\\n", "0 NaN 2008-04-30 00:00:00 NaN NaN \n", "1 NaN 2008-04-30 00:00:00 NaN NaN \n", "2 NaN 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid_p ModifiedDate_p \\\n", "0 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} 2014-02-08 10:01:36.827000000 \n", "1 {58AE3C20-4F3A-4749-A7D4-D568806CC537} 2014-02-08 10:01:36.827000000 \n", "2 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} 2014-02-08 10:01:36.827000000 \n", "\n", " ProductCategoryID Name rowguid ModifiedDate \n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "\n", "[3 rows x 29 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(p, ps, how='left', left_on='ProductSubcategoryID', right_on='ProductSubcategoryID', suffixes=('_p', '')).head(3)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join Sales Order Header and Sales Order Detail Tables

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesOrderIDRevisionNumberOrderDateDueDateShipDateStatusOnlineOrderFlagSalesOrderNumberPurchaseOrderNumberAccountNumber...SalesOrderDetailIDCarrierTrackingNumberOrderQtyProductIDSpecialOfferIDUnitPriceUnitPriceDiscountLineTotalrowguid_yModifiedDate_y
04365982011-05-31 00:00:002011-06-12 00:00:002011-06-07 00:00:0050SO43659PO52214578710-4020-000676...1.04911-403C-981.0776.01.02024.9940.02024.994{B207C96D-D9E6-402B-8470-2CC176C42283}2011-05-31 00:00:00
14365982011-05-31 00:00:002011-06-12 00:00:002011-06-07 00:00:0050SO43659PO52214578710-4020-000676...2.04911-403C-983.0777.01.02024.9940.06074.982{7ABB600D-1E77-41BE-9FE5-B9142CFC08FA}2011-05-31 00:00:00
24365982011-05-31 00:00:002011-06-12 00:00:002011-06-07 00:00:0050SO43659PO52214578710-4020-000676...3.04911-403C-981.0778.01.02024.9940.02024.994{475CF8C6-49F6-486E-B0AD-AFC6A50CDD2F}2011-05-31 00:00:00
\n", "

3 rows × 36 columns

\n", "
" ], "text/plain": [ " SalesOrderID RevisionNumber OrderDate DueDate \\\n", "0 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n", "1 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n", "2 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n", "\n", " ShipDate Status OnlineOrderFlag SalesOrderNumber \\\n", "0 2011-06-07 00:00:00 5 0 SO43659 \n", "1 2011-06-07 00:00:00 5 0 SO43659 \n", "2 2011-06-07 00:00:00 5 0 SO43659 \n", "\n", " PurchaseOrderNumber AccountNumber ... SalesOrderDetailID \\\n", "0 PO522145787 10-4020-000676 ... 1.0 \n", "1 PO522145787 10-4020-000676 ... 2.0 \n", "2 PO522145787 10-4020-000676 ... 3.0 \n", "\n", " CarrierTrackingNumber OrderQty ProductID SpecialOfferID UnitPrice \\\n", "0 4911-403C-98 1.0 776.0 1.0 2024.994 \n", "1 4911-403C-98 3.0 777.0 1.0 2024.994 \n", "2 4911-403C-98 1.0 778.0 1.0 2024.994 \n", "\n", " UnitPriceDiscount LineTotal rowguid_y \\\n", "0 0.0 2024.994 {B207C96D-D9E6-402B-8470-2CC176C42283} \n", "1 0.0 6074.982 {7ABB600D-1E77-41BE-9FE5-B9142CFC08FA} \n", "2 0.0 2024.994 {475CF8C6-49F6-486E-B0AD-AFC6A50CDD2F} \n", "\n", " ModifiedDate_y \n", "0 2011-05-31 00:00:00 \n", "1 2011-05-31 00:00:00 \n", "2 2011-05-31 00:00:00 \n", "\n", "[3 rows x 36 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Here, we're adding in an optional concept of validation. This is a one-to-many merge,\n", "# since we can have multiple products (detail) in each sales order (header). Note the\n", "# header table is to the left of the detail table.\n", "pd.merge(soh, sod, how='left', left_on='SalesOrderID', right_on='SalesOrderID').head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Join Sales Order Header, Sales Order Detail, and Product Tables

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesOrderIDRevisionNumberOrderDateDueDateShipDateStatusOnlineOrderFlagSalesOrderNumberPurchaseOrderNumberAccountNumber...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
04365982011-05-31 00:00:002011-06-12 00:00:002011-06-07 00:00:0050SO43659PO52214578710-4020-000676...MHU1.019.02011-05-31 00:00:002012-05-29 00:00:00NaN{02935111-A546-4C6D-941F-BE12D42C158E}2014-02-08 10:01:36.827000000
14365982011-05-31 00:00:002011-06-12 00:00:002011-06-07 00:00:0050SO43659PO52214578710-4020-000676...MHU1.019.02011-05-31 00:00:002012-05-29 00:00:00NaN{7920BC3B-8FD4-4610-93D2-E693A66B6474}2014-02-08 10:01:36.827000000
24365982011-05-31 00:00:002011-06-12 00:00:002011-06-07 00:00:0050SO43659PO52214578710-4020-000676...MHU1.019.02011-05-31 00:00:002012-05-29 00:00:00NaN{1B486300-7E64-4C5D-A9BA-A8368E20C5A0}2014-02-08 10:01:36.827000000
\n", "

3 rows × 60 columns

\n", "
" ], "text/plain": [ " SalesOrderID RevisionNumber OrderDate DueDate \\\n", "0 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n", "1 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n", "2 43659 8 2011-05-31 00:00:00 2011-06-12 00:00:00 \n", "\n", " ShipDate Status OnlineOrderFlag SalesOrderNumber \\\n", "0 2011-06-07 00:00:00 5 0 SO43659 \n", "1 2011-06-07 00:00:00 5 0 SO43659 \n", "2 2011-06-07 00:00:00 5 0 SO43659 \n", "\n", " PurchaseOrderNumber AccountNumber ... ProductLine Class Style \\\n", "0 PO522145787 10-4020-000676 ... M H U \n", "1 PO522145787 10-4020-000676 ... M H U \n", "2 PO522145787 10-4020-000676 ... M H U \n", "\n", " ProductSubcategoryID ProductModelID SellStartDate \\\n", "0 1.0 19.0 2011-05-31 00:00:00 \n", "1 1.0 19.0 2011-05-31 00:00:00 \n", "2 1.0 19.0 2011-05-31 00:00:00 \n", "\n", " SellEndDate DiscontinuedDate \\\n", "0 2012-05-29 00:00:00 NaN \n", "1 2012-05-29 00:00:00 NaN \n", "2 2012-05-29 00:00:00 NaN \n", "\n", " rowguid ModifiedDate \n", "0 {02935111-A546-4C6D-941F-BE12D42C158E} 2014-02-08 10:01:36.827000000 \n", "1 {7920BC3B-8FD4-4610-93D2-E693A66B6474} 2014-02-08 10:01:36.827000000 \n", "2 {1B486300-7E64-4C5D-A9BA-A8368E20C5A0} 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 60 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Here, again we are using the resultant dataframe of the first merge and applying the .merge\n", "# dataframe METHOD to join in the Product table. Note that the product table is a many to 1\n", "# merge - there are multiple sales orders that all may reference the same product (i.e)\n", "# we have sold a mountain bike, model xyz, more than once.\n", "pd.merge(soh, sod, how='left', left_on='SalesOrderID', right_on='SalesOrderID') \\\n", " .merge(p, how='left', left_on='ProductID', right_on='ProductID').head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }