PYTHR/unit-6-pandas/09-pandas-join/code/pandas-join-answers.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div>\n",
    "    <span>\n",
    "    <p align=\"left\">\n",
    "    <img align=\"left\" style=\"padding-right: 5px\" valign=\"center\" src=\"https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png\" width=\"28px\">\n",
    "    </p>\n",
    "    </span>\n",
    "    <span>\n",
    "        <h1>Joining Table with Pandas</h1>\n",
    "    </span>\n",
    "</div>\n",
    "\n",
    "Pandas provides support for combining `Series`, `DataFrame` and even `xarray` (3 dimensional `DataFrame`s, formerly known in pandas v0.20.0 as `Panel`s) objects with various kinds of set logic for the indicies and relational algebra functionality in the case of join / merge-type operations. More simply stated, this allows you to combine `DataFrame`s.\n",
    "\n",
    "<!-- Overview -->\n",
    "<details>\n",
    "    <summary>Overview</summary>\n",
    "    <ul>\n",
    "        <li><b>In this session, we'll cover:</b></li>\n",
    "        <br>\n",
    "        <ul>\n",
    "            <li>Concatenating objects with <code>.append()</code> and <code>.concat()</code></li>\n",
    "            <li>Combining objects with <code>.join()</code> and <code>.merge()</code></li>\n",
    "            <li>Combining timeseries objects with <code>.merge_ordered()</code></li>\n",
    "            <li>Traditionally, this functionality is performed in a relational database, such as <a href=\"https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join\">SQL</a>. With pandas, you'll be able to perform the same operations - in python! The backend is <code>numpy</code>, a powerful linear algebra library which helps keep things speedy</li>\n",
    "        </ul>\n",
    "        <br>\n",
    "        <li><b>Why Join?</b></li>\n",
    "        <br>\n",
    "        <ul>\n",
    "            <li>You might be asking yourself - why keep data separated in different files? <i>Why not just keep it all in one file?</i></li>\n",
    "            <li>The answer stems from a thing called <a href=\"https://support.microsoft.com/en-us/help/283878/description-of-the-database-normalization-basics\">database normalization</a>. When a database is <i>normalized</i>, it is structured in such a way that redundancy of data is minimized. This allows a database to be faster, smaller, and more flexible when it comes time to change the data inside of it</li>\n",
    "            <li>The manifestation of this <i>normalization</i> is data that is represented within multiple <a href=\"https://en.wikipedia.org/wiki/Table_(database)\">tables</a> (which are effectively dataframes), related to each other by <a href=\"https://www.studytonight.com/dbms/database-key.php\">keys</a>, or columns in one table that equal a column in another table, allowing them to be joined. In this case, our tables are the <code>.csv</code> files we'll be importing</li>\n",
    "        </ul>\n",
    "    </ul>\n",
    "</details>\n",
    "\n",
    "<!-- TOC -->\n",
    "<details>\n",
    "    <summary>Table of Contents</summary>\n",
    "    <ul>\n",
    "        <li><a href=\"#import\">Import</a></li>\n",
    "        <li><a href=\"#conapp\">Concatenate and Append</a></li>\n",
    "        <ul>\n",
    "            <li><a href=\"#concatenate\">Concatenate</a></li>\n",
    "            <li><a href=\"#append\">Append</a></li>\n",
    "        </ul>\n",
    "        <li><a href=\"#joining\">Joining</a></li>\n",
    "        <ul>\n",
    "            <li><a href=\"join\">Join</a></li>\n",
    "            <li><a href=\"#merge\">Merge</a></li>\n",
    "            <ul>\n",
    "                <li><a href=\"#merge_keycols\">Merge on Non-Index Columns</a></li>\n",
    "                <li><a href=\"#yourturn\">Now it's Your Turn!</a></li>\n",
    "            </ul>\n",
    "        </ul>\n",
    "        <li><a href=\"#exercise\">Exercise - AdventureWorks</a></li>\n",
    "        <ul>\n",
    "            <li><a href=\"#p_exercise\">Table Joins on Live Data</a></li>\n",
    "            <ul>\n",
    "                <li><a href=\"#ex_pp\">Join Product Tables</a></li>\n",
    "                <li><a href=\"#ex_soh_sod\">Join Sales Order Header and Sales Order Detail Tables</a></li>\n",
    "                <li><a href=\"#ex_soh_sod_pt\">Join Sales Order Header, Sales Order Detail, and Product Tables</a></li>\n",
    "            </ul>\n",
    "        </ul>\n",
    "    </ul>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"import\"></div>\n",
    "<h2>Import Pandas</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pandas v0.24.2\n",
      "Numpy v1.16.3\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"conapp\"></div>\n",
    "<h2>Concatenate and Append</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"concatenate\"></div>\n",
    "<h3>Concatenate</h3>\n",
    "\n",
    "Concatenate sticks dataframes together, either on top of each other, or next to each other.\n",
    "\n",
    "```python\n",
    "Signature: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)\n",
    "Docstring:\n",
    "Concatenate pandas objects along a particular axis with optional set logic\n",
    "along the other axes.\n",
    "```\n",
    "\n",
    "First, let's create two dataframes, `df1` and `df2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# KEEP\n",
    "df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      c       3\n",
       "1      d       4"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# KEEP\n",
    "df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's stick the dataframes on top of each other using `concat`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2\n",
       "0      c       3\n",
       "1      d       4"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df1, df2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's stick the dataframes <b>next</b> to each other using `concat`. Use of the `axis` kwarg will help us here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number letter  number\n",
       "0      a       1      c       3\n",
       "1      b       2      d       4"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df1, df2], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"append\"></div>\n",
    "<h3>Append</h3>\n",
    "\n",
    "Append is very similar to `concat`, except it limits itself to a specific case of `concat`, where `axis=0` (stack on top of each other) and `join=outer` (how to handle the axis of the second dataframe). For almost all cases, `concat` has all the functionality of `append` (and more) and can replace `append` entirely.\n",
    "\n",
    "```python\n",
    "Signature: pd.DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=None)\n",
    "Docstring:\n",
    "Append rows of `other` to the end of this frame, returning a new\n",
    "object. Columns not in this frame are added as new columns.\n",
    "```\n",
    "\n",
    "Also note that `append` is a DataFrame and Series method, and not a pandas library function like `concat` is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2\n",
       "0      c       3\n",
       "1      d       4"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.append(df2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"joining\"></div>\n",
    "<h2>Joining</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"join\"></div>\n",
    "<h3>Join</h3>\n",
    "\n",
    "`join` allows us to compare two dataframes, and combine them by using a matching column known as a `key`. Normally, during joins, this key is explicitly stated (we'll get to this with `merge` in our next example). With `join`, the `key` joining the table is always the `index` of the first table with (by default) the index of the second table. \n",
    "\n",
    "```python\n",
    "Signature: pd.DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)\n",
    "Docstring:\n",
    "Join columns with other DataFrame either on index or on a key\n",
    "column. Efficiently Join multiple DataFrame objects by index at once by\n",
    "passing a list.\n",
    "```\n",
    "\n",
    "First, let's create two dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      a       1\n",
       "1      b       2\n",
       "2      c       3\n",
       "3      d       4"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# KEEP\n",
    "df1 = pd.DataFrame([['a', 1], ['b', 2], ['c', 3], ['d', 4]], columns=['letter', 'number'])\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>e</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>f</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  number\n",
       "0      e       5\n",
       "1      f       6"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# KEEP\n",
    "df2 = pd.DataFrame([['e', 5], ['f', 6]], columns=['letter', 'number'])\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, lets `join` these two dataframes. Note that we will `key`, or 'line up', the two dataframes based on their `indicies`.\n",
    "\n",
    "Note that, when joining dataframes with any common column names, we will need to supply a `lsuffix` or `rsuffix` kwarg. This is appended to the end of the column name of the returned, joined dataframe to differentiate and identify the source column. Here, we'll use `_df1` to identify that the column shown came from the `df1` dataframe, and `_df2` as a suffix to identify its origin as the `df2` dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter_df1</th>\n",
       "      <th>number_df1</th>\n",
       "      <th>letter_df2</th>\n",
       "      <th>number_df2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "      <td>e</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter_df1  number_df1 letter_df2  number_df2\n",
       "0          a           1          e         5.0\n",
       "1          b           2          f         6.0\n",
       "2          c           3        NaN         NaN\n",
       "3          d           4        NaN         NaN"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1.join(df2, lsuffix='_df1', rsuffix='_df2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how we have joined the two dataframes on their indicies, which creates a null for rows of index 2 and 3 in `df2`. This is expected and correct.\n",
    "\n",
    "Also note that the default join behavior of `join` is `left`. We can change this with the `how` kwarg.\n",
    "\n",
    "For reference, here are the common types of joins. Join types won't be covered in this lesson.\n",
    "<p align=\"center\">\n",
    "<img width=\"500px\" src=\"https://i.stack.imgur.com/udQpD.jpg\">\n",
    "</p>\n",
    "\n",
    "The type of join we performed above is shown in the upper-left most figure in the above chart."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"merge\"></div>\n",
    "<h3>Merge</h3>\n",
    "\n",
    "Similar to `join` is `merge`. The difference between the two is the <i>keying behavior</i>. `merge` has a richer API (more functionality) and allows one to join on columns in the source dataframe <i>other than the index</i>. Because `merge` can effectively do everything that `join` can do, and more - it is recommended to always use `merge` unless code brevity is the top concern. \n",
    "\n",
    "```python\n",
    "Signature: pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)\n",
    "Docstring:\n",
    "Merge DataFrame objects by performing a database-style join operation by\n",
    "columns or indexes.\n",
    "```\n",
    "\n",
    "Note that `merge` is <i>both</i> a DataFrame method as well as a pandas function. Below, we'll be using the pandas function, `pd.merge()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter_df1</th>\n",
       "      <th>number_df1</th>\n",
       "      <th>letter_df2</th>\n",
       "      <th>number_df2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "      <td>e</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter_df1  number_df1 letter_df2  number_df2\n",
       "0          a           1          e         5.0\n",
       "1          b           2          f         6.0\n",
       "2          c           3        NaN         NaN\n",
       "3          d           4        NaN         NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df1, df2, how='left', left_index=True, right_index=True, suffixes=('_df1', '_df2'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we've achieved the same exact output as we did with `join`, but it took a little more explicit work. Let's run through the arguments for clarity:\n",
    "\n",
    "<ul>\n",
    "    <li><code>df1</code>: this is the first dataframe, and considered to be on the 'left' of <code>df2</code></li>\n",
    "    <li><code>df2</code>: this is the second dataframe, considered to be on the right of <code>df1</code></li>\n",
    "    <li><code>how='left'</code>: this states the type of join; see the above SQL join table</li>\n",
    "    <li><code>left_index=True</code>: this uses the index of <code>df1</code> as the join key for the left table</li>\n",
    "    <li><code>right_index=True</code>: this uses the index of <code>df2</code> as the join key for the right table</li>\n",
    "    <li><code>suffixes</code>: this places <code>_df1</code> after column names which came from <code>df1</code></li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"merge_keycols\"></div>\n",
    "<h4>Merge on Non-Index Columns</h4>\n",
    "\n",
    "This brings us to our next point: merging on columns that are not the index columns. This is very, very common in SQL joins and this technique can be used to join just about any DataFrame.\n",
    "\n",
    "First, let's create some more realistic data - stocks!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# KEEP\n",
    "openprice = pd.DataFrame({'Symbol': ['AAPL', 'DHR', 'DAL', 'AMZN'], 'OpenPrice': [217.51, 96.54, 51.45, 1703.34]})\n",
    "wkhigh = pd.DataFrame({'Symbol': ['DAL', 'AMZN', 'AAPL', 'DHR'], '52wkHigh': [60.79, 2050.49, 233.47, 110.11]})\n",
    "stockname = pd.DataFrame({'Symbol': ['AMZN', 'DHR', 'DAL', 'AAPL'], 'Name': ['Amazon', 'Danaher', 'Delta Airlines', 'Apple']})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's join the <code>openprice</code> and <code>wkhigh</code> dataframes together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Symbol</th>\n",
       "      <th>OpenPrice</th>\n",
       "      <th>52wkHigh</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAPL</td>\n",
       "      <td>217.51</td>\n",
       "      <td>233.47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DHR</td>\n",
       "      <td>96.54</td>\n",
       "      <td>110.11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>DAL</td>\n",
       "      <td>51.45</td>\n",
       "      <td>60.79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AMZN</td>\n",
       "      <td>1703.34</td>\n",
       "      <td>2050.49</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Symbol  OpenPrice  52wkHigh\n",
       "0   AAPL     217.51    233.47\n",
       "1    DHR      96.54    110.11\n",
       "2    DAL      51.45     60.79\n",
       "3   AMZN    1703.34   2050.49"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(openprice, wkhigh, how='left', left_on='Symbol', right_on='Symbol')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how our `Symbol` column isn't in the same order in each dataframe. This is intentional, and note that the dataframe on the left, `openprice` dictates the order of the dataframe on the right, `wkhigh`. Also note that the shared key between the two dataframes is exempt from having a <code>suffix</code> applied to it. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"yourturn\"></div>\n",
    "<h4>Now it's your turn!</h4>\n",
    "\n",
    "<ul>\n",
    "    <li><code>merge</code> the <code>openprice</code> and <code>stockname</code> dataframes and inspect the result</li>\n",
    "    <li><code>merge</code> all three dataframes together and inspect the result</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Symbol</th>\n",
       "      <th>OpenPrice</th>\n",
       "      <th>Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAPL</td>\n",
       "      <td>217.51</td>\n",
       "      <td>Apple</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DHR</td>\n",
       "      <td>96.54</td>\n",
       "      <td>Danaher</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>DAL</td>\n",
       "      <td>51.45</td>\n",
       "      <td>Delta Airlines</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AMZN</td>\n",
       "      <td>1703.34</td>\n",
       "      <td>Amazon</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Symbol  OpenPrice            Name\n",
       "0   AAPL     217.51           Apple\n",
       "1    DHR      96.54         Danaher\n",
       "2    DAL      51.45  Delta Airlines\n",
       "3   AMZN    1703.34          Amazon"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(openprice, stockname, how='left', left_on='Symbol', right_on='Symbol')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Symbol</th>\n",
       "      <th>OpenPrice</th>\n",
       "      <th>Name</th>\n",
       "      <th>52wkHigh</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAPL</td>\n",
       "      <td>217.51</td>\n",
       "      <td>Apple</td>\n",
       "      <td>233.47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DHR</td>\n",
       "      <td>96.54</td>\n",
       "      <td>Danaher</td>\n",
       "      <td>110.11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>DAL</td>\n",
       "      <td>51.45</td>\n",
       "      <td>Delta Airlines</td>\n",
       "      <td>60.79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AMZN</td>\n",
       "      <td>1703.34</td>\n",
       "      <td>Amazon</td>\n",
       "      <td>2050.49</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Symbol  OpenPrice            Name  52wkHigh\n",
       "0   AAPL     217.51           Apple    233.47\n",
       "1    DHR      96.54         Danaher    110.11\n",
       "2    DAL      51.45  Delta Airlines     60.79\n",
       "3   AMZN    1703.34          Amazon   2050.49"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Note that we're using the DataFrame .merge() method here for brevity\n",
    "pd.merge(openprice, stockname, how='left', left_on='Symbol', right_on='Symbol') \\\n",
    "    .merge(wkhigh, how='left', left_on='Symbol', right_on='Symbol')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"exercise\"></div>\n",
    "<h2>Exercise - Adventure Works</h2>\n",
    "<p align=\"right\">\n",
    "<img src=\"http://lh6.ggpht.com/_XjcDyZkJqHg/TPaaRcaysbI/AAAAAAAAAFo/b1U3q-qbTjY/AdventureWorks%20Logo%5B5%5D.png?imgmax=800\">\n",
    "</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"p_exercise\"></div>\n",
    "<h3>Table Joins on Live Data</h3>\n",
    "\n",
    "Here are the data dictionaries we'll be using for the following exercise:\n",
    "\n",
    "<ul>\n",
    "    <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html\">Production.Product</a></li>\n",
    "    <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Production.ProductSubCategory.html\">Production.ProductSubcategory</a></li>\n",
    "    <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderHeader.html\">Sales.SalesOrderHeader</a></li>\n",
    "    <li><a href=\"https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html\">Sales.SalesOrderDetail</a></li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = pd.read_csv('../data/Production.Product.csv', sep='\\t')\n",
    "ps = pd.read_csv('../data/Production.ProductSubcategory.csv', sep='\\t')\n",
    "soh = pd.read_csv('../data/Sales.SalesOrderHeader.csv', sep='\\t', nrows=1000)\n",
    "sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\\t', nrows=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"ex_pp\"></div>\n",
    "<h4>Join Product Tables</h4>\n",
    "\n",
    "<ul>\n",
    "    <li>Using the <code>Production.Product.ProductID</code> and <code>Production.ProductSubcategory.ProductID</code> keys, join the <code>Production.Product</code> and <code>Production.ProductSubcategory</code> tables</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ProductID</th>\n",
       "      <th>Name_p</th>\n",
       "      <th>ProductNumber</th>\n",
       "      <th>MakeFlag</th>\n",
       "      <th>FinishedGoodsFlag</th>\n",
       "      <th>Color</th>\n",
       "      <th>SafetyStockLevel</th>\n",
       "      <th>ReorderPoint</th>\n",
       "      <th>StandardCost</th>\n",
       "      <th>ListPrice</th>\n",
       "      <th>...</th>\n",
       "      <th>ProductModelID</th>\n",
       "      <th>SellStartDate</th>\n",
       "      <th>SellEndDate</th>\n",
       "      <th>DiscontinuedDate</th>\n",
       "      <th>rowguid_p</th>\n",
       "      <th>ModifiedDate_p</th>\n",
       "      <th>ProductCategoryID</th>\n",
       "      <th>Name</th>\n",
       "      <th>rowguid</th>\n",
       "      <th>ModifiedDate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Adjustable Race</td>\n",
       "      <td>AR-5381</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1000</td>\n",
       "      <td>750</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-04-30 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}</td>\n",
       "      <td>2014-02-08 10:01:36.827000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Bearing Ball</td>\n",
       "      <td>BA-8327</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1000</td>\n",
       "      <td>750</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-04-30 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{58AE3C20-4F3A-4749-A7D4-D568806CC537}</td>\n",
       "      <td>2014-02-08 10:01:36.827000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>BB Ball Bearing</td>\n",
       "      <td>BE-2349</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>800</td>\n",
       "      <td>600</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008-04-30 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}</td>\n",
       "      <td>2014-02-08 10:01:36.827000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   ProductID           Name_p ProductNumber  MakeFlag  FinishedGoodsFlag  \\\n",
       "0          1  Adjustable Race       AR-5381         0                  0   \n",
       "1          2     Bearing Ball       BA-8327         0                  0   \n",
       "2          3  BB Ball Bearing       BE-2349         1                  0   \n",
       "\n",
       "  Color  SafetyStockLevel  ReorderPoint  StandardCost  ListPrice  ...  \\\n",
       "0   NaN              1000           750           0.0        0.0  ...   \n",
       "1   NaN              1000           750           0.0        0.0  ...   \n",
       "2   NaN               800           600           0.0        0.0  ...   \n",
       "\n",
       "  ProductModelID        SellStartDate SellEndDate  DiscontinuedDate  \\\n",
       "0            NaN  2008-04-30 00:00:00         NaN               NaN   \n",
       "1            NaN  2008-04-30 00:00:00         NaN               NaN   \n",
       "2            NaN  2008-04-30 00:00:00         NaN               NaN   \n",
       "\n",
       "                                rowguid_p                 ModifiedDate_p  \\\n",
       "0  {694215B7-08F7-4C0D-ACB1-D734BA44C0C8}  2014-02-08 10:01:36.827000000   \n",
       "1  {58AE3C20-4F3A-4749-A7D4-D568806CC537}  2014-02-08 10:01:36.827000000   \n",
       "2  {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}  2014-02-08 10:01:36.827000000   \n",
       "\n",
       "  ProductCategoryID Name  rowguid  ModifiedDate  \n",
       "0               NaN  NaN      NaN           NaN  \n",
       "1               NaN  NaN      NaN           NaN  \n",
       "2               NaN  NaN      NaN           NaN  \n",
       "\n",
       "[3 rows x 29 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(p, ps, how='left', left_on='ProductSubcategoryID', right_on='ProductSubcategoryID', suffixes=('_p', '')).head(3)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"ex_soh_sod\"></div>\n",
    "<h4>Join Sales Order Header and Sales Order Detail Tables</h4>\n",
    "\n",
    "<ul>\n",
    "    <li>Join the <code>Sales.SalesOrderHeader</code> and <code>Sales.SalesOrderDetail</code> tables</li>\n",
    "    <li>Don't forget to use your data dictionaries!</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SalesOrderID</th>\n",
       "      <th>RevisionNumber</th>\n",
       "      <th>OrderDate</th>\n",
       "      <th>DueDate</th>\n",
       "      <th>ShipDate</th>\n",
       "      <th>Status</th>\n",
       "      <th>OnlineOrderFlag</th>\n",
       "      <th>SalesOrderNumber</th>\n",
       "      <th>PurchaseOrderNumber</th>\n",
       "      <th>AccountNumber</th>\n",
       "      <th>...</th>\n",
       "      <th>SalesOrderDetailID</th>\n",
       "      <th>CarrierTrackingNumber</th>\n",
       "      <th>OrderQty</th>\n",
       "      <th>ProductID</th>\n",
       "      <th>SpecialOfferID</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>UnitPriceDiscount</th>\n",
       "      <th>LineTotal</th>\n",
       "      <th>rowguid_y</th>\n",
       "      <th>ModifiedDate_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>43659</td>\n",
       "      <td>8</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2011-06-12 00:00:00</td>\n",
       "      <td>2011-06-07 00:00:00</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>SO43659</td>\n",
       "      <td>PO522145787</td>\n",
       "      <td>10-4020-000676</td>\n",
       "      <td>...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4911-403C-98</td>\n",
       "      <td>1.0</td>\n",
       "      <td>776.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2024.994</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2024.994</td>\n",
       "      <td>{B207C96D-D9E6-402B-8470-2CC176C42283}</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>43659</td>\n",
       "      <td>8</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2011-06-12 00:00:00</td>\n",
       "      <td>2011-06-07 00:00:00</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>SO43659</td>\n",
       "      <td>PO522145787</td>\n",
       "      <td>10-4020-000676</td>\n",
       "      <td>...</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4911-403C-98</td>\n",
       "      <td>3.0</td>\n",
       "      <td>777.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2024.994</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6074.982</td>\n",
       "      <td>{7ABB600D-1E77-41BE-9FE5-B9142CFC08FA}</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>43659</td>\n",
       "      <td>8</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2011-06-12 00:00:00</td>\n",
       "      <td>2011-06-07 00:00:00</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>SO43659</td>\n",
       "      <td>PO522145787</td>\n",
       "      <td>10-4020-000676</td>\n",
       "      <td>...</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4911-403C-98</td>\n",
       "      <td>1.0</td>\n",
       "      <td>778.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2024.994</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2024.994</td>\n",
       "      <td>{475CF8C6-49F6-486E-B0AD-AFC6A50CDD2F}</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 36 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   SalesOrderID  RevisionNumber            OrderDate              DueDate  \\\n",
       "0         43659               8  2011-05-31 00:00:00  2011-06-12 00:00:00   \n",
       "1         43659               8  2011-05-31 00:00:00  2011-06-12 00:00:00   \n",
       "2         43659               8  2011-05-31 00:00:00  2011-06-12 00:00:00   \n",
       "\n",
       "              ShipDate  Status  OnlineOrderFlag SalesOrderNumber  \\\n",
       "0  2011-06-07 00:00:00       5                0          SO43659   \n",
       "1  2011-06-07 00:00:00       5                0          SO43659   \n",
       "2  2011-06-07 00:00:00       5                0          SO43659   \n",
       "\n",
       "  PurchaseOrderNumber   AccountNumber  ...  SalesOrderDetailID  \\\n",
       "0         PO522145787  10-4020-000676  ...                 1.0   \n",
       "1         PO522145787  10-4020-000676  ...                 2.0   \n",
       "2         PO522145787  10-4020-000676  ...                 3.0   \n",
       "\n",
       "   CarrierTrackingNumber  OrderQty  ProductID  SpecialOfferID  UnitPrice  \\\n",
       "0           4911-403C-98       1.0      776.0             1.0   2024.994   \n",
       "1           4911-403C-98       3.0      777.0             1.0   2024.994   \n",
       "2           4911-403C-98       1.0      778.0             1.0   2024.994   \n",
       "\n",
       "   UnitPriceDiscount LineTotal                               rowguid_y  \\\n",
       "0                0.0  2024.994  {B207C96D-D9E6-402B-8470-2CC176C42283}   \n",
       "1                0.0  6074.982  {7ABB600D-1E77-41BE-9FE5-B9142CFC08FA}   \n",
       "2                0.0  2024.994  {475CF8C6-49F6-486E-B0AD-AFC6A50CDD2F}   \n",
       "\n",
       "        ModifiedDate_y  \n",
       "0  2011-05-31 00:00:00  \n",
       "1  2011-05-31 00:00:00  \n",
       "2  2011-05-31 00:00:00  \n",
       "\n",
       "[3 rows x 36 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here, we're adding in an optional concept of validation. This is a one-to-many merge,\n",
    "# since we can have multiple products (detail) in each sales order (header). Note the\n",
    "# header table is to the left of the detail table.\n",
    "pd.merge(soh, sod, how='left', left_on='SalesOrderID', right_on='SalesOrderID').head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div id=\"ex_soh_sod_pt\"></div>\n",
    "<h4>Join Sales Order Header, Sales Order Detail, and Product Tables</h4>\n",
    "\n",
    "<ul>\n",
    "    <li>Join the <code>Sales.SalesOrderHeader</code>, <code>Sales.SalesOrderDetail</code>, and <code>Production.Product</code> tables</li>\n",
    "    <li>Don't forget to use your data dictionaries!</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SalesOrderID</th>\n",
       "      <th>RevisionNumber</th>\n",
       "      <th>OrderDate</th>\n",
       "      <th>DueDate</th>\n",
       "      <th>ShipDate</th>\n",
       "      <th>Status</th>\n",
       "      <th>OnlineOrderFlag</th>\n",
       "      <th>SalesOrderNumber</th>\n",
       "      <th>PurchaseOrderNumber</th>\n",
       "      <th>AccountNumber</th>\n",
       "      <th>...</th>\n",
       "      <th>ProductLine</th>\n",
       "      <th>Class</th>\n",
       "      <th>Style</th>\n",
       "      <th>ProductSubcategoryID</th>\n",
       "      <th>ProductModelID</th>\n",
       "      <th>SellStartDate</th>\n",
       "      <th>SellEndDate</th>\n",
       "      <th>DiscontinuedDate</th>\n",
       "      <th>rowguid</th>\n",
       "      <th>ModifiedDate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>43659</td>\n",
       "      <td>8</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2011-06-12 00:00:00</td>\n",
       "      <td>2011-06-07 00:00:00</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>SO43659</td>\n",
       "      <td>PO522145787</td>\n",
       "      <td>10-4020-000676</td>\n",
       "      <td>...</td>\n",
       "      <td>M</td>\n",
       "      <td>H</td>\n",
       "      <td>U</td>\n",
       "      <td>1.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2012-05-29 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{02935111-A546-4C6D-941F-BE12D42C158E}</td>\n",
       "      <td>2014-02-08 10:01:36.827000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>43659</td>\n",
       "      <td>8</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2011-06-12 00:00:00</td>\n",
       "      <td>2011-06-07 00:00:00</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>SO43659</td>\n",
       "      <td>PO522145787</td>\n",
       "      <td>10-4020-000676</td>\n",
       "      <td>...</td>\n",
       "      <td>M</td>\n",
       "      <td>H</td>\n",
       "      <td>U</td>\n",
       "      <td>1.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2012-05-29 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{7920BC3B-8FD4-4610-93D2-E693A66B6474}</td>\n",
       "      <td>2014-02-08 10:01:36.827000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>43659</td>\n",
       "      <td>8</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2011-06-12 00:00:00</td>\n",
       "      <td>2011-06-07 00:00:00</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>SO43659</td>\n",
       "      <td>PO522145787</td>\n",
       "      <td>10-4020-000676</td>\n",
       "      <td>...</td>\n",
       "      <td>M</td>\n",
       "      <td>H</td>\n",
       "      <td>U</td>\n",
       "      <td>1.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>2011-05-31 00:00:00</td>\n",
       "      <td>2012-05-29 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{1B486300-7E64-4C5D-A9BA-A8368E20C5A0}</td>\n",
       "      <td>2014-02-08 10:01:36.827000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 60 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   SalesOrderID  RevisionNumber            OrderDate              DueDate  \\\n",
       "0         43659               8  2011-05-31 00:00:00  2011-06-12 00:00:00   \n",
       "1         43659               8  2011-05-31 00:00:00  2011-06-12 00:00:00   \n",
       "2         43659               8  2011-05-31 00:00:00  2011-06-12 00:00:00   \n",
       "\n",
       "              ShipDate  Status  OnlineOrderFlag SalesOrderNumber  \\\n",
       "0  2011-06-07 00:00:00       5                0          SO43659   \n",
       "1  2011-06-07 00:00:00       5                0          SO43659   \n",
       "2  2011-06-07 00:00:00       5                0          SO43659   \n",
       "\n",
       "  PurchaseOrderNumber   AccountNumber  ...  ProductLine  Class  Style  \\\n",
       "0         PO522145787  10-4020-000676  ...           M      H      U    \n",
       "1         PO522145787  10-4020-000676  ...           M      H      U    \n",
       "2         PO522145787  10-4020-000676  ...           M      H      U    \n",
       "\n",
       "   ProductSubcategoryID  ProductModelID        SellStartDate  \\\n",
       "0                   1.0            19.0  2011-05-31 00:00:00   \n",
       "1                   1.0            19.0  2011-05-31 00:00:00   \n",
       "2                   1.0            19.0  2011-05-31 00:00:00   \n",
       "\n",
       "           SellEndDate DiscontinuedDate  \\\n",
       "0  2012-05-29 00:00:00              NaN   \n",
       "1  2012-05-29 00:00:00              NaN   \n",
       "2  2012-05-29 00:00:00              NaN   \n",
       "\n",
       "                                  rowguid                   ModifiedDate  \n",
       "0  {02935111-A546-4C6D-941F-BE12D42C158E}  2014-02-08 10:01:36.827000000  \n",
       "1  {7920BC3B-8FD4-4610-93D2-E693A66B6474}  2014-02-08 10:01:36.827000000  \n",
       "2  {1B486300-7E64-4C5D-A9BA-A8368E20C5A0}  2014-02-08 10:01:36.827000000  \n",
       "\n",
       "[3 rows x 60 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here, again we are using the resultant dataframe of the first merge and applying the .merge\n",
    "# dataframe METHOD to join in the Product table. Note that the product table is a many to 1\n",
    "# merge - there are multiple sales orders that all may reference the same product (i.e)\n",
    "# we have sold a mountain bike, model xyz, more than once.\n",
    "pd.merge(soh, sod, how='left', left_on='SalesOrderID', right_on='SalesOrderID') \\\n",
    "    .merge(p, how='left', left_on='ProductID', right_on='ProductID').head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}