You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1556 lines
136 KiB
1556 lines
136 KiB
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div>\n",
|
|
" <span>\n",
|
|
" <p align=\"left\">\n",
|
|
" <img align=\"left\" style=\"padding-right: 5px\" valign=\"center\" src=\"https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png\" width=\"28px\">\n",
|
|
" </p>\n",
|
|
" </span>\n",
|
|
" <span>\n",
|
|
" <h1>Consumer Sales Lab</h1>\n",
|
|
" </span>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
"<font color='red'><strong>Important!!</font></strong>: This lab is fairly challenging and may take longer than 60m to complete. Because of this, we've included a <a href='#shortcut'>shortcut cell</a> that _skips the EDA section_ of this lab and lets you get right to the analysis section. Work with your instructor and use your best judgement to use your time wisely to focus on the areas you'd like to practice.\n",
|
|
"\n",
|
|
"This lab can be conducted in one of two ways:\n",
|
|
"\n",
|
|
"- <a href=\"#eda\">EDA</a> _and_ <a href=\"#analysis\">Analysis</a>\n",
|
|
"- <a href=\"#analysis\">Analysis</a> only\n",
|
|
"\n",
|
|
"The <a href=\"#eda\">EDA</a> section covers the following topics:\n",
|
|
"\n",
|
|
"- <a href='#import'>Importing</a> data from csvs\n",
|
|
"- <a href=\"#nulls\">Handling nulls</a>\n",
|
|
"- <a href='#dtypes'>Casting different Dtypes</a>\n",
|
|
"- <a href=\"#join\">Complex joining</a> of star-schema tables\n",
|
|
"\n",
|
|
"The <a href=\"#analysis\">Analysis</a> section covers the following topics:\n",
|
|
"\n",
|
|
"- <a href='#fe'>Feature engineering</a>\n",
|
|
"- <a href=\"#visualization\">Visualization and Reporting</a>\n",
|
|
"\n",
|
|
"<details>\n",
|
|
" <summary>Table of Contents</summary>\n",
|
|
" <ul>\n",
|
|
" <li><a href=\"#eda\">EDA</a></li>\n",
|
|
" <ul>\n",
|
|
" <li><a href='#import'>Import</a></li>\n",
|
|
" <li><a href=\"#nulls\">Nulls</a></li>\n",
|
|
" <li><a href='#dtypes'>Dtypes</a></li>\n",
|
|
" <li><a href=\"#join\">Join</a></li>\n",
|
|
" </ul>\n",
|
|
" <li><a href=\"#analysis\">Analysis</a></li>\n",
|
|
" <ul>\n",
|
|
" <li><a href='#fe'>Feature Engineering</a></li>\n",
|
|
" <li><a href=\"#visualization\">Visualization and Reporting</a></li>\n",
|
|
" <ul>\n",
|
|
" <li><a href='#1a'>1.A</a></li>\n",
|
|
" <li><a href=\"#1b\">1.B</a></li>\n",
|
|
" <li><a href='#1c'>1.C</a></li>\n",
|
|
" <li><a href=\"#2a\">2.A</a></li>\n",
|
|
" <li><a href=\"#3a\">3.A</a></li>\n",
|
|
" </ul>\n",
|
|
" </ul>\n",
|
|
" </ul>\n",
|
|
"</details>\n",
|
|
"<details>\n",
|
|
" <summary>Background</summary>\n",
|
|
" <ul>\n",
|
|
" <li>Originally adapted from <a href=\"https://sense-demo.qlik.com/sense/app/372cbc85-f7fb-4db6-a620-9a5367845dce\">qlik</a>, we'll be performing EDA on a consumer data set.</li>\n",
|
|
" </ul>\n",
|
|
"</details>\n",
|
|
"<details id='prompts'>\n",
|
|
" <summary>Prompts</summary>\n",
|
|
" <br>\n",
|
|
" Your boss, Joanna, has requested a report on the following:\n",
|
|
" <ol>\n",
|
|
" <li>Product Sales</li>\n",
|
|
" <ol>\n",
|
|
" <li>Gross margin analysis by product group.</li>\n",
|
|
" <li>Sales by product group, top 10 product groups only.</li>\n",
|
|
" <li>Sales, by year/month, year over year</li>\n",
|
|
" </ol>\n",
|
|
" <li>Sales Reps</li>\n",
|
|
" <ol>\n",
|
|
" <li>Sum of Sales and sales quantity, by rep, by customer</li>\n",
|
|
" </ol>\n",
|
|
" <li>Supply Chain</li>\n",
|
|
" <ol>\n",
|
|
" <li>Inventory vs Lead Time for all products</li>\n",
|
|
" </ol>\n",
|
|
" </ol>\n",
|
|
"</details>\n",
|
|
"<details id='dictionary'>\n",
|
|
" <summary>Data Dictionary</summary>\n",
|
|
" <br>\n",
|
|
" <!-- table created with https://www.tablesgenerator.com/html_tables please see ../assets/dictionary.tgn file -->\n",
|
|
" <table>\n",
|
|
" <tr>\n",
|
|
" <th>Table</th>\n",
|
|
" <th>Field</th>\n",
|
|
" <th>Description</th>\n",
|
|
" <th>PK</th>\n",
|
|
" <th>FK</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Item master.xls</td>\n",
|
|
" <td>Item Number</td>\n",
|
|
" <td>Foreign key to Sales.Item Number field. Unique identifier for item</td>\n",
|
|
" <td>Y</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Item master.xls</td>\n",
|
|
" <td>Product Group</td>\n",
|
|
" <td>Group for the product, i.e. Frozen Foods, Deli, etc</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Item master.xls</td>\n",
|
|
" <td>Product Line</td>\n",
|
|
" <td>Product line, i.e. Food, Drink, etc</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Item master.xls</td>\n",
|
|
" <td>Product Sub Group</td>\n",
|
|
" <td>Detail field for the Product Group field, i.e. Produce -> Fresh Vegetables</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Item master.xls</td>\n",
|
|
" <td>Product Type</td>\n",
|
|
" <td>Type of product and additional detail at the sub group level, i.e. 'Breakfast Foods'</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Manager</td>\n",
|
|
" <td>Name of manager</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr> \n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Manager Number</td>\n",
|
|
" <td>ID of manager</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Path</td>\n",
|
|
" <td>Order through which sales passes through reps, separated by hyphens. Correlates with Sales Rep ID key.</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Sales Rep Name</td>\n",
|
|
" <td>Primary sales rep name</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Sales Rep Name 1</td>\n",
|
|
" <td>Secondary sales rep name (nullable)</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Sales Rep Name 2</td>\n",
|
|
" <td>Tertiary sales rep name (nullable)</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Sales Rep Name 3</td>\n",
|
|
" <td>Quaterinary sales rep name (nullable)</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales rep.csv</td>\n",
|
|
" <td>Sales Rep ID</td>\n",
|
|
" <td>Foreign key to Sales. UID for path.</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Customers.xlsx</td>\n",
|
|
" <td>Customer</td>\n",
|
|
" <td>Name of customer</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Customers.xlsx</td>\n",
|
|
" <td>Customer Number</td>\n",
|
|
" <td>Unique identifier for customer name, keys to Sales.Customer Number</td>\n",
|
|
" <td>Y</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Customers.xlsx</td>\n",
|
|
" <td>City Code</td>\n",
|
|
" <td>City ID, foreign key for City.City Code</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Cities.xlsx</td>\n",
|
|
" <td>City</td>\n",
|
|
" <td>Name of city</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Cities.xlsx</td>\n",
|
|
" <td>City Code</td>\n",
|
|
" <td>ID of city name</td>\n",
|
|
" <td>Y</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Cities.xlsx</td>\n",
|
|
" <td>Region</td>\n",
|
|
" <td>Sales region (i.e. USA, Nordic, etc)</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Cities.xlsx</td>\n",
|
|
" <td>Latitude</td>\n",
|
|
" <td>Latitude of city</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Cities.xlsx</td>\n",
|
|
" <td>Longitude</td>\n",
|
|
" <td>Longitude of city</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Cities.xlsx</td>\n",
|
|
" <td>Desc</td>\n",
|
|
" <td>String description of city, including city, state (if applicable), and country</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>%KEY</td>\n",
|
|
" <td>Primary key of table</td>\n",
|
|
" <td>Y</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Cost</td>\n",
|
|
" <td>Total cost of sale for transaction [USD]</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Customer Number</td>\n",
|
|
" <td>Customer number, keys to Customer.Customer Number</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Date</td>\n",
|
|
" <td>Date of sale</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>GrossSales</td>\n",
|
|
" <td>Gross sale for invoice [USD]</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Invoice Date</td>\n",
|
|
" <td>Date of invoice</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Item Desc</td>\n",
|
|
" <td>Description of invoiced item</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Item Number</td>\n",
|
|
" <td>ID of invoiced item (product) - not a primary key. Keys to Item.Item Number</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Margin</td>\n",
|
|
" <td>Percent gross margin of line item sale</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Order Number</td>\n",
|
|
" <td>ID of the order placed</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Promised Delivery Date</td>\n",
|
|
" <td>Agreed date of delivery</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Sales</td>\n",
|
|
" <td>Gross sale for invoice [USD], less cost of sale</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Sales Qty</td>\n",
|
|
" <td>Qty of invoiced item sold (see Item Number, Item Desc)</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Sales.xlsx</td>\n",
|
|
" <td>Sales Rep Number</td>\n",
|
|
" <td>Sales rep ID credited with sale</td>\n",
|
|
" <td>N</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
"</table>\n",
|
|
"</details>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='eda'></div>\n",
|
|
"<h2>EDA</h2>\n",
|
|
"\n",
|
|
"Before we create our charts/reports for Joanna, we'll need to sanity check our input data. We'll get to the analysis (feature engineering) and visualization and reporting in just a bit."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='import'></div>\n",
|
|
"<h3>Import Data</h3>\n",
|
|
"Read in the data. Check the raw file to make sure you understand quote characters, delimiters, and encoding. You will need to use the encoding flag here since we are dealing with international character sets."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Conduct any library imports here\n",
|
|
"import pandas as pd\n",
|
|
"from matplotlib.ticker import FormatStrFormatter\n",
|
|
"import matplotlib\n",
|
|
"matplotlib.use('nbagg')\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"%matplotlib inline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Read in your sales, cities, customers, item_master, and sales_rep csvs here.\n",
|
|
"sales = pd.read_csv('../data/sales.csv')\n",
|
|
"cities = pd.read_csv('../data/cities.csv', encoding = \"ISO-8859-1\")\n",
|
|
"customers = pd.read_csv('../data/customers.csv', encoding = \"ISO-8859-1\")\n",
|
|
"item_master = pd.read_csv('../data/item_master.csv')\n",
|
|
"sales_rep = pd.read_csv('../data/sales_rep.csv')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>%KEY</th>\n",
|
|
" <th>Cost</th>\n",
|
|
" <th>Customer Number</th>\n",
|
|
" <th>Date</th>\n",
|
|
" <th>GrossSales</th>\n",
|
|
" <th>Invoice Date</th>\n",
|
|
" <th>Invoice Number</th>\n",
|
|
" <th>Item Desc</th>\n",
|
|
" <th>Item Number</th>\n",
|
|
" <th>Margin</th>\n",
|
|
" <th>Order Number</th>\n",
|
|
" <th>Promised Delivery Date</th>\n",
|
|
" <th>Sales</th>\n",
|
|
" <th>Sales Qty</th>\n",
|
|
" <th>Sales Rep Number</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>3428</td>\n",
|
|
" <td>-513.15</td>\n",
|
|
" <td>10012226</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>-573.3835</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>318960</td>\n",
|
|
" <td>Cutting Edge Sliced Ham</td>\n",
|
|
" <td>10696</td>\n",
|
|
" <td>-37.29</td>\n",
|
|
" <td>115785</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>-550.44</td>\n",
|
|
" <td>-1.0</td>\n",
|
|
" <td>180</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>3429</td>\n",
|
|
" <td>-105.93</td>\n",
|
|
" <td>10012226</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>-204.6638</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>318960</td>\n",
|
|
" <td>Washington Cranberry Juice</td>\n",
|
|
" <td>10009</td>\n",
|
|
" <td>-90.54</td>\n",
|
|
" <td>115785</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>-196.47</td>\n",
|
|
" <td>-2.0</td>\n",
|
|
" <td>180</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>3430</td>\n",
|
|
" <td>-88.07</td>\n",
|
|
" <td>10012226</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>-165.8016</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>318960</td>\n",
|
|
" <td>Moms Sliced Ham</td>\n",
|
|
" <td>10385</td>\n",
|
|
" <td>-71.10</td>\n",
|
|
" <td>115785</td>\n",
|
|
" <td>1/12/2012</td>\n",
|
|
" <td>-159.17</td>\n",
|
|
" <td>-3.0</td>\n",
|
|
" <td>180</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" %KEY Cost Customer Number Date GrossSales Invoice Date \\\n",
|
|
"0 3428 -513.15 10012226 1/12/2012 -573.3835 1/12/2012 \n",
|
|
"1 3429 -105.93 10012226 1/12/2012 -204.6638 1/12/2012 \n",
|
|
"2 3430 -88.07 10012226 1/12/2012 -165.8016 1/12/2012 \n",
|
|
"\n",
|
|
" Invoice Number Item Desc Item Number Margin \\\n",
|
|
"0 318960 Cutting Edge Sliced Ham 10696 -37.29 \n",
|
|
"1 318960 Washington Cranberry Juice 10009 -90.54 \n",
|
|
"2 318960 Moms Sliced Ham 10385 -71.10 \n",
|
|
"\n",
|
|
" Order Number Promised Delivery Date Sales Sales Qty Sales Rep Number \n",
|
|
"0 115785 1/12/2012 -550.44 -1.0 180 \n",
|
|
"1 115785 1/12/2012 -196.47 -2.0 180 \n",
|
|
"2 115785 1/12/2012 -159.17 -3.0 180 "
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"sales.head(3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='nulls'></div>\n",
|
|
"<h3>Nulls</h3>\n",
|
|
"Check for nulls and missing values in all imported tables. If you are filling missing values, state your reasoning for dropping and/or imputing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Sales nulls: 0\n",
|
|
"Customer nulls: 0\n",
|
|
"Item Master nulls: 0\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# No nulls in Sales\n",
|
|
"print(f'Sales nulls: {sales.isnull().sum().sum()}')\n",
|
|
"print(f'Customer nulls: {customers.isnull().sum().sum()}')\n",
|
|
"print(f'Item Master nulls: {item_master.isnull().sum().sum()}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Manager</th>\n",
|
|
" <th>Manager Number</th>\n",
|
|
" <th>Path</th>\n",
|
|
" <th>Sales Rep Name</th>\n",
|
|
" <th>Sales Rep Name1</th>\n",
|
|
" <th>Sales Rep Name2</th>\n",
|
|
" <th>Sales Rep Name3</th>\n",
|
|
" <th>Sales Rep ID</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>104</td>\n",
|
|
" <td>Amanda Honda-Amalia Craig</td>\n",
|
|
" <td>Amalia Craig</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>Amalia Craig</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>103</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>104</td>\n",
|
|
" <td>Amanda Honda-Cart Lynch</td>\n",
|
|
" <td>Cart Lynch</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>Cart Lynch</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>112</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>104</td>\n",
|
|
" <td>Amanda Honda-Molly McKenzie</td>\n",
|
|
" <td>Molly McKenzie</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>Molly McKenzie</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>159</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>104</td>\n",
|
|
" <td>Amanda Honda-Sheila Hein</td>\n",
|
|
" <td>Sheila Hein</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>Sheila Hein</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>176</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>109</td>\n",
|
|
" <td>Brenda Gibson-Dennis Johnson</td>\n",
|
|
" <td>Dennis Johnson</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>Dennis Johnson</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>121</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>109</td>\n",
|
|
" <td>Brenda Gibson-Ken Roberts</td>\n",
|
|
" <td>Ken Roberts</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>Ken Roberts</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>145</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6</th>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>109</td>\n",
|
|
" <td>Brenda Gibson-Robert Kim</td>\n",
|
|
" <td>Robert Kim</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>Robert Kim</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>163</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7</th>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>109</td>\n",
|
|
" <td>Brenda Gibson-William Fisher</td>\n",
|
|
" <td>William Fisher</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>William Fisher</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>185</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>21</th>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>134</td>\n",
|
|
" <td>John Greg-David Laychak</td>\n",
|
|
" <td>David Laychak</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>David Laychak</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>118</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>22</th>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>134</td>\n",
|
|
" <td>John Greg-Kathy Clinton</td>\n",
|
|
" <td>Kathy Clinton</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>Kathy Clinton</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>144</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>23</th>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>134</td>\n",
|
|
" <td>John Greg-Sandra Barone</td>\n",
|
|
" <td>Sandra Barone</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>Sandra Barone</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>170</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>24</th>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>134</td>\n",
|
|
" <td>John Greg-Viginia Mountain</td>\n",
|
|
" <td>Viginia Mountain</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>Viginia Mountain</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>184</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>41</th>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>169</td>\n",
|
|
" <td>Samantha Allen-Brad Taylor</td>\n",
|
|
" <td>Brad Taylor</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>Brad Taylor</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>108</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>42</th>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>169</td>\n",
|
|
" <td>Samantha Allen-Karl Anderson</td>\n",
|
|
" <td>Karl Anderson</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>Karl Anderson</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>143</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>43</th>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>169</td>\n",
|
|
" <td>Samantha Allen-Odessa Morris</td>\n",
|
|
" <td>Odessa Morris</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>Odessa Morris</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>160</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>44</th>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>169</td>\n",
|
|
" <td>Samantha Allen-Stephanie Reagan</td>\n",
|
|
" <td>Stephanie Reagan</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>Stephanie Reagan</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>179</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>52</th>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>181</td>\n",
|
|
" <td>Stewart Wind-Carolyn Halmon</td>\n",
|
|
" <td>Carolyn Halmon</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>Carolyn Halmon</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>111</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>53</th>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>181</td>\n",
|
|
" <td>Stewart Wind-John Davis</td>\n",
|
|
" <td>John Davis</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>John Davis</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>132</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>54</th>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>181</td>\n",
|
|
" <td>Stewart Wind-Micheal Williams</td>\n",
|
|
" <td>Micheal Williams</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>Micheal Williams</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>157</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>55</th>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>181</td>\n",
|
|
" <td>Stewart Wind-Ronald Golinski</td>\n",
|
|
" <td>Ronald Golinski</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>Ronald Golinski</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>166</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>59</th>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>104</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>Amanda Honda</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>104</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>60</th>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>109</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>Brenda Gibson</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>109</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>61</th>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>134</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>John Greg</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>134</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>62</th>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>169</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>Samantha Allen</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>169</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>63</th>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>181</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>Stewart Wind</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>181</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Manager Manager Number Path \\\n",
|
|
"0 Amanda Honda 104 Amanda Honda-Amalia Craig \n",
|
|
"1 Amanda Honda 104 Amanda Honda-Cart Lynch \n",
|
|
"2 Amanda Honda 104 Amanda Honda-Molly McKenzie \n",
|
|
"3 Amanda Honda 104 Amanda Honda-Sheila Hein \n",
|
|
"4 Brenda Gibson 109 Brenda Gibson-Dennis Johnson \n",
|
|
"5 Brenda Gibson 109 Brenda Gibson-Ken Roberts \n",
|
|
"6 Brenda Gibson 109 Brenda Gibson-Robert Kim \n",
|
|
"7 Brenda Gibson 109 Brenda Gibson-William Fisher \n",
|
|
"21 John Greg 134 John Greg-David Laychak \n",
|
|
"22 John Greg 134 John Greg-Kathy Clinton \n",
|
|
"23 John Greg 134 John Greg-Sandra Barone \n",
|
|
"24 John Greg 134 John Greg-Viginia Mountain \n",
|
|
"41 Samantha Allen 169 Samantha Allen-Brad Taylor \n",
|
|
"42 Samantha Allen 169 Samantha Allen-Karl Anderson \n",
|
|
"43 Samantha Allen 169 Samantha Allen-Odessa Morris \n",
|
|
"44 Samantha Allen 169 Samantha Allen-Stephanie Reagan \n",
|
|
"52 Stewart Wind 181 Stewart Wind-Carolyn Halmon \n",
|
|
"53 Stewart Wind 181 Stewart Wind-John Davis \n",
|
|
"54 Stewart Wind 181 Stewart Wind-Micheal Williams \n",
|
|
"55 Stewart Wind 181 Stewart Wind-Ronald Golinski \n",
|
|
"59 Amanda Honda 104 Amanda Honda \n",
|
|
"60 Brenda Gibson 109 Brenda Gibson \n",
|
|
"61 John Greg 134 John Greg \n",
|
|
"62 Samantha Allen 169 Samantha Allen \n",
|
|
"63 Stewart Wind 181 Stewart Wind \n",
|
|
"\n",
|
|
" Sales Rep Name Sales Rep Name1 Sales Rep Name2 Sales Rep Name3 \\\n",
|
|
"0 Amalia Craig Amanda Honda Amalia Craig NaN \n",
|
|
"1 Cart Lynch Amanda Honda Cart Lynch NaN \n",
|
|
"2 Molly McKenzie Amanda Honda Molly McKenzie NaN \n",
|
|
"3 Sheila Hein Amanda Honda Sheila Hein NaN \n",
|
|
"4 Dennis Johnson Brenda Gibson Dennis Johnson NaN \n",
|
|
"5 Ken Roberts Brenda Gibson Ken Roberts NaN \n",
|
|
"6 Robert Kim Brenda Gibson Robert Kim NaN \n",
|
|
"7 William Fisher Brenda Gibson William Fisher NaN \n",
|
|
"21 David Laychak John Greg David Laychak NaN \n",
|
|
"22 Kathy Clinton John Greg Kathy Clinton NaN \n",
|
|
"23 Sandra Barone John Greg Sandra Barone NaN \n",
|
|
"24 Viginia Mountain John Greg Viginia Mountain NaN \n",
|
|
"41 Brad Taylor Samantha Allen Brad Taylor NaN \n",
|
|
"42 Karl Anderson Samantha Allen Karl Anderson NaN \n",
|
|
"43 Odessa Morris Samantha Allen Odessa Morris NaN \n",
|
|
"44 Stephanie Reagan Samantha Allen Stephanie Reagan NaN \n",
|
|
"52 Carolyn Halmon Stewart Wind Carolyn Halmon NaN \n",
|
|
"53 John Davis Stewart Wind John Davis NaN \n",
|
|
"54 Micheal Williams Stewart Wind Micheal Williams NaN \n",
|
|
"55 Ronald Golinski Stewart Wind Ronald Golinski NaN \n",
|
|
"59 Amanda Honda Amanda Honda NaN NaN \n",
|
|
"60 Brenda Gibson Brenda Gibson NaN NaN \n",
|
|
"61 John Greg John Greg NaN NaN \n",
|
|
"62 Samantha Allen Samantha Allen NaN NaN \n",
|
|
"63 Stewart Wind Stewart Wind NaN NaN \n",
|
|
"\n",
|
|
" Sales Rep ID \n",
|
|
"0 103 \n",
|
|
"1 112 \n",
|
|
"2 159 \n",
|
|
"3 176 \n",
|
|
"4 121 \n",
|
|
"5 145 \n",
|
|
"6 163 \n",
|
|
"7 185 \n",
|
|
"21 118 \n",
|
|
"22 144 \n",
|
|
"23 170 \n",
|
|
"24 184 \n",
|
|
"41 108 \n",
|
|
"42 143 \n",
|
|
"43 160 \n",
|
|
"44 179 \n",
|
|
"52 111 \n",
|
|
"53 132 \n",
|
|
"54 157 \n",
|
|
"55 166 \n",
|
|
"59 104 \n",
|
|
"60 109 \n",
|
|
"61 134 \n",
|
|
"62 169 \n",
|
|
"63 181 "
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Sales Rep looks like we just don't have a 2nd or 3rd sales rep for that territory/path.\n",
|
|
"sales_rep[ sales_rep['Sales Rep Name2'].isnull() | sales_rep['Sales Rep Name3'].isnull() ]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>City</th>\n",
|
|
" <th>City Code</th>\n",
|
|
" <th>Region</th>\n",
|
|
" <th>Latitude</th>\n",
|
|
" <th>Longitude</th>\n",
|
|
" <th>Desc</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>94</th>\n",
|
|
" <td>Yokohama</td>\n",
|
|
" <td>95</td>\n",
|
|
" <td>Japan</td>\n",
|
|
" <td>35.455592</td>\n",
|
|
" <td>139.572196</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" City City Code Region Latitude Longitude Desc\n",
|
|
"94 Yokohama 95 Japan 35.455592 139.572196 NaN"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Don't think we'll be using the Desc field, so we'll leave this as-is\n",
|
|
"cities[cities['Desc'].isnull()]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='dtypes'></div>\n",
|
|
"<h3>Dtypes</h3>\n",
|
|
"Review all imported tables and convert the data types if necessary, according to the rules in the following table:\n",
|
|
"<br><br>\n",
|
|
"<table>\n",
|
|
" <tr>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Dtype</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Primary or Foreign Keys</td>\n",
|
|
" <td>int64 or int32</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Currency</td>\n",
|
|
" <td>float64</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Text fields</td>\n",
|
|
" <td>object (string)</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Discrete, non-negative values</td>\n",
|
|
" <td>int64 or int32</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <td>Dates</td>\n",
|
|
" <td>datetime64[ns] (Timestamp object)</td>\n",
|
|
" </tr>\n",
|
|
"</table>\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"%KEY int64\n",
|
|
"Cost float64\n",
|
|
"Customer Number int64\n",
|
|
"Date datetime64[ns]\n",
|
|
"GrossSales float64\n",
|
|
"Invoice Date datetime64[ns]\n",
|
|
"Invoice Number int64\n",
|
|
"Item Desc object\n",
|
|
"Item Number int64\n",
|
|
"Margin float64\n",
|
|
"Order Number int64\n",
|
|
"Promised Delivery Date datetime64[ns]\n",
|
|
"Sales float64\n",
|
|
"Sales Qty int64\n",
|
|
"Sales Rep Number int64\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Dtypes look good as-is with the exception of the datetime fields in the sales df.\n",
|
|
"# This might take a minute to run since we have quite a few rows.\n",
|
|
"sales['Promised Delivery Date'] = pd.to_datetime(sales['Promised Delivery Date'])\n",
|
|
"sales['Invoice Date'] = pd.to_datetime(sales['Invoice Date'])\n",
|
|
"sales['Promised Delivery Date'] = pd.to_datetime(sales['Promised Delivery Date'])\n",
|
|
"sales['Date'] = pd.to_datetime(sales['Date'])\n",
|
|
"# Let's get the qty as an int since we can't have fractional qtys\n",
|
|
"sales['Sales Qty'] = sales['Sales Qty'].astype('int64')\n",
|
|
"\n",
|
|
"sales.dtypes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='join'></div>\n",
|
|
"<h3>Join</h3>\n",
|
|
"Join all your tables together and store the joined result into a dataframe named <code>cs</code>. You'll need this for the <a href=\"#visualization\">reporting and visualization</a> section below. \n",
|
|
"\n",
|
|
"Use the <a href=\"#dictionary\">data dictionary</a> for guidance."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cs = pd.merge(sales, item_master, how='left', left_on='Item Number', right_on='Item Number') \\\n",
|
|
" .merge(sales_rep, how='left', left_on='Sales Rep Number', right_on='Sales Rep ID') \\\n",
|
|
" .merge(customers, how='left', left_on='Customer Number', right_on='Customer Number') \\\n",
|
|
" .merge(cities, how='inner', left_on='City Code', right_on='City Code')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='Analysis'></div>\n",
|
|
"<h2>Analysis</h2>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='fe'></div>\n",
|
|
"<h3>Feature Engineering</h3>\n",
|
|
"<ul>\n",
|
|
" <li>Create a new column, <code>GrossMargin</code>, which is the <code>GrossSales</code> minus the <code>Cost</code>, all divided by <code>GrossSales</code>. Store this value as a float (percentage).</li>\n",
|
|
" <li>Create a new column, <code>ShipDiff</code>, which is the difference between the <code>Promised Delivery Date</code> and the <code>Invoice Date</code>, in <code>seconds</code>.</li>\n",
|
|
" <li>Drop the <code>%KEY</code>, <code>Sales Rep Number</code>, <code>Manager Number</code>, <code>Path</code>, <code>Sales Rep ID</code>, and <code>Desc</code>.</li>\n",
|
|
"</ul>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cs['GrossMargin'] = (cs['GrossSales'] - cs['Cost']) / cs['GrossSales']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cs['ShipDiff'] = (cs['Promised Delivery Date'] - cs['Invoice Date']).dt.seconds"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cs.drop(['%KEY', 'Sales Rep Number', 'Manager Number', 'Path', 'Sales Rep ID', 'Desc'], axis=1, inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='visualization'></div>\n",
|
|
"<h2>Reporting and Visualization</h2>\n",
|
|
"Create charts or reports according to the <a href=\"#prompts\">prompts</a>.\n",
|
|
"\n",
|
|
"<b>Use your best judgement to create visualizations or reports to best answer the questions. As yourself questions such as:</b>\n",
|
|
"<ul>\n",
|
|
" <li>Is the data I'm using categorical or continuous?</li>\n",
|
|
" <li>Am I looking at timeseries data?</li>\n",
|
|
" <li>Am I representing a part-of-a-whole relationship?</li>\n",
|
|
" <li>Do I have many data points? If so, could I report or chart a subset of that data?</li>\n",
|
|
"</ul>\n",
|
|
"\n",
|
|
"<b>There's no right or wrong answer to these questions. As you solve them, focus on this progression:</b>\n",
|
|
"<ul>\n",
|
|
" <li>First, create a pandas report with a dataframe or list of values that attempts to answer the prompt.</li>\n",
|
|
" <li>If you get that done, try charting it out using a pandas charting method, like <code>.plot()</code></li>\n",
|
|
" <li>If you get that done, look at how you might make your chart <i>more information dense</i></li>\n",
|
|
" <ul>\n",
|
|
" <li>Increase chart ink area</li>\n",
|
|
" <li>Reduce visual clutter</li>\n",
|
|
" <li>Use color to convey meaning</li>\n",
|
|
" <li>Increase information density with shape, size, color</li>\n",
|
|
" <li>Use callouts to highlight anomalies in your data or points of interest</li>\n",
|
|
" </ul>\n",
|
|
" <li>If you get that done, look at graduating from the <a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html\">pandas plotting methods</a> to <a href=\"https://matplotlib.org/\">matplotlib</a>, which is the backend for pandas plotting. This will allow you more control over your plots but the learning curve is fairly steep. Hang in there!</li>\n",
|
|
"</ul>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://cdn.wikimg.net/en/strategywiki/images/7/70/SMB3-warpzone.png\" height=\"150\" width=\"150\">\n",
|
|
"\n",
|
|
"<div id='shortcut'></div>\n",
|
|
"<font color='red'><strong>Shortcut cell</font></strong>: if you'd like to bypass the EDA section, please run the cell below to import a pre-cleaned dataset into variable `cs` for charting purposes:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Shortcut cell\n",
|
|
"cs = pd.read_csv('../data/pre_cleaned_data.csv', \n",
|
|
" infer_datetime_format=True, \n",
|
|
" parse_dates=['Date', 'Invoice Date', 'Promised Delivery Date']\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='1a'></div>\n",
|
|
"<h3>1.A</h3>\n",
|
|
"Gross Margin by product group."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {
|
|
"needs_background": "light"
|
|
},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"cs.groupby(['Product Group'])['GrossMargin'].mean().sort_values().plot(kind='barh')\n",
|
|
"plt.title('Gross Margin by Product Group')\n",
|
|
"plt.xlabel('Gross Margin [%]');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='1b'></div>\n",
|
|
"<h3>1.B</h3>\n",
|
|
"Sales by product group, top 10 product groups only."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {
|
|
"needs_background": "light"
|
|
},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"cs.groupby(['Product Type'])['GrossSales'].sum()\\\n",
|
|
" .apply(lambda x: x/1000000)\\\n",
|
|
" .sort_values(ascending=False)\\\n",
|
|
" .head(10)\\\n",
|
|
" .plot(kind='bar');\n",
|
|
"plt.title('Sales [MM USD] by Product Type')\n",
|
|
"plt.ylabel('Sales [MM USD]');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='1c'></div>\n",
|
|
"<h3>1.C</h3>\n",
|
|
"Sales, by year/month, year over year"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {
|
|
"needs_background": "light"
|
|
},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"(cs.groupby([cs['Date'].dt.year, cs['Date'].dt.month])['GrossSales'].sum().unstack().T/1000000).plot()\n",
|
|
"plt.ylabel('Sales [Million USD]')\n",
|
|
"plt.xlabel('Month')\n",
|
|
"plt.title('Monthly Sales, YOY');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='2a'></div>\n",
|
|
"<h3>2.A</h3>\n",
|
|
"Sum of Sales and sales quantity, by rep, by customer. Top 10 customer gross sales only. Formatted as a data frame, not a chart."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>GrossSales</th>\n",
|
|
" <th>Sales Qty</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sales Rep Name</th>\n",
|
|
" <th>Customer</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>Stewart Wind</th>\n",
|
|
" <th>PageWave</th>\n",
|
|
" <td>5.867753e+06</td>\n",
|
|
" <td>232848</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">Judy Thurman</th>\n",
|
|
" <th>Paracel</th>\n",
|
|
" <td>5.665680e+06</td>\n",
|
|
" <td>28360</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Deak-Perera Group.</th>\n",
|
|
" <td>5.326473e+06</td>\n",
|
|
" <td>27784</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Stewart Wind</th>\n",
|
|
" <th>Talarian</th>\n",
|
|
" <td>4.442953e+06</td>\n",
|
|
" <td>177012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">Lee Chin</th>\n",
|
|
" <th>Userland</th>\n",
|
|
" <td>3.747440e+06</td>\n",
|
|
" <td>111200</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Target</th>\n",
|
|
" <td>3.410170e+06</td>\n",
|
|
" <td>101601</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Stewart Wind</th>\n",
|
|
" <th>Acer</th>\n",
|
|
" <td>2.816532e+06</td>\n",
|
|
" <td>109296</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Judy Thurman</th>\n",
|
|
" <th>Tandy Corporation</th>\n",
|
|
" <td>2.551884e+06</td>\n",
|
|
" <td>11872</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Lee Chin</th>\n",
|
|
" <th>Boston and Albany Railroad Company</th>\n",
|
|
" <td>2.075920e+06</td>\n",
|
|
" <td>61600</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Cheryle Sincock</th>\n",
|
|
" <th>Matradi</th>\n",
|
|
" <td>1.730093e+06</td>\n",
|
|
" <td>100072</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" GrossSales Sales Qty\n",
|
|
"Sales Rep Name Customer \n",
|
|
"Stewart Wind PageWave 5.867753e+06 232848\n",
|
|
"Judy Thurman Paracel 5.665680e+06 28360\n",
|
|
" Deak-Perera Group. 5.326473e+06 27784\n",
|
|
"Stewart Wind Talarian 4.442953e+06 177012\n",
|
|
"Lee Chin Userland 3.747440e+06 111200\n",
|
|
" Target 3.410170e+06 101601\n",
|
|
"Stewart Wind Acer 2.816532e+06 109296\n",
|
|
"Judy Thurman Tandy Corporation 2.551884e+06 11872\n",
|
|
"Lee Chin Boston and Albany Railroad Company 2.075920e+06 61600\n",
|
|
"Cheryle Sincock Matradi 1.730093e+06 100072"
|
|
]
|
|
},
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"cs.groupby(['Sales Rep Name', 'Customer'])[['GrossSales', 'Sales Qty']].sum()\\\n",
|
|
" .sort_values('GrossSales', ascending=False)\\\n",
|
|
" .head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div id='3a'></div>\n",
|
|
"<h3>3.A</h3>\n",
|
|
"Scatter plot of mean Gross Margin vs Gross Sales, by Product Sub Group"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def gsgm(s):\n",
|
|
" '''\n",
|
|
" Scales gross sales and margin for plotting purposes\n",
|
|
" '''\n",
|
|
" if s.name == 'GrossSales':\n",
|
|
" return(s)\n",
|
|
" elif s.name == 'GrossMargin':\n",
|
|
" return(s*100)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {
|
|
"needs_background": "light"
|
|
},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"cs.groupby(['Product Sub Group'])[['GrossMargin', 'GrossSales']]\\\n",
|
|
" .mean()\\\n",
|
|
" .apply(lambda x: gsgm(x))\\\n",
|
|
" .plot\\\n",
|
|
" .scatter('GrossSales', 'GrossMargin');\n",
|
|
"plt.title('Mean Gross Margin vs Gross Sales');"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|