{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas for Exploratory Data Analysis II \n",
"\n",
"Pandas a very useful Python library for data manipulation and exploration. We have so much more to see!\n",
"\n",
"In this lesson, we'll continue exploring Pandas for EDA. Specifically: \n",
"\n",
"- Identify and handle missing values with Pandas.\n",
"- Implement groupby statements for specific segmented analysis.\n",
"- Use apply functions to clean data with Pandas.\n",
"\n",
"We'll implicitly review many functions from our first Pandas lesson along the way!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Remember the AdventureWorks Cycles Dataset?\n",
"
\n",
"\n",
"Here's the Production.Product table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html), which is a description of the fields (columns) in the table (the .csv file we will import below):
\n",
"- **ProductID** - Primary key for Product records.\n",
"- **Name** - Name of the product.\n",
"- **ProductNumber** - Unique product identification number.\n",
"- **MakeFlag** - 0 = Product is purchased, 1 = Product is manufactured in-house.\n",
"- **FinishedGoodsFlag** - 0 = Product is not a salable item. 1 = Product is salable.\n",
"- **Color** - Product color.\n",
"- **SafetyStockLevel** - Minimum inventory quantity.\n",
"- **ReorderPoint** - Inventory level that triggers a purchase order or work order.\n",
"- **StandardCost** - Standard cost of the product.\n",
"- **ListPrice** - Selling price.\n",
"- **Size** - Product size.\n",
"- **SizeUnitMeasureCode** - Unit of measure for the Size column.\n",
"- **WeightUnitMeasureCode** - Unit of measure for the Weight column.\n",
"- **DaysToManufacture** - Number of days required to manufacture the product.\n",
"- **ProductLine** - R = Road, M = Mountain, T = Touring, S = Standard\n",
"- **Class** - H = High, M = Medium, L = Low\n",
"- **Style** - W = Womens, M = Mens, U = Universal\n",
"- **ProductSubcategoryID** - Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID.\n",
"- **ProductModelID** - Product is a member of this product model. Foreign key to ProductModel.ProductModelID.\n",
"- **SellStartDate** - Date the product was available for sale.\n",
"- **SellEndDate** - Date the product was no longer available for sale.\n",
"- **DiscontinuedDate** - Date the product was discontinued.\n",
"- **rowguid** - ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample.\n",
"- **ModifiedDate** - Date and time the record was last updated.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Pandas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np # used for linear algebra and random sampling\n",
"# used for plotting charts within the notebook (instead of a separate window)\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read in the dataset\n",
"\n",
"We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"prod = pd.read_csv('../data/Production.Product.csv', sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ProductID | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" Color | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Adjustable Race | \n",
" AR-5381 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Bearing Ball | \n",
" BA-8327 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {58AE3C20-4F3A-4749-A7D4-D568806CC537} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" BB Ball Bearing | \n",
" BE-2349 | \n",
" 1 | \n",
" 0 | \n",
" NaN | \n",
" 800 | \n",
" 600 | \n",
" 0.0 | \n",
" 0.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 25 columns
\n",
"
"
],
"text/plain": [
" ProductID Name ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"0 1 Adjustable Race AR-5381 0 0 \n",
"1 2 Bearing Ball BA-8327 0 0 \n",
"2 3 BB Ball Bearing BE-2349 1 0 \n",
"\n",
" Color SafetyStockLevel ReorderPoint StandardCost ListPrice ... \\\n",
"0 NaN 1000 750 0.0 0.0 ... \n",
"1 NaN 1000 750 0.0 0.0 ... \n",
"2 NaN 800 600 0.0 0.0 ... \n",
"\n",
" ProductLine Class Style ProductSubcategoryID ProductModelID \\\n",
"0 NaN NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"\n",
" SellStartDate SellEndDate DiscontinuedDate \\\n",
"0 2008-04-30 00:00:00 NaN NaN \n",
"1 2008-04-30 00:00:00 NaN NaN \n",
"2 2008-04-30 00:00:00 NaN NaN \n",
"\n",
" rowguid ModifiedDate \n",
"0 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} 2014-02-08 10:01:36.827000000 \n",
"1 {58AE3C20-4F3A-4749-A7D4-D568806CC537} 2014-02-08 10:01:36.827000000 \n",
"2 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} 2014-02-08 10:01:36.827000000 \n",
"\n",
"[3 rows x 25 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's check out the first 3 rows again, for old time's sake\n",
"prod.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(504, 25)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# and the number of rows x cols\n",
"prod.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reset our index (like last time)\n",
"\n",
"Let's bring our `ProductID` column into the index since it's the PK (primary key) of our table and that's where PKs belong as a best practice."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# repalace auto-generated index with the ProductID column\n",
"prod.set_index('ProductID', inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling missing data\n",
"\n",
"Recall missing data is a systemic, challenging problem for data scientists. Imagine conducting a poll, but some of the data gets lost, or you run out of budget and can't complete it! 😮
\n",
"\n",
"\"Handling missing data\" itself is a broad topic. We'll focus on two components:\n",
"\n",
"- Using Pandas to identify we have missing data\n",
"- Strategies to fill in missing data (known in the business as `imputing`)\n",
"- Filling in missing data with Pandas\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Identifying missing data\n",
"\n",
"Before *handling*, we must identify we're missing data at all!\n",
"\n",
"We have a few ways to explore missing data, and they are reminiscient of our Boolean filters."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" Color | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | ProductID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" | 2 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" | 3 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 24 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag Color \\\n",
"ProductID \n",
"1 True True True True False \n",
"2 True True True True False \n",
"3 True True True True False \n",
"\n",
" SafetyStockLevel ReorderPoint StandardCost ListPrice Size \\\n",
"ProductID \n",
"1 True True True True False \n",
"2 True True True True False \n",
"3 True True True True False \n",
"\n",
" ... ProductLine Class Style ProductSubcategoryID \\\n",
"ProductID ... \n",
"1 ... False False False False \n",
"2 ... False False False False \n",
"3 ... False False False False \n",
"\n",
" ProductModelID SellStartDate SellEndDate DiscontinuedDate \\\n",
"ProductID \n",
"1 False True False False \n",
"2 False True False False \n",
"3 False True False False \n",
"\n",
" rowguid ModifiedDate \n",
"ProductID \n",
"1 True True \n",
"2 True True \n",
"3 True True \n",
"\n",
"[3 rows x 24 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# True when data isn't missing\n",
"prod.notnull().head(3)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" Color | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | ProductID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" ... | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" ... | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" ... | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 24 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag Color \\\n",
"ProductID \n",
"1 False False False False True \n",
"2 False False False False True \n",
"3 False False False False True \n",
"\n",
" SafetyStockLevel ReorderPoint StandardCost ListPrice Size ... \\\n",
"ProductID ... \n",
"1 False False False False True ... \n",
"2 False False False False True ... \n",
"3 False False False False True ... \n",
"\n",
" ProductLine Class Style ProductSubcategoryID ProductModelID \\\n",
"ProductID \n",
"1 True True True True True \n",
"2 True True True True True \n",
"3 True True True True True \n",
"\n",
" SellStartDate SellEndDate DiscontinuedDate rowguid ModifiedDate \n",
"ProductID \n",
"1 False True True False False \n",
"2 False True True False False \n",
"3 False True True False False \n",
"\n",
"[3 rows x 24 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# True when data is missing\n",
"prod.isnull().head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we may want to see null values in aggregate. We can use `sum()` to sum down a given column"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Name 0\n",
"ProductNumber 0\n",
"MakeFlag 0\n",
"FinishedGoodsFlag 0\n",
"Color 248\n",
"SafetyStockLevel 0\n",
"ReorderPoint 0\n",
"StandardCost 0\n",
"ListPrice 0\n",
"Size 293\n",
"SizeUnitMeasureCode 328\n",
"WeightUnitMeasureCode 299\n",
"Weight 299\n",
"DaysToManufacture 0\n",
"ProductLine 226\n",
"Class 257\n",
"Style 293\n",
"ProductSubcategoryID 209\n",
"ProductModelID 209\n",
"SellStartDate 0\n",
"SellEndDate 406\n",
"DiscontinuedDate 504\n",
"rowguid 0\n",
"ModifiedDate 0\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# here is a quick and dirty way to do it\n",
"prod.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look! We've found missing values!\n",
"\n",
"How could this missing data be problematic for our analysis?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Understanding missing data\n",
"\n",
"Finding missing data is the easy part! Determining way to do next is more complicated.\n",
"\n",
"Typically, we are most interested in knowing **why** we are missing data. Once we know what 'type of missingness' we have (the source of missing data), we can proceed effectively."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first quantify how much data we are missing. Here is another implementation of `prod.isnull().sum()`, only wrapped with a `DataFrame` and some labels to make it a little more user-friendly:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count of Nulls | \n",
"
\n",
" \n",
" | Column | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | DiscontinuedDate | \n",
" 504 | \n",
"
\n",
" \n",
" | SellEndDate | \n",
" 406 | \n",
"
\n",
" \n",
" | SizeUnitMeasureCode | \n",
" 328 | \n",
"
\n",
" \n",
" | Weight | \n",
" 299 | \n",
"
\n",
" \n",
" | WeightUnitMeasureCode | \n",
" 299 | \n",
"
\n",
" \n",
" | Size | \n",
" 293 | \n",
"
\n",
" \n",
" | Style | \n",
" 293 | \n",
"
\n",
" \n",
" | Class | \n",
" 257 | \n",
"
\n",
" \n",
" | Color | \n",
" 248 | \n",
"
\n",
" \n",
" | ProductLine | \n",
" 226 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Count of Nulls\n",
"Column \n",
"DiscontinuedDate 504\n",
"SellEndDate 406\n",
"SizeUnitMeasureCode 328\n",
"Weight 299\n",
"WeightUnitMeasureCode 299\n",
"Size 293\n",
"Style 293\n",
"Class 257\n",
"Color 248\n",
"ProductLine 226"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# or we can make things pretty as follows\n",
"null_df = pd.DataFrame(prod.isnull().sum(), columns=['Count of Nulls'])\n",
"null_df.index.name = 'Column'\n",
"null_df.sort_values(['Count of Nulls'], ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filling in missing data\n",
"\n",
"How we fill in data depends largely on why it is missing (types of missingness) and what sampling we have available to us.\n",
"\n",
"We may:\n",
"\n",
"- Delete missing data altogether\n",
"- Fill in missing data with:\n",
" - The average of the column\n",
" - The median of the column\n",
" - A predicted amount based on other factors\n",
"- Collect more data:\n",
" - Resample the population\n",
" - Followup with the authority providing data that is missing\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our case, let's focus on handling missing values in `Color`. Let's get a count of the unique values in that column. We will need to use the `dropna=False` kwarg, otherwise the `pd.Series.value_counts()` method will not count `NaN` (null) values."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NaN 248\n",
"Black 93\n",
"Silver 43\n",
"Red 38\n",
"Yellow 36\n",
"Blue 26\n",
"Multi 8\n",
"Silver/Black 7\n",
"White 4\n",
"Grey 1\n",
"Name: Color, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod['Color'].value_counts(dropna=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ahoy! We have 248 nulls!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 1: Drop the missing values."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ProductID\n",
"317 Black\n",
"318 Black\n",
"319 Black\n",
"320 Silver\n",
"321 Silver\n",
" ... \n",
"992 Black\n",
"993 Black\n",
"997 Black\n",
"998 Black\n",
"999 Black\n",
"Name: Color, Length: 256, dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drops rows where any row has a missing value - this does not happen *in place*, so we are not actually dropping\n",
"prod['Color'].dropna(inplace=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Important!** `pd.DataFrame.dropna()` and `pd.Series.dropna()` are very versatile! Let's look at the docs (Series is similar):\n",
"\n",
"```python\n",
"Signature: pd.DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)\n",
"Docstring:\n",
"Remove missing values.\n",
"\n",
"See the :ref:`User Guide ` for more on which values are\n",
"considered missing, and how to work with missing data.\n",
"\n",
"Parameters\n",
"----------\n",
"axis : {0 or 'index', 1 or 'columns'}, default 0\n",
" Determine if rows or columns which contain missing values are\n",
" removed.\n",
"\n",
" * 0, or 'index' : Drop rows which contain missing values.\n",
" * 1, or 'columns' : Drop columns which contain missing value.\n",
"\n",
" .. deprecated:: 0.23.0: Pass tuple or list to drop on multiple\n",
" axes.\n",
"how : {'any', 'all'}, default 'any'\n",
" Determine if row or column is removed from DataFrame, when we have\n",
" at least one NA or all NA.\n",
"\n",
" * 'any' : If any NA values are present, drop that row or column.\n",
" * 'all' : If all values are NA, drop that row or column.\n",
"thresh : int, optional\n",
" Require that many non-NA values.\n",
"subset : array-like, optional\n",
" Labels along other axis to consider, e.g. if you are dropping rows\n",
" these would be a list of columns to include.\n",
"inplace : bool, default False\n",
" If True, do operation inplace and return None.\n",
"```\n",
"\n",
"**how**: This tells us if we want to remove a row if _any_ of the columns have a null, or _all_ of the columns have a null.
\n",
"**subset**: We can input an array here, like `['Color', 'Size', 'Weight']`, and it will only consider nulls in those columns. This is very useful!
\n",
"**inplace**: This is if you want to mutate (change) the source dataframe. Default is `False`, so it will return a _copy_ of the source dataframe."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To accomplish the same thing, but implement it on our entire dataframe, we can do the following:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" Color | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | ProductID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 317 | \n",
" LL Crankarm | \n",
" CA-5965 | \n",
" 0 | \n",
" 0 | \n",
" Black | \n",
" 500 | \n",
" 375 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" L | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {3C9D10B7-A6B2-4774-9963-C19DCEE72FEA} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 318 | \n",
" ML Crankarm | \n",
" CA-6738 | \n",
" 0 | \n",
" 0 | \n",
" Black | \n",
" 500 | \n",
" 375 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" M | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {EABB9A92-FA07-4EAB-8955-F0517B4A4CA7} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 319 | \n",
" HL Crankarm | \n",
" CA-7457 | \n",
" 0 | \n",
" 0 | \n",
" Black | \n",
" 500 | \n",
" 375 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {7D3FD384-4F29-484B-86FA-4206E276FE58} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 24 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag Color \\\n",
"ProductID \n",
"317 LL Crankarm CA-5965 0 0 Black \n",
"318 ML Crankarm CA-6738 0 0 Black \n",
"319 HL Crankarm CA-7457 0 0 Black \n",
"\n",
" SafetyStockLevel ReorderPoint StandardCost ListPrice Size ... \\\n",
"ProductID ... \n",
"317 500 375 0.0 0.0 NaN ... \n",
"318 500 375 0.0 0.0 NaN ... \n",
"319 500 375 0.0 0.0 NaN ... \n",
"\n",
" ProductLine Class Style ProductSubcategoryID ProductModelID \\\n",
"ProductID \n",
"317 NaN L NaN NaN NaN \n",
"318 NaN M NaN NaN NaN \n",
"319 NaN NaN NaN NaN NaN \n",
"\n",
" SellStartDate SellEndDate DiscontinuedDate \\\n",
"ProductID \n",
"317 2008-04-30 00:00:00 NaN NaN \n",
"318 2008-04-30 00:00:00 NaN NaN \n",
"319 2008-04-30 00:00:00 NaN NaN \n",
"\n",
" rowguid \\\n",
"ProductID \n",
"317 {3C9D10B7-A6B2-4774-9963-C19DCEE72FEA} \n",
"318 {EABB9A92-FA07-4EAB-8955-F0517B4A4CA7} \n",
"319 {7D3FD384-4F29-484B-86FA-4206E276FE58} \n",
"\n",
" ModifiedDate \n",
"ProductID \n",
"317 2014-02-08 10:01:36.827000000 \n",
"318 2014-02-08 10:01:36.827000000 \n",
"319 2014-02-08 10:01:36.827000000 \n",
"\n",
"[3 rows x 24 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drops all nulls from the Color column, but returns the entire dataframe instead of just the Color column\n",
"prod.dropna(subset=['Color']).head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Option 2: Fill in missing values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Traditionally, we fill missing data with a median, average, or mode (most frequently occurring). For `Color`, let's replace the nulls with the string value `NoColor`.\n",
"\n",
"Let's first look at the way we'd do it with a single column, using the `pd.Series.fillna()` method:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ProductID\n",
"1 NoColor\n",
"2 NoColor\n",
"3 NoColor\n",
"4 NoColor\n",
"316 NoColor\n",
"317 Black\n",
"318 Black\n",
"319 Black\n",
"320 Silver\n",
"321 Silver\n",
"Name: Color, dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod['Color'].fillna(value='NoColor').head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see how we'd do it to the whole dataframe, using the `pd.DataFrame.fillna()` method. Notice the similar API between the methods with the `value` kwarg. Good congruent design, pandas development team! The full dataframe is returned, and not just a column."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" Color | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | ProductID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" Adjustable Race | \n",
" AR-5381 | \n",
" 0 | \n",
" 0 | \n",
" NoColor | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 2 | \n",
" Bearing Ball | \n",
" BA-8327 | \n",
" 0 | \n",
" 0 | \n",
" NoColor | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {58AE3C20-4F3A-4749-A7D4-D568806CC537} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 3 | \n",
" BB Ball Bearing | \n",
" BE-2349 | \n",
" 1 | \n",
" 0 | \n",
" NoColor | \n",
" 800 | \n",
" 600 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 4 | \n",
" Headset Ball Bearings | \n",
" BE-2908 | \n",
" 0 | \n",
" 0 | \n",
" NoColor | \n",
" 800 | \n",
" 600 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 316 | \n",
" Blade | \n",
" BL-2036 | \n",
" 1 | \n",
" 0 | \n",
" NoColor | \n",
" 800 | \n",
" 600 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {E73E9750-603B-4131-89F5-3DD15ED5FF80} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 317 | \n",
" LL Crankarm | \n",
" CA-5965 | \n",
" 0 | \n",
" 0 | \n",
" Black | \n",
" 500 | \n",
" 375 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" L | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {3C9D10B7-A6B2-4774-9963-C19DCEE72FEA} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 318 | \n",
" ML Crankarm | \n",
" CA-6738 | \n",
" 0 | \n",
" 0 | \n",
" Black | \n",
" 500 | \n",
" 375 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" M | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {EABB9A92-FA07-4EAB-8955-F0517B4A4CA7} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 319 | \n",
" HL Crankarm | \n",
" CA-7457 | \n",
" 0 | \n",
" 0 | \n",
" Black | \n",
" 500 | \n",
" 375 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {7D3FD384-4F29-484B-86FA-4206E276FE58} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 320 | \n",
" Chainring Bolts | \n",
" CB-2903 | \n",
" 0 | \n",
" 0 | \n",
" Silver | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {7BE38E48-B7D6-4486-888E-F53C26735101} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 321 | \n",
" Chainring Nut | \n",
" CN-6137 | \n",
" 0 | \n",
" 0 | \n",
" Silver | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {3314B1D7-EF69-4431-B6DD-DC75268BD5DF} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
"
\n",
"
10 rows × 24 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"ProductID \n",
"1 Adjustable Race AR-5381 0 0 \n",
"2 Bearing Ball BA-8327 0 0 \n",
"3 BB Ball Bearing BE-2349 1 0 \n",
"4 Headset Ball Bearings BE-2908 0 0 \n",
"316 Blade BL-2036 1 0 \n",
"317 LL Crankarm CA-5965 0 0 \n",
"318 ML Crankarm CA-6738 0 0 \n",
"319 HL Crankarm CA-7457 0 0 \n",
"320 Chainring Bolts CB-2903 0 0 \n",
"321 Chainring Nut CN-6137 0 0 \n",
"\n",
" Color SafetyStockLevel ReorderPoint StandardCost ListPrice \\\n",
"ProductID \n",
"1 NoColor 1000 750 0.0 0.0 \n",
"2 NoColor 1000 750 0.0 0.0 \n",
"3 NoColor 800 600 0.0 0.0 \n",
"4 NoColor 800 600 0.0 0.0 \n",
"316 NoColor 800 600 0.0 0.0 \n",
"317 Black 500 375 0.0 0.0 \n",
"318 Black 500 375 0.0 0.0 \n",
"319 Black 500 375 0.0 0.0 \n",
"320 Silver 1000 750 0.0 0.0 \n",
"321 Silver 1000 750 0.0 0.0 \n",
"\n",
" Size ... ProductLine Class Style ProductSubcategoryID \\\n",
"ProductID ... \n",
"1 NaN ... NaN NaN NaN NaN \n",
"2 NaN ... NaN NaN NaN NaN \n",
"3 NaN ... NaN NaN NaN NaN \n",
"4 NaN ... NaN NaN NaN NaN \n",
"316 NaN ... NaN NaN NaN NaN \n",
"317 NaN ... NaN L NaN NaN \n",
"318 NaN ... NaN M NaN NaN \n",
"319 NaN ... NaN NaN NaN NaN \n",
"320 NaN ... NaN NaN NaN NaN \n",
"321 NaN ... NaN NaN NaN NaN \n",
"\n",
" ProductModelID SellStartDate SellEndDate DiscontinuedDate \\\n",
"ProductID \n",
"1 NaN 2008-04-30 00:00:00 NaN NaN \n",
"2 NaN 2008-04-30 00:00:00 NaN NaN \n",
"3 NaN 2008-04-30 00:00:00 NaN NaN \n",
"4 NaN 2008-04-30 00:00:00 NaN NaN \n",
"316 NaN 2008-04-30 00:00:00 NaN NaN \n",
"317 NaN 2008-04-30 00:00:00 NaN NaN \n",
"318 NaN 2008-04-30 00:00:00 NaN NaN \n",
"319 NaN 2008-04-30 00:00:00 NaN NaN \n",
"320 NaN 2008-04-30 00:00:00 NaN NaN \n",
"321 NaN 2008-04-30 00:00:00 NaN NaN \n",
"\n",
" rowguid \\\n",
"ProductID \n",
"1 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} \n",
"2 {58AE3C20-4F3A-4749-A7D4-D568806CC537} \n",
"3 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} \n",
"4 {ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B} \n",
"316 {E73E9750-603B-4131-89F5-3DD15ED5FF80} \n",
"317 {3C9D10B7-A6B2-4774-9963-C19DCEE72FEA} \n",
"318 {EABB9A92-FA07-4EAB-8955-F0517B4A4CA7} \n",
"319 {7D3FD384-4F29-484B-86FA-4206E276FE58} \n",
"320 {7BE38E48-B7D6-4486-888E-F53C26735101} \n",
"321 {3314B1D7-EF69-4431-B6DD-DC75268BD5DF} \n",
"\n",
" ModifiedDate \n",
"ProductID \n",
"1 2014-02-08 10:01:36.827000000 \n",
"2 2014-02-08 10:01:36.827000000 \n",
"3 2014-02-08 10:01:36.827000000 \n",
"4 2014-02-08 10:01:36.827000000 \n",
"316 2014-02-08 10:01:36.827000000 \n",
"317 2014-02-08 10:01:36.827000000 \n",
"318 2014-02-08 10:01:36.827000000 \n",
"319 2014-02-08 10:01:36.827000000 \n",
"320 2014-02-08 10:01:36.827000000 \n",
"321 2014-02-08 10:01:36.827000000 \n",
"\n",
"[10 rows x 24 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod.fillna(value={'Color': 'NoColor'}).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But wait! There's more! We can reference any other data or formulas we want with the imputation (the value we fill the nulls with). This is very handy if you want to impute with the average or median of that column... or even another column altogether! Here is an example where we will the nulls of `Color` with the average value from the `ListPrice` column. This has no practical value in this application, but immense value in other applications."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" Color | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | ProductID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" Adjustable Race | \n",
" AR-5381 | \n",
" 0 | \n",
" 0 | \n",
" 438.66625 | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 2 | \n",
" Bearing Ball | \n",
" BA-8327 | \n",
" 0 | \n",
" 0 | \n",
" 438.66625 | \n",
" 1000 | \n",
" 750 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {58AE3C20-4F3A-4749-A7D4-D568806CC537} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
" | 3 | \n",
" BB Ball Bearing | \n",
" BE-2349 | \n",
" 1 | \n",
" 0 | \n",
" 438.66625 | \n",
" 800 | \n",
" 600 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2008-04-30 00:00:00 | \n",
" NaN | \n",
" NaN | \n",
" {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} | \n",
" 2014-02-08 10:01:36.827000000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 24 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"ProductID \n",
"1 Adjustable Race AR-5381 0 0 \n",
"2 Bearing Ball BA-8327 0 0 \n",
"3 BB Ball Bearing BE-2349 1 0 \n",
"\n",
" Color SafetyStockLevel ReorderPoint StandardCost ListPrice \\\n",
"ProductID \n",
"1 438.66625 1000 750 0.0 0.0 \n",
"2 438.66625 1000 750 0.0 0.0 \n",
"3 438.66625 800 600 0.0 0.0 \n",
"\n",
" Size ... ProductLine Class Style ProductSubcategoryID \\\n",
"ProductID ... \n",
"1 NaN ... NaN NaN NaN NaN \n",
"2 NaN ... NaN NaN NaN NaN \n",
"3 NaN ... NaN NaN NaN NaN \n",
"\n",
" ProductModelID SellStartDate SellEndDate DiscontinuedDate \\\n",
"ProductID \n",
"1 NaN 2008-04-30 00:00:00 NaN NaN \n",
"2 NaN 2008-04-30 00:00:00 NaN NaN \n",
"3 NaN 2008-04-30 00:00:00 NaN NaN \n",
"\n",
" rowguid \\\n",
"ProductID \n",
"1 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} \n",
"2 {58AE3C20-4F3A-4749-A7D4-D568806CC537} \n",
"3 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} \n",
"\n",
" ModifiedDate \n",
"ProductID \n",
"1 2014-02-08 10:01:36.827000000 \n",
"2 2014-02-08 10:01:36.827000000 \n",
"3 2014-02-08 10:01:36.827000000 \n",
"\n",
"[3 rows x 24 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod.fillna(value={'Color': prod['ListPrice'].mean() }).head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They're gone! Important points:\n",
"\n",
"- Don't forget to use the `inplace=True` kwarg to mutate the source dataframe (i.e. 'save changes'). \n",
"- It is helpful to not use `inplace=True` initially to ensure your code/logic is correct, prior to making permanent changes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Groupby Statements\n",
"\n",
"In Pandas, groupby statements are similar to pivot tables in that they allow us to segment our population to a specific subset.\n",
"\n",
"For example, if we want to know the average number of bottles sold and pack sizes per city, a groupby statement would make this task much more straightforward.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To think how a groupby statement works, think about it like this:\n",
"\n",
"- **Split:** Separate our DataFrame by a specific attribute, for example, group by `Color`\n",
"- **Combine:** Put our DataFrame back together and return some _aggregated_ metric, such as the `sum`, `count`, or `max`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try it out!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's group by `Color`, and get a count of products for each color."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" SizeUnitMeasureCode | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | Color | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Black | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 71 | \n",
" 55 | \n",
" ... | \n",
" 86 | \n",
" 72 | \n",
" 71 | \n",
" 89 | \n",
" 89 | \n",
" 93 | \n",
" 44 | \n",
" 0 | \n",
" 93 | \n",
" 93 | \n",
"
\n",
" \n",
" | Blue | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 25 | \n",
" 22 | \n",
" ... | \n",
" 26 | \n",
" 22 | \n",
" 25 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 0 | \n",
" 0 | \n",
" 26 | \n",
" 26 | \n",
"
\n",
" \n",
" | Grey | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" | Multi | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 7 | \n",
" 0 | \n",
" ... | \n",
" 8 | \n",
" 0 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 3 | \n",
" 0 | \n",
" 8 | \n",
" 8 | \n",
"
\n",
" \n",
" | Red | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 37 | \n",
" 37 | \n",
" ... | \n",
" 38 | \n",
" 37 | \n",
" 37 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 30 | \n",
" 0 | \n",
" 38 | \n",
" 38 | \n",
"
\n",
" \n",
" | Silver | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 31 | \n",
" 30 | \n",
" ... | \n",
" 31 | \n",
" 30 | \n",
" 30 | \n",
" 36 | \n",
" 36 | \n",
" 43 | \n",
" 6 | \n",
" 0 | \n",
" 43 | \n",
" 43 | \n",
"
\n",
" \n",
" | Silver/Black | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 7 | \n",
" 6 | \n",
" 0 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" | White | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 0 | \n",
" ... | \n",
" 4 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 2 | \n",
" 0 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" | Yellow | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 32 | \n",
" ... | \n",
" 36 | \n",
" 32 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 0 | \n",
" 0 | \n",
" 36 | \n",
" 36 | \n",
"
\n",
" \n",
"
\n",
"
9 rows × 23 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"Color \n",
"Black 93 93 93 93 \n",
"Blue 26 26 26 26 \n",
"Grey 1 1 1 1 \n",
"Multi 8 8 8 8 \n",
"Red 38 38 38 38 \n",
"Silver 43 43 43 43 \n",
"Silver/Black 7 7 7 7 \n",
"White 4 4 4 4 \n",
"Yellow 36 36 36 36 \n",
"\n",
" SafetyStockLevel ReorderPoint StandardCost ListPrice Size \\\n",
"Color \n",
"Black 93 93 93 93 71 \n",
"Blue 26 26 26 26 25 \n",
"Grey 1 1 1 1 0 \n",
"Multi 8 8 8 8 7 \n",
"Red 38 38 38 38 37 \n",
"Silver 43 43 43 43 31 \n",
"Silver/Black 7 7 7 7 0 \n",
"White 4 4 4 4 4 \n",
"Yellow 36 36 36 36 36 \n",
"\n",
" SizeUnitMeasureCode ... ProductLine Class Style \\\n",
"Color ... \n",
"Black 55 ... 86 72 71 \n",
"Blue 22 ... 26 22 25 \n",
"Grey 0 ... 1 0 0 \n",
"Multi 0 ... 8 0 8 \n",
"Red 37 ... 38 37 37 \n",
"Silver 30 ... 31 30 30 \n",
"Silver/Black 0 ... 7 6 0 \n",
"White 0 ... 4 0 4 \n",
"Yellow 32 ... 36 32 36 \n",
"\n",
" ProductSubcategoryID ProductModelID SellStartDate \\\n",
"Color \n",
"Black 89 89 93 \n",
"Blue 26 26 26 \n",
"Grey 1 1 1 \n",
"Multi 8 8 8 \n",
"Red 38 38 38 \n",
"Silver 36 36 43 \n",
"Silver/Black 7 7 7 \n",
"White 4 4 4 \n",
"Yellow 36 36 36 \n",
"\n",
" SellEndDate DiscontinuedDate rowguid ModifiedDate \n",
"Color \n",
"Black 44 0 93 93 \n",
"Blue 0 0 26 26 \n",
"Grey 1 0 1 1 \n",
"Multi 3 0 8 8 \n",
"Red 30 0 38 38 \n",
"Silver 6 0 43 43 \n",
"Silver/Black 0 0 7 7 \n",
"White 2 0 4 4 \n",
"Yellow 0 0 36 36 \n",
"\n",
"[9 rows x 23 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# group by Color, giving the number of products of each color\n",
"prod.groupby(['Color']).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do we notice about this output? Are all columns the same? Why or why not?\n",
"\n",
"We can see that the `.count()` method excludes nulls, and there is no way to change this with the current implementation:\n",
"```python\n",
"Signature: .count()\n",
"Docstring: Compute count of group, excluding missing values \n",
"File: ~/miniconda3/envs/ga/lib/python3.7/site-packages/pandas/core/groupby/groupby.py\n",
"Type: method\n",
"```\n",
"\n",
"As a best practice, you should either:\n",
"- fill in nulls prior to your .count(), or\n",
"- use the PK (primary key) of the table, which is guaranteed non-null"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" ProductNumber | \n",
" MakeFlag | \n",
" FinishedGoodsFlag | \n",
" SafetyStockLevel | \n",
" ReorderPoint | \n",
" StandardCost | \n",
" ListPrice | \n",
" Size | \n",
" SizeUnitMeasureCode | \n",
" ... | \n",
" ProductLine | \n",
" Class | \n",
" Style | \n",
" ProductSubcategoryID | \n",
" ProductModelID | \n",
" SellStartDate | \n",
" SellEndDate | \n",
" DiscontinuedDate | \n",
" rowguid | \n",
" ModifiedDate | \n",
"
\n",
" \n",
" | Color | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Black | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" ... | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
" 93 | \n",
"
\n",
" \n",
" | Blue | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" ... | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
"
\n",
" \n",
" | Grey | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" ... | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" | Multi | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" ... | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
"
\n",
" \n",
" | Red | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" ... | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
" 38 | \n",
"
\n",
" \n",
" | Silver | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" ... | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
" 43 | \n",
"
\n",
" \n",
" | Silver/Black | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" ... | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" | White | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" ... | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" | Yellow | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" ... | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
" 36 | \n",
"
\n",
" \n",
" | x | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" ... | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
" 248 | \n",
"
\n",
" \n",
"
\n",
"
10 rows × 23 columns
\n",
"
"
],
"text/plain": [
" Name ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"Color \n",
"Black 93 93 93 93 \n",
"Blue 26 26 26 26 \n",
"Grey 1 1 1 1 \n",
"Multi 8 8 8 8 \n",
"Red 38 38 38 38 \n",
"Silver 43 43 43 43 \n",
"Silver/Black 7 7 7 7 \n",
"White 4 4 4 4 \n",
"Yellow 36 36 36 36 \n",
"x 248 248 248 248 \n",
"\n",
" SafetyStockLevel ReorderPoint StandardCost ListPrice Size \\\n",
"Color \n",
"Black 93 93 93 93 93 \n",
"Blue 26 26 26 26 26 \n",
"Grey 1 1 1 1 1 \n",
"Multi 8 8 8 8 8 \n",
"Red 38 38 38 38 38 \n",
"Silver 43 43 43 43 43 \n",
"Silver/Black 7 7 7 7 7 \n",
"White 4 4 4 4 4 \n",
"Yellow 36 36 36 36 36 \n",
"x 248 248 248 248 248 \n",
"\n",
" SizeUnitMeasureCode ... ProductLine Class Style \\\n",
"Color ... \n",
"Black 93 ... 93 93 93 \n",
"Blue 26 ... 26 26 26 \n",
"Grey 1 ... 1 1 1 \n",
"Multi 8 ... 8 8 8 \n",
"Red 38 ... 38 38 38 \n",
"Silver 43 ... 43 43 43 \n",
"Silver/Black 7 ... 7 7 7 \n",
"White 4 ... 4 4 4 \n",
"Yellow 36 ... 36 36 36 \n",
"x 248 ... 248 248 248 \n",
"\n",
" ProductSubcategoryID ProductModelID SellStartDate \\\n",
"Color \n",
"Black 93 93 93 \n",
"Blue 26 26 26 \n",
"Grey 1 1 1 \n",
"Multi 8 8 8 \n",
"Red 38 38 38 \n",
"Silver 43 43 43 \n",
"Silver/Black 7 7 7 \n",
"White 4 4 4 \n",
"Yellow 36 36 36 \n",
"x 248 248 248 \n",
"\n",
" SellEndDate DiscontinuedDate rowguid ModifiedDate \n",
"Color \n",
"Black 93 93 93 93 \n",
"Blue 26 26 26 26 \n",
"Grey 1 1 1 1 \n",
"Multi 8 8 8 8 \n",
"Red 38 38 38 38 \n",
"Silver 43 43 43 43 \n",
"Silver/Black 7 7 7 7 \n",
"White 4 4 4 4 \n",
"Yellow 36 36 36 36 \n",
"x 248 248 248 248 \n",
"\n",
"[10 rows x 23 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# here we can use 'x' as a dummy placeholder for nulls, simply to get consistent counts for all columns\n",
"prod.fillna(value='x').groupby(['Color']).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's find out the most expensive price for an item, by `Color`:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Color\n",
"Red 3578.27\n",
"Silver 3399.99\n",
"Black 3374.99\n",
"Blue 2384.07\n",
"Yellow 2384.07\n",
"Grey 125.00\n",
"Multi 89.99\n",
"Silver/Black 80.99\n",
"White 9.50\n",
"Name: ListPrice, dtype: float64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod.groupby('Color')['ListPrice'].max().sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also do multi-level groupbys. This is referred to as a `Multiindex` dataframe. Here, we can see the following fields in a nested group by, with a count of Name (with nulls filled!); effectively giving us a count of the number of products for every unique Class/Style combination:\n",
"\n",
"- Class - H = High, M = Medium, L = Low\n",
"- Style - W = Womens, M = Mens, U = Universal"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" Name | \n",
"
\n",
" \n",
" | Class | \n",
" Style | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | H | \n",
" U | \n",
" 64 | \n",
"
\n",
" \n",
" | L | \n",
" U | \n",
" 68 | \n",
"
\n",
" \n",
" | M | \n",
" U | \n",
" 22 | \n",
"
\n",
" \n",
" | W | \n",
" 22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name\n",
"Class Style \n",
"H U 64\n",
"L U 68\n",
"M U 22\n",
" W 22"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod.fillna(value={'Name': 'x'}).groupby(by=['Class', 'Style']).count()[['Name']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use the `.agg()` method with multiple arguments, to simulate a `.describe()` method like we used before:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" count | \n",
" mean | \n",
" min | \n",
" max | \n",
"
\n",
" \n",
" | Color | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Black | \n",
" 93 | \n",
" 725.121075 | \n",
" 0.00 | \n",
" 3374.99 | \n",
"
\n",
" \n",
" | Blue | \n",
" 26 | \n",
" 923.679231 | \n",
" 34.99 | \n",
" 2384.07 | \n",
"
\n",
" \n",
" | Grey | \n",
" 1 | \n",
" 125.000000 | \n",
" 125.00 | \n",
" 125.00 | \n",
"
\n",
" \n",
" | Multi | \n",
" 8 | \n",
" 59.865000 | \n",
" 8.99 | \n",
" 89.99 | \n",
"
\n",
" \n",
" | Red | \n",
" 38 | \n",
" 1401.950000 | \n",
" 34.99 | \n",
" 3578.27 | \n",
"
\n",
" \n",
" | Silver | \n",
" 43 | \n",
" 850.305349 | \n",
" 0.00 | \n",
" 3399.99 | \n",
"
\n",
" \n",
" | Silver/Black | \n",
" 7 | \n",
" 64.018571 | \n",
" 40.49 | \n",
" 80.99 | \n",
"
\n",
" \n",
" | White | \n",
" 4 | \n",
" 9.245000 | \n",
" 8.99 | \n",
" 9.50 | \n",
"
\n",
" \n",
" | Yellow | \n",
" 36 | \n",
" 959.091389 | \n",
" 53.99 | \n",
" 2384.07 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count mean min max\n",
"Color \n",
"Black 93 725.121075 0.00 3374.99\n",
"Blue 26 923.679231 34.99 2384.07\n",
"Grey 1 125.000000 125.00 125.00\n",
"Multi 8 59.865000 8.99 89.99\n",
"Red 38 1401.950000 34.99 3578.27\n",
"Silver 43 850.305349 0.00 3399.99\n",
"Silver/Black 7 64.018571 40.49 80.99\n",
"White 4 9.245000 8.99 9.50\n",
"Yellow 36 959.091389 53.99 2384.07"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod.groupby('Color')['ListPrice'].agg(['count', 'mean', 'min', 'max'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply functions for column operations\n",
"\n",
"Apply functions allow us to perform a complex operation across an entire columns highly efficiently."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For example, let's say we want to change our colors from a word, to just a single letter. How would we do that?\n",
"\n",
"The first step is writing a function, with the argument being the value we would receive from each cell in the column. This function will mutate the input, and return the result. This result will then be _applied_ to the source dataframe (if desired)."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, 'Black', 'Silver', 'Red', 'White', 'Blue', 'Multi', 'Yellow',\n",
" 'Grey', 'Silver/Black'], dtype=object)"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod['Color'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"def color_to_letter(color):\n",
" # this maps the original color to the color we want it to be\n",
" color_dict = {\n",
" 'Black': 'B',\n",
" 'Silver': 'S',\n",
" 'Red': 'R',\n",
" 'White': 'W',\n",
" 'Blue': 'B',\n",
" 'Multi': 'M',\n",
" 'Yellow': 'Y',\n",
" 'Grey': 'G',\n",
" 'Silver/Black': 'V'\n",
" }\n",
" try:\n",
" return color_dict[color]\n",
" # this catches nulls, or any other color we haven't\n",
" # defined in our color_dict, and fills it with 'N'\n",
" except KeyError:\n",
" return 'N'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can _apply_ this function to our `pd.Series` object, returning the result (which we can use to overwrite the source, if we choose)."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ProductID\n",
"1 N\n",
"2 N\n",
"3 N\n",
"4 N\n",
"316 N\n",
"317 B\n",
"318 B\n",
"319 B\n",
"320 S\n",
"321 S\n",
"Name: Color, dtype: object"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod['Color'].apply(color_to_letter).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `pd.DataFrame.apply` implementation is similar, however it effectively 'scrolls through' the columns and passes each one sequentially to your function:\n",
"\n",
"```python\n",
"Objects passed to the function are Series objects whose index is\n",
"either the DataFrame's index (``axis=0``) or the DataFrame's columns\n",
"(``axis=1``).\n",
"```\n",
"\n",
"It should only be used when you wish to apply the same function to all columns (or rows) of your `pd.DataFrame` object."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use `pd.Series.apply()` with a **labmda expression**. This is an undeclared function and is commonly used for simple functions within the `.apply()` method. Let's use it to add $100 to our `ListPrice` column. Hey, baby needs new shoes!"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ProductID\n",
"990 539.99\n",
"991 539.99\n",
"992 539.99\n",
"993 539.99\n",
"994 53.99\n",
"995 101.24\n",
"996 121.49\n",
"997 539.99\n",
"998 539.99\n",
"999 539.99\n",
"Name: ListPrice, dtype: float64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# without apply\n",
"prod['ListPrice'].tail(10)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ProductID\n",
"990 639.99\n",
"991 639.99\n",
"992 639.99\n",
"993 639.99\n",
"994 153.99\n",
"995 201.24\n",
"996 221.49\n",
"997 639.99\n",
"998 639.99\n",
"999 639.99\n",
"Name: ListPrice, dtype: float64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# and now with 100 more dollars!\n",
"prod['ListPrice'].apply(lambda x: x + 100).tail(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Boom! Maybe financing that new boat wasn't such a bad idea after all!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Your turn:** Identify one other column where we may want to write a new apply function, or use the one we just created for the purposes of cleaning up our dataset."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# identify a column to mutate (change)\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# write a function to mutate that column (or columns) note: if using a lambda function, you can leave this blank\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# apply that function across the whole column\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wrap up\n",
"\n",
"We've covered even more useful information! Here are the key takeaways:\n",
"\n",
"- **Missing data** comes in many shapes and sizes. Before deciding how to handle it, we identify it exists. We then derive how the missingness is affecting our dataset, and make a determination about how to fill in values.\n",
"\n",
"```python\n",
"# pro tip for identifying missing data\n",
"df.isnull().sum()\n",
"```\n",
"\n",
"- **Grouby** statements are particularly useful for a subsection-of-interest analysis. Specifically, zooming in on one condition, and determining relevant statstics.\n",
"\n",
"```python\n",
"# group by \n",
"df.groupby('column').agg['count', 'mean', 'max', 'min']\n",
"```\n",
"\n",
"- **Apply functions** help us clean values across an entire DataFrame column. They are *like* a for loop for cleaning, but many times more efficient. They follow a common pattern:\n",
"1. Write a function that works on a single value\n",
"2. Test that function on a single value\n",
"3. Apply that function to a whole column\n",
"\n",
"(The most confusing part of apply functions is that we write them with *a single value* in mind, and then apply them to many single values at once.)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}