{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pandas for Exploratory Data Analysis I \n", "by [@josephofiowa](https://twitter.com/josephofiowa)\n", "\n", "Pandas is the most prominent Python library for exploratory data analysis (EDA). The functions Pandas supports are integral to understanding, formatting, and preparing our data. Formally, we use Pandas to investigate, wrangle, munge, and clean our data. Pandas is the Swiss Army Knife of data manipulation!\n", "\n", "\n", "We'll have two coding-heavy sessions on Pandas. In this one, we'll use Pandas to:\n", " - Read in a dataset\n", " - Investigate a dataset's integrity\n", " - Filter, sort, and manipulate a DataFrame's series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About the Dataset: Adventureworks Cycles\n", "\n", "\n", "\n", "For today's Pandas exercises, we will be using a dataset developed by Microsoft for training purposes in SQL server, known the [Adventureworks Cycles 2014OLTP Database](https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks). It is based on a fictitious company called Adventure Works Cycles (AWC), a multinational manufacturer and seller of bicycles and accessories. The company is based in Bothell, Washington, USA and has regional sales offices in several countries. We will be looking at a single table from this database, the Production.Product table, which outlines some of the products this company sells. \n", "\n", "A full data dictionary can be viewed [here](https://www.sqldatadictionary.com/AdventureWorks2014/).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a closer look at the Production.Product table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html), which is a description of the fields (columns) in the table (the .csv file we will import below):\n", "- **ProductID** - Primary key for Product records.\n", "- **Name** - Name of the product.\n", "- **ProductNumber** - Unique product identification number.\n", "- **MakeFlag** - 0 = Product is purchased, 1 = Product is manufactured in-house.\n", "- **FinishedGoodsFlag** - 0 = Product is not a salable item. 1 = Product is salable.\n", "- **Color** - Product color.\n", "- **SafetyStockLevel** - Minimum inventory quantity.\n", "- **ReorderPoint** - Inventory level that triggers a purchase order or work order.\n", "- **StandardCost** - Standard cost of the product.\n", "- **ListPrice** - Selling price.\n", "- **Size** - Product size.\n", "- **SizeUnitMeasureCode** - Unit of measure for the Size column.\n", "- **WeightUnitMeasureCode** - Unit of measure for the Weight column.\n", "- **DaysToManufacture** - Number of days required to manufacture the product.\n", "- **ProductLine** - R = Road, M = Mountain, T = Touring, S = Standard\n", "- **Class** - H = High, M = Medium, L = Low\n", "- **Style** - W = Womens, M = Mens, U = Universal\n", "- **ProductSubcategoryID** - Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID.\n", "- **ProductModelID** - Product is a member of this product model. Foreign key to ProductModel.ProductModelID.\n", "- **SellStartDate** - Date the product was available for sale.\n", "- **SellEndDate** - Date the product was no longer available for sale.\n", "- **DiscontinuedDate** - Date the product was discontinued.\n", "- **rowguid** - ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample.\n", "- **ModifiedDate** - Date and time the record was last updated.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Pandas\n", "\n", "To [import a library](https://docs.python.org/3/reference/import.html), we write `import` and the library name. For Pandas, is it common to name the library `pd` so that when we reference a function from the Pandas library, we only write `pd` to reference the aliased [namespace](https://docs.python.org/3/tutorial/classes.html#python-scopes-and-namespaces) -- not `pandas`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am using pandas Version: 1.2.1.\n", "It is installed at: ['C:\\\\Users\\\\Andrew\\\\Anaconda3\\\\lib\\\\site-packages\\\\pandas']\n" ] } ], "source": [ "# we can see the details about the imported package by referencing its private class propertys:\n", "print(f'I am using {pd.__name__} \\\n", "Version: {pd.__version__}.\\n\\\n", "It is installed at: {pd.__path__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading in Data\n", "\n", "Pandas dramatically simplifies the process of reading in data. When we say \"reading in data,\" we mean loading a file into our machine's memory.\n", "\n", "When you have a CSV, for example, and then you double-click to open it in Microsoft Excel, the open file is \"read into memory.\" You can now manipulate the CSV.\n", "\n", "When we read data into memory in Python, we are creating an object. We will soon explore this object. _(And, as an aside, when we have a file that is greater than the size of our computer's memory, this is approaching \"Big Data.\")_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we are working with a CSV, we will use the [read CSV](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method.
A [delimiter](https://en.wikipedia.org/wiki/Delimiter-separated_values) is a character that separates fields (columns) in the imported file. Just because a file says `.csv` does not necessarily mean that a comma is used as the delimiter. In this case, we have a tab character as the delimiter for our columns, so we will be using `sep='\\t'` to tell pandas to 'cut' the columns every time it sees a [tab character in the file](http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "prod = pd.read_csv('../data/Production.Product.csv', sep='\\t')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProductIDNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPrice...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
01Adjustable RaceAR-538100NaN10007500.00.0...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000
\n", "

1 rows × 25 columns

\n", "
" ], "text/plain": [ " ProductID Name ProductNumber MakeFlag FinishedGoodsFlag \\\n", "0 1 Adjustable Race AR-5381 0 0 \n", "\n", " Color SafetyStockLevel ReorderPoint StandardCost ListPrice ... \\\n", "0 NaN 1000 750 0.0 0.0 ... \n", "\n", " ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "0 NaN NaN NaN NaN NaN \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "0 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid ModifiedDate \n", "0 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} 2014-02-08 10:01:36.827000000 \n", "\n", "[1 rows x 25 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Documentation Pause*\n", "\n", "How did we know how to use `pd.read_csv`? Let's take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). Note the first argument required (`filepath`).\n", "> Take a moment to dissect other arguments and options when reading in data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have just created a data structure called a `DataFrame`. See?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(prod)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting our DataFrame: The basics\n", "\n", "We'll now perform basic operations on the DataFrame, denoted with comments." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProductIDNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPrice...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
01Adjustable RaceAR-538100NaN10007500.00.0...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000
12Bearing BallBA-832700NaN10007500.00.0...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{58AE3C20-4F3A-4749-A7D4-D568806CC537}2014-02-08 10:01:36.827000000
23BB Ball BearingBE-234910NaN8006000.00.0...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}2014-02-08 10:01:36.827000000
\n", "

3 rows × 25 columns

\n", "
" ], "text/plain": [ " ProductID Name ProductNumber MakeFlag FinishedGoodsFlag \\\n", "0 1 Adjustable Race AR-5381 0 0 \n", "1 2 Bearing Ball BA-8327 0 0 \n", "2 3 BB Ball Bearing BE-2349 1 0 \n", "\n", " Color SafetyStockLevel ReorderPoint StandardCost ListPrice ... \\\n", "0 NaN 1000 750 0.0 0.0 ... \n", "1 NaN 1000 750 0.0 0.0 ... \n", "2 NaN 800 600 0.0 0.0 ... \n", "\n", " ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "0 NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "0 2008-04-30 00:00:00 NaN NaN \n", "1 2008-04-30 00:00:00 NaN NaN \n", "2 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid ModifiedDate \n", "0 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} 2014-02-08 10:01:36.827000000 \n", "1 {58AE3C20-4F3A-4749-A7D4-D568806CC537} 2014-02-08 10:01:36.827000000 \n", "2 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 25 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# print the first and last 3 rows\n", "prod.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `.head()` is a method (denoted by parantheses), so it takes arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Class Question:** \n", "- What do you think changes if we pass a different number `head()` argument?\n", "- How would we print the last 5 rows?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(504, 25)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# identify the shape (rows by columns)\n", "prod.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have 504 rows, and 25 columns. This is a tuple, so we can extract the parts using indexing:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "504" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# print the number of rows\n", "prod.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the index\n", "An [index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html) is an immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects. Think of it as a 'row address' for your data frame (table). It is best practice to explicitly set the index of your dataframe, as these are commonly used as [primary keys](https://en.wikipedia.org/wiki/Primary_key) which can be used to [join](https://www.w3schools.com/sql/sql_join.asp) your dataframe to other dataframes.\n", "\n", "The dataframe can store different types (int, string, datetime), and when importing data will automatically assign a number to each row, starting at zero and counting up. You can overwrite this, which is what we are going to do." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=504, step=1)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# displaying the index as it sits (auto-generated upon import)\n", "prod.index" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "# also note that our auto-generated index has no name\n", "print(prod.index.name)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProductIDName
01Adjustable Race
12Bearing Ball
23BB Ball Bearing
\n", "
" ], "text/plain": [ " ProductID Name\n", "0 1 Adjustable Race\n", "1 2 Bearing Ball\n", "2 3 BB Ball Bearing" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Here we are looking at three columns;\n", "# the one on the left is the index (automatically generated upon import by pandas)\n", "# 'ProductID' is our PK (primary key) from our imported table. 'Name' is a data column.\n", "# Notice that the generated index starts at zero and our PK starts at 1.\n", "pd.DataFrame(prod.head(3)[['ProductID', 'Name']])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
ProductID
1Adjustable Race
2Bearing Ball
3BB Ball Bearing
\n", "
" ], "text/plain": [ " Name\n", "ProductID \n", "1 Adjustable Race\n", "2 Bearing Ball\n", "3 BB Ball Bearing" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setting the index overwrites the automatically generated index\n", "# with our PK, which resided in the 'ProductID' column.\n", "prod.set_index('ProductID', inplace=True)\n", "prod.head(3)[['Name']]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Int64Index([ 1, 2, 3, 4, 316, 317, 318, 319, 320, 321,\n", " ...\n", " 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],\n", " dtype='int64', name='ProductID', length=504)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note how our index property has changed as a result\n", "prod.index" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ProductID\n" ] } ], "source": [ "# And our index has also inherited the name of our 'ProductID' column\n", "print(prod.index.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Column headers and datatypes" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Name', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag', 'Color',\n", " 'SafetyStockLevel', 'ReorderPoint', 'StandardCost', 'ListPrice', 'Size',\n", " 'SizeUnitMeasureCode', 'WeightUnitMeasureCode', 'Weight',\n", " 'DaysToManufacture', 'ProductLine', 'Class', 'Style',\n", " 'ProductSubcategoryID', 'ProductModelID', 'SellStartDate',\n", " 'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate'],\n", " dtype='object')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# print the columns\n", "prod.columns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DataType
Nameobject
ProductNumberobject
MakeFlagint64
FinishedGoodsFlagint64
Colorobject
SafetyStockLevelint64
ReorderPointint64
StandardCostfloat64
ListPricefloat64
Sizeobject
SizeUnitMeasureCodeobject
WeightUnitMeasureCodeobject
Weightfloat64
DaysToManufactureint64
ProductLineobject
Classobject
Styleobject
ProductSubcategoryIDfloat64
ProductModelIDfloat64
SellStartDateobject
SellEndDateobject
DiscontinuedDatefloat64
rowguidobject
ModifiedDateobject
\n", "
" ], "text/plain": [ " DataType\n", "Name object\n", "ProductNumber object\n", "MakeFlag int64\n", "FinishedGoodsFlag int64\n", "Color object\n", "SafetyStockLevel int64\n", "ReorderPoint int64\n", "StandardCost float64\n", "ListPrice float64\n", "Size object\n", "SizeUnitMeasureCode object\n", "WeightUnitMeasureCode object\n", "Weight float64\n", "DaysToManufacture int64\n", "ProductLine object\n", "Class object\n", "Style object\n", "ProductSubcategoryID float64\n", "ProductModelID float64\n", "SellStartDate object\n", "SellEndDate object\n", "DiscontinuedDate float64\n", "rowguid object\n", "ModifiedDate object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# examine the datatypes of the columns\n", "# note that these were automatically inferred by pandas upon import!\n", "pd.DataFrame(prod.dtypes, columns=['DataType'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Class Question:** Why do datatypes matter? What operations could we perform on some datatypes that we could not on others? Note the importance of this in checking dataset integrity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting a Column\n", "\n", "We can select columns in two ways. Either we treat the column as an attribute of the DataFrame or we index the DataFrame for a specific element (in this case, the element is a column name)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "If I use SINGLE brackets, pandas returns a \n", "If I use DOUBLE brackets, pandas returns a \n" ] } ], "source": [ "print('If I use SINGLE brackets, pandas returns a', \n", " type(prod['Name']),\n", " '\\nIf I use DOUBLE brackets, pandas returns a',\n", " type(prod[['Name']]))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ProductID\n", "1 Adjustable Race\n", "2 Bearing Ball\n", "3 BB Ball Bearing\n", "Name: Name, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select the Name column, Series object\n", "prod['Name'].head(3)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
ProductID
1Adjustable Race
2Bearing Ball
3BB Ball Bearing
\n", "
" ], "text/plain": [ " Name\n", "ProductID \n", "1 Adjustable Race\n", "2 Bearing Ball\n", "3 BB Ball Bearing" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select the Name column, DataFrame object\n", "prod[['Name']].head(3)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameProductNumber
ProductID
1Adjustable RaceAR-5381
2Bearing BallBA-8327
3BB Ball BearingBE-2349
\n", "
" ], "text/plain": [ " Name ProductNumber\n", "ProductID \n", "1 Adjustable Race AR-5381\n", "2 Bearing Ball BA-8327\n", "3 BB Ball Bearing BE-2349" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# selecting > 1 column (must use double brackets!)\n", "prod[['Name', 'ProductNumber']].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Class Question:** What if we wanted to select a column that has a space in it? Which method from the above two would we use? Why?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## loc and iloc\n", "\n", "`loc` and `iloc` are ways to select multiple rows and columns _at the same time_. \n", "- `loc` uses label-based selection (the index values and column names)\n", "- `iloc` used position-based selection (the position of the row/column within the df)\n", "- For each of them, you specify the rows you want first, followed by the columns. The row values are required, the columns are not." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameProductNumberMakeFlagFinishedGoodsFlagColor
ProductID
1Adjustable RaceAR-538100NaN
2Bearing BallBA-832700NaN
3BB Ball BearingBE-234910NaN
\n", "
" ], "text/plain": [ " Name ProductNumber MakeFlag FinishedGoodsFlag Color\n", "ProductID \n", "1 Adjustable Race AR-5381 0 0 NaN\n", "2 Bearing Ball BA-8327 0 0 NaN\n", "3 BB Ball Bearing BE-2349 1 0 NaN" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use loc to select columns bewteen 'Name' and `Color, and rows with index value 1-3\n", "# Note that both endpoints are included for the row and column ranges; this is slightly different than the Python range() function or list slicing\n", "prod.loc[1:3, 'Name':'Color']" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rowguidModifiedDate
ProductID
1{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000
2{58AE3C20-4F3A-4749-A7D4-D568806CC537}2014-02-08 10:01:36.827000000
3{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}2014-02-08 10:01:36.827000000
4{ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B}2014-02-08 10:01:36.827000000
316{E73E9750-603B-4131-89F5-3DD15ED5FF80}2014-02-08 10:01:36.827000000
\n", "
" ], "text/plain": [ " rowguid \\\n", "ProductID \n", "1 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} \n", "2 {58AE3C20-4F3A-4749-A7D4-D568806CC537} \n", "3 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} \n", "4 {ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B} \n", "316 {E73E9750-603B-4131-89F5-3DD15ED5FF80} \n", "\n", " ModifiedDate \n", "ProductID \n", "1 2014-02-08 10:01:36.827000000 \n", "2 2014-02-08 10:01:36.827000000 \n", "3 2014-02-08 10:01:36.827000000 \n", "4 2014-02-08 10:01:36.827000000 \n", "316 2014-02-08 10:01:36.827000000 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use iloc to get the first 5 rows and the last two columns\n", "# The 0 at the start of the row range is optional\n", "# iloc *does* work like list slicing in that the endpoint of the range is excluded\n", "prod.iloc[0:5, -2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Renaming Columns\n", "\n", "Perhaps we want to rename our columns. There are a few options for doing this.\n", "\n", "Renaming **specific** columns by using a dictionary:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProductNameNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
1Adjustable RaceAR-538100NaN10007500.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000
2Bearing BallBA-832700NaN10007500.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{58AE3C20-4F3A-4749-A7D4-D568806CC537}2014-02-08 10:01:36.827000000
3BB Ball BearingBE-234910NaN8006000.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}2014-02-08 10:01:36.827000000
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " ProductName Number MakeFlag FinishedGoodsFlag Color \\\n", "ProductID \n", "1 Adjustable Race AR-5381 0 0 NaN \n", "2 Bearing Ball BA-8327 0 0 NaN \n", "3 BB Ball Bearing BE-2349 1 0 NaN \n", "\n", " SafetyStockLevel ReorderPoint StandardCost ListPrice Size ... \\\n", "ProductID ... \n", "1 1000 750 0.0 0.0 NaN ... \n", "2 1000 750 0.0 0.0 NaN ... \n", "3 800 600 0.0 0.0 NaN ... \n", "\n", " ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "ProductID \n", "1 NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "ProductID \n", "1 2008-04-30 00:00:00 NaN NaN \n", "2 2008-04-30 00:00:00 NaN NaN \n", "3 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid \\\n", "ProductID \n", "1 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} \n", "2 {58AE3C20-4F3A-4749-A7D4-D568806CC537} \n", "3 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} \n", "\n", " ModifiedDate \n", "ProductID \n", "1 2014-02-08 10:01:36.827000000 \n", "2 2014-02-08 10:01:36.827000000 \n", "3 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rename one or more columns with a dictionary. Note: inplace=False will return a new dataframe,\n", "# but leave the original dataframe untouched. Change this to True to modify the original dataframe.\n", "prod.rename(columns={'Name': 'ProductName', 'ProductNumber':'Number'}, inplace=False).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Renaming **ALL** columns with a new list of column names.\n", "\n", "Note that the `pd.DataFrame.columns` property can be cast to a `list` type. Originally, it's a `pd.core.indexes.base.Index` object:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "My columns look like:\n", " Index(['Name', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag', 'Color',\n", " 'SafetyStockLevel', 'ReorderPoint', 'StandardCost', 'ListPrice', 'Size',\n", " 'SizeUnitMeasureCode', 'WeightUnitMeasureCode', 'Weight',\n", " 'DaysToManufacture', 'ProductLine', 'Class', 'Style',\n", " 'ProductSubcategoryID', 'ProductModelID', 'SellStartDate',\n", " 'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate'],\n", " dtype='object')\n", "\n", "And the type is:\n", " \n" ] } ], "source": [ "print('My columns look like:\\n', prod.columns)\n", "print('\\nAnd the type is:\\n', type(prod.columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can explicitly cast these to a list object as such, by using the built-in `list()` function:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Now my columns look like:\n", " ['Name', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag', 'Color', 'SafetyStockLevel', 'ReorderPoint', 'StandardCost', 'ListPrice', 'Size', 'SizeUnitMeasureCode', 'WeightUnitMeasureCode', 'Weight', 'DaysToManufacture', 'ProductLine', 'Class', 'Style', 'ProductSubcategoryID', 'ProductModelID', 'SellStartDate', 'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate']\n", "\n", "And are of type:\n", " \n" ] } ], "source": [ "print('Now my columns look like:\\n', list(prod.columns))\n", "print('\\nAnd are of type:\\n', type(list(prod.columns)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can place these columns into a variable, `cols`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Name', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag', 'Color', 'SafetyStockLevel', 'ReorderPoint', 'StandardCost', 'ListPrice', 'Size', 'SizeUnitMeasureCode', 'WeightUnitMeasureCode', 'Weight', 'DaysToManufacture', 'ProductLine', 'Class', 'Style', 'ProductSubcategoryID', 'ProductModelID', 'SellStartDate', 'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate']\n" ] } ], "source": [ "# declare a list of strings - these strings will become the new column names\n", "cols = list(prod.columns)\n", "print(cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use list indexing to mutate the columns we want:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['NewName', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag', 'Color', 'SafetyStockLevel', 'ReorderPoint', 'StandardCost', 'ListPrice', 'Size', 'SizeUnitMeasureCode', 'WeightUnitMeasureCode', 'Weight', 'DaysToManufacture', 'ProductLine', 'Class', 'Style', 'ProductSubcategoryID', 'ProductModelID', 'SellStartDate', 'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate']\n" ] } ], "source": [ "cols[0] = 'NewName'\n", "print(cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can set the `pd.DataFrame.columns` property (this is a settable class property), to the new `cols` list, overwriting the existing columns header names:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['NewName', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag', 'Color',\n", " 'SafetyStockLevel', 'ReorderPoint', 'StandardCost', 'ListPrice', 'Size',\n", " 'SizeUnitMeasureCode', 'WeightUnitMeasureCode', 'Weight',\n", " 'DaysToManufacture', 'ProductLine', 'Class', 'Style',\n", " 'ProductSubcategoryID', 'ProductModelID', 'SellStartDate',\n", " 'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate'],\n", " dtype='object')\n" ] } ], "source": [ "# Note that our first column name has changed from 'Name' to 'NewName'\n", "prod.columns = cols\n", "print(prod.columns)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewName
ProductID
1Adjustable Race
2Bearing Ball
3BB Ball Bearing
\n", "
" ], "text/plain": [ " NewName\n", "ProductID \n", "1 Adjustable Race\n", "2 Bearing Ball\n", "3 BB Ball Bearing" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod.head(3)[['NewName']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Common Column Operations\n", "\n", "While this is non-comprehensive, these are a few key column-specific data checks.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Descriptive statistics:** the minimum, first quartile, median, third quartile, and maximum.\n", "\n", "(And more! The mean too.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Five Number Summary (all assumes numeric data):\n", "- **Min:** The smallest value in the column\n", "- **Max:** The largest value in the column\n", "- **Quartile:** A quartile is one fourth of our data\n", " - **First quartile:** This is the bottom most 25 percent\n", " - **Median:** The middle value. (Line all values biggest to smallest - median is the middle!) Also the 50th percentile\n", " - **Third quartile:** This the the top 75 percentile of our data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://www.mathsisfun.com/data/images/quartiles-a.svg)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MakeFlagSafetyStockLevelStandardCost
count504.000000504.000000504.000000
mean0.474206535.150794258.602961
std0.499830374.112954461.632808
min0.0000004.0000000.000000
25%0.000000100.0000000.000000
50%0.000000500.00000023.372200
75%1.0000001000.000000317.075825
max1.0000001000.0000002171.294200
\n", "
" ], "text/plain": [ " MakeFlag SafetyStockLevel StandardCost\n", "count 504.000000 504.000000 504.000000\n", "mean 0.474206 535.150794 258.602961\n", "std 0.499830 374.112954 461.632808\n", "min 0.000000 4.000000 0.000000\n", "25% 0.000000 100.000000 0.000000\n", "50% 0.000000 500.000000 23.372200\n", "75% 1.000000 1000.000000 317.075825\n", "max 1.000000 1000.000000 2171.294200" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# note - describe *default* only checks numeric datatypes\n", "prod[['MakeFlag', 'SafetyStockLevel', 'StandardCost']].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Value Counts:** `pd.Series.value_counts()` count the occurrence of each value within our series." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Black 93\n", "Silver 43\n", "Red 38\n", "Yellow 36\n", "Blue 26\n", "Multi 8\n", "Silver/Black 7\n", "White 4\n", "Grey 1\n", "Name: Color, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show the most popular product colors (aggregated by count, descending by default)\n", "prod['Color'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Unique values:** Determine the number of distinct values within a given series." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([nan, 'Black', 'Silver', 'Red', 'White', 'Blue', 'Multi', 'Yellow',\n", " 'Grey', 'Silver/Black'], dtype=object)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What are the unique colors for the products?\n", "prod['Color'].unique()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# HOW MANY distinct colors are there?\n", "prod['Color'].nunique()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can also include nulls with .nunique() as such:\n", "prod['Color'].nunique(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering on a Single Condition\n", "\n", "Filtering and sorting are key processes that allow us to drill into the 'nitty gritty' and cross sections of our dataset.\n", "\n", "To filter, we use a process called **Boolean Filtering**, wherein we define a Boolean condition, and use that Boolean condition to filer on our DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall: our given dataset has a column `Color`. Let's see if we can find all products that are `Black`. Let's take a look at the first 10 rows of the dataframe to see how it looks as-is:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ProductID\n", "1 NaN\n", "2 NaN\n", "3 NaN\n", "4 NaN\n", "316 NaN\n", "317 Black\n", "318 Black\n", "319 Black\n", "320 Silver\n", "321 Silver\n", "Name: Color, dtype: object" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod['Color'].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By applying a `boolean mask` to this dataframe, `== 'Black'`, we can get the following:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ProductID\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "316 False\n", "317 True\n", "318 True\n", "319 True\n", "320 False\n", "321 False\n", "Name: Color, dtype: bool" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod['Color'].head(10) == 'Black'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use that 'mask' from above, and apply it to our full dataframe. Every time we have a `True` in a row, we return the row. If we have a `False` in that row, we do not return it. The result is a dataframe that only has rows where `Color` is `Black`:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
317LL CrankarmCA-596500Black5003750.00.0NaN...NaNLNaNNaNNaN2008-04-30 00:00:00NaNNaN{3C9D10B7-A6B2-4774-9963-C19DCEE72FEA}2014-02-08 10:01:36.827000000
318ML CrankarmCA-673800Black5003750.00.0NaN...NaNMNaNNaNNaN2008-04-30 00:00:00NaNNaN{EABB9A92-FA07-4EAB-8955-F0517B4A4CA7}2014-02-08 10:01:36.827000000
319HL CrankarmCA-745700Black5003750.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{7D3FD384-4F29-484B-86FA-4206E276FE58}2014-02-08 10:01:36.827000000
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " NewName ProductNumber MakeFlag FinishedGoodsFlag Color \\\n", "ProductID \n", "317 LL Crankarm CA-5965 0 0 Black \n", "318 ML Crankarm CA-6738 0 0 Black \n", "319 HL Crankarm CA-7457 0 0 Black \n", "\n", " SafetyStockLevel ReorderPoint StandardCost ListPrice Size ... \\\n", "ProductID ... \n", "317 500 375 0.0 0.0 NaN ... \n", "318 500 375 0.0 0.0 NaN ... \n", "319 500 375 0.0 0.0 NaN ... \n", "\n", " ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "ProductID \n", "317 NaN L NaN NaN NaN \n", "318 NaN M NaN NaN NaN \n", "319 NaN NaN NaN NaN NaN \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "ProductID \n", "317 2008-04-30 00:00:00 NaN NaN \n", "318 2008-04-30 00:00:00 NaN NaN \n", "319 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid \\\n", "ProductID \n", "317 {3C9D10B7-A6B2-4774-9963-C19DCEE72FEA} \n", "318 {EABB9A92-FA07-4EAB-8955-F0517B4A4CA7} \n", "319 {7D3FD384-4F29-484B-86FA-4206E276FE58} \n", "\n", " ModifiedDate \n", "ProductID \n", "317 2014-02-08 10:01:36.827000000 \n", "318 2014-02-08 10:01:36.827000000 \n", "319 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod[prod['Color'] == 'Black'].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate the **average ListPrice** for the **salable products**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Think: What are the component parts of this problem?" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
680HL Road Frame - Black, 58FR-R92B-5811Black5003751059.31001431.5058...RHU14.06.02008-04-30 00:00:00NaNNaN{43DD68D6-14A4-461F-9069-55309D90EA7E}2014-02-08 10:01:36.827000000
706HL Road Frame - Red, 58FR-R92R-5811Red5003751059.31001431.5058...RHU14.06.02008-04-30 00:00:00NaNNaN{9540FF17-2712-4C90-A3D1-8CE5568B2462}2014-02-08 10:01:36.827000000
707Sport-100 Helmet, RedHL-U509-R01Red4313.086334.99NaN...SNaNNaN31.033.02011-05-31 00:00:00NaNNaN{2E1EF41A-C08A-4FF6-8ADA-BDE58B64A712}2014-02-08 10:01:36.827000000
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " NewName ProductNumber MakeFlag \\\n", "ProductID \n", "680 HL Road Frame - Black, 58 FR-R92B-58 1 \n", "706 HL Road Frame - Red, 58 FR-R92R-58 1 \n", "707 Sport-100 Helmet, Red HL-U509-R 0 \n", "\n", " FinishedGoodsFlag Color SafetyStockLevel ReorderPoint \\\n", "ProductID \n", "680 1 Black 500 375 \n", "706 1 Red 500 375 \n", "707 1 Red 4 3 \n", "\n", " StandardCost ListPrice Size ... ProductLine Class Style \\\n", "ProductID ... \n", "680 1059.3100 1431.50 58 ... R H U \n", "706 1059.3100 1431.50 58 ... R H U \n", "707 13.0863 34.99 NaN ... S NaN NaN \n", "\n", " ProductSubcategoryID ProductModelID SellStartDate \\\n", "ProductID \n", "680 14.0 6.0 2008-04-30 00:00:00 \n", "706 14.0 6.0 2008-04-30 00:00:00 \n", "707 31.0 33.0 2011-05-31 00:00:00 \n", "\n", " SellEndDate DiscontinuedDate \\\n", "ProductID \n", "680 NaN NaN \n", "706 NaN NaN \n", "707 NaN NaN \n", "\n", " rowguid \\\n", "ProductID \n", "680 {43DD68D6-14A4-461F-9069-55309D90EA7E} \n", "706 {9540FF17-2712-4C90-A3D1-8CE5568B2462} \n", "707 {2E1EF41A-C08A-4FF6-8ADA-BDE58B64A712} \n", "\n", " ModifiedDate \n", "ProductID \n", "680 2014-02-08 10:01:36.827000000 \n", "706 2014-02-08 10:01:36.827000000 \n", "707 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First, we need to get salable items. Use your data dictionary from the beginning of this lesson.\n", "prod[prod['FinishedGoodsFlag'] == 1].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we need to find average list price of those above items. Let's just get the 'ListPrice' column for starters." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ProductID\n", "680 1431.50\n", "706 1431.50\n", "707 34.99\n", "Name: ListPrice, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod[prod['FinishedGoodsFlag'] == 1]['ListPrice'].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the average of that column, just take `.mean()`" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "744.595220338982" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod[prod['FinishedGoodsFlag'] == 1]['ListPrice'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can take a shortcut and just use `.describe()` here:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 295.000000\n", "mean 744.595220\n", "std 892.563172\n", "min 2.290000\n", "25% 66.745000\n", "50% 337.220000\n", "75% 1100.240000\n", "max 3578.270000\n", "Name: ListPrice, dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod[prod['FinishedGoodsFlag'] == 1]['ListPrice'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sneak peek**: Another handy trick is to use `.hist()` to get a distribution of a continuous variable - in this case, `ListPrice`. We'll cover this more in future lessons:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD6CAYAAABamQdMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAASMElEQVR4nO3dfYxld33f8fenXmwek12zY+rs2tl1tIQ4iII1WG5JEcUJ2CZiXclIi6qyBUurBkPJk8K6SHXyB5KdtqFFTYk28cZLigyuA7VVSJOtMbUi4XXHxg9rFuPFdu3BG+8gYycpksHwzR/3DNyM787DvXPvzPx4v6TRPfd3fveez5y5+9kz5z5MqgpJUlv+wVoHkCStPstdkhpkuUtSgyx3SWqQ5S5JDbLcJalBS5Z7koNJTiY5umD8g0keSvJgkt/tG786yfFu3dvHEVqStLhNy5hzA/BfgE/ODyT5Z8Bu4HVV9VySs7rx84E9wM8DPwX87ySvrqrvL7aBrVu31o4dO4b6BiTpx9Xdd9/9raqaGrRuyXKvqjuS7Fgw/CvAtVX1XDfnZDe+G/h0N/5okuPAhcCXF9vGjh07mJmZWSqKJKlPkv93qnXDnnN/NfBPkxxJ8n+SvLEb3wY80TdvthuTJE3Qck7LnOp2W4CLgDcCNyU5D8iAuQM/3yDJPmAfwLnnnjtkDEnSIMMeuc8Cn62eu4AfAFu78XP65m0Hnhx0B1V1oKqmq2p6amrgKSNJ0pCGLff/AbwVIMmrgdOBbwG3AnuSnJFkJ7ALuGs1gkqSlm/J0zJJbgTeAmxNMgtcAxwEDnYvj/wusLd6Hy/5YJKbgK8CzwNXLfVKGUnS6st6+Mjf6enp8tUykrQySe6uqulB63yHqiQ1yHKXpAZZ7pLUoGFf575u7Nj/+TXb9mPXvmPNti1Ji/HIXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1aMlyT3Iwycnu76UuXPebSSrJ1u56knw8yfEk9ye5YByhJUmLW86R+w3AJQsHk5wD/BLweN/wpcCu7msf8InRI0qSVmrJcq+qO4CnB6z6GPBbQP9f2N4NfLJ67gQ2Jzl7VZJKkpZtqHPuSd4JfLOq7luwahvwRN/12W5MkjRBK/4ze0leCnwEeNug1QPGasAYSfbRO3XDueeeu9IYkqRFDHPk/jPATuC+JI8B24F7kvxDekfq5/TN3Q48OehOqupAVU1X1fTU1NQQMSRJp7Licq+qB6rqrKraUVU76BX6BVX1V8CtwHu6V81cBDxbVSdWN7IkaSnLeSnkjcCXgZ9NMpvkykWmfwF4BDgO/CHw/lVJKUlakSXPuVfVu5dYv6NvuYCrRo8lSRqF71CVpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktSg5fwN1YNJTiY52jf275N8Lcn9ST6XZHPfuquTHE/yUJK3jyu4JOnUlnPkfgNwyYKxw8Brq+p1wNeBqwGSnA/sAX6+u81/TXLaqqWVJC3LkuVeVXcATy8Y+4uqer67eiewvVveDXy6qp6rqkeB48CFq5hXkrQMq3HO/X3An3XL24An+tbNdmMvkGRfkpkkM3Nzc6sQQ5I0b6RyT/IR4HngU/NDA6bVoNtW1YGqmq6q6ampqVFiSJIW2DTsDZPsBX4ZuLiq5gt8Fjinb9p24Mnh40mShjHUkXuSS4APA++squ/0rboV2JPkjCQ7gV3AXaPHlCStxJJH7kluBN4CbE0yC1xD79UxZwCHkwDcWVX/uqoeTHIT8FV6p2uuqqrvjyu8JGmwJcu9qt49YPj6ReZ/FPjoKKEkSaPxHaqS1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhq0ZLknOZjkZJKjfWNnJjmc5OHucks3niQfT3I8yf1JLhhneEnSYMs5cr8BuGTB2H7gtqraBdzWXQe4FNjVfe0DPrE6MSVJK7FkuVfVHcDTC4Z3A4e65UPA5X3jn6yeO4HNSc5erbCSpOUZ9pz7q6rqBEB3eVY3vg14om/ebDf2Akn2JZlJMjM3NzdkDEnSIKv9hGoGjNWgiVV1oKqmq2p6ampqlWNI0o+3Ycv9qfnTLd3lyW58Fjinb9524Mnh40mShjFsud8K7O2W9wK39I2/p3vVzEXAs/OnbyRJk7NpqQlJbgTeAmxNMgtcA1wL3JTkSuBx4F3d9C8AlwHHge8A7x1DZknSEpYs96p69ylWXTxgbgFXjRpKkjQa36EqSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBI5V7kl9L8mCSo0luTPLiJDuTHEnycJLPJDl9tcJKkpZn6HJPsg34N8B0Vb0WOA3YA1wHfKyqdgHfBq5cjaCSpOUb9bTMJuAlSTYBLwVOAG8Fbu7WHwIuH3EbkqQVGrrcq+qbwH8AHqdX6s8CdwPPVNXz3bRZYNug2yfZl2Qmyczc3NywMSRJA4xyWmYLsBvYCfwU8DLg0gFTa9Dtq+pAVU1X1fTU1NSwMSRJA4xyWuYXgUeraq6qvgd8FvgnwObuNA3AduDJETNKklZolHJ/HLgoyUuTBLgY+CpwO3BFN2cvcMtoESVJKzXKOfcj9J44vQd4oLuvA8CHgV9Pchx4JXD9KuSUJK3ApqWnnFpVXQNcs2D4EeDCUe5XkjQa36EqSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBI5V7ks1Jbk7ytSTHkvzjJGcmOZzk4e5yy2qFlSQtz6hH7v8Z+F9V9RrgHwHHgP3AbVW1C7ituy5JmqChyz3JTwBvBq4HqKrvVtUzwG7gUDftEHD5qCElSSszypH7ecAc8MdJvpLkj5K8DHhVVZ0A6C7PGnTjJPuSzCSZmZubGyGGJGmhUcp9E3AB8ImqegPw/1nBKZiqOlBV01U1PTU1NUIMSdJCo5T7LDBbVUe66zfTK/unkpwN0F2eHC2iJGmlhi73qvor4IkkP9sNXQx8FbgV2NuN7QVuGSmhJGnFNo14+w8Cn0pyOvAI8F56/2HclORK4HHgXSNuQ5K0QiOVe1XdC0wPWHXxKPcrSRqN71CVpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktSgkcs9yWlJvpLkf3bXdyY5kuThJJ/p/r6qJGmCVuPI/UPAsb7r1wEfq6pdwLeBK1dhG5KkFRip3JNsB94B/FF3PcBbgZu7KYeAy0fZhiRp5UY9cv9PwG8BP+iuvxJ4pqqe767PAtsG3TDJviQzSWbm5uZGjCFJ6jd0uSf5ZeBkVd3dPzxgag26fVUdqKrpqpqempoaNoYkaYBNI9z2TcA7k1wGvBj4CXpH8puTbOqO3rcDT44eU5K0EkMfuVfV1VW1vap2AHuAL1bVvwBuB67opu0Fbhk5pSRpRcbxOvcPA7+e5Di9c/DXj2EbkqRFjHJa5oeq6kvAl7rlR4ALV+N+JUnD8R2qktQgy12SGrQqp2U0WTv2f36tI0zcY9e+Y60jSBuKR+6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQb4UcgQ/ji9JlLQxeOQuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDhi73JOckuT3JsSQPJvlQN35mksNJHu4ut6xeXEnScozyJqbngd+oqnuSvAK4O8lh4F8Bt1XVtUn2A/vp/V1VSSuwVm+S87Pz2zD0kXtVnaiqe7rlvwGOAduA3cChbtoh4PJRQ0qSVmZVzrkn2QG8ATgCvKqqTkDvPwDgrNXYhiRp+UYu9yQvB/4U+NWq+usV3G5fkpkkM3Nzc6PGkCT1Ganck7yIXrF/qqo+2w0/leTsbv3ZwMlBt62qA1U1XVXTU1NTo8SQJC0wyqtlAlwPHKuq3+tbdSuwt1veC9wyfDxJ0jBGebXMm4B/CTyQ5N5u7N8C1wI3JbkSeBx412gRJUkrNXS5V9VfAjnF6ouHvV9J0uh8h6okNchyl6QGWe6S1CDLXZIa5B/IlvRjby3/2P24PsvHI3dJapDlLkkNstwlqUGWuyQ1yHKXpAb5ahltCC2+mkEaJ4/cJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lq0NjKPcklSR5KcjzJ/nFtR5L0QmMp9ySnAb8PXAqcD7w7yfnj2JYk6YXGdeR+IXC8qh6pqu8CnwZ2j2lbkqQFxlXu24An+q7PdmOSpAkY12fLZMBY/b0JyT5gX3f1b5M8NOS2tgLfGvK2k7ZRsm6UnDCBrLluVe5mw+zTXLdxsrJx9uspc474+PrpU60YV7nPAuf0Xd8OPNk/oaoOAAdG3VCSmaqaHvV+JmGjZN0oOWHjZN0oOcGs47AWOcd1Wub/AruS7ExyOrAHuHVM25IkLTCWI/eqej7JB4A/B04DDlbVg+PYliTphcb2ee5V9QXgC+O6/z4jn9qZoI2SdaPkhI2TdaPkBLOOw8RzpqqWniVJ2lD8+AFJatCGLvf19hEHSR5L8kCSe5PMdGNnJjmc5OHucks3niQf77Lfn+SCMWc7mORkkqN9YyvOlmRvN//hJHsnlPO3k3yz26/3Jrmsb93VXc6Hkry9b3zsj40k5yS5PcmxJA8m+VA3vq726yI5191+TfLiJHclua/L+jvd+M4kR7r985nuhRokOaO7frxbv2Op72HMOW9I8mjfPn19Nz75n31Vbcgvek/UfgM4DzgduA84f40zPQZsXTD2u8D+bnk/cF23fBnwZ/TeE3ARcGTM2d4MXAAcHTYbcCbwSHe5pVveMoGcvw385oC553c/9zOAnd3j4bRJPTaAs4ELuuVXAF/vMq2r/bpIznW3X7t98/Ju+UXAkW5f3QTs6cb/APiVbvn9wB90y3uAzyz2PUwg5w3AFQPmT/xnv5GP3DfKRxzsBg51y4eAy/vGP1k9dwKbk5w9rhBVdQfw9IjZ3g4crqqnq+rbwGHgkgnkPJXdwKer6rmqehQ4Tu9xMZHHRlWdqKp7uuW/AY7Reyf2utqvi+Q8lTXbr92++dvu6ou6rwLeCtzcjS/cp/P7+mbg4iRZ5HsYd85TmfjPfiOX+3r8iIMC/iLJ3em9AxfgVVV1Anr/yICzuvH1kH+l2dYy8we6X2cPzp/mWCTPxHN2pwPeQO8Ibt3u1wU5YR3u1ySnJbkXOEmv7L4BPFNVzw/Y7g8zdeufBV45iawLc1bV/D79aLdPP5bkjIU5F+QZW86NXO5LfsTBGnhTVV1A79Mwr0ry5kXmrsf8806Vba0yfwL4GeD1wAngP3bj6yJnkpcDfwr8alX99WJTB4xNLO+AnOtyv1bV96vq9fTe2X4h8HOLbHfNsi7MmeS1wNXAa4A30jvV8uG1yrmRy33JjziYtKp6srs8CXyO3gPzqfnTLd3lyW76esi/0mxrkrmqnur+If0A+EN+9Ov1mudM8iJ6hfmpqvpsN7zu9uugnOt5v3b5ngG+RO8c9eYk8+/L6d/uDzN163+S3mm9iWXty3lJdwqsquo54I9Zw326kct9XX3EQZKXJXnF/DLwNuBol2n+GfC9wC3d8q3Ae7pn0S8Cnp3/VX6CVprtz4G3JdnS/Qr/tm5srBY8F/HP6e3X+Zx7uldM7AR2AXcxocdGd273euBYVf1e36p1tV9PlXM97tckU0k2d8svAX6R3nMEtwNXdNMW7tP5fX0F8MXqPVN5qu9hnDm/1vefeug9L9C/Tyf7s1+NZ2XX6oveM9Bfp3dO7iNrnOU8es/O3wc8OJ+H3vm/24CHu8sz60fPtv9+l/0BYHrM+W6k96v39+gdLVw5TDbgffSenDoOvHdCOf+ky3F/94/k7L75H+lyPgRcOsnHBvAL9H6Fvh+4t/u6bL3t10Vyrrv9CrwO+EqX6Sjw7/r+fd3V7Z//DpzRjb+4u368W3/eUt/DmHN+sdunR4H/xo9eUTPxn73vUJWkBm3k0zKSpFOw3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJatDfAeOis0EIh1xDAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "prod[prod['FinishedGoodsFlag'] == 1]['ListPrice'].hist(grid=False, bins=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering on Multiple Conditions\n", "\n", "Here, we will filter on _multiple conditions_. Before, we filtered on rows where Color was Black. We also filtered where FinishedGoodsFlag was equal to 1. Let's see what happens when we filter on *both* simultaneously. \n", "\n", "The format for multiple conditions is:\n", "\n", "`df[ (df['col1'] == value1) & (df['col2'] == value2) ]`\n", "\n", "Or, more simply:\n", "\n", "`df[ (CONDITION 1) & (CONDITION 2) ]`\n", "\n", "Which eventually may evaluate to something like:\n", "\n", "`df[ True & False ]`\n", "\n", "...on a row-by-row basis. If the end result is `False`, the row is omitted.\n", "\n", "_Don't forget parentheses in your conditions!_ This is a common mistake." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
680HL Road Frame - Black, 58FR-R92B-5811Black5003751059.31001431.5058...RHU14.06.02008-04-30 00:00:00NaNNaN{43DD68D6-14A4-461F-9069-55309D90EA7E}2014-02-08 10:01:36.827000000
708Sport-100 Helmet, BlackHL-U50901Black4313.086334.99NaN...SNaNNaN31.033.02011-05-31 00:00:00NaNNaN{A25A44FB-C2DE-4268-958F-110B8D7621E2}2014-02-08 10:01:36.827000000
722LL Road Frame - Black, 58FR-R38B-5811Black500375204.6251337.2258...RLU14.09.02011-05-31 00:00:00NaNNaN{2140F256-F705-4D67-975D-32DE03265838}2014-02-08 10:01:36.827000000
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " NewName ProductNumber MakeFlag \\\n", "ProductID \n", "680 HL Road Frame - Black, 58 FR-R92B-58 1 \n", "708 Sport-100 Helmet, Black HL-U509 0 \n", "722 LL Road Frame - Black, 58 FR-R38B-58 1 \n", "\n", " FinishedGoodsFlag Color SafetyStockLevel ReorderPoint \\\n", "ProductID \n", "680 1 Black 500 375 \n", "708 1 Black 4 3 \n", "722 1 Black 500 375 \n", "\n", " StandardCost ListPrice Size ... ProductLine Class Style \\\n", "ProductID ... \n", "680 1059.3100 1431.50 58 ... R H U \n", "708 13.0863 34.99 NaN ... S NaN NaN \n", "722 204.6251 337.22 58 ... R L U \n", "\n", " ProductSubcategoryID ProductModelID SellStartDate \\\n", "ProductID \n", "680 14.0 6.0 2008-04-30 00:00:00 \n", "708 31.0 33.0 2011-05-31 00:00:00 \n", "722 14.0 9.0 2011-05-31 00:00:00 \n", "\n", " SellEndDate DiscontinuedDate \\\n", "ProductID \n", "680 NaN NaN \n", "708 NaN NaN \n", "722 NaN NaN \n", "\n", " rowguid \\\n", "ProductID \n", "680 {43DD68D6-14A4-461F-9069-55309D90EA7E} \n", "708 {A25A44FB-C2DE-4268-958F-110B8D7621E2} \n", "722 {2140F256-F705-4D67-975D-32DE03265838} \n", "\n", " ModifiedDate \n", "ProductID \n", "680 2014-02-08 10:01:36.827000000 \n", "708 2014-02-08 10:01:36.827000000 \n", "722 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's look at a table where Color is Black, and FinishedGoodsFlag is 1\n", "prod[ (prod['Color'] == 'Black') & (prod['FinishedGoodsFlag'] == 1) ].head(3)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
1Adjustable RaceAR-538100NaN10007500.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000
2Bearing BallBA-832700NaN10007500.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{58AE3C20-4F3A-4749-A7D4-D568806CC537}2014-02-08 10:01:36.827000000
3BB Ball BearingBE-234910NaN8006000.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}2014-02-08 10:01:36.827000000
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " NewName ProductNumber MakeFlag FinishedGoodsFlag Color \\\n", "ProductID \n", "1 Adjustable Race AR-5381 0 0 NaN \n", "2 Bearing Ball BA-8327 0 0 NaN \n", "3 BB Ball Bearing BE-2349 1 0 NaN \n", "\n", " SafetyStockLevel ReorderPoint StandardCost ListPrice Size ... \\\n", "ProductID ... \n", "1 1000 750 0.0 0.0 NaN ... \n", "2 1000 750 0.0 0.0 NaN ... \n", "3 800 600 0.0 0.0 NaN ... \n", "\n", " ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "ProductID \n", "1 NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "ProductID \n", "1 2008-04-30 00:00:00 NaN NaN \n", "2 2008-04-30 00:00:00 NaN NaN \n", "3 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid \\\n", "ProductID \n", "1 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} \n", "2 {58AE3C20-4F3A-4749-A7D4-D568806CC537} \n", "3 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} \n", "\n", " ModifiedDate \n", "ProductID \n", "1 2014-02-08 10:01:36.827000000 \n", "2 2014-02-08 10:01:36.827000000 \n", "3 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Here we have an example of a list price of greater than 50, \n", "# OR a product size that is not equal to 'XL'\n", "\n", "prod[ (prod['ListPrice'] > 50) | (prod['Size'] != 'XL') ].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting\n", "\n", "We can sort one column of our DataFrame as well." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
749Road-150 Red, 62BK-R93R-6211Red100752171.29423578.2762...RHU2.025.02011-05-31 00:00:002012-05-29 00:00:00NaN{BC621E1F-2553-4FDC-B22E-5E44A9003569}2014-02-08 10:01:36.827000000
750Road-150 Red, 44BK-R93R-4411Red100752171.29423578.2744...RHU2.025.02011-05-31 00:00:002012-05-29 00:00:00NaN{C19E1136-5DA4-4B40-8758-54A85D7EA494}2014-02-08 10:01:36.827000000
751Road-150 Red, 48BK-R93R-4811Red100752171.29423578.2748...RHU2.025.02011-05-31 00:00:002012-05-29 00:00:00NaN{D10B7CC1-455E-435B-A08F-EC5B1C5776E9}2014-02-08 10:01:36.827000000
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " NewName ProductNumber MakeFlag FinishedGoodsFlag Color \\\n", "ProductID \n", "749 Road-150 Red, 62 BK-R93R-62 1 1 Red \n", "750 Road-150 Red, 44 BK-R93R-44 1 1 Red \n", "751 Road-150 Red, 48 BK-R93R-48 1 1 Red \n", "\n", " SafetyStockLevel ReorderPoint StandardCost ListPrice Size ... \\\n", "ProductID ... \n", "749 100 75 2171.2942 3578.27 62 ... \n", "750 100 75 2171.2942 3578.27 44 ... \n", "751 100 75 2171.2942 3578.27 48 ... \n", "\n", " ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "ProductID \n", "749 R H U 2.0 25.0 \n", "750 R H U 2.0 25.0 \n", "751 R H U 2.0 25.0 \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "ProductID \n", "749 2011-05-31 00:00:00 2012-05-29 00:00:00 NaN \n", "750 2011-05-31 00:00:00 2012-05-29 00:00:00 NaN \n", "751 2011-05-31 00:00:00 2012-05-29 00:00:00 NaN \n", "\n", " rowguid \\\n", "ProductID \n", "749 {BC621E1F-2553-4FDC-B22E-5E44A9003569} \n", "750 {C19E1136-5DA4-4B40-8758-54A85D7EA494} \n", "751 {D10B7CC1-455E-435B-A08F-EC5B1C5776E9} \n", "\n", " ModifiedDate \n", "ProductID \n", "749 2014-02-08 10:01:36.827000000 \n", "750 2014-02-08 10:01:36.827000000 \n", "751 2014-02-08 10:01:36.827000000 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's sort by standard cost, descending\n", "prod.sort_values(by='StandardCost', ascending=False).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This one is a little more advanced, but it demonstrates a few things:\n", "- Conversion of a `numpy.ndarray` object (return type of `pd.Series.unique()`) into a `pd.Series` object\n", "- `pd.Series.sort_values` with the `by=` kwarg omitted (if only one column is the operand, `by=` doesn't need specified\n", "- Alphabetical sort of a string field, `ascending=True` means A->Z\n", "- Inclusion of nulls, `NaN` in a string field (versus omission with a float/int as prior example)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 Black\n", "5 Blue\n", "8 Grey\n", "6 Multi\n", "3 Red\n", "2 Silver\n", "9 Silver/Black\n", "4 White\n", "7 Yellow\n", "0 NaN\n", "dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(prod['Color'].unique()).sort_values(ascending=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Independent Exercises\n", "\n", "Do your best to complete the following prompts. Don't hesitate to look at code we wrote together!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the first 4 rows of the whole DataFrame." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NewNameProductNumberMakeFlagFinishedGoodsFlagColorSafetyStockLevelReorderPointStandardCostListPriceSize...ProductLineClassStyleProductSubcategoryIDProductModelIDSellStartDateSellEndDateDiscontinuedDaterowguidModifiedDate
ProductID
1Adjustable RaceAR-538100NaN10007500.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}2014-02-08 10:01:36.827000000
2Bearing BallBA-832700NaN10007500.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{58AE3C20-4F3A-4749-A7D4-D568806CC537}2014-02-08 10:01:36.827000000
3BB Ball BearingBE-234910NaN8006000.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}2014-02-08 10:01:36.827000000
4Headset Ball BearingsBE-290800NaN8006000.00.0NaN...NaNNaNNaNNaNNaN2008-04-30 00:00:00NaNNaN{ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B}2014-02-08 10:01:36.827000000
\n", "

4 rows × 24 columns

\n", "
" ], "text/plain": [ " NewName ProductNumber MakeFlag FinishedGoodsFlag \\\n", "ProductID \n", "1 Adjustable Race AR-5381 0 0 \n", "2 Bearing Ball BA-8327 0 0 \n", "3 BB Ball Bearing BE-2349 1 0 \n", "4 Headset Ball Bearings BE-2908 0 0 \n", "\n", " Color SafetyStockLevel ReorderPoint StandardCost ListPrice Size \\\n", "ProductID \n", "1 NaN 1000 750 0.0 0.0 NaN \n", "2 NaN 1000 750 0.0 0.0 NaN \n", "3 NaN 800 600 0.0 0.0 NaN \n", "4 NaN 800 600 0.0 0.0 NaN \n", "\n", " ... ProductLine Class Style ProductSubcategoryID ProductModelID \\\n", "ProductID ... \n", "1 ... NaN NaN NaN NaN NaN \n", "2 ... NaN NaN NaN NaN NaN \n", "3 ... NaN NaN NaN NaN NaN \n", "4 ... NaN NaN NaN NaN NaN \n", "\n", " SellStartDate SellEndDate DiscontinuedDate \\\n", "ProductID \n", "1 2008-04-30 00:00:00 NaN NaN \n", "2 2008-04-30 00:00:00 NaN NaN \n", "3 2008-04-30 00:00:00 NaN NaN \n", "4 2008-04-30 00:00:00 NaN NaN \n", "\n", " rowguid \\\n", "ProductID \n", "1 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} \n", "2 {58AE3C20-4F3A-4749-A7D4-D568806CC537} \n", "3 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} \n", "4 {ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B} \n", "\n", " ModifiedDate \n", "ProductID \n", "1 2014-02-08 10:01:36.827000000 \n", "2 2014-02-08 10:01:36.827000000 \n", "3 2014-02-08 10:01:36.827000000 \n", "4 2014-02-08 10:01:36.827000000 \n", "\n", "[4 rows x 24 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many rows are in the dataframe? Return the answer as an int." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "504" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many columns? Retrun the answer as an int." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod.shape[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different product lines are there?" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod['ProductLine'].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the values of these product lines?" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([nan, 'R ', 'S ', 'M ', 'T '], dtype=object)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod['ProductLine'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do the number of values for the product lines match the number you have using `.nunique()`? Why or why not?" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "# your answer here\n", "# pd.Series.nunique() does not count nulls, seen as nan in a np.ndarray." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take the output from your previous answer (using `.unique()`). Select the label corresponding to the `Road` product line using list indexing notation. How many characters are in this string?" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R \n", "2\n" ] } ], "source": [ "# your answer here\n", "print(prod['ProductLine'].unique()[1])\n", "print(len(prod['ProductLine'].unique()[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you notice anything odd about this?" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "# your answer here\n", "# There is trailing whitespace! The horror!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many products are there for the `Road` product line? Don't forget what you just worked on above! Return your answer as an int." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod[prod['ProductLine'] == 'R '].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many products are there in the `Women's` `Mountain` category? Return your answer as an int. _Hint: Use the data dictionary above!_" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod[(prod['ProductLine'] == 'M ') & (prod['Style'] == 'W ')].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Challenge:** What are the top 3 _most expensive list price_ product that are either in the `Women's` `Mountain` category, _OR_ `Silver` in `Color`? Return your answer as a DataFrame object, with the `ProductID` index, `NewName` relabeled as `Name`, and `ListPrice` columns. Perform the statement in one execution, and do not mutate the source DataFrame." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameListPrice
ProductID
774Mountain-100 Silver, 483399.99
771Mountain-100 Silver, 383399.99
772Mountain-100 Silver, 423399.99
\n", "
" ], "text/plain": [ " Name ListPrice\n", "ProductID \n", "774 Mountain-100 Silver, 48 3399.99\n", "771 Mountain-100 Silver, 38 3399.99\n", "772 Mountain-100 Silver, 42 3399.99" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# your answer here\n", "prod[\n", " ((prod['ProductLine'] == 'M ') & (prod['Style'] == 'W ')) | \n", " (prod['Color'] == 'Silver')].sort_values(\n", " by='ListPrice', ascending=False).head(3)[['NewName', 'ListPrice']].rename(columns={'NewName': 'Name'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap\n", "\n", "We covered a lot of ground! It's ok if this takes a while to gel.\n", "\n", "```python\n", "\n", "# basic DataFrame operations\n", "df.head()\n", "df.tail()\n", "df.shape\n", "df.columns\n", "df.Index\n", "\n", "# selecting columns\n", "df.column_name\n", "df['column_name']\n", "\n", "# renaming columns\n", "df.rename({'old_name':'new_name'}, inplace=True)\n", "df.columns = ['new_column_a', 'new_column_b']\n", "\n", "# notable columns operations\n", "df.describe() # five number summary\n", "df['col1'].nunique() # number of unique values\n", "df['col1'].value_counts() # number of occurrences of each value in column\n", "\n", "# filtering\n", "df[ df['col1'] < 50 ] # filter column to be less than 50\n", "df[ (df['col1'] == value1) & (df['col2'] > value2) ] # filter column where col1 is equal to value1 AND col2 is greater to value 2\n", "\n", "# sorting\n", "df.sort_values(by='column_name', ascending = False) # sort biggest to smallest\n", "\n", "```\n", "\n", "\n", "It's common to refer back to your own code *all the time.* Don't hesistate to reference this guide! 🐼\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }