Pandas I

A Note on Delivery

This unit’s lessons will occur in jupyter notebooks
- Slides will be an introduction to the lesson (no code, just overview)
- Then, we’ll open a notebook and start coding!

Learning Objectives

After this lesson, you will be able to:

Use Pandas to read in a dataset.
Investigate a dataset’s integrity.
Filter, sort, and manipulate DataFrame series.

Talking Points: This lesson introduces the Pandas library and the beginnings of Exploratory Data Analysis. The majority of the lesson should be spent going through code – whether that is via Jupyter Slides or a Jupyter Notebook demonstration.

To present this content, begin with intro-to-pandas-i.ipynb to introduce Pandas as a library and data integrity. Transition to the Jupyter Notebook to introduce reading in data, column manipulation, filtering and sorting; conclude with exercises.

Teaching Tips: - There are Class Questions littered throughout the notebook. Use as much/little time on these as you see fit relative to how your class is pacing - There is an Independent Exercise at the end of this lesson. It is aspirational to have time to let students work entirely independently on this time-wise, so consider doing a guided code-along or paired programming. Answers are included. - Pause after learning objectives and level-set for what students will get out of the lesson

What is Pandas?

A group of adorable bears 🐼🐼🐼
A Python library for data manipulation.

So, Pandas the Library

The Swiss Army Knife of data manipulation!

Pandas:

Is the library for exploratory data analysis (EDA).
Formats, wrangles, cleans, and prepares our data.

Quick Backstory from 2009:

A humble open source project for Panel Data (hence “Pandas”) from Wes McKinney.
A ‘panel’ is the name of the object (in pandas) holding an n-dimensional numpy array
Don’t let the term fool you, a panel is effectively the same thing as an excel workbook (a collection of sheets)
A 2-dimensional panel is a Dataframe (rows and columns)
A 1-dimensional panel is a Series (column)

Teaching Tips:

Explain what you mean by Swiss Army knife as not all students may understand that metaphor
Remind students of the meaning of exploratory data analysis (EDA)

Talking Points:

“Pandas is the most prominent Python library for exploratory data analysis (EDA). The functions Pandas supports are integral to understanding, formatting, and preparing our data. Formally, we use Pandas to investigate, wrangle, munge, and clean our data. Pandas is the Swiss Army Knife of data manipulation!” “Pandas began as a humble open source project for Panel Data (hence”Pandas“) in 2009 by Wes McKinney. It has grown to be the most use Python-related tag on Stack Overflow.” - Pandas is one of the most useful data manipulation libraries. Its utilities, on the outset, replace many things we know how to do in Excel. However, we also produce a script for creating reproducible steps and Excel is limited to 1.3M rows. Pandas is not.

Exploratory Data Analysis (EDA)

The process of understanding our dataset and producing our first level of insights.

This includes:

Reading in data: “Import cat population.”
Checking data types. “Is the population count in integers?”
Renaming columns: “cat_breed is more helpful than Biological Family”
Joining together data: “Join the cat population data with the cat population data.”
Looking for missing data: “It doesn’t mention corgis.”
And more!

Today, we will focus on the most ‘mission critical’ elements of EDA.

Teaching Tips:

Point out from the bulleted list what is mission critical
Time permitting, ask students to share a similar example of a dataset

Talking Points:

“Exploratory Data Analysis (EDA) is the process of understanding our dataset, and producing our first level of insights. This includes reading in the data, understanding our data dictionary, checking data types, assessing descriptive statistics, renaming columns, joining together data, looking for missing data, and so much more. That sounds like a lot, but today, we will just focus on the most ‘mission critical’ elements of EDA.”
It’s common to get later in the data science workflow, only to realize unclean data or a feature could be engineered earlier in the process.
Hypothesis-driven EDA is essential to productive EDA – otherwise we will ceaselessly torture our data for answers.

Quick Review

Exploratory Data Analysis (EDA) is the process of understanding our dataset, and producing our first level of insights.What does this include?
Pandas is a prominent Python library used for exploratory data analysis

What dataset are we exploring?

Adventure Works Cycles!
We will be using a dataset developed by Microsoft for training purposes in SQL server, known the Adventureworks Cycles 2014OLTP Database.
It is based on a fictitious company called Adventure Works Cycles (AWC), a multinational manufacturer and seller of bicycles and accessories.
The company is based in Bothell, Washington, USA and has regional sales offices in several countries.
We will be looking at a single table from this database, the Production.Product table, which outlines some of the products this company sells.

Teaching Tips:

Open this page in a new window.

Talking Points:

Let’s take a closer look at the data dictionary, or what is included:

ProductID - Primary key for Product records. Name - Name of the product. ProductNumber - Unique product identification number. MakeFlag - 0 = Product is purchased, 1 = Product is manufactured in-house. FinishedGoodsFlag - 0 = Product is not a salable item. 1 = Product is salable. Color - Product color. SafetyStockLevel - Minimum inventory quantity. ReorderPoint - Inventory level that triggers a purchase order or work order. StandardCost - Standard cost of the product. ListPrice - Selling price. Size - Product size. SizeUnitMeasureCode - Unit of measure for the Size column. WeightUnitMeasureCode - Unit of measure for the Weight column. DaysToManufacture - Number of days required to manufacture the product. ProductLine - R = Road, M = Mountain, T = Touring, S = Standard Class - H = High, M = Medium, L = Low Style - W = Womens, M = Mens, U = Universal ProductSubcategoryID - Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID. ProductModelID - Product is a member of this product model. Foreign key to ProductModel.ProductModelID. SellStartDate - Date the product was available for sale. SellEndDate - Date the product was no longer available for sale. DiscontinuedDate - Date the product was discontinued. rowguid - ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample. ModifiedDate - Date and time the record was last updated.

Discussion: What Could We Examine?

What are some potential insights you’d like to uncover given the data?
What if you are examining it from the standpoint of a the business?
What if you are a potential distributor of their products?

Our Modified Adventure Works Dataset

The full dataset is actually a large, star-schema relational databse.

We will work with a modified dataset.

Key changes:

Only a single table from this database
Contains information on products the company makes
- Such as the product names
- The product weights, measures
- And the product prices

Data Integrity

The first thing we check! Assuring our data can be trusted to produce meaningful insights.

Correctly formatted datatypes.

“Decimals are floats, not strings.”

Missing Data

i.e. “Why do we only have even days of the month?”

Clean Truth about Dirty Data

Assessing data integrity isn’t a one-stop step.
Much like EDA itself, it’s an ongoing process!
We uncover additional potential problems and anomalies to remedy along the way.

Launch our notebook

We’ll work in the Notebook - We’re fledgling data scientists!

The .ipynb file you will open is called " intro-to-pandas-i.ipynb ".

Open it up!

Jump down to Import.

Additional Resources

Pandas documentation
DataSchool 30-video series (by a former GA instructor!)