7.4 KiB

Raw Permalink Blame History Unescape Escape

Pandas I

A Note on Delivery

This unit's lessons will occur in jupyter notebooks
- Slides will be an introduction to the lesson (no code, just overview)
- Then, we'll open a notebook and start coding!

Learning Objectives

After this lesson, you will be able to:

Use Pandas to read in a dataset.
Investigate a dataset's integrity.
Filter, sort, and manipulate DataFrame series.

What is Pandas?

A group of adorable bears 🐼🐼🐼
A Python library for data manipulation.

So, Pandas the Library

The Swiss Army Knife of data manipulation!

Pandas:

Is the library for exploratory data analysis (EDA).
Formats, wrangles, cleans, and prepares our data.

Quick Backstory from 2009:

A humble open source project for Panel Data (hence "Pandas") from Wes McKinney.
A 'panel' is the name of the object (in pandas) holding an n-dimensional numpy array
Don't let the term fool you, a panel is effectively the same thing as an excel workbook (a collection of sheets)
A 2-dimensional panel is a Dataframe (rows and columns)
A 1-dimensional panel is a Series (column)

Exploratory Data Analysis (EDA)

The process of understanding our dataset and producing our first level of insights.

This includes:

Reading in data: "Import cat population."
Checking data types. "Is the population count in integers?"
Renaming columns: "cat_breed is more helpful than Biological Family"
Joining together data: "Join the cat population data with the cat population data."
Looking for missing data: "It doesn't mention corgis."
And more!

Today, we will focus on the most 'mission critical' elements of EDA.

Quick Review

Exploratory Data Analysis (EDA) is the process of understanding our dataset, and producing our first level of insights.What does this include?
Pandas is a prominent Python library used for exploratory data analysis

What dataset are we exploring?

Adventure Works Cycles!
We will be using a dataset developed by Microsoft for training purposes in SQL server, known the Adventureworks Cycles 2014OLTP Database.
It is based on a fictitious company called Adventure Works Cycles (AWC), a multinational manufacturer and seller of bicycles and accessories.
The company is based in Bothell, Washington, USA and has regional sales offices in several countries.
We will be looking at a single table from this database, the Production.Product table, which outlines some of the products this company sells.

Let's take a closer look at the data dictionary, or what is included:

ProductID - Primary key for Product records.
Name - Name of the product.
ProductNumber - Unique product identification number.
MakeFlag - 0 = Product is purchased, 1 = Product is manufactured in-house.
FinishedGoodsFlag - 0 = Product is not a salable item. 1 = Product is salable.
Color - Product color.
SafetyStockLevel - Minimum inventory quantity.
ReorderPoint - Inventory level that triggers a purchase order or work order.
StandardCost - Standard cost of the product.
ListPrice - Selling price.
Size - Product size.
SizeUnitMeasureCode - Unit of measure for the Size column.
WeightUnitMeasureCode - Unit of measure for the Weight column.
DaysToManufacture - Number of days required to manufacture the product.
ProductLine - R = Road, M = Mountain, T = Touring, S = Standard
Class - H = High, M = Medium, L = Low
Style - W = Womens, M = Mens, U = Universal
ProductSubcategoryID - Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID.
ProductModelID - Product is a member of this product model. Foreign key to ProductModel.ProductModelID.
SellStartDate - Date the product was available for sale.
SellEndDate - Date the product was no longer available for sale.
DiscontinuedDate - Date the product was discontinued.
rowguid - ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample.
ModifiedDate - Date and time the record was last updated.

Discussion: What Could We Examine?

What are some potential insights you'd like to uncover given the data?
What if you are examining it from the standpoint of a the business?
What if you are a potential distributor of their products?

Our Modified Adventure Works Dataset

The full dataset is actually a large, star-schema relational databse.

We will work with a modified dataset.

Key changes:

Only a single table from this database
Contains information on products the company makes
- Such as the product names
- The product weights, measures
- And the product prices

Data Integrity

The first thing we check! Assuring our data can be trusted to produce meaningful insights.

Correctly formatted datatypes.

"Decimals are floats, not strings."

Missing Data

i.e. "Why do we only have even days of the month?"

Clean Truth about Dirty Data

Assessing data integrity isn't a one-stop step.
Much like EDA itself, it's an ongoing process!
We uncover additional potential problems and anomalies to remedy along the way.

Additional Resources

Pandas documentation
DataSchool 30-video series (by a former GA instructor!)

7.4 KiB Raw Permalink Blame History Unescape Escape