You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

404 lines
19 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title></title>
<meta name="description" content="">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- For syntax highlighting -->
<link rel="stylesheet" href="../../../../lib/css/zenburn.css">
<link rel="stylesheet" href="../../../../lib/css/prism.css">
<link rel="stylesheet" href="../../../../css/reveal.css">
<link rel="stylesheet" href="../../../../css/theme/ga-title.css" id="theme">
<!--[if lt IE 9]>
<script src="lib/js/html5shiv.js"></script>
<![endif]-->
<link rel="stylesheet" type="text/css" href="https://s3.amazonaws.com/python-ga/proxima-nova/fonts.css" />
</head>
<body class="language-javascript">
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<!--
---
title: Pandas I
type: lesson
duration: "1:00"
creator: [Joseph Nelson](https://twitter.com/josephofiowa)
---
-->
<section id="section" class="level2 separator">
<h2><img src="http://nagale.com/ga-python/images/GA_Cog_Medium_White_RGB.png" /></h2>
<h1>
Pandas I
</h1>
<!--
## Overview
This lesson introduces the Pandas library and the beginnings of Exploratory Data Analysis. The majority of the lesson should be spent going through code -- whether that is via Jupyter Slides or a Jupyter Notebook demonstration.
## Learning Objectives
In this lesson, students will:
- Use Pandas to read in a dataset.
- Investigate a dataset's integrity.
- Filter, sort, and manipulate DataFrame series.
## Duration
60 minutes
## Suggested Agenda
| Time | Activity |
| --- | --- |
## Suggested Agenda
| Time | Activity | Purpose |
|-------------|----------|---------|
| 0:00 - 0:03 | Welcome |
| 0:03 - 0:15 | Slides |
| 0:15 - 0:17 | NOTE: Switch to Notebook |
| 0:17 - 0:25 | Basic Pandas |
| 0:25 - 0:35 | Columns |
| 0:35 - 0:44 | Filtering and Sorting |
| 0:44 - 0:58 | Independent Exercise |
| 0:58 - 1:00 | Summary |
## Materials and Preparation
- Send out the presentation link.
- Students will need the data sets and notebook. Consider having a zip file of all notebooks and data sets for the rest of the unit that you hand out at the beginning of this lesson. Alternatively, link them directly in GitHub - remember that they haven't learned GitHub, so you'll need to help them download the files.
- The presentation is also at the top of the Notebook, so students can later reference in one place. Jump down to `Importing Pandas`.
## Differentiation and Extensions
- If students are excelling in the first half, consider deeper discussions surrounding five number summaries, data integrity, off-the-cuff filters and sorts
- If students are struggling, work on the code more heavily than the **Class Questions** portions. Make the Independent Exercises be Collective Exercises (as a class)
## In Class: Materials
- Projector
- Internet connection
- Jupyter Notebooks
- Python3
-->
<hr />
</section>
<section id="learning-objectives" class="level2">
<h2>Learning Objectives</h2>
<p><em>After this lesson, you will be able to:</em></p>
<ul>
<li>Use Pandas to read in a dataset.</li>
<li>Investigate a datasets integrity.</li>
<li>Filter, sort, and manipulate DataFrame series.</li>
</ul>
<aside class="notes">
<p><strong>Talking Points</strong>: This lesson introduces the Pandas library and the beginnings of Exploratory Data Analysis. The majority of the lesson should be spent going through code whether that is via Jupyter Slides or a Jupyter Notebook demonstration.</p>
<p>To present this content, begin with <code>intro-to-pandas-i.ipynb</code> to introduce Pandas as a library and data integrity. Transition to the Jupyter Notebook to introduce reading in data, column manipulation, filtering and sorting; conclude with exercises.</p>
<strong>Teaching Tips</strong>: - There are <strong>Class Questions</strong> littered throughout the notebook. Use as much/little time on these as you see fit relative to how your class is pacing - There is an <strong>Independent Exercise</strong> at the end of this lesson. It is aspirational to have time to let students work entirely independently on this time-wise, so consider doing a guided code-along or paired programming. Answers are included. - Pause after learning objectives and level-set for what students will get out of the lesson
</aside>
<hr />
</section>
<section id="what-is-pandas" class="level2">
<h2>What is Pandas?</h2>
<ul>
<li>A group of adorable bears 🐼🐼🐼</li>
<li>A Python library for data manipulation.</li>
</ul>
<iframe src="https://giphy.com/embed/EatwJZRUIv41G" width="480" height="270" frameborder="0" class="giphy-embed" allowfullscreen>
</iframe>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Get them excited to learn this.</li>
<li>The iframe is just this gif:<img src="https://media.giphy.com/media/EatwJZRUIv41G/giphy.gif" /></li>
<li>Show your favorite Pandas gifs <a href="https://media.giphy.com/media/z6xE1olZ5YP4I/giphy.gif">Seriously</a></li>
<li>Describe exploratory data analysis as an <strong>ongoing</strong> process, and cite an example from your experience.</li>
</ul>
<p><strong>Talking Points</strong>:</p>
<ul>
<li>As we learned, Python libraries are collections of functions and methods that allow us to perform lots of actions without writing as much of our own code.</li>
<li>The pandas library is written specifically for data manipulation and analysis in Python</li>
</ul>
</aside>
<hr />
</section>
<section id="so-pandas-the-library" class="level2">
<h2>So, Pandas the Library</h2>
<p>The Swiss Army Knife of data manipulation!</p>
<p>Pandas:</p>
<ul>
<li>Is <em>the</em> library for exploratory data analysis (EDA).</li>
<li>Formats, wrangles, cleans, and prepares our data.</li>
</ul>
<p>Quick Backstory from 2009:</p>
<ul>
<li>A humble open source project for Panel Data (hence “Pandas”) from Wes McKinney.</li>
<li>A panel is the name of the object (in pandas) holding an n-dimensional numpy array</li>
<li>Dont let the term fool you, a panel is effectively the same thing as an excel workbook (a collection of sheets)</li>
<li>A 2-dimensional panel is a Dataframe (rows and columns)</li>
<li>A 1-dimensional panel is a Series (column)</li>
</ul>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Explain what you mean by Swiss Army knife as not all students may understand that metaphor</li>
<li>Remind students of the meaning of exploratory data analysis (EDA)</li>
</ul>
<p><strong>Talking Points</strong>:</p>
“Pandas is the most prominent Python library for exploratory data analysis (EDA). The functions Pandas supports are integral to understanding, formatting, and preparing our data. Formally, we use Pandas to investigate, wrangle, munge, and clean our data. Pandas is the Swiss Army Knife of data manipulation!” “Pandas began as a humble open source project for Panel Data (hence”Pandas“) in 2009 by Wes McKinney. It has grown to be the most use Python-related tag on Stack Overflow.” - Pandas is one of the most useful data manipulation libraries. Its utilities, on the outset, replace many things we know how to do in Excel. However, we also produce a script for creating reproducible steps <strong>and</strong> Excel is limited to 1.3M rows. Pandas is not.
</aside>
<hr />
</section>
<section id="exploratory-data-analysis-eda" class="level2">
<h2>Exploratory Data Analysis (EDA)</h2>
<p>The process of understanding our dataset and producing our first level of insights.</p>
<p>This includes:</p>
<ul>
<li>Reading in data: “Import cat population.”</li>
<li>Checking data types. “Is the population count in integers?”</li>
<li>Renaming columns: “<code>cat_breed</code> is more helpful than <code>Biological Family</code></li>
<li>Joining together data: “Join the cat population data with the cat population data.”</li>
<li>Looking for missing data: “It doesnt mention corgis.”</li>
<li>And more!</li>
</ul>
<p>Today, we will focus on the most mission critical elements of EDA.</p>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Point out from the bulleted list what is mission critical</li>
<li>Time permitting, ask students to share a similar example of a dataset</li>
</ul>
<p><strong>Talking Points</strong>:</p>
<ul>
<li>“Exploratory Data Analysis (EDA) is the process of understanding our dataset, and producing our first level of insights. This includes reading in the data, understanding our data dictionary, checking data types, assessing descriptive statistics, renaming columns, joining together data, looking for missing data, and so much more. That sounds like a lot, but today, we will just focus on the most mission critical elements of EDA.”</li>
<li>Its common to get later in the data science workflow, only to realize unclean data or a feature could be engineered earlier in the process.</li>
<li>Hypothesis-driven EDA is essential to productive EDA otherwise we will ceaselessly torture our data for answers.</li>
</ul>
</aside>
<hr />
</section>
<section id="quick-review" class="level2">
<h2>Quick Review</h2>
<ul>
<li>Exploratory Data Analysis (EDA) is the process of understanding our dataset, and producing our first level of insights.What does this include?</li>
<li>Pandas is a prominent Python library used for exploratory data analysis</li>
</ul>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Pause to gather students answers about what EDA includes.</li>
<li>Answer any clarification questions!</li>
</ul>
</aside>
<hr />
</section>
<section id="what-dataset-are-we-exploring" class="level2">
<h2>What dataset are we exploring?</h2>
<ul>
<li><p>Adventure Works Cycles!</p></li>
<li>We will be using a dataset developed by Microsoft for training purposes in SQL server, known the Adventureworks Cycles 2014OLTP Database.</li>
<li>It is based on a fictitious company called Adventure Works Cycles (AWC), a multinational manufacturer and seller of bicycles and accessories.</li>
<li>The company is based in Bothell, Washington, USA and has regional sales offices in several countries.</li>
<li><p>We will be looking at a single table from this database, the Production.Product table, which outlines some of the products this company sells.</p></li>
</ul>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Open this page in a new window.</li>
</ul>
<p><strong>Talking Points</strong>:</p>
<p>Lets take a closer look at the data dictionary, or what is included:</p>
ProductID - Primary key for Product records. Name - Name of the product. ProductNumber - Unique product identification number. MakeFlag - 0 = Product is purchased, 1 = Product is manufactured in-house. FinishedGoodsFlag - 0 = Product is not a salable item. 1 = Product is salable. Color - Product color. SafetyStockLevel - Minimum inventory quantity. ReorderPoint - Inventory level that triggers a purchase order or work order. StandardCost - Standard cost of the product. ListPrice - Selling price. Size - Product size. SizeUnitMeasureCode - Unit of measure for the Size column. WeightUnitMeasureCode - Unit of measure for the Weight column. DaysToManufacture - Number of days required to manufacture the product. ProductLine - R = Road, M = Mountain, T = Touring, S = Standard Class - H = High, M = Medium, L = Low Style - W = Womens, M = Mens, U = Universal ProductSubcategoryID - Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID. ProductModelID - Product is a member of this product model. Foreign key to ProductModel.ProductModelID. SellStartDate - Date the product was available for sale. SellEndDate - Date the product was no longer available for sale. DiscontinuedDate - Date the product was discontinued. rowguid - ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample. ModifiedDate - Date and time the record was last updated.
</aside>
<hr />
</section>
<section id="discussion-what-could-we-examine" class="level2">
<h2>Discussion: What Could We Examine?</h2>
<ul>
<li><p>What are some potential insights youd like to uncover given the data?</p></li>
<li><p>What if you are examining it from the standpoint of a the business?</p></li>
<li><p>What if you are a potential distributor of their products?</p></li>
</ul>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Walk through these questions one by one. Encourage discussion - theres no wrong answer!</li>
<li>For the standpoint of the business and distributor, ask why?</li>
</ul>
</aside>
<hr />
</section>
<section id="our-modified-adventure-works-dataset" class="level2">
<h2>Our Modified Adventure Works Dataset</h2>
<p>The full dataset is actually a large, star-schema relational databse.</p>
<p>We will work with a modified dataset.</p>
<p>Key changes:</p>
<ul>
<li>Only a single table from this database</li>
<li>Contains information on products the company makes
<ul>
<li>Such as the product names</li>
<li>The product weights, measures</li>
<li>And the product prices</li>
</ul></li>
</ul>
<aside class="notes">
<p><strong>Teaching Tips</strong>: - Make sure that students know this is part of a cycle companys live database, very similar to what youd see in the real world</p>
<p><strong>Talking Points</strong>: - Well be working with this, and other tables in the future - As we join more tables together, well uncover more information about this business - Encourage students to think about their work experience and how this might apply to them</p>
</aside>
<hr />
</section>
<section id="data-integrity" class="level2">
<h2>Data Integrity</h2>
<p>The first thing we check! Assuring our data can be trusted to produce meaningful insights.</p>
<p>Correctly formatted datatypes.</p>
<ul>
<li>“Decimals are floats, not strings.”</li>
</ul>
<p>Missing Data</p>
<ul>
<li>i.e. “Why do we only have even days of the month?”</li>
</ul>
<aside class="notes">
<p><strong>Talking Points</strong>:</p>
<ul>
<li>Ask how you might keep data integrity top of mind to maintain a clean data set</li>
<li>Give examples here.</li>
</ul>
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Pause here to ask if students have any questions</li>
</ul>
</aside>
<hr />
</section>
<section id="clean-truth-about-dirty-data" class="level2">
<h2>Clean Truth about Dirty Data</h2>
<ul>
<li><p>Assessing data integrity isnt a one-stop step.</p></li>
<li><p>Much like EDA itself, its an ongoing process!</p></li>
<li><p>We uncover additional potential problems and anomalies to remedy along the way.</p></li>
</ul>
<aside class="notes">
<p><strong>Talking Points</strong>:</p>
<ul>
<li>Ask how you might keep data integrity top of mind to maintain a clean data set</li>
<li>Give examples here.</li>
</ul>
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Give examples here.</li>
</ul>
</aside>
<hr />
</section>
<section id="launch-our-notebook" class="level2">
<h2>Launch our notebook</h2>
<p>Well work in the Notebook - Were fledgling data scientists!</p>
<p>The <code>.ipynb</code> file you will open is called &quot; <code>intro-to-pandas-i.ipynb</code> &quot;.</p>
<p>Open it up!</p>
<p>Jump down to <code>Import</code>.</p>
<aside class="notes">
<p><strong>Teaching Tips</strong>:</p>
<ul>
<li>Make sure everyone gets to the notebook successfully.</li>
<li>Have students assist one another and walk around the room to ensure everyone gets to the notebook successfully</li>
<li>Make sure all students can open and run their Notebooks. Its only the second time theyve done so!</li>
<li>The presentation is also at the top of the Notebook, so students can later reference in one place. Jump down to <code>Importing Pandas</code>.</li>
</ul>
</aside>
<hr />
</section>
<section id="additional-resources" class="level2">
<h2>Additional Resources</h2>
<ul>
<li>Pandas <a href="https://pandas.pydata.org/pandas-docs/stable/">documentation</a></li>
<li>DataSchool <a href="http://www.dataschool.io/easier-data-analysis-with-pandas/">30-video series</a> (by a former GA instructor!)</li>
</ul>
</section>
</div>
<footer><span class='slide-number'></span></footer>
</div>
<script src="../../../../lib/js/head.min.js"></script>
<script src="../../../../js/reveal.js"></script>
<script>
var dependencies = [
{ src: '../../../../lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../../../../plugin/markdown/showdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../../../../plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../../../../plugin/prism/prism.js', async: true, callback: function() { /*hljs.initHighlightingOnLoad();*/ } },
{ src: '../../../../plugin/zoom-js/zoom.js', async: true, condition: function() { return !!document.body.classList; } }
];
if (Reveal.getQueryHash().instructor === 1) {
dependencies.push({ src: '../../../../plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } });
}
// Full list of configuration options available here:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
controls: true,
progress: true,
history: true,
center: false,
slideNumber: true,
// available themes are in /css/theme
theme: Reveal.getQueryHash().theme || 'default',
// default/cube/page/concave/zoom/linear/fade/none
transition: Reveal.getQueryHash().transition || 'slide',
// Optional libraries used to extend on reveal.js
dependencies: dependencies
});
if (Reveal.getQueryHash().instructor === 1) {
Reveal.configure(dependencies.push({ src: '../../../../plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } }));
}
Reveal.addEventListener('ready', function() {
if (Reveal.getCurrentSlide().classList.contains('separator-subhead')) {
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-subhead.css');
} else if (Reveal.getCurrentSlide().classList.contains('separator')) {
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-title.css')
} else {
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga.css');
}
});
Reveal.addEventListener('slidechanged', function(e) {
if (Reveal.getCurrentSlide().classList.contains('separator-subhead')) {
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-subhead.css');
} else if (Reveal.getCurrentSlide().classList.contains('separator')) {
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-title.css')
} else {
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga.css');
}
});
</script>
</body>
</html>