|
|
<!doctype html>
|
|
|
<html lang="en">
|
|
|
<head>
|
|
|
<meta charset="utf-8">
|
|
|
|
|
|
<title></title>
|
|
|
|
|
|
<meta name="description" content="">
|
|
|
|
|
|
|
|
|
<meta name="apple-mobile-web-app-capable" content="yes" />
|
|
|
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
|
|
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
|
|
<!-- For syntax highlighting -->
|
|
|
<link rel="stylesheet" href="../../../../lib/css/zenburn.css">
|
|
|
<link rel="stylesheet" href="../../../../lib/css/prism.css">
|
|
|
|
|
|
<link rel="stylesheet" href="../../../../css/reveal.css">
|
|
|
<link rel="stylesheet" href="../../../../css/theme/ga-title.css" id="theme">
|
|
|
|
|
|
<!--[if lt IE 9]>
|
|
|
<script src="lib/js/html5shiv.js"></script>
|
|
|
<![endif]-->
|
|
|
|
|
|
<link rel="stylesheet" type="text/css" href="https://s3.amazonaws.com/python-ga/proxima-nova/fonts.css" />
|
|
|
|
|
|
</head>
|
|
|
|
|
|
<body class="language-javascript">
|
|
|
|
|
|
<div class="reveal">
|
|
|
|
|
|
<!-- Any section element inside of this container is displayed as a slide -->
|
|
|
<div class="slides">
|
|
|
|
|
|
|
|
|
<!--
|
|
|
---
|
|
|
title: Pandas II
|
|
|
type: lesson
|
|
|
duration: "1:00"
|
|
|
creator: [Joseph Nelson](https://twitter.com/josephofiowa)
|
|
|
---
|
|
|
-->
|
|
|
<section id="section" class="level2 separator">
|
|
|
<h2><img src="https://s3.amazonaws.com/python-ga/images/GA_Cog_Medium_White_RGB.png" /></h2>
|
|
|
<h1>
|
|
|
Pandas II
|
|
|
</h1>
|
|
|
<!--
|
|
|
|
|
|
## Overview
|
|
|
This presentation is review - go through the lesson in the notebook first!
|
|
|
The goal of this lesson is to continue demonstrating valuable functions within the Pandas library for exploratory data analysis. This lesson covers handling missing data and null values, then continues to implementing `groupby` and apply functions.
|
|
|
|
|
|
|
|
|
|
|
|
## Important Notes or Prerequisites
|
|
|
|
|
|
- Review the Iowa liquor dataset [here](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy)
|
|
|
- Be aware that the dataset you are examining is a **subset** of that dataset: it is only May 2017 and May 2018. **New columns** have been created to delineate: `is_may_2017` and `is_may_2018`. These are demonstrated for the purposes of showing filters.
|
|
|
- There are **Class Questions** littered throughout the notebook. Use as much/little time on these as you see fit relative to how your class is pacing
|
|
|
- This lesson includes high level slides and a Notebook. To present this content, it is recommended you begin directly with the Jupyter Notebook. The student slides contain the wrap-up of the big ideas covered in the notebook.
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
## Learning Objectives
|
|
|
*After this lesson, you will be able to:*
|
|
|
|
|
|
- Identify and handle missing values with Pandas.
|
|
|
- Implement groupby statements for specific segmented analysis.
|
|
|
- Use apply functions to clean data with Pandas.
|
|
|
|
|
|
|
|
|
|
|
|
## Duration
|
|
|
45 minutes.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Suggested Agenda
|
|
|
|
|
|
| Time | Activity | Purpose |
|
|
|
|-------------|----------|---------|
|
|
|
| 0:00 - 0:03 | Welcome |
|
|
|
| 0:03 - 0:08 | Dataset Recap |
|
|
|
| 0:08 - 0:11 | Read in the Data |
|
|
|
| 0:11 - 0:25 | Handling Missing Data |
|
|
|
| 0:25 - 0:32 | Groupby Statements |
|
|
|
| 0:32 - 0:40 | Apply Functions |
|
|
|
| 0:40 - 0:45 | Summary |
|
|
|
|
|
|
## Materials and Preparation
|
|
|
- Send out the link to the presentation slides, and help students download the Notebook.
|
|
|
|
|
|
## Differentiation and Extensions
|
|
|
|
|
|
- If students are excelling in the first half, consider deeper discussions surrounding types of missingness (missing at random, missing conditionally at random, not missing at random). Also, write a few more `groupby` statements.
|
|
|
- If students are struggling, hone the conceptual elements of each portion heavily - the **why** for identifying and handling missing data, groupby, and apply functions. Note that the order of these lessons is in order of importance, so even if the latter half is rushed, students will be covering the major keys!
|
|
|
|
|
|
|
|
|
-->
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="learning-objectives" class="level2">
|
|
|
<h2>Learning Objectives</h2>
|
|
|
<p><em>After this lesson, you will be able to:</em></p>
|
|
|
<ul>
|
|
|
<li>Identify and handle missing values with Pandas.</li>
|
|
|
<li>Implement groupby statements for specific segmented analysis.</li>
|
|
|
<li>Use apply functions to clean data with Pandas.</li>
|
|
|
</ul>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="to-the-notebook" class="level2">
|
|
|
<h2>To the notebook!</h2>
|
|
|
<p>We actually will commence this lesson directly in the Jupyter Notebook, <code>pandas-ii.ipynb</code>, to walkthrough the what, why, and how all at once.</p>
|
|
|
<p>Nonetheless, below, we have included slides reviewing the key concepts.</p>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Go through the lesson in the notebook first; come back to this presentation for a recap of concepts afterwards.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="how-do-we-handle-missing-data" class="level2">
|
|
|
<h2>How do we handle missing data?</h2>
|
|
|
<p>To handle missing data, we must:</p>
|
|
|
<ul>
|
|
|
<li>Identify we have missing data from our DataFrame</li>
|
|
|
<li>Determine, to the best of our ability, the cause of this missingness</li>
|
|
|
<li>Justify how we will handle the missing data (drop or fill in with a specific value?)</li>
|
|
|
</ul>
|
|
|
<p><strong>Pro tip:</strong> The faster you understand <em>why</em> some observations are missing, the faster and more accurately you can handle them.</p>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Do a quick recap and check for understanding.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="key-pandas-functions-for-missing-data" class="level2">
|
|
|
<h2>Key Pandas functions for missing data:</h2>
|
|
|
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb1-1" data-line-number="1"></a>
|
|
|
<a class="sourceLine" id="cb1-2" data-line-number="2"><span class="co"># identify</span></a>
|
|
|
<a class="sourceLine" id="cb1-3" data-line-number="3">df.isnull().<span class="bu">sum</span>()</a>
|
|
|
<a class="sourceLine" id="cb1-4" data-line-number="4"></a>
|
|
|
<a class="sourceLine" id="cb1-5" data-line-number="5"><span class="co"># drop (if necessary)</span></a>
|
|
|
<a class="sourceLine" id="cb1-6" data-line-number="6">df.dropna(inplace <span class="op">=</span> <span class="va">True</span>) <span class="co">#careful!</span></a>
|
|
|
<a class="sourceLine" id="cb1-7" data-line-number="7"></a>
|
|
|
<a class="sourceLine" id="cb1-8" data-line-number="8"><span class="co"># fill in (if necessary) - replace value with desired means of filling</span></a>
|
|
|
<a class="sourceLine" id="cb1-9" data-line-number="9">df.fillna(value<span class="op">=</span>column.mean(), inplace<span class="op">=</span><span class="va">True</span>)</a></code></pre></div>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Do a quick recap and check for understanding.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="how-do-we-use-groupby-statements" class="level2">
|
|
|
<h2>How do we use groupby statements?</h2>
|
|
|
<p><strong>Groupby</strong> allows us to conduct analysis on a specific subset.</p>
|
|
|
<p>Groupby follows a “split, apply, combine” methodology:</p>
|
|
|
</section>
|
|
|
<section id="section-1" class="level2">
|
|
|
<h2><img src="http://i.imgur.com/yjNkiwL.png" /></h2>
|
|
|
<p>Determine what attribute to groubpy in a cohort, and how to aggregate those values within that cohort.</p>
|
|
|
<blockquote>
|
|
|
<p>e.g. If we have 300 lemonade stands, do we want to know the average amount of lemonade sold across all stands, or identify which lemonade stand sold the most?</p>
|
|
|
</blockquote>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Do a quick recap and check for understanding.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="key-pandas-functions-for-groupby" class="level2">
|
|
|
<h2>Key Pandas functions for groupby:</h2>
|
|
|
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb2-1" data-line-number="1"></a>
|
|
|
<a class="sourceLine" id="cb2-2" data-line-number="2"><span class="co"># replace column with the column of interest!</span></a>
|
|
|
<a class="sourceLine" id="cb2-3" data-line-number="3">df.groupby(<span class="st">'column'</span>).agg[<span class="st">'count'</span>, <span class="st">'mean'</span>, <span class="st">'max'</span>, <span class="st">'min'</span>]</a></code></pre></div>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Do a quick recap and check for understanding.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="how-do-we-use-apply-functions" class="level2">
|
|
|
<h2>How do we use apply functions?</h2>
|
|
|
<ul>
|
|
|
<li><strong>Apply functions</strong> help us clean values across an entire DataFrame column. They are <em>like</em> a for loop for cleaning, but many times more efficient. They follow a common pattern:</li>
|
|
|
</ul>
|
|
|
<ol type="1">
|
|
|
<li>Write a function that works on a single value</li>
|
|
|
<li>Test that function on a single value</li>
|
|
|
<li>Apply that function to a whole column</li>
|
|
|
</ol>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Do a quick recap and check for understanding.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="key-pandas-example-for-apply-functions" class="level2">
|
|
|
<h2>Key Pandas example for apply functions:</h2>
|
|
|
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb3-1" data-line-number="1"><span class="kw">def</span> dollars_to_float(value):</a>
|
|
|
<a class="sourceLine" id="cb3-2" data-line-number="2"></a>
|
|
|
<a class="sourceLine" id="cb3-3" data-line-number="3"> <span class="co"># try to convert the inputted value to a float</span></a>
|
|
|
<a class="sourceLine" id="cb3-4" data-line-number="4"> <span class="cf">try</span>:</a>
|
|
|
<a class="sourceLine" id="cb3-5" data-line-number="5"> <span class="cf">return</span> <span class="bu">float</span>(value.strip(<span class="st">'$'</span>))</a>
|
|
|
<a class="sourceLine" id="cb3-6" data-line-number="6"></a>
|
|
|
<a class="sourceLine" id="cb3-7" data-line-number="7"> <span class="co"># in the case of the value being a null value, we simply return a null</span></a>
|
|
|
<a class="sourceLine" id="cb3-8" data-line-number="8"> <span class="cf">except</span>:</a>
|
|
|
<a class="sourceLine" id="cb3-9" data-line-number="9"> <span class="cf">return</span> np.nan</a>
|
|
|
<a class="sourceLine" id="cb3-10" data-line-number="10"></a>
|
|
|
<a class="sourceLine" id="cb3-11" data-line-number="11"></a>
|
|
|
<a class="sourceLine" id="cb3-12" data-line-number="12">liq[<span class="st">'sale_clean'</span>] <span class="op">=</span> liq.sale.<span class="bu">apply</span>(dollars_to_float)</a></code></pre></div>
|
|
|
<aside class="notes">
|
|
|
<p><strong>Teaching Tips</strong>:</p>
|
|
|
<ul>
|
|
|
<li>Do a quick recap and check for understanding.</li>
|
|
|
</ul>
|
|
|
</aside>
|
|
|
<hr />
|
|
|
</section>
|
|
|
<section id="additional-resources" class="level2">
|
|
|
<h2>Additional Resources</h2>
|
|
|
<ul>
|
|
|
<li>Pandas <a href="https://pandas.pydata.org/pandas-docs/stable/">documentation</a></li>
|
|
|
<li>DataSchool <a href="http://www.dataschool.io/easier-data-analysis-with-pandas/">30-video series</a> (by a former GA instructor!)</li>
|
|
|
</ul>
|
|
|
</section>
|
|
|
|
|
|
</div>
|
|
|
<footer><span class='slide-number'></span></footer>
|
|
|
</div>
|
|
|
<script src="../../../../lib/js/head.min.js"></script>
|
|
|
<script src="../../../../js/reveal.js"></script>
|
|
|
|
|
|
<script>
|
|
|
|
|
|
var dependencies = [
|
|
|
{ src: '../../../../lib/js/classList.js', condition: function() { return !document.body.classList; } },
|
|
|
{ src: '../../../../plugin/markdown/showdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
|
|
{ src: '../../../../plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
|
|
{ src: '../../../../plugin/prism/prism.js', async: true, callback: function() { /*hljs.initHighlightingOnLoad();*/ } },
|
|
|
{ src: '../../../../plugin/zoom-js/zoom.js', async: true, condition: function() { return !!document.body.classList; } }
|
|
|
];
|
|
|
|
|
|
if (Reveal.getQueryHash().instructor === 1) {
|
|
|
dependencies.push({ src: '../../../../plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } });
|
|
|
}
|
|
|
// Full list of configuration options available here:
|
|
|
// https://github.com/hakimel/reveal.js#configuration
|
|
|
Reveal.initialize({
|
|
|
controls: true,
|
|
|
progress: true,
|
|
|
history: true,
|
|
|
center: false,
|
|
|
slideNumber: true,
|
|
|
|
|
|
// available themes are in /css/theme
|
|
|
theme: Reveal.getQueryHash().theme || 'default',
|
|
|
|
|
|
// default/cube/page/concave/zoom/linear/fade/none
|
|
|
transition: Reveal.getQueryHash().transition || 'slide',
|
|
|
|
|
|
// Optional libraries used to extend on reveal.js
|
|
|
dependencies: dependencies
|
|
|
});
|
|
|
|
|
|
if (Reveal.getQueryHash().instructor === 1) {
|
|
|
Reveal.configure(dependencies.push({ src: '../../../../plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } }));
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
Reveal.addEventListener('ready', function() {
|
|
|
if (Reveal.getCurrentSlide().classList.contains('separator-subhead')) {
|
|
|
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-subhead.css');
|
|
|
} else if (Reveal.getCurrentSlide().classList.contains('separator')) {
|
|
|
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-title.css')
|
|
|
} else {
|
|
|
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga.css');
|
|
|
}
|
|
|
});
|
|
|
|
|
|
Reveal.addEventListener('slidechanged', function(e) {
|
|
|
if (Reveal.getCurrentSlide().classList.contains('separator-subhead')) {
|
|
|
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-subhead.css');
|
|
|
} else if (Reveal.getCurrentSlide().classList.contains('separator')) {
|
|
|
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga-title.css')
|
|
|
} else {
|
|
|
document.getElementById('theme').setAttribute('href', '../../../../css/theme/ga.css');
|
|
|
}
|
|
|
});
|
|
|
</script>
|
|
|
|
|
|
</body>
|
|
|
</html>
|