{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-zPDF33DONE" }, "source": [ "# Plotting with Pandas and Matplotlib\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "I4ag5MBwDONH" }, "source": [ "### Learning Objectives\n", "*After this lesson, you will be able to:*\n", "- Implement different types of plots on a given dataset.\n", "\n", "\n", "\n", "---------" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "qDihbOCjDONK" }, "source": [ "## Recap\n", "\n", "In the last lesson, we learned about when to use the different types of plots. Can anyone give an example of when we would use a:\n", " * line plot?\n", " * bar plot?\n", " * histogram?\n", " * scatter plot?" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yjT2l0BuDONM" }, "source": [ "### Pandas and Matplotlib\n", "\n", "\n", "\n", "As we explore different types of plots, notice:\n", "\n", "1. Different types of plots are drawn very similarly -- they even tend to share parameter names.\n", "2. In Pandas, calling `plot()` on a `DataFrame` is different than calling it on a `Series`. Although the methods are both named `plot`, they may take different parameters." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "2HqUvXmYDONP" }, "source": [ "*Sometimes Pandas can be a little frustrating... perserverence is key!*\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SN6Z5XUNDONT" }, "source": [ "## Lesson Guide\n", "\n", "- [Line Plots](#line-plots)\n", "- [Bar Plots](#bar-plots)\n", "- [Histograms](#histograms)\n", "- [Scatter Plots](#scatter-plots)\n", "- [Using Seaborn](#using-seaborn)\n", "- [OPTIONAL: Understanding Matplotlib (Figures, Subplots, and Axes)](#matplotlib)\n", "- [OPTIONAL: Additional Topics](#additional-topics)\n", "\n", "- [Summary](#summary)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bqte2ZN4DONV" }, "source": [ "## Plotting with Pandas: How?\n", "\n", "`..plot()`\n", "\n", "**Note**: These are example plots on a ficticious dataset. We'll work with real ones in just a minute!\n", "\n", "`population['states'].value_counts().plot()` creates:\n", "\n", "![](https://exceljet.net/sites/default/files/styles/original_with_watermark/public/images/charttypes/line%20chart2.png?itok=lG1hqRu4)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "WNoBS5juDONW" }, "source": [ "## Plotting: Visualization Types\n", "\n", "Line charts are default.\n", "\n", "`# line chart`\n", "\n", "`population['states'].value_counts().plot()`\n", "\n", "For other charts:\n", "\n", "`population['states'].plot(kind='bar')`\n", "\n", "`population['states'].plot(kind='hist', bins=3);`\n", "\n", "`population['states'].plot(kind='scatter', x='states', y='population')`\n", "\n", "Let's try!" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "MxzMTLp3DONY" }, "source": [ "### Import packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "0CkXJSdxDONa" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# set the plots to display in the Jupyter notebook\n", "%matplotlib inline\n", "\n", "# change plotting colors per client request\n", "plt.style.use('fivethirtyeight')\n", "\n", "# Increase default figure and font sizes for easier viewing.\n", "plt.rcParams['figure.figsize'] = (8, 6)\n", "plt.rcParams['font.size'] = 14" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "h0ZGCLuEDONf" }, "source": [ "### Load in data sets for visualization\n", "\n", "- [Football Records](https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017): International football results from 1872 to 2018\n", "- [Avocado Prices](https://www.kaggle.com/neuromusic/avocado-prices): Historical data on avocado prices and sales volume in multiple US markets\n", "- [Chocolate Bar Ratings](https://www.kaggle.com/rtatman/chocolate-bar-ratings): Expert ratings of over 1,700 chocolate bars\n", "\n", "These have been included in `../data` of this repo for your convenience." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "foot = pd.read_csv('../data/international_football_results.csv')\n", "avo = pd.read_csv('../data/avocado.csv')\n", "choc = pd.read_csv('../data/chocolate_ratings.csv')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "CrFBAr-yDONk" }, "source": [ "\n", "## Line plots: Show the trend of a numerical variable over time\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's focus on the football scores for starters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can extract the year by converting the date to a `datetime64[ns]` object, and then using the `pd.Series.dt.year` property to return the year (as an `int`). We'll learn more about this in future lessons." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then get the number of games played every year by using `pd.Series.value_counts`, and using the `sort_index()` method to ensure our year is sorted chronologically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this date, we can use the `pd.Series.plot()` method to graph **count of games** against **year of game**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "SQRNn8NbDONs", "outputId": "b042fdea-b151-4287-9525-0038cb9dcfbc" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "YMx-ggN0DONx" }, "source": [ "### Knowledge Check \n", "\n", "Why does it make sense to use a line plot for this visualization? \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "C54zib1EDONy" }, "source": [ "### Another example\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "rGlS563hDONz", "outputId": "90f67495-1fe1-4a3f-930b-8281a43192fe" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "FQHr3oqpDON3" }, "source": [ "### Knowledge Check \n", "\n", "Why would it **NOT** make sense to use a line plot for this visualization?\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JaFxqHmuDON7" }, "source": [ "\n", "## Bar Plots: Show a numerical comparison across different categories\n", "---" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "aib7n5nMDON8" }, "source": [ "Count the number of games played in each country in the football dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "0NKT6cViDON8", "outputId": "20117947-0dda-4db3-8bd5-1510caa7a005" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view the same information, but in a bar chart instead. Note we are using `.head()` to return the top 5. Also note that `value_counts()` automatically sorts by the value (read the docs!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "-um6OqVnDOOA", "outputId": "4726c40f-b1a8-4bc4-ad4a-d24e6c63e1a0" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "KdwuxIbXDOOE" }, "source": [ "\n", "## Histograms: Show the distribution of a numerical variable\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's change to the chocolate bar dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "lbRp1-zJDOOG" }, "source": [ "### How would you split the `Rating` values into 3 equally sized bins?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "3GA8RVSRDOOG", "outputId": "6fd9da5c-9695-45e2-893e-50a97dea8db0" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use a histogram! The `bins=n` kwarg allows us to specify the number of bins ('buckets') of values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes it is helpful to increase this number if you think you might have an outlier or a zero-weighted set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "SvtT0px2DOOQ", "outputId": "6f0840c3-e0f0-4c4f-f9e3-199921b567b5" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ok1NR7AgDOOU" }, "source": [ "### Knowledge check: \n", "What does the y-axis represent on a histogram? What about the x-axis? How would you explain a histogram to a non-technical person?" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "St7-JnIMDOOV" }, "source": [ "### Making histograms of an entire dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "9Z_Qlm-DDOOW", "outputId": "9167113b-c728-494c-852c-70feffa51be8" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JgMCXRAwDOOa" }, "source": [ "### Why doesn't it make plots of ALL the columns in the dataframe?\n", "\n", "Hint: what is different about the columns it plots vs. the ones it left out?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "BDjFPIqXDOOa", "outputId": "381b4346-18b3-4437-c1bc-52bb1d80d901" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the data types of all the columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like it included `REF`, `Review Date`, and `Rating`. These have datatypes of `int64`, `int64`, and `float64` respectively. What do these all have in common, that is different from the other data types?\n", "

\n", "
\n", " Click for the answer!\n", " They're all **numeric!** The other columns are **categorical**, specifically string values.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can filter on these types using the `select_dtypes()` DataFrame method (which can be very handy!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LtmF6np5DOOe" }, "source": [ "### Challenge: create a histogram of the `Review Date` column, with 10 bins, and label both axes\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1HkkekvDDOOj" }, "source": [ "\n", "## Scatter plots: Show the relationship between two numerical variables\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scatter plots are very good at showing the **interaction between two numeric variables** (especially when they're continuous)!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh snap! What did we just make?! It's a [price elasticity curve!](https://en.wikipedia.org/wiki/Price_elasticity_of_demand)\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "cI3W9z6pDOO1" }, "source": [ "We can also use a thing called a **scatter matrix** or a **pairplot**, which is a grid of scatter plots. This allows you to quickly **view the interaction of N x M features**. You are generally looking for a trend between variables (a line or curve). Using machine learning, you can fit these curves to provide predictive power." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "0vTeuwo7DOO1", "outputId": "f4645002-dc3b-49f9-d751-42c43c9c58aa" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use a very handy parameter, `c`, which allows us to color the dots in a scatter plot. This is extremely helpful when doing **classification problems**, often you will set the color to the class label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's map the `type` field to the color of the dot in our price elasticity curve. To use the type field, we need to convert it from a string into a number. We can use `pd.Series.apply()` for this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see we have two unique type labels, `conventional` and `organic`. Although that is the case for this dataset, let's create a function that will store the labels in a dictionary, incrementing the number up by `1` for each new label. This way, if we receive an additional type label in the future, our code won't break. Always think about extensible code!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use this `mapping_dict` dictionary to map the values using `.apply()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can use this **binary class label** as our `c` parameter to gain some insight:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazing! It looks like the organic avocados (value of `1`) totally occupy the lower volume, higher price bracket. Those dang kids with their toast and unicycles driving up the price of my 'cados!\n", "\n", "Here, we can also see a 'more' continuous `c` parameter, `year`, which makes use of the gradient a little better. There are tons of gradients you can use, check them out [here](https://matplotlib.org/examples/color/colormaps_reference.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can save the plot to a file, using the `plt.savefig()` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Enp2scMwDOO8", "outputId": "9e479b9b-7a9a-42ce-8268-1788d23330e7" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Ron6_N0BDOO-" }, "source": [ "\n", "### Summary\n", "\n", "In this lesson, we showed examples of how to create a variety of plots using Pandas and Matplotlib. We also showed how to use each plot to effectively display data.\n", "\n", "Do not be concerned if you do not remember everything — this will come with practice! Although there are many plot styles, many similarities exist between how each plot is drawn. For example, they have most parameters in common, and the same Matplotlib functions are used to modify the plot area.\n", "\n", "We looked at:\n", "- Line plots\n", "- Bar plots\n", "- Histograms\n", "- Scatter plots" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "nLX7Hs6zDOO-" }, "source": [ "### Additional Resources\n", "\n", "Always read the documentation!\n", "\n", "- [Pandas Plotting Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)\n", "\n", "- [Matplotlib Documentation](https://matplotlib.org/)\n", "\n", "- [Matplotlib sample plots](https://matplotlib.org/tutorials/introductory/sample_plots.html)" ] } ], "metadata": { "colab": { "name": "04-plotting-with-pandas.ipynb", "provenance": [], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }