{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature engineering in Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading/Exploring the data\n", "\n", "Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Pandas\n", "\n", "Import the `pandas` library as `pd`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the `../data/iris.csv` dataset into an object named `iris`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different species are in this dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are their names?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many samples are there per species?\n", "\n", "
HintUse the .value_counts() method
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering\n", "\n", "Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a similar column called `'petal_ratio'`: petal width / petal length" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply\n", "\n", "Create a column called `'encoded_species'`:\n", "- 0 for setosa\n", "- 1 for versicolor\n", "- 2 for virginica\n", "\n", "\n", "
Hint 1\n", "Create a dictionary using the species as keys and the numbers 0-2 for values\n", "
\n", "\n", "
Hint 2\n", " Use the dictionary in hint 1 with the .apply() method to create the new column\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## March Madness\n", "\n", "Let's change up the dataset to something different than flowers: March Madness!\n", "\n", "Read in the dataset `../data/ncaa-seeds.csv` to an object named `seeds`.\n", "\n", "This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:\n", "\n", "| team_seed | opponent_seed |\n", "|-----------|---------------|\n", "| 01N | 16N |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).\n", "\n", "Using the `.apply()` method, create the following new columns:\n", "- `team_division`\n", "- `opponent_division`\n", "\n", "The first row of your result should look as follows:\n", "\n", "| team_seed | opponent_seed | team_division | opponent_division |\n", "|-----------|---------------|---------------|-------------------|\n", "| 01N | 16N | N | N |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.\n", "\n", "The first row of your result should look as follows:\n", "\n", "| team_seed | opponent_seed | team_division | opponent_division |\n", "|-----------|---------------|---------------|-------------------|\n", "| 1 | 16 | N | N |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. \n", "\n", "The first row of your result should look as follows:\n", "\n", "| team_seed | opponent_seed | team_division | opponent_division | seed_delta |\n", "|-----------|---------------|---------------|-------------------|------------|\n", "| 1 | 16 | N | N | -15 |\n", "\n", "
\n", "
Did you get an error?\n", "team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 1 }