{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature engineering in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading/Exploring the data\n",
"\n",
"Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Pandas\n",
"\n",
"Import the `pandas` library as `pd`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the `../data/iris.csv` dataset into an object named `iris`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many different species are in this dataset?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are their names?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many samples are there per species?\n",
"\n",
"Hint
Use the .value_counts() method "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Engineering\n",
"\n",
"Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a similar column called `'petal_ratio'`: petal width / petal length"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply\n",
"\n",
"Create a column called `'encoded_species'`:\n",
"- 0 for setosa\n",
"- 1 for versicolor\n",
"- 2 for virginica\n",
"\n",
"\n",
"Hint 1
\n",
"Create a dictionary using the species as keys and the numbers 0-2 for values\n",
" \n",
"\n",
"Hint 2
\n",
" Use the dictionary in hint 1 with the .apply() method to create the new column\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## March Madness\n",
"\n",
"Let's change up the dataset to something different than flowers: March Madness!\n",
"\n",
"Read in the dataset `../data/ncaa-seeds.csv` to an object named `seeds`.\n",
"\n",
"This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:\n",
"\n",
"| team_seed | opponent_seed |\n",
"|-----------|---------------|\n",
"| 01N | 16N |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).\n",
"\n",
"Using the `.apply()` method, create the following new columns:\n",
"- `team_division`\n",
"- `opponent_division`\n",
"\n",
"The first row of your result should look as follows:\n",
"\n",
"| team_seed | opponent_seed | team_division | opponent_division |\n",
"|-----------|---------------|---------------|-------------------|\n",
"| 01N | 16N | N | N |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.\n",
"\n",
"The first row of your result should look as follows:\n",
"\n",
"| team_seed | opponent_seed | team_division | opponent_division |\n",
"|-----------|---------------|---------------|-------------------|\n",
"| 1 | 16 | N | N |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. \n",
"\n",
"The first row of your result should look as follows:\n",
"\n",
"| team_seed | opponent_seed | team_division | opponent_division | seed_delta |\n",
"|-----------|---------------|---------------|-------------------|------------|\n",
"| 1 | 16 | N | N | -15 |\n",
"\n",
"
\n",
"Did you get an error?
\n",
"team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}