You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

868 lines
26 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature engineering in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading/Exploring the data\n",
"\n",
"Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Pandas\n",
"\n",
"Import the `pandas` library as `pd`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the `../data/iris.csv` dataset into an object named `iris`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"iris = pd.read_csv('../data/iris.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many different species are in this dataset?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris['species'].nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are their names?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['setosa', 'versicolor', 'virginica'], dtype=object)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris['species'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many samples are there per species?\n",
"\n",
"<details><summary>Hint</summary>Use the <a href=\"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html\"><code>.value_counts()</code></a> method</details>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"versicolor 50\n",
"setosa 50\n",
"virginica 50\n",
"Name: species, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris['species'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Engineering\n",
"\n",
"Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"iris['sepal_ratio'] = iris['sepal width (cm)'] / iris['sepal length (cm)']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a similar column called `'petal_ratio'`: petal width / petal length"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"iris['petal_ratio'] = iris['petal width (cm)'] / iris['petal length (cm)']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>species</th>\n",
" <th>sepal_ratio</th>\n",
" <th>petal_ratio</th>\n",
" <th>sepal length (inches)</th>\n",
" <th>petal length (inches)</th>\n",
" <th>sepal width (inches)</th>\n",
" <th>petal width (inches)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.686275</td>\n",
" <td>0.142857</td>\n",
" <td>2.007875</td>\n",
" <td>0.551181</td>\n",
" <td>1.377954</td>\n",
" <td>0.07874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.612245</td>\n",
" <td>0.142857</td>\n",
" <td>1.929135</td>\n",
" <td>0.551181</td>\n",
" <td>1.181103</td>\n",
" <td>0.07874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.680851</td>\n",
" <td>0.153846</td>\n",
" <td>1.850395</td>\n",
" <td>0.511811</td>\n",
" <td>1.259843</td>\n",
" <td>0.07874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.673913</td>\n",
" <td>0.133333</td>\n",
" <td>1.811025</td>\n",
" <td>0.590552</td>\n",
" <td>1.220473</td>\n",
" <td>0.07874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.720000</td>\n",
" <td>0.142857</td>\n",
" <td>1.968505</td>\n",
" <td>0.551181</td>\n",
" <td>1.417324</td>\n",
" <td>0.07874</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" species sepal_ratio petal_ratio sepal length (inches) \\\n",
"0 setosa 0.686275 0.142857 2.007875 \n",
"1 setosa 0.612245 0.142857 1.929135 \n",
"2 setosa 0.680851 0.153846 1.850395 \n",
"3 setosa 0.673913 0.133333 1.811025 \n",
"4 setosa 0.720000 0.142857 1.968505 \n",
"\n",
" petal length (inches) sepal width (inches) petal width (inches) \n",
"0 0.551181 1.377954 0.07874 \n",
"1 0.551181 1.181103 0.07874 \n",
"2 0.511811 1.259843 0.07874 \n",
"3 0.590552 1.220473 0.07874 \n",
"4 0.551181 1.417324 0.07874 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris['sepal length (inches)'] = iris['sepal length (cm)'] * 0.393701\n",
"iris['petal length (inches)'] = iris['petal length (cm)'] * 0.393701\n",
"iris['sepal width (inches)'] = iris['sepal width (cm)'] * 0.393701\n",
"iris['petal width (inches)'] = iris['petal width (cm)'] * 0.393701\n",
"iris.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply\n",
"\n",
"Create a column called `'encoded_species'`:\n",
"- 0 for setosa\n",
"- 1 for versicolor\n",
"- 2 for virginica\n",
"\n",
"\n",
"<details><summary>Hint 1</summary>\n",
"Create a dictionary using the species as keys and the numbers 0-2 for values\n",
"</details>\n",
"\n",
"<details><summary>Hint 2</summary>\n",
" Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column\n",
"</details>"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>species</th>\n",
" <th>sepal_ratio</th>\n",
" <th>petal_ratio</th>\n",
" <th>sepal length (inches)</th>\n",
" <th>petal length (inches)</th>\n",
" <th>sepal width (inches)</th>\n",
" <th>petal width (inches)</th>\n",
" <th>encoded_species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.686275</td>\n",
" <td>0.142857</td>\n",
" <td>2.007875</td>\n",
" <td>0.551181</td>\n",
" <td>1.377954</td>\n",
" <td>0.07874</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.612245</td>\n",
" <td>0.142857</td>\n",
" <td>1.929135</td>\n",
" <td>0.551181</td>\n",
" <td>1.181103</td>\n",
" <td>0.07874</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.680851</td>\n",
" <td>0.153846</td>\n",
" <td>1.850395</td>\n",
" <td>0.511811</td>\n",
" <td>1.259843</td>\n",
" <td>0.07874</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.673913</td>\n",
" <td>0.133333</td>\n",
" <td>1.811025</td>\n",
" <td>0.590552</td>\n",
" <td>1.220473</td>\n",
" <td>0.07874</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>setosa</td>\n",
" <td>0.720000</td>\n",
" <td>0.142857</td>\n",
" <td>1.968505</td>\n",
" <td>0.551181</td>\n",
" <td>1.417324</td>\n",
" <td>0.07874</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" species sepal_ratio petal_ratio sepal length (inches) \\\n",
"0 setosa 0.686275 0.142857 2.007875 \n",
"1 setosa 0.612245 0.142857 1.929135 \n",
"2 setosa 0.680851 0.153846 1.850395 \n",
"3 setosa 0.673913 0.133333 1.811025 \n",
"4 setosa 0.720000 0.142857 1.968505 \n",
"\n",
" petal length (inches) sepal width (inches) petal width (inches) \\\n",
"0 0.551181 1.377954 0.07874 \n",
"1 0.551181 1.181103 0.07874 \n",
"2 0.511811 1.259843 0.07874 \n",
"3 0.590552 1.220473 0.07874 \n",
"4 0.551181 1.417324 0.07874 \n",
"\n",
" encoded_species \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"species_dict = {\n",
" 'setosa': 0,\n",
" 'versicolor': 1,\n",
" 'virginica': 2\n",
"}\n",
"iris['encoded_species'] = iris['species'].map(species_dict)\n",
"iris.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## March Madness\n",
"\n",
"Let's change up the dataset to something different than flowers: March Madness!\n",
"\n",
"Read in the dataset `../data/ncaa-seeds.csv` to an object named `seeds`.\n",
"\n",
"This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:\n",
"\n",
"| team_seed | opponent_seed |\n",
"|-----------|---------------|\n",
"| 01N | 16N |"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>team_seed</th>\n",
" <th>opponent_seed</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>01N</td>\n",
" <td>16N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>02N</td>\n",
" <td>15N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>03N</td>\n",
" <td>14N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>04N</td>\n",
" <td>13N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>05N</td>\n",
" <td>12N</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" team_seed opponent_seed\n",
"0 01N 16N\n",
"1 02N 15N\n",
"2 03N 14N\n",
"3 04N 13N\n",
"4 05N 12N"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seeds = pd.read_csv('../data/ncaa-seeds.csv')\n",
"seeds.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).\n",
"\n",
"Using the `.apply()` method, create the following new columns:\n",
"- `team_division`\n",
"- `opponent_division`\n",
"\n",
"The first row of your result should look as follows:\n",
"\n",
"| team_seed | opponent_seed | team_division | opponent_division |\n",
"|-----------|---------------|---------------|-------------------|\n",
"| 01N | 16N | N | N |\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>team_seed</th>\n",
" <th>opponent_seed</th>\n",
" <th>team_division</th>\n",
" <th>opponent_division</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>01N</td>\n",
" <td>16N</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>02N</td>\n",
" <td>15N</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>03N</td>\n",
" <td>14N</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>04N</td>\n",
" <td>13N</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>05N</td>\n",
" <td>12N</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" team_seed opponent_seed team_division opponent_division\n",
"0 01N 16N N N\n",
"1 02N 15N N N\n",
"2 03N 14N N N\n",
"3 04N 13N N N\n",
"4 05N 12N N N"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seeds['team_division'] = seeds['team_seed'].apply(lambda div: div[-1])\n",
"seeds['opponent_division'] = seeds['opponent_seed'].apply(lambda div: div[-1])\n",
"seeds.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.\n",
"\n",
"The first row of your result should look as follows:\n",
"\n",
"| team_seed | opponent_seed | team_division | opponent_division |\n",
"|-----------|---------------|---------------|-------------------|\n",
"| 1 | 16 | N | N |"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>team_seed</th>\n",
" <th>opponent_seed</th>\n",
" <th>team_division</th>\n",
" <th>opponent_division</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>16</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>15</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>14</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>13</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>12</td>\n",
" <td>N</td>\n",
" <td>N</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" team_seed opponent_seed team_division opponent_division\n",
"0 1 16 N N\n",
"1 2 15 N N\n",
"2 3 14 N N\n",
"3 4 13 N N\n",
"4 5 12 N N"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seeds['team_seed'] = seeds['team_seed'].apply(lambda seed: int(seed[:-1]))\n",
"seeds['opponent_seed'] = seeds['opponent_seed'].apply(lambda seed: int(seed[:-1]))\n",
"seeds.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. \n",
"\n",
"The first row of your result should look as follows:\n",
"\n",
"| team_seed | opponent_seed | team_division | opponent_division | seed_delta |\n",
"|-----------|---------------|---------------|-------------------|------------|\n",
"| 1 | 16 | N | N | -15 |\n",
"\n",
"<br>\n",
"<details><summary>Did you get an error?</summary>\n",
"team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.\n",
"</details>"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}