You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
284 lines
6.9 KiB
284 lines
6.9 KiB
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Feature engineering in Pandas"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading/Exploring the data\n",
|
|
"\n",
|
|
"Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Import Pandas\n",
|
|
"\n",
|
|
"Import the `pandas` library as `pd`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Read the `../data/iris.csv` dataset into an object named `iris`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"How many different species are in this dataset?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"What are their names?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"How many samples are there per species?\n",
|
|
"\n",
|
|
"<details><summary>Hint</summary>Use the <a href=\"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html\"><code>.value_counts()</code></a> method</details>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Feature Engineering\n",
|
|
"\n",
|
|
"Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Create a similar column called `'petal_ratio'`: petal width / petal length"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Apply\n",
|
|
"\n",
|
|
"Create a column called `'encoded_species'`:\n",
|
|
"- 0 for setosa\n",
|
|
"- 1 for versicolor\n",
|
|
"- 2 for virginica\n",
|
|
"\n",
|
|
"\n",
|
|
"<details><summary>Hint 1</summary>\n",
|
|
"Create a dictionary using the species as keys and the numbers 0-2 for values\n",
|
|
"</details>\n",
|
|
"\n",
|
|
"<details><summary>Hint 2</summary>\n",
|
|
" Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column\n",
|
|
"</details>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## March Madness\n",
|
|
"\n",
|
|
"Let's change up the dataset to something different than flowers: March Madness!\n",
|
|
"\n",
|
|
"Read in the dataset `../data/ncaa-seeds.csv` to an object named `seeds`.\n",
|
|
"\n",
|
|
"This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:\n",
|
|
"\n",
|
|
"| team_seed | opponent_seed |\n",
|
|
"|-----------|---------------|\n",
|
|
"| 01N | 16N |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).\n",
|
|
"\n",
|
|
"Using the `.apply()` method, create the following new columns:\n",
|
|
"- `team_division`\n",
|
|
"- `opponent_division`\n",
|
|
"\n",
|
|
"The first row of your result should look as follows:\n",
|
|
"\n",
|
|
"| team_seed | opponent_seed | team_division | opponent_division |\n",
|
|
"|-----------|---------------|---------------|-------------------|\n",
|
|
"| 01N | 16N | N | N |\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.\n",
|
|
"\n",
|
|
"The first row of your result should look as follows:\n",
|
|
"\n",
|
|
"| team_seed | opponent_seed | team_division | opponent_division |\n",
|
|
"|-----------|---------------|---------------|-------------------|\n",
|
|
"| 1 | 16 | N | N |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. \n",
|
|
"\n",
|
|
"The first row of your result should look as follows:\n",
|
|
"\n",
|
|
"| team_seed | opponent_seed | team_division | opponent_division | seed_delta |\n",
|
|
"|-----------|---------------|---------------|-------------------|------------|\n",
|
|
"| 1 | 16 | N | N | -15 |\n",
|
|
"\n",
|
|
"<br>\n",
|
|
"<details><summary>Did you get an error?</summary>\n",
|
|
"team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.\n",
|
|
"</details>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"anaconda-cloud": {},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|