You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1930 lines
58 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
" <span>\n",
" <p align=\"left\">\n",
" <img align=\"left\" valign=\"center\" src=\"https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/5734/GA_Stack_Large_RedBlack_RGB.png\" width=\"80px\">\n",
" </p>\n",
" </span>\n",
" <span>\n",
" <h1>Pandas for Dates and Times</h1>\n",
" </span>\n",
"</div>\n",
"\n",
"\n",
"Pandas can also be used for times and dates! This is especially helpful for financial analysis, as most measures are with respect to time.\n",
"\n",
"<!-- Overview -->\n",
"<details>\n",
" <summary>Overview</summary>\n",
" <ul>\n",
" <li>In this lesson, we'll continue exploring Pandas with dates and times. Specifically:</li>\n",
" <ul>\n",
" <li>Converting dates and times into a <code>Timestamp</code> object using <code>to_datetime</code>.</li>\n",
" <li>Specifying input and output format arguments</li>\n",
" <li>Extracting components, such as year and day, from a <code>Timestamp</code> object</li>\n",
" <li>Creating <code>DatetimeIndex</code> objects, and their advantages</li>\n",
" </ul>\n",
" </ul>\n",
"</details>\n",
"\n",
"<!-- TOC -->\n",
"<details>\n",
" <summary>Table of Contents</summary>\n",
" <ul>\n",
" <li><a href=\"#import\">Import</a></li>\n",
" <li><a href=\"#objects\">Datetime Objects</a></li>\n",
" <ul>\n",
" <li><a href=\"#timestamp\">Creating Timestamp Objects</a></li>\n",
" <li><a href=\"#timestampidx\">Creating an Index of Timestamps</a></li>\n",
" <li><a href=\"#period\">Creating Period Objects</a></li>\n",
" <li><a href=\"#periodidx\">Creating an Index of Periods</a></li>\n",
" </ul>\n",
" <li><a href=\"#conversion\">Converting Datetime Objects</a></li>\n",
" <ul>\n",
" <li><a href=\"#todatetime\">Using .to_datetime()</a></li>\n",
" <ul>\n",
" <li><a href=\"#nulls\">Working with Nulls</a></li>\n",
" </ul>\n",
" <li><a href=\"#extracting\">Extracting Components from Datetime Objects</a></li>\n",
" <ul>\n",
" <li><a href=\"#year\">Year</a></li>\n",
" <li><a href=\"#day\">Day</a></li>\n",
" </ul>\n",
" </ul>\n",
" <li><a href=\"#exercise\">Exercise - AdventureWorks</a></li>\n",
" <ul>\n",
" <li><a href=\"#p_exercise\">Production.Product Exercise</a></li>\n",
" <ul>\n",
" <li><a href=\"#p_read\">Read in the Dataset</a></li>\n",
" <li><a href=\"#p_resetidx\">Reset Index</a></li>\n",
" <li><a href=\"#p_types\">Identify Types</a></li>\n",
" <li><a href=\"#p_nullcheck\">Check Nulls</a></li>\n",
" <li><a href=\"#p_convert\">Convert Dates</a></li>\n",
" <li><a href=\"#p_newcols\">Create New Columns</a></li>\n",
" </ul>\n",
" <li><a href=\"#p_exercise\">Sales.SalesOrderHeader Exercise</a></li>\n",
" <ul>\n",
" <li><a href=\"#s_read\">Read in the Dataset</a></li>\n",
" <li><a href=\"#s_resetidx\">Reset Index</a></li>\n",
" <li><a href=\"#s_slice\">Slice Dates</a></li>\n",
" </ul>\n",
" </ul>\n",
" </ul>\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"import\"></div>\n",
"<h2>Import Pandas</h2>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pandas v1.0.1\n",
"Numpy v1.18.1\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"objects\"></div>\n",
"<h2>Datetime Objects</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table>\n",
" <tr>\n",
" <th>Use</th>\n",
" <th>Class</th>\n",
" <th>Remarks</th>\n",
" <th>How to create</th>\n",
" </tr>\n",
" <tr>\n",
" <td rowspan=\"2\">Time points</td>\n",
" <td>Timestamp</td>\n",
" <td>Represents a single timestamp</td>\n",
" <td>to_datetime, Timestamp</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DatetimeIndex</td>\n",
" <td>Index of Timestamp</td>\n",
" <td>to_datetime, date_range, bdate_range, DatetimeIndex</td>\n",
" </tr>\n",
" <tr>\n",
" <td rowspan=\"2\">Time spans</td>\n",
" <td>Period</td>\n",
" <td>Represents a single time span</td>\n",
" <td>Period</td>\n",
" </tr>\n",
" <tr>\n",
" <td>PeriodIndex</td>\n",
" <td>Index of Period</td>\n",
" <td>period_range, PeriodIndex</td>\n",
" </tr>\n",
"<table>\n",
" \n",
"Above is a list of the possible types of dates and times within pandas. Note that, under the hood, numpy `datetime64` and `timedelta64` objects are being used - the former for `Timestamp` and `DatetimeIndex` objects, and the latter for `Period` and `PeriodIndex` objects, respectively.\n",
"\n",
"<ul>\n",
" <li>Time points</li>\n",
" <ul>\n",
" <li>The first two objects in the above table, `Timestamp` and `DatetimeIndex` deal with <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-to-timestamps\"><b>discrete points in time</b></a>. This will be the focus for this and future labs.</li>\n",
" </ul>\n",
" <li>Time spans</li>\n",
" <ul>\n",
" <li>The latter two objects, `Period` and `PeriodIndex` deal with <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-span-representation\"><b>spans of time</b></a>. We will briefly touch on this, as most data ingestion tasks (versus creation) are handled with coersion into the former `Timestamp` objects. It is assumed, and usually the case, that pandas mutates previously existing data (usually in an RDBMS like SQL).</li>\n",
" </ul>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"timestamp\"></div>\n",
"<h3>Creating Timestamp Objects</h3>\n",
"\n",
"Let's start by making a `Timestamp` object.\n",
"\n",
"```python\n",
"Init signature: pd.Timestamp(ts_input=<object object at 0x7fc5d75bfe60>, freq=None, tz=None, unit=None, year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=None)\n",
"Docstring: \n",
"Pandas replacement for datetime.datetime\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a `Timestamp` object using the kwargs explicitly:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2018-10-19 12:35:59')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Timestamp(year=2018, month=10, day=19, hour=12, minute=35, second=59)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also implicitly cast a string as follows. Note the `T` separator between YYYYMMDD and HH:MM:SS."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2018-10-29 12:35:59')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Timestamp('20181029T12:35:59')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that, under the hood, a `Timestamp` object is a `datetime64[ns]` numpy object which has nanosecond resolution and is stored as a 64 bit integer. As such, it's capable of covering about 584 years. That's a lot of nanoseconds! 2^64, to be exact."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1677-09-21 00:12:43.145225\n",
"2262-04-11 23:47:16.854775807\n"
]
}
],
"source": [
"print(f'{pd.Timestamp.min}\\n{pd.Timestamp.max}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"timestampidx\"></div>\n",
"<h3>Creating an Index of Timestamps</h3>\n",
"\n",
"We can assign these `Timestamp` objects to the index of our `DataFrame` to create a table that is indexed chronologically. A timeseries database, if you will. Pandas has a helper function for this, `pd.date_range`. This takes three arguments:\n",
"\n",
"<ol>\n",
" <li><code>start</code>: The beginning of the index</li>\n",
" <li><code>end</code>: The end of the index</li>\n",
" <li><code>freq</code>: The interval for each <code>Timestamp</code></li>\n",
"</ol>\n",
"\n",
"Note that the <code>freq</code> references what is referred to as an <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases\">offset alias</a>, which is pre-loaded set of common frequencies, or <code>Timestamp</code> spans.\n",
"\n",
"```python\n",
"Signature: pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)\n",
"Docstring:\n",
"Return a fixed frequency DatetimeIndex.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create an index starting a Jan 1, 2018, and ending at Jan 1, 2019. It does so with `BM`, or <i>business month end frequency</i>. This is the last work day of each month."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-30', '2018-04-30',\n",
" '2018-05-31', '2018-06-29', '2018-07-31', '2018-08-31',\n",
" '2018-09-28', '2018-10-31', '2018-11-30', '2018-12-31'],\n",
" dtype='datetime64[ns]', freq='BM')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"start = pd.Timestamp(year=2018, month=1, day=1)\n",
"end = pd.Timestamp(year=2019, month=1, day=1)\n",
"pd.date_range(start, end, freq='BM')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then create a series out of this index by specifying it in the <code>index=</code> kwarg of <code>pd.Series</code>:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"date_idx = pd.date_range(start, end, freq='BM')\n",
"date_series = pd.Series(np.random.randn(len(date_idx)), index=date_idx)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2018-01-31 -0.044885\n",
"2018-02-28 0.151402\n",
"2018-03-30 -1.498441\n",
"2018-04-30 0.245860\n",
"2018-05-31 -0.222092\n",
"2018-06-29 -1.082039\n",
"2018-07-31 1.595104\n",
"2018-08-31 -0.981576\n",
"2018-09-28 -0.154903\n",
"2018-10-31 -0.139406\n",
"2018-11-30 -0.086125\n",
"2018-12-31 0.372902\n",
"Freq: BM, dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that our index type is a <code>DateTimeIndex</code>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.indexes.datetimes.DatetimeIndex"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(date_series.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This allows us the distinct advantage of slicing and indexing our index, just as we would with an automatically generated, integer index. Let's take the first 3 rows:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2018-01-31 -0.044885\n",
"2018-02-28 0.151402\n",
"2018-03-30 -1.498441\n",
"Freq: BM, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But there's more! We can even select specific dates in the index, as a string, which returns to us the corresponding value in that 'cell':"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-0.04488546780421337"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series['1/31/2018']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"stringslicing\"></div>\n",
"Finally, we can select <b>ranges</b> of dates, <i>as strings</i>:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2018-01-31 -0.044885\n",
"2018-02-28 0.151402\n",
"2018-03-30 -1.498441\n",
"2018-04-30 0.245860\n",
"2018-05-31 -0.222092\n",
"2018-06-29 -1.082039\n",
"2018-07-31 1.595104\n",
"2018-08-31 -0.981576\n",
"Freq: BM, dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series['1/31/2018':'8/31/2018']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"period\"></div>\n",
"<h3>Creating Period Objects</h3>\n",
"\n",
"Let's <i>briefly</i> touch on creating `Period` objects. Remember that this represents a <b>span</b> of time, whereas `Timestamp` objects represent a <b>distinct point</b> in time.\n",
"\n",
"Periods are useful to determine <i>if a `Timestamp` is within the bounds of a period</i>.\n",
"\n",
"Note that <b>both</b> object types can be used as an index of a `DataFrame`. Additionally, one can check if a `Timestamp` is <i>between</i> (or greater / less than) other `Timestamp`s, so a `Period` is generally only used <i>when you are only concerned, <b>within a period</b>, when an event occurs, but are not concerned with the <b>exact time</b> the event occurred.</i>.\n",
"\n",
"```python\n",
"Init signature: pd.Period(value=None, freq=None, ordinal=None, year=None, month=None, quarter=None, day=None, hour=None, minute=None, second=None)\n",
"Docstring: \n",
"Represents a period of time\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make a `Period` object. Here, we're using an <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases\">offset alias</a> of `B`, which represents business day frequency (M-F inclusive), and is the default frequency for a period. We're explicitly stating this for clarity."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Period('2018-10-19', 'B')"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Period(year=2018, month=10, day=19, freq='B')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"periodidx\"></div>\n",
"<h3>Creating an Index of Periods</h3>\n",
"\n",
"Here, we once again have a helper function, `period_range`, which allows us to create sequential `Period` objects within an index.\n",
"\n",
"```python\n",
"Signature: pd.period_range(start=None, end=None, periods=None, freq='D', name=None)\n",
"Docstring:\n",
"Return a fixed frequency PeriodIndex, with day (calendar) as the default\n",
"frequency\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make a period index with a daily frequency, `M`, spanning one full year. Note that we do not have `BM` (business month) as an available frequency for a `Period` object, unlike our previous `Timestamp` example."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PeriodIndex(['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',\n",
" '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12',\n",
" '2019-01'],\n",
" dtype='period[M]', freq='M')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"start = pd.Timestamp(year=2018, month=1, day=1)\n",
"end = pd.Timestamp(year=2019, month=1, day=1)\n",
"pd.period_range(start, end, freq='M')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"date_idx = pd.period_range(start, end, freq='M')\n",
"date_series = pd.Series(np.random.randn(len(date_idx)), index=date_idx)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2018-01 -0.202734\n",
"2018-02 -1.288933\n",
"2018-03 -0.929931\n",
"2018-04 -0.896952\n",
"2018-05 1.432703\n",
"2018-06 1.775852\n",
"2018-07 0.599205\n",
"2018-08 -2.111145\n",
"2018-09 -0.652618\n",
"2018-10 0.008192\n",
"2018-11 0.055414\n",
"2018-12 -0.255667\n",
"2019-01 -1.791715\n",
"Freq: M, dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The true power of this lies in <i>the ability to see if a `Timestamp` object <b>exists within the bounds of a `Period` index</b></i>. Here, we are finding out in what 'time slot' Feb 22, 2018 lies within our `Period` index."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([False, False, False, False, False, False, False, False, False,\n",
" False, False, False, False])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series.index == pd.Timestamp(year=2018, month=2, day=22)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This can be used to index the source table as such, returning the row representing the period that contains our target date, Feb 22 2018."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Series([], Freq: M, dtype: float64)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_series[ date_series.index == pd.Timestamp(year=2018, month=2, day=22) ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"conversion\"></div>\n",
"<h2>Converting Datetime Objects</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"todatetime\"></div>\n",
"<h3>Using .to_datetime()</h3>\n",
"\n",
"Previously, we manually created `Timestamp` and `Period` objects. This assumes we know the year, day, etc of our input data - nice and clean, ready to convert. \n",
"\n",
"Of course, data is never clean, and we'd need to parse the input string to feed the individual keyword arguments to a function like `Timestamp` for it to know how to convert it. What a pain! Surely, there must be a better way to parse these pesky strings!\n",
"\n",
"<b>Enter `to_datetime()`.</b>\n",
"\n",
"This function is extremely powerful and automatically detects and parses input dates (as strings) and returns the result as a `Timestamp` object. Nice!\n",
"\n",
"```python\n",
"Signature: pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=False)\n",
"Docstring:\n",
"Convert argument to datetime.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if it can convert this wild string correctly, 1:55pm and 24 seconds, January 22nd, 1985."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('1985-01-22 13:55:24')"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.to_datetime(\"1:55pm and 24 seconds, January 22nd, 1985\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Amazing! Let's try it out on a `DataFrame` object to see how it fares. First, let's make a `DataFrame` containing a few dates we make up. Note that the 3rd entry is `np.nan`, which represents a null value in our dataset."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>myDateColumn</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10/1/2018 8:29:59PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1/1/2019 12:01AM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>September 20th, 1992</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" myDateColumn\n",
"0 10/1/2018 8:29:59PM\n",
"1 1/1/2019 12:01AM\n",
"2 NaN\n",
"3 September 20th, 1992"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'myDateColumn': ['10/1/2018 8:29:59PM', '1/1/2019 12:01AM', np.nan, 'September 20th, 1992']})\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's try to convert this column to a `Timestamp` object, using `to_datetime()`. Do we think it will work? Note: we are storing the result `Series` object of the conversion in a variable, `s`, for later use."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2018-10-01 20:29:59\n",
"1 2019-01-01 00:01:00\n",
"2 NaT\n",
"3 1992-09-20 00:00:00\n",
"Name: myDateColumn, dtype: datetime64[ns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = pd.to_datetime(df['myDateColumn'])\n",
"s.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>It did!</i><br><br><u>Note:</u>\n",
"<ul>\n",
" <li>We're storing the date as the numpy object, <code>datetime64[ns]</code>, which is the storage object for a <code>Timestamp</code>.</li>\n",
" <li>We have assumed a time of midnight, <code>00:00:00</code> for days where no time information is passed to <code>to_datetime()</code>.</li>\n",
" <li>We've imputed our null as <code>NaT</code>, which is short for 'Not a Time'</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"nulls\"></div>\n",
"<h4>Working with Nulls</h4>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This brings us to our final point - what happens if our superhero `to_datetime()` function <i>can't parse the input string</i> and arrive at a usable date?\n",
"\n",
"The default behavior is `raise`, which raises a `ValueError` and exits the function (stops parsing immediately). What if we just want to stick a null in there and move on?\n",
"\n",
"That's what `coerce` is useful for. If `to_datetime()` can't parse the string, it'll just stick a `NaT` in there instead. In many cases, this is preferable. Make sure to keep an eye on the number of nulls you generate when using this as it won't warn you."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NaT"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the default for errors kwarg is 'raise'\n",
"pd.to_datetime('I am not a date', errors='coerce')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"extracting\"></div>\n",
"<h3>Extracting Components from Datetime Objects</h3>\n",
"\n",
"So, we've gotten our messy string values all tidied up using `to_datetime()`. What happens when I want to retrieve the year or day of the data I've stored?\n",
"\n",
"Enter <a href=\"https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties\">series datetimelike properties</a>.\n",
"\n",
"These are a collection of `Series` object properties that are accessible when the datatype is `Timestamp` or `Period`. Basically, <i>it allows us to extract date information from our date column, when it's stored as a date</i>.\n",
"\n",
"We access these properties using the following dot notation:\n",
"\n",
"```python\n",
"pd.Series.dt.<part of the date you want>\n",
"```\n",
"\n",
"If our `Series` object were named `s`, it'd be written as:\n",
"\n",
"```python\n",
"s.dt.<part of the date you want>\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"year\"></div>\n",
"Let's extract just the `year` of the above `Series` object. What do we expect to see?"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2018.0\n",
"1 2019.0\n",
"2 NaN\n",
"3 1992.0\n",
"Name: myDateColumn, dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.dt.year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"day\"></div>\n",
"Now you try - retrieve the `day` of the time series. What is the result?"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"1 1.0\n",
"2 NaN\n",
"3 20.0\n",
"Name: myDateColumn, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.dt.day"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the data type of the resultant conversion? Why does this matter?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Best Practices</b>:\n",
"<ul>\n",
" <li>Whenever possible, store dates as <code>Timestamp</code> or <code>Period</code> objects</li>\n",
" <li>These datatypes use methods and properties that are memory (space) and compute (time) optimized</li>\n",
" <li><i>Only during reporting or extraction</i> should the above properties be used</li>\n",
" <li>Any children of the parent datetime object are not stored to help reduce redundancy and database size</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"exercise\"></div>\n",
"<h2>Exercise - AdventureWorks</h2>\n",
"<p align=\"right\">\n",
"<img src=\"http://lh6.ggpht.com/_XjcDyZkJqHg/TPaaRcaysbI/AAAAAAAAAFo/b1U3q-qbTjY/AdventureWorks%20Logo%5B5%5D.png?imgmax=800\">\n",
"</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_exercise\"></div>\n",
"<h3>Production.Product</h3>\n",
"\n",
"Here's the <i>Production.Product</i> table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html), which is a description of the fields (columns) in the table (the .csv file we will import below):<br>\n",
"\n",
"<details>\n",
" <summary>Data Dictionary</summary>\n",
" <table>\n",
" <tr>\n",
" <th>Name</th>\n",
" <th>Description</th>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductID</td>\n",
" <td>Primary key for Product records</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Name</td>\n",
" <td>Name of the product</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductNumber</td>\n",
" <td>Unique product identification number</td>\n",
" </tr>\n",
" <tr>\n",
" <td>MakeFlag</td>\n",
" <td>0 = Product is purchased, 1 = Product is manufactured in-house.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>FinishedGoodsFlag</td>\n",
" <td>0 = Product is not a salable item. 1 = Product is salable.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Color</td>\n",
" <td>Product color</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SafetyStockLevel</td>\n",
" <td>Minimum inventory quantity</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ReorderPoint</td>\n",
" <td>Inventory level that triggers a purchase order or work order</td>\n",
" </tr>\n",
" <tr>\n",
" <td>StandardCost</td>\n",
" <td>Standard cost of the product [USD]</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ListPrice</td>\n",
" <td>Selling price [USD]</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Size</td>\n",
" <td>Product size [units vary, see SizeUnitMeasureCode]</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SizeUnitMeasureCode</td>\n",
" <td>Unit of measure for the Size column</td>\n",
" </tr>\n",
" <tr>\n",
" <td>WeightUnitMeasureCode</td>\n",
" <td>Unit of measure for the Weight column</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DaysToManufacture</td>\n",
" <td>Number of days required to manufacture the product</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductLine</td>\n",
" <td>R = Road, M = Mountain, T = Touring, S = Standard</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Class</td>\n",
" <td>H = High, M = Medium, L = Low</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Style</td>\n",
" <td>W = Womens, M = Mens, U = Universal</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductSubcategoryID</td>\n",
" <td>Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductModelID</td>\n",
" <td>Product is a member of this product model. Foreign key to ProductModel.ProductModelID</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SellStartDate</td>\n",
" <td>Date the product was available for sale</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SellEndDate</td>\n",
" <td>Date the product was no longer available for sale</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DiscontinuedDate</td>\n",
" <td>Date the product was discontinued</td>\n",
" </tr>\n",
" <tr>\n",
" <td>rowguid</td>\n",
" <td>ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ModifiedDate</td>\n",
" <td>Date and time the record was last updated</td>\n",
" </tr>\n",
" </table>\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_read\"></div>\n",
"<h4>Read in the Dataset</h4>\n",
"\n",
"We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"prod = pd.read_csv('../data/Production.Product.csv', sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ProductID</th>\n",
" <th>Name</th>\n",
" <th>ProductNumber</th>\n",
" <th>MakeFlag</th>\n",
" <th>FinishedGoodsFlag</th>\n",
" <th>Color</th>\n",
" <th>SafetyStockLevel</th>\n",
" <th>ReorderPoint</th>\n",
" <th>StandardCost</th>\n",
" <th>ListPrice</th>\n",
" <th>...</th>\n",
" <th>ProductLine</th>\n",
" <th>Class</th>\n",
" <th>Style</th>\n",
" <th>ProductSubcategoryID</th>\n",
" <th>ProductModelID</th>\n",
" <th>SellStartDate</th>\n",
" <th>SellEndDate</th>\n",
" <th>DiscontinuedDate</th>\n",
" <th>rowguid</th>\n",
" <th>ModifiedDate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Adjustable Race</td>\n",
" <td>AR-5381</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>1000</td>\n",
" <td>750</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2008-04-30 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>{694215B7-08F7-4C0D-ACB1-D734BA44C0C8}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Bearing Ball</td>\n",
" <td>BA-8327</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>1000</td>\n",
" <td>750</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2008-04-30 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>{58AE3C20-4F3A-4749-A7D4-D568806CC537}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>BB Ball Bearing</td>\n",
" <td>BE-2349</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>800</td>\n",
" <td>600</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2008-04-30 00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E}</td>\n",
" <td>2014-02-08 10:01:36.827000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 25 columns</p>\n",
"</div>"
],
"text/plain": [
" ProductID Name ProductNumber MakeFlag FinishedGoodsFlag \\\n",
"0 1 Adjustable Race AR-5381 0 0 \n",
"1 2 Bearing Ball BA-8327 0 0 \n",
"2 3 BB Ball Bearing BE-2349 1 0 \n",
"\n",
" Color SafetyStockLevel ReorderPoint StandardCost ListPrice ... \\\n",
"0 NaN 1000 750 0.0 0.0 ... \n",
"1 NaN 1000 750 0.0 0.0 ... \n",
"2 NaN 800 600 0.0 0.0 ... \n",
"\n",
" ProductLine Class Style ProductSubcategoryID ProductModelID \\\n",
"0 NaN NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"\n",
" SellStartDate SellEndDate DiscontinuedDate \\\n",
"0 2008-04-30 00:00:00 NaN NaN \n",
"1 2008-04-30 00:00:00 NaN NaN \n",
"2 2008-04-30 00:00:00 NaN NaN \n",
"\n",
" rowguid ModifiedDate \n",
"0 {694215B7-08F7-4C0D-ACB1-D734BA44C0C8} 2014-02-08 10:01:36.827000000 \n",
"1 {58AE3C20-4F3A-4749-A7D4-D568806CC537} 2014-02-08 10:01:36.827000000 \n",
"2 {9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E} 2014-02-08 10:01:36.827000000 \n",
"\n",
"[3 rows x 25 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's check out the first 3 rows again, for old time's sake\n",
"prod.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(504, 25)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# and the number of rows x cols\n",
"prod.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_resetidx\"></div>\n",
"<h4>Reset Index</h4>\n",
"\n",
"Let's bring our `ProductID` column into the index since it's the PK (primary key) of our table and that's where PKs belong as a best practice."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"prod.set_index('ProductID', inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_types\"></div>\n",
"<h3>Identify Types</h3>\n",
"\n",
"<ul>\n",
" <li>Print out the column data types</li>\n",
" <li>Which columns in the `prod` dataframe are candidates for datetime conversion? Store the column names in a list named <code>datecols</code></li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Name object\n",
"ProductNumber object\n",
"MakeFlag int64\n",
"FinishedGoodsFlag int64\n",
"Color object\n",
"SafetyStockLevel int64\n",
"ReorderPoint int64\n",
"StandardCost float64\n",
"ListPrice float64\n",
"Size object\n",
"SizeUnitMeasureCode object\n",
"WeightUnitMeasureCode object\n",
"Weight float64\n",
"DaysToManufacture int64\n",
"ProductLine object\n",
"Class object\n",
"Style object\n",
"ProductSubcategoryID float64\n",
"ProductModelID float64\n",
"SellStartDate object\n",
"SellEndDate object\n",
"DiscontinuedDate float64\n",
"rowguid object\n",
"ModifiedDate object\n",
"dtype: object"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"datecols = ['SellStartDate', 'SellEndDate', 'DiscontinuedDate', 'ModifiedDate']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_nullcheck\"></div>\n",
"<h4>Check Nulls</h4>\n",
"\n",
"<ul>\n",
" <li>Report the number of nulls for each column contained in <code>datecols</code></li>\n",
" <li>Display this result as a <code>pd.Series</code> object</li>\n",
" <li>Make note of anything that might warrant further investigation</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SellStartDate 0\n",
"SellEndDate 406\n",
"DiscontinuedDate 504\n",
"ModifiedDate 0\n",
"dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dicontinued date is null for every single row in the dataframe. \n",
"# It is also stored as a float. This is because pandas has no information to \n",
"# identify it as a date or string. \n",
"# Moreover, it looks like a large portion (406/504) products have stopped \n",
"# selling but are not discontinued.\n",
"prod[datecols].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_convert\"></div>\n",
"<h3>Convert the <code>SellStartDate</code> column to a <code>Timestamp</code> object</h3>\n",
"\n",
"Convert the <code>SellStartDate</code> column to a <code>Timestamp</code> object using <a href=\"#todatetime\"><code>to_datetime()</code></a>.\n",
"\n",
"<ul>\n",
" <li>Write a <code>for</code> loop to iterate over the columns in <code>datecols</code></i>\n",
" <li>Using <a href=\"#todatetime\"><code>to_datetime()</code></a>, convert each column to a <code>Timestamp</code> object</li>\n",
" <li>Take the result of this <code>Timestamp</code> object and overwrite each respective source column</li>\n",
" <li>Print the data types of the <code>datecols</code> columns to verify the conversion</li>\n",
" <li>Print the first 3 rows of the <code>datecols</code> columns to verify the conversion</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"for col in datecols:\n",
" prod[col] = pd.to_datetime(prod[col])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SellStartDate datetime64[ns]\n",
"SellEndDate datetime64[ns]\n",
"DiscontinuedDate datetime64[ns]\n",
"ModifiedDate datetime64[ns]\n",
"dtype: object"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod[datecols].dtypes"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SellStartDate</th>\n",
" <th>SellEndDate</th>\n",
" <th>DiscontinuedDate</th>\n",
" <th>ModifiedDate</th>\n",
" </tr>\n",
" <tr>\n",
" <th>ProductID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2008-04-30</td>\n",
" <td>NaT</td>\n",
" <td>NaT</td>\n",
" <td>2014-02-08 10:01:36.827</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2008-04-30</td>\n",
" <td>NaT</td>\n",
" <td>NaT</td>\n",
" <td>2014-02-08 10:01:36.827</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2008-04-30</td>\n",
" <td>NaT</td>\n",
" <td>NaT</td>\n",
" <td>2014-02-08 10:01:36.827</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SellStartDate SellEndDate DiscontinuedDate ModifiedDate\n",
"ProductID \n",
"1 2008-04-30 NaT NaT 2014-02-08 10:01:36.827\n",
"2 2008-04-30 NaT NaT 2014-02-08 10:01:36.827\n",
"3 2008-04-30 NaT NaT 2014-02-08 10:01:36.827"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod[datecols].head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_newcols\"></div>\n",
"<h3>Create New Columns</h3>\n",
"\n",
"Using <a href=\"https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties\">series datetimelike properties</a>, create three new columns:\n",
"\n",
"<ul>\n",
" <li><code>SellStartDate_Year</code>, a column containing the <code>year</code> of the <code>SellStartDate</code> column.</li>\n",
" <li><code>SellStartDate_Month</code>, a column containing the <code>month</code> of the <code>SellStartDate</code> column.</li>\n",
" <li><code>SellStartDate_Day</code>, a column containing the <code>month</code> of the <code>SellStartDate</code> column.</li>\n",
" <li>Print the data types of <code>SellStartDate</code>, <code>SellStartDate_Year</code>, <code>SellStartDate_Month</code>, and <code>SellStartDate_Day</code>.</li>\n",
" <li>Print the first 3 rows of <code>SellStartDate</code>, <code>SellStartDate_Year</code>, <code>SellStartDate_Month</code>, and <code>SellStartDate_Day</code>.</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"prod['SellStartDate_Year'] = prod['SellStartDate'].dt.year\n",
"prod['SellStartDate_Month'] = prod['SellStartDate'].dt.month\n",
"prod['SellStartDate_Day'] = prod['SellStartDate'].dt.day"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SellStartDate datetime64[ns]\n",
"SellStartDate_Year int64\n",
"SellStartDate_Month int64\n",
"SellStartDate_Day int64\n",
"dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod[['SellStartDate', 'SellStartDate_Year', 'SellStartDate_Month', 'SellStartDate_Day']].dtypes"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SellStartDate</th>\n",
" <th>SellStartDate_Year</th>\n",
" <th>SellStartDate_Month</th>\n",
" <th>SellStartDate_Day</th>\n",
" </tr>\n",
" <tr>\n",
" <th>ProductID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2008-04-30</td>\n",
" <td>2008</td>\n",
" <td>4</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2008-04-30</td>\n",
" <td>2008</td>\n",
" <td>4</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2008-04-30</td>\n",
" <td>2008</td>\n",
" <td>4</td>\n",
" <td>30</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SellStartDate SellStartDate_Year SellStartDate_Month \\\n",
"ProductID \n",
"1 2008-04-30 2008 4 \n",
"2 2008-04-30 2008 4 \n",
"3 2008-04-30 2008 4 \n",
"\n",
" SellStartDate_Day \n",
"ProductID \n",
"1 30 \n",
"2 30 \n",
"3 30 "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prod[['SellStartDate', 'SellStartDate_Year', 'SellStartDate_Month', 'SellStartDate_Day']].head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"exercise_product\"></div>\n",
"<h3>Sales.SalesOrderDetail</h3>\n",
"\n",
"Here's the <i>Sales.SalesOrderDetail</i> table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html), which is a description of the fields (columns) in the table (the .csv file we will import below):<br>\n",
"\n",
"<details>\n",
" <summary>Data Dictionary</summary>\n",
" <table>\n",
" <tr>\n",
" <th>Name</th>\n",
" <th>Description</th>\n",
" </tr>\n",
" <tr>\n",
" <td>TBD</td>\n",
" <td>TBD</td>\n",
" </tr>\n",
" </table>\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"s_read\"></div>\n",
"<h3>Read in the Dataset</h3>\n",
"\n",
"We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['SalesOrderID', 'SalesOrderDetailID', 'CarrierTrackingNumber',\n",
" 'OrderQty', 'ProductID', 'SpecialOfferID', 'UnitPrice',\n",
" 'UnitPriceDiscount', 'LineTotal', 'rowguid', 'ModifiedDate'],\n",
" dtype='object')"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sod.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"s_resetidx\"></div>\n",
"<h3>Reset Index</h3>\n",
"\n",
"Using <code>.set_index()</code>, set the <code>sod</code> dataframe to a <a href=\"#timestampidx\"><code>Timestamp</code> index</a>.\n",
"\n",
"<ul>\n",
" <li>Display the index of the <code>sod</code> dataframe as-is.</li>\n",
" <li>Convert <code>ModifiedDate</code> to a <code>Timestamp</code> object.</li>\n",
" <li>Use <code>.set_index()</code> to make this new column the dataframe index.</li>\n",
" <li>Display the index of the <code>sod</code> dataframe after conversion. Has the type changed?</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=121317, step=1)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sod.index"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2011-05-31', '2011-05-31', '2011-05-31', '2011-05-31',\n",
" '2011-05-31', '2011-05-31', '2011-05-31', '2011-05-31',\n",
" '2011-05-31', '2011-05-31',\n",
" ...\n",
" '2014-06-30', '2014-06-30', '2014-06-30', '2014-06-30',\n",
" '2014-06-30', '2014-06-30', '2014-06-30', '2014-06-30',\n",
" '2014-06-30', '2014-06-30'],\n",
" dtype='datetime64[ns]', name='ModifiedDate', length=121317, freq=None)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sod.set_index(pd.to_datetime(sod['ModifiedDate']), inplace=True)\n",
"sod.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"s_slice\"></div>\n",
"<h3>Slicing Dates</h3>\n",
"\n",
"Using <a href=\"#stringslicing\">string slicing</a> of the index:\n",
"<ul>\n",
" <li>Find how many <code>SalesOrderID</code>s were processed between March 15th, 2013 and March 20th, 2013</li>\n",
" <li>Return your result as an <code>int</code>.</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"55"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sod['3/15/2013':'3/20/2013']['SalesOrderID'].shape[0]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}