You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

933 lines
30 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
" <span>\n",
" <p align=\"left\">\n",
" <img align=\"left\" valign=\"center\" src=\"https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/5734/GA_Stack_Large_RedBlack_RGB.png\" width=\"80px\">\n",
" </p>\n",
" </span>\n",
" <span>\n",
" <h1>Pandas for Dates and Times</h1>\n",
" </span>\n",
"</div>\n",
"\n",
"\n",
"Pandas can also be used for times and dates! This is especially helpful for financial analysis, as most measures are with respect to time.\n",
"\n",
"<!-- Overview -->\n",
"<details>\n",
" <summary>Overview</summary>\n",
" <ul>\n",
" <li>In this lesson, we'll continue exploring Pandas with dates and times. Specifically:</li>\n",
" <ul>\n",
" <li>Converting dates and times into a <code>Timestamp</code> object using <code>to_datetime</code>.</li>\n",
" <li>Specifying input and output format arguments</li>\n",
" <li>Extracting components, such as year and day, from a <code>Timestamp</code> object</li>\n",
" <li>Creating <code>DatetimeIndex</code> objects, and their advantages</li>\n",
" </ul>\n",
" </ul>\n",
"</details>\n",
"\n",
"<!-- TOC -->\n",
"<details>\n",
" <summary>Table of Contents</summary>\n",
" <ul>\n",
" <li><a href=\"#import\">Import</a></li>\n",
" <li><a href=\"#objects\">Datetime Objects</a></li>\n",
" <ul>\n",
" <li><a href=\"#timestamp\">Creating Timestamp Objects</a></li>\n",
" <li><a href=\"#timestampidx\">Creating an Index of Timestamps</a></li>\n",
" <li><a href=\"#period\">Creating Period Objects</a></li>\n",
" <li><a href=\"#periodidx\">Creating an Index of Periods</a></li>\n",
" </ul>\n",
" <li><a href=\"#conversion\">Converting Datetime Objects</a></li>\n",
" <ul>\n",
" <li><a href=\"#todatetime\">Using .to_datetime()</a></li>\n",
" <ul>\n",
" <li><a href=\"#nulls\">Working with Nulls</a></li>\n",
" </ul>\n",
" <li><a href=\"#extracting\">Extracting Components from Datetime Objects</a></li>\n",
" <ul>\n",
" <li><a href=\"#year\">Year</a></li>\n",
" <li><a href=\"#day\">Day</a></li>\n",
" </ul>\n",
" </ul>\n",
" <li><a href=\"#exercise\">Exercise - AdventureWorks</a></li>\n",
" <ul>\n",
" <li><a href=\"#p_exercise\">Production.Product Exercise</a></li>\n",
" <ul>\n",
" <li><a href=\"#p_read\">Read in the Dataset</a></li>\n",
" <li><a href=\"#p_resetidx\">Reset Index</a></li>\n",
" <li><a href=\"#p_types\">Identify Types</a></li>\n",
" <li><a href=\"#p_nullcheck\">Check Nulls</a></li>\n",
" <li><a href=\"#p_convert\">Convert Dates</a></li>\n",
" <li><a href=\"#p_newcols\">Create New Columns</a></li>\n",
" </ul>\n",
" <li><a href=\"#p_exercise\">Sales.SalesOrderHeader Exercise</a></li>\n",
" <ul>\n",
" <li><a href=\"#s_read\">Read in the Dataset</a></li>\n",
" <li><a href=\"#s_resetidx\">Reset Index</a></li>\n",
" <li><a href=\"#s_slice\">Slice Dates</a></li>\n",
" </ul>\n",
" </ul>\n",
" </ul>\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"import\"></div>\n",
"<h2>Import Pandas</h2>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"objects\"></div>\n",
"<h2>Datetime Objects</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table>\n",
" <tr>\n",
" <th>Use</th>\n",
" <th>Class</th>\n",
" <th>Remarks</th>\n",
" <th>How to create</th>\n",
" </tr>\n",
" <tr>\n",
" <td rowspan=\"2\">Time points</td>\n",
" <td>Timestamp</td>\n",
" <td>Represents a single timestamp</td>\n",
" <td>to_datetime, Timestamp</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DatetimeIndex</td>\n",
" <td>Index of Timestamp</td>\n",
" <td>to_datetime, date_range, bdate_range, DatetimeIndex</td>\n",
" </tr>\n",
" <tr>\n",
" <td rowspan=\"2\">Time spans</td>\n",
" <td>Period</td>\n",
" <td>Represents a single time span</td>\n",
" <td>Period</td>\n",
" </tr>\n",
" <tr>\n",
" <td>PeriodIndex</td>\n",
" <td>Index of Period</td>\n",
" <td>period_range, PeriodIndex</td>\n",
" </tr>\n",
"<table>\n",
" \n",
"Above is a list of the possible types of dates and times within pandas. Note that, under the hood, numpy `datetime64` and `timedelta64` objects are being used - the former for `Timestamp` and `DatetimeIndex` objects, and the latter for `Period` and `PeriodIndex` objects, respectively.\n",
"\n",
"<ul>\n",
" <li>Time points</li>\n",
" <ul>\n",
" <li>The first two objects in the above table, `Timestamp` and `DatetimeIndex` deal with <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-to-timestamps\"><b>discrete points in time</b></a>. This will be the focus for this and future labs.</li>\n",
" </ul>\n",
" <li>Time spans</li>\n",
" <ul>\n",
" <li>The latter two objects, `Period` and `PeriodIndex` deal with <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-span-representation\"><b>spans of time</b></a>. We will briefly touch on this, as most data ingestion tasks (versus creation) are handled with coersion into the former `Timestamp` objects. It is assumed, and usually the case, that pandas mutates previously existing data (usually in an RDBMS like SQL).</li>\n",
" </ul>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"timestamp\"></div>\n",
"<h3>Creating Timestamp Objects</h3>\n",
"\n",
"Let's start by making a `Timestamp` object.\n",
"\n",
"```python\n",
"Init signature: pd.Timestamp(ts_input=<object object at 0x7fc5d75bfe60>, freq=None, tz=None, unit=None, year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=None)\n",
"Docstring: \n",
"Pandas replacement for datetime.datetime\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a `Timestamp` object using the kwargs explicitly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also implicitly cast a string as follows. Note the `T` separator between YYYYMMDD and HH:MM:SS."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that, under the hood, a `Timestamp` object is a `datetime64[ns]` numpy object which has nanosecond resolution and is stored as a 64 bit integer. As such, it's capable of covering about 584 years. That's a lot of nanoseconds! 2^64, to be exact."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"timestampidx\"></div>\n",
"<h3>Creating an Index of Timestamps</h3>\n",
"\n",
"We can assign these `Timestamp` objects to the index of our `DataFrame` to create a table that is indexed chronologically. A timeseries database, if you will. Pandas has a helper function for this, `pd.date_range`. This takes three arguments:\n",
"\n",
"<ol>\n",
" <li><code>start</code>: The beginning of the index</li>\n",
" <li><code>end</code>: The end of the index</li>\n",
" <li><code>freq</code>: The interval for each <code>Timestamp</code></li>\n",
"</ol>\n",
"\n",
"Note that the <code>freq</code> references what is referred to as an <a href=\"https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases\">offset alias</a>, which is pre-loaded set of common frequencies, or <code>Timestamp</code> spans.\n",
"\n",
"```python\n",
"Signature: pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)\n",
"Docstring:\n",
"Return a fixed frequency DatetimeIndex.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create an index starting a Jan 1, 2018, and ending at Jan 1, 2019. It does so with `BM`, or <i>business month end frequency</i>. This is the last work day of each month."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then create a series out of this index by specifying it in the <code>index=</code> kwarg of <code>pd.Series</code>:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that our index type is a <code>DateTimeIndex</code>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This allows us the distinct advantage of slicing and indexing our index, just as we would with an automatically generated, integer index. Let's take the first 3 rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But there's more! We can even select specific dates in the index, as a string, which returns to us the corresponding value in that 'cell':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"stringslicing\"></div>\n",
"Finally, we can select <b>ranges</b> of dates, <i>as strings</i>:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"conversion\"></div>\n",
"<h2>Converting Datetime Objects</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"todatetime\"></div>\n",
"<h3>Using .to_datetime()</h3>\n",
"\n",
"Previously, we manually created `Timestamp` and `Period` objects. This assumes we know the year, day, etc of our input data - nice and clean, ready to convert. \n",
"\n",
"Of course, data is never clean, and we'd need to parse the input string to feed the individual keyword arguments to a function like `Timestamp` for it to know how to convert it. What a pain! Surely, there must be a better way to parse these pesky strings!\n",
"\n",
"<b>Enter `to_datetime()`.</b>\n",
"\n",
"This function is extremely powerful and automatically detects and parses input dates (as strings) and returns the result as a `Timestamp` object. Nice!\n",
"\n",
"```python\n",
"Signature: pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=False)\n",
"Docstring:\n",
"Convert argument to datetime.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if it can convert this wild string correctly, 1:55pm and 24 seconds, January 22nd, 1985."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Amazing! Let's try it out on a `DataFrame` object to see how it fares. First, let's make a `DataFrame` containing a few dates we make up. Note that the 3rd entry is `np.nan`, which represents a null value in our dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's try to convert this column to a `Timestamp` object, using `to_datetime()`. Do we think it will work? Note: we are storing the result `Series` object of the conversion in a variable, `s`, for later use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>It did!</i><br><br><u>Note:</u>\n",
"<ul>\n",
" <li>We're storing the date as the numpy object, <code>datetime64[ns]</code>, which is the storage object for a <code>Timestamp</code>.</li>\n",
" <li>We have assumed a time of midnight, <code>00:00:00</code> for days where no time information is passed to <code>to_datetime()</code>.</li>\n",
" <li>We've imputed our null as <code>NaT</code>, which is short for 'Not a Time'</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"nulls\"></div>\n",
"<h4>Working with Nulls</h4>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This brings us to our final point - what happens if our superhero `to_datetime()` function <i>can't parse the input string</i> and arrive at a usable date?\n",
"\n",
"The default behavior is `raise`, which raises a `ValueError` and exits the function (stops parsing immediately). What if we just want to stick a null in there and move on?\n",
"\n",
"That's what `coerce` is useful for. If `to_datetime()` can't parse the string, it'll just stick a `NaT` in there instead. In many cases, this is preferable. Make sure to keep an eye on the number of nulls you generate when using this as it won't warn you."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# the default for errors kwarg is 'raise'\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"extracting\"></div>\n",
"<h3>Extracting Components from Datetime Objects</h3>\n",
"\n",
"So, we've gotten our messy string values all tidied up using `to_datetime()`. What happens when I want to retrieve the year or day of the data I've stored?\n",
"\n",
"Enter <a href=\"https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties\">series datetimelike properties</a>.\n",
"\n",
"These are a collection of `Series` object properties that are accessible when the datatype is `Timestamp` or `Period`. Basically, <i>it allows us to extract date information from our date column, when it's stored as a date</i>.\n",
"\n",
"We access these properties using the following dot notation:\n",
"\n",
"```python\n",
"pd.Series.dt.<part of the date you want>\n",
"```\n",
"\n",
"If our `Series` object were named `s`, it'd be written as:\n",
"\n",
"```python\n",
"s.dt.<part of the date you want>\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"year\"></div>\n",
"Let's extract just the `year` of the above `Series` object. What do we expect to see?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"day\"></div>\n",
"Now you try - retrieve the `day` of the time series. What is the result?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the data type of the resultant conversion? Why does this matter?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Best Practices</b>:\n",
"<ul>\n",
" <li>Whenever possible, store dates as <code>Timestamp</code> or <code>Period</code> objects</li>\n",
" <li>These datatypes use methods and properties that are memory (space) and compute (time) optimized</li>\n",
" <li><i>Only during reporting or extraction</i> should the above properties be used</li>\n",
" <li>Any children of the parent datetime object are not stored to help reduce redundancy and database size</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"exercise\"></div>\n",
"<h2>Exercise - AdventureWorks</h2>\n",
"<p align=\"right\">\n",
"<img src=\"http://lh6.ggpht.com/_XjcDyZkJqHg/TPaaRcaysbI/AAAAAAAAAFo/b1U3q-qbTjY/AdventureWorks%20Logo%5B5%5D.png?imgmax=800\">\n",
"</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_exercise\"></div>\n",
"<h3>Production.Product</h3>\n",
"\n",
"Here's the <i>Production.Product</i> table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html), which is a description of the fields (columns) in the table (the .csv file we will import below):<br>\n",
"\n",
"<details>\n",
" <summary>Data Dictionary</summary>\n",
" <table>\n",
" <tr>\n",
" <th>Name</th>\n",
" <th>Description</th>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductID</td>\n",
" <td>Primary key for Product records</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Name</td>\n",
" <td>Name of the product</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductNumber</td>\n",
" <td>Unique product identification number</td>\n",
" </tr>\n",
" <tr>\n",
" <td>MakeFlag</td>\n",
" <td>0 = Product is purchased, 1 = Product is manufactured in-house.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>FinishedGoodsFlag</td>\n",
" <td>0 = Product is not a salable item. 1 = Product is salable.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Color</td>\n",
" <td>Product color</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SafetyStockLevel</td>\n",
" <td>Minimum inventory quantity</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ReorderPoint</td>\n",
" <td>Inventory level that triggers a purchase order or work order</td>\n",
" </tr>\n",
" <tr>\n",
" <td>StandardCost</td>\n",
" <td>Standard cost of the product [USD]</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ListPrice</td>\n",
" <td>Selling price [USD]</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Size</td>\n",
" <td>Product size [units vary, see SizeUnitMeasureCode]</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SizeUnitMeasureCode</td>\n",
" <td>Unit of measure for the Size column</td>\n",
" </tr>\n",
" <tr>\n",
" <td>WeightUnitMeasureCode</td>\n",
" <td>Unit of measure for the Weight column</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DaysToManufacture</td>\n",
" <td>Number of days required to manufacture the product</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductLine</td>\n",
" <td>R = Road, M = Mountain, T = Touring, S = Standard</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Class</td>\n",
" <td>H = High, M = Medium, L = Low</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Style</td>\n",
" <td>W = Womens, M = Mens, U = Universal</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductSubcategoryID</td>\n",
" <td>Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ProductModelID</td>\n",
" <td>Product is a member of this product model. Foreign key to ProductModel.ProductModelID</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SellStartDate</td>\n",
" <td>Date the product was available for sale</td>\n",
" </tr>\n",
" <tr>\n",
" <td>SellEndDate</td>\n",
" <td>Date the product was no longer available for sale</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DiscontinuedDate</td>\n",
" <td>Date the product was discontinued</td>\n",
" </tr>\n",
" <tr>\n",
" <td>rowguid</td>\n",
" <td>ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ModifiedDate</td>\n",
" <td>Date and time the record was last updated</td>\n",
" </tr>\n",
" </table>\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_read\"></div>\n",
"<h4>Read in the Dataset</h4>\n",
"\n",
"We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# let's check out the first 3 rows again, for old time's sake\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# and the number of rows x cols\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_resetidx\"></div>\n",
"<h4>Reset Index</h4>\n",
"\n",
"Let's bring our `ProductID` column into the index since it's the PK (primary key) of our table and that's where PKs belong as a best practice."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_types\"></div>\n",
"<h3>Identify Types</h3>\n",
"\n",
"<ul>\n",
" <li>Print out the column data types</li>\n",
" <li>Which columns in the `prod` dataframe are candidates for datetime conversion? Store the column names in a list named <code>datecols</code></li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_nullcheck\"></div>\n",
"<h4>Check Nulls</h4>\n",
"\n",
"<ul>\n",
" <li>Report the number of nulls for each column contained in <code>datecols</code></li>\n",
" <li>Display this result as a <code>pd.Series</code> object</li>\n",
" <li>Make note of anything that might warrant further investigation</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_convert\"></div>\n",
"<h3>Convert the <code>SellStartDate</code> column to a <code>Timestamp</code> object</h3>\n",
"\n",
"Convert the <code>SellStartDate</code> column to a <code>Timestamp</code> object using <a href=\"#todatetime\"><code>to_datetime()</code></a>.\n",
"\n",
"<ul>\n",
" <li>Write a <code>for</code> loop to iterate over the columns in <code>datecols</code></i>\n",
" <li>Using <a href=\"#todatetime\"><code>to_datetime()</code></a>, convert each column to a <code>Timestamp</code> object</li>\n",
" <li>Take the result of this <code>Timestamp</code> object and overwrite each respective source column</li>\n",
" <li>Print the data types of the <code>datecols</code> columns to verify the conversion</li>\n",
" <li>Print the first 3 rows of the <code>datecols</code> columns to verify the conversion</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"p_newcols\"></div>\n",
"<h3>Create New Columns</h3>\n",
"\n",
"Using <a href=\"https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties\">series datetimelike properties</a>, create three new columns:\n",
"\n",
"<ul>\n",
" <li><code>SellStartDate_Year</code>, a column containing the <code>year</code> of the <code>SellStartDate</code> column.</li>\n",
" <li><code>SellStartDate_Month</code>, a column containing the <code>month</code> of the <code>SellStartDate</code> column.</li>\n",
" <li><code>SellStartDate_Day</code>, a column containing the <code>month</code> of the <code>SellStartDate</code> column.</li>\n",
" <li>Print the data types of <code>SellStartDate</code>, <code>SellStartDate_Year</code>, <code>SellStartDate_Month</code>, and <code>SellStartDate_Day</code>.</li>\n",
" <li>Print the first 3 rows of <code>SellStartDate</code>, <code>SellStartDate_Year</code>, <code>SellStartDate_Month</code>, and <code>SellStartDate_Day</code>.</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"exercise_product\"></div>\n",
"<h3>Sales.SalesOrderDetail</h3>\n",
"\n",
"Here's the <i>Sales.SalesOrderDetail</i> table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html), which is a description of the fields (columns) in the table (the .csv file we will import below).<br>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"s_read\"></div>\n",
"<h3>Read in the Dataset</h3>\n",
"\n",
"We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"s_resetidx\"></div>\n",
"<h3>Reset Index</h3>\n",
"\n",
"Using <code>.set_index()</code>, set the <code>sod</code> dataframe to a <a href=\"#timestampidx\"><code>Timestamp</code> index</a>.\n",
"\n",
"<ul>\n",
" <li>Display the index of the <code>sod</code> dataframe as-is.</li>\n",
" <li>Convert <code>ModifiedDate</code> to a <code>Timestamp</code> object.</li>\n",
" <li>Use <code>.set_index()</code> to make this new column the dataframe index.</li>\n",
" <li>Display the index of the <code>sod</code> dataframe after conversion. Has the type changed?</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"s_slice\"></div>\n",
"<h3>Slicing Dates</h3>\n",
"\n",
"Using <a href=\"#stringslicing\">string slicing</a> of the index:\n",
"<ul>\n",
" <li>Find how many <code>SalesOrderID</code>s were processed between March 15th, 2013 and March 20th, 2013</li>\n",
" <li>Return your result as an <code>int</code>.</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}