<div>
    <span>
    <p align="left">
    <img align="left" valign="center" src="https://ga-core.s3.amazonaws.com/production/uploads/program/default_image/5734/GA_Stack_Large_RedBlack_RGB.png" width="80px">
    </p>
    </span>
    <span>
        <h1>Pandas for Dates and Times</h1>
    </span>
</div>


Pandas can also be used for times and dates! This is especially helpful for financial analysis, as most measures are with respect to time.

<!-- Overview -->
<details>
    <summary>Overview</summary>
    <ul>
        <li>In this lesson, we'll continue exploring Pandas with dates and times. Specifically:</li>
        <ul>
            <li>Converting dates and times into a <code>Timestamp</code> object using <code>to_datetime</code>.</li>
            <li>Specifying input and output format arguments</li>
            <li>Extracting components, such as year and day, from a <code>Timestamp</code> object</li>
            <li>Creating <code>DatetimeIndex</code> objects, and their advantages</li>
        </ul>
    </ul>
</details>

<!-- TOC -->
<details>
    <summary>Table of Contents</summary>
    <ul>
        <li><a href="#import">Import</a></li>
        <li><a href="#objects">Datetime Objects</a></li>
        <ul>
            <li><a href="#timestamp">Creating Timestamp Objects</a></li>
            <li><a href="#timestampidx">Creating an Index of Timestamps</a></li>
            <li><a href="#period">Creating Period Objects</a></li>
            <li><a href="#periodidx">Creating an Index of Periods</a></li>
        </ul>
        <li><a href="#conversion">Converting Datetime Objects</a></li>
        <ul>
            <li><a href="#todatetime">Using .to_datetime()</a></li>
            <ul>
                <li><a href="#nulls">Working with Nulls</a></li>
            </ul>
            <li><a href="#extracting">Extracting Components from Datetime Objects</a></li>
            <ul>
                <li><a href="#year">Year</a></li>
                <li><a href="#day">Day</a></li>
            </ul>
        </ul>
        <li><a href="#exercise">Exercise - AdventureWorks</a></li>
        <ul>
            <li><a href="#p_exercise">Production.Product Exercise</a></li>
            <ul>
                <li><a href="#p_read">Read in the Dataset</a></li>
                <li><a href="#p_resetidx">Reset Index</a></li>
                <li><a href="#p_types">Identify Types</a></li>
                <li><a href="#p_nullcheck">Check Nulls</a></li>
                <li><a href="#p_convert">Convert Dates</a></li>
                <li><a href="#p_newcols">Create New Columns</a></li>
            </ul>
            <li><a href="#p_exercise">Sales.SalesOrderHeader Exercise</a></li>
            <ul>
                <li><a href="#s_read">Read in the Dataset</a></li>
                <li><a href="#s_resetidx">Reset Index</a></li>
                <li><a href="#s_slice">Slice Dates</a></li>
            </ul>
        </ul>
    </ul>
</details>


<div id="import"></div>
<h2>Import Pandas</h2>

In [1]:
import pandas as pd
import numpy as np
print(f'Pandas v{pd.__version__}\nNumpy v{np.__version__}')

Pandas v1.0.1
Numpy v1.18.1


<div id="objects"></div>
<h2>Datetime Objects</h2>

<table>
  <tr>
    <th>Use</th>
    <th>Class</th>
    <th>Remarks</th>
    <th>How to create</th>
  </tr>
  <tr>
    <td rowspan="2">Time points</td>
    <td>Timestamp</td>
    <td>Represents a single timestamp</td>
    <td>to_datetime, Timestamp</td>
  </tr>
  <tr>
    <td>DatetimeIndex</td>
    <td>Index of Timestamp</td>
    <td>to_datetime, date_range, bdate_range, DatetimeIndex</td>
  </tr>
  <tr>
    <td rowspan="2">Time spans</td>
    <td>Period</td>
    <td>Represents a single time span</td>
    <td>Period</td>
  </tr>
  <tr>
    <td>PeriodIndex</td>
    <td>Index of Period</td>
    <td>period_range, PeriodIndex</td>
  </tr>
<table>
    
Above is a list of the possible types of dates and times within pandas. Note that, under the hood, numpy `datetime64` and `timedelta64` objects are being used - the former for `Timestamp` and `DatetimeIndex` objects, and the latter for `Period` and `PeriodIndex` objects, respectively.

<ul>
    <li>Time points</li>
    <ul>
        <li>The first two objects in the above table, `Timestamp` and `DatetimeIndex` deal with <a href="https://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-to-timestamps"><b>discrete points in time</b></a>. This will be the focus for this and future labs.</li>
    </ul>
    <li>Time spans</li>
    <ul>
        <li>The latter two objects, `Period` and `PeriodIndex` deal with <a href="https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-span-representation"><b>spans of time</b></a>. We will briefly touch on this, as most data ingestion tasks (versus creation) are handled with coersion into the former `Timestamp` objects. It is assumed, and usually the case, that pandas mutates previously existing data (usually in an RDBMS like SQL).</li>
    </ul>
</ul>

<div id="timestamp"></div>
<h3>Creating Timestamp Objects</h3>

Let's start by making a `Timestamp` object.

```python
Init signature: pd.Timestamp(ts_input=<object object at 0x7fc5d75bfe60>, freq=None, tz=None, unit=None, year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=None)
Docstring:     
Pandas replacement for datetime.datetime
```

We can create a `Timestamp` object using the kwargs explicitly:

In [2]:
pd.Timestamp(year=2018, month=10, day=19, hour=12, minute=35, second=59)

Timestamp('2018-10-19 12:35:59')

We can also implicitly cast a string as follows. Note the `T` separator between YYYYMMDD and HH:MM:SS.

In [3]:
pd.Timestamp('20181029T12:35:59')

Timestamp('2018-10-29 12:35:59')

Note that, under the hood, a `Timestamp` object is a `datetime64[ns]` numpy object which has nanosecond resolution and is stored as a 64 bit integer. As such, it's capable of covering about 584 years. That's a lot of nanoseconds! 2^64, to be exact.

In [4]:
print(f'{pd.Timestamp.min}\n{pd.Timestamp.max}')

1677-09-21 00:12:43.145225
2262-04-11 23:47:16.854775807


<div id="timestampidx"></div>
<h3>Creating an Index of Timestamps</h3>

We can assign these `Timestamp` objects to the index of our `DataFrame` to create a table that is indexed chronologically. A timeseries database, if you will. Pandas has a helper function for this, `pd.date_range`. This takes three arguments:

<ol>
    <li><code>start</code>: The beginning of the index</li>
    <li><code>end</code>: The end of the index</li>
    <li><code>freq</code>: The interval for each <code>Timestamp</code></li>
</ol>

Note that the <code>freq</code> references what is referred to as an <a href="https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases">offset alias</a>, which is pre-loaded set of common frequencies, or <code>Timestamp</code> spans.

```python
Signature: pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
Docstring:
Return a fixed frequency DatetimeIndex.
```

This will create an index starting a Jan 1, 2018, and ending at Jan 1, 2019. It does so with `BM`, or <i>business month end frequency</i>. This is the last work day of each month.

In [5]:
start = pd.Timestamp(year=2018, month=1, day=1)
end = pd.Timestamp(year=2019, month=1, day=1)
pd.date_range(start, end, freq='BM')

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-30', '2018-04-30',
               '2018-05-31', '2018-06-29', '2018-07-31', '2018-08-31',
               '2018-09-28', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='BM')

We can then create a series out of this index by specifying it in the <code>index=</code> kwarg of <code>pd.Series</code>:

In [6]:
date_idx = pd.date_range(start, end, freq='BM')
date_series = pd.Series(np.random.randn(len(date_idx)), index=date_idx)

In [7]:
date_series

2018-01-31   -0.044885
2018-02-28    0.151402
2018-03-30   -1.498441
2018-04-30    0.245860
2018-05-31   -0.222092
2018-06-29   -1.082039
2018-07-31    1.595104
2018-08-31   -0.981576
2018-09-28   -0.154903
2018-10-31   -0.139406
2018-11-30   -0.086125
2018-12-31    0.372902
Freq: BM, dtype: float64

Note that our index type is a <code>DateTimeIndex</code>

In [8]:
type(date_series.index)

pandas.core.indexes.datetimes.DatetimeIndex

This allows us the distinct advantage of slicing and indexing our index, just as we would with an automatically generated, integer index. Let's take the first 3 rows:

In [9]:
date_series[:3]

2018-01-31   -0.044885
2018-02-28    0.151402
2018-03-30   -1.498441
Freq: BM, dtype: float64

But there's more! We can even select specific dates in the index, as a string, which returns to us the corresponding value in that 'cell':

In [10]:
date_series['1/31/2018']

-0.04488546780421337

<div id="stringslicing"></div>
Finally, we can select <b>ranges</b> of dates, <i>as strings</i>:

In [11]:
date_series['1/31/2018':'8/31/2018']

2018-01-31   -0.044885
2018-02-28    0.151402
2018-03-30   -1.498441
2018-04-30    0.245860
2018-05-31   -0.222092
2018-06-29   -1.082039
2018-07-31    1.595104
2018-08-31   -0.981576
Freq: BM, dtype: float64

<div id="period"></div>
<h3>Creating Period Objects</h3>

Let's <i>briefly</i> touch on creating `Period` objects. Remember that this represents a <b>span</b> of time, whereas `Timestamp` objects represent a <b>distinct point</b> in time.

Periods are useful to determine <i>if a `Timestamp` is within the bounds of a period</i>.

Note that <b>both</b> object types can be used as an index of a `DataFrame`. Additionally, one can check if a `Timestamp` is <i>between</i> (or greater / less than) other `Timestamp`s, so a `Period` is generally only used <i>when you are only concerned, <b>within a period</b>, when an event occurs, but are not concerned with the <b>exact time</b> the event occurred.</i>.

```python
Init signature: pd.Period(value=None, freq=None, ordinal=None, year=None, month=None, quarter=None, day=None, hour=None, minute=None, second=None)
Docstring:     
Represents a period of time
```

Let's make a `Period` object. Here, we're using an <a href="https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases">offset alias</a> of `B`, which represents business day frequency (M-F inclusive), and is the default frequency for a period. We're explicitly stating this for clarity.

In [12]:
pd.Period(year=2018, month=10, day=19, freq='B')

Period('2018-10-19', 'B')

<div id="periodidx"></div>
<h3>Creating an Index of Periods</h3>

Here, we once again have a helper function, `period_range`, which allows us to create sequential `Period` objects within an index.

```python
Signature: pd.period_range(start=None, end=None, periods=None, freq='D', name=None)
Docstring:
Return a fixed frequency PeriodIndex, with day (calendar) as the default
frequency
```

Let's make a period index with a daily frequency, `M`, spanning one full year. Note that we do not have `BM` (business month) as an available frequency for a `Period` object, unlike our previous `Timestamp` example.

In [13]:
start = pd.Timestamp(year=2018, month=1, day=1)
end = pd.Timestamp(year=2019, month=1, day=1)
pd.period_range(start, end, freq='M')

PeriodIndex(['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
             '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12',
             '2019-01'],
            dtype='period[M]', freq='M')

In [14]:
date_idx = pd.period_range(start, end, freq='M')
date_series = pd.Series(np.random.randn(len(date_idx)), index=date_idx)

In [15]:
date_series

2018-01   -0.202734
2018-02   -1.288933
2018-03   -0.929931
2018-04   -0.896952
2018-05    1.432703
2018-06    1.775852
2018-07    0.599205
2018-08   -2.111145
2018-09   -0.652618
2018-10    0.008192
2018-11    0.055414
2018-12   -0.255667
2019-01   -1.791715
Freq: M, dtype: float64

The true power of this lies in <i>the ability to see if a `Timestamp` object <b>exists within the bounds of a `Period` index</b></i>. Here, we are finding out in what 'time slot' Feb 22, 2018 lies within our `Period` index.

In [16]:
date_series.index == pd.Timestamp(year=2018, month=2, day=22)

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False])

This can be used to index the source table as such, returning the row representing the period that contains our target date, Feb 22 2018.

In [17]:
date_series[ date_series.index == pd.Timestamp(year=2018, month=2, day=22) ]

Series([], Freq: M, dtype: float64)

<div id="conversion"></div>
<h2>Converting Datetime Objects</h2>

<div id="todatetime"></div>
<h3>Using .to_datetime()</h3>

Previously, we manually created `Timestamp` and `Period` objects. This assumes we know the year, day, etc of our input data - nice and clean, ready to convert. 

Of course, data is never clean, and we'd need to parse the input string to feed the individual keyword arguments to a function like `Timestamp` for it to know how to convert it. What a pain! Surely, there must be a better way to parse these pesky strings!

<b>Enter `to_datetime()`.</b>

This function is extremely powerful and automatically detects and parses input dates (as strings) and returns the result as a `Timestamp` object. Nice!

```python
Signature: pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=False)
Docstring:
Convert argument to datetime.
```

Let's see if it can convert this wild string correctly, 1:55pm and 24 seconds, January 22nd, 1985.

In [18]:
pd.to_datetime("1:55pm and 24 seconds, January 22nd, 1985")

Timestamp('1985-01-22 13:55:24')

Amazing! Let's try it out on a `DataFrame` object to see how it fares. First, let's make a `DataFrame` containing a few dates we make up. Note that the 3rd entry is `np.nan`, which represents a null value in our dataset.

In [19]:
df = pd.DataFrame({'myDateColumn': ['10/1/2018 8:29:59PM', '1/1/2019 12:01AM', np.nan, 'September 20th, 1992']})
df.head()

Unnamed: 0,myDateColumn
0,10/1/2018 8:29:59PM
1,1/1/2019 12:01AM
2,
3,"September 20th, 1992"


Now, let's try to convert this column to a `Timestamp` object, using `to_datetime()`. Do we think it will work? Note: we are storing the result `Series` object of the conversion in a variable, `s`, for later use.

In [20]:
s = pd.to_datetime(df['myDateColumn'])
s.head()

0   2018-10-01 20:29:59
1   2019-01-01 00:01:00
2                   NaT
3   1992-09-20 00:00:00
Name: myDateColumn, dtype: datetime64[ns]

<i>It did!</i><br><br><u>Note:</u>
<ul>
    <li>We're storing the date as the numpy object, <code>datetime64[ns]</code>, which is the storage object for a <code>Timestamp</code>.</li>
    <li>We have assumed a time of midnight, <code>00:00:00</code> for days where no time information is passed to <code>to_datetime()</code>.</li>
    <li>We've imputed our null as <code>NaT</code>, which is short for 'Not a Time'</li>
</ul>

<div id="nulls"></div>
<h4>Working with Nulls</h4>

This brings us to our final point - what happens if our superhero `to_datetime()` function <i>can't parse the input string</i> and arrive at a usable date?

The default behavior is `raise`, which raises a `ValueError` and exits the function (stops parsing immediately). What if we just want to stick a null in there and move on?

That's what `coerce` is useful for. If `to_datetime()` can't parse the string, it'll just stick a `NaT` in there instead. In many cases, this is preferable. Make sure to keep an eye on the number of nulls you generate when using this as it won't warn you.

In [21]:
# the default for errors kwarg is 'raise'
pd.to_datetime('I am not a date', errors='coerce')

NaT

<div id="extracting"></div>
<h3>Extracting Components from Datetime Objects</h3>

So, we've gotten our messy string values all tidied up using `to_datetime()`. What happens when I want to retrieve the year or day of the data I've stored?

Enter <a href="https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties">series datetimelike properties</a>.

These are a collection of `Series` object properties that are accessible when the datatype is `Timestamp` or `Period`. Basically, <i>it allows us to extract date information from our date column, when it's stored as a date</i>.

We access these properties using the following dot notation:

```python
pd.Series.dt.<part of the date you want>
```

If our `Series` object were named `s`, it'd be written as:

```python
s.dt.<part of the date you want>
```

<div id="year"></div>
Let's extract just the `year` of the above `Series` object. What do we expect to see?

In [22]:
s.dt.year

0    2018.0
1    2019.0
2       NaN
3    1992.0
Name: myDateColumn, dtype: float64

<div id="day"></div>
Now you try - retrieve the `day` of the time series. What is the result?

In [23]:
s.dt.day

0     1.0
1     1.0
2     NaN
3    20.0
Name: myDateColumn, dtype: float64

What is the data type of the resultant conversion? Why does this matter?

<b>Best Practices</b>:
<ul>
    <li>Whenever possible, store dates as <code>Timestamp</code> or <code>Period</code> objects</li>
    <li>These datatypes use methods and properties that are memory (space) and compute (time) optimized</li>
    <li><i>Only during reporting or extraction</i> should the above properties be used</li>
    <li>Any children of the parent datetime object are not stored to help reduce redundancy and database size</li>
</ul>

<div id="exercise"></div>
<h2>Exercise - AdventureWorks</h2>
<p align="right">
<img src="http://lh6.ggpht.com/_XjcDyZkJqHg/TPaaRcaysbI/AAAAAAAAAFo/b1U3q-qbTjY/AdventureWorks%20Logo%5B5%5D.png?imgmax=800">
</p>

<div id="p_exercise"></div>
<h3>Production.Product</h3>

Here's the <i>Production.Product</i> table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html), which is a description of the fields (columns) in the table (the .csv file we will import below):<br>

<details>
    <summary>Data Dictionary</summary>
    <table>
        <tr>
            <th>Name</th>
            <th>Description</th>
        </tr>
        <tr>
            <td>ProductID</td>
            <td>Primary key for Product records</td>
        </tr>
        <tr>
            <td>Name</td>
            <td>Name of the product</td>
        </tr>
        <tr>
            <td>ProductNumber</td>
            <td>Unique product identification number</td>
        </tr>
        <tr>
            <td>MakeFlag</td>
            <td>0 = Product is purchased, 1 = Product is manufactured in-house.</td>
        </tr>
        <tr>
            <td>FinishedGoodsFlag</td>
            <td>0 = Product is not a salable item. 1 = Product is salable.</td>
        </tr>
        <tr>
            <td>Color</td>
            <td>Product color</td>
        </tr>
        <tr>
            <td>SafetyStockLevel</td>
            <td>Minimum inventory quantity</td>
        </tr>
        <tr>
            <td>ReorderPoint</td>
            <td>Inventory level that triggers a purchase order or work order</td>
        </tr>
        <tr>
            <td>StandardCost</td>
            <td>Standard cost of the product [USD]</td>
        </tr>
        <tr>
            <td>ListPrice</td>
            <td>Selling price [USD]</td>
        </tr>
        <tr>
            <td>Size</td>
            <td>Product size [units vary, see SizeUnitMeasureCode]</td>
        </tr>
        <tr>
            <td>SizeUnitMeasureCode</td>
            <td>Unit of measure for the Size column</td>
        </tr>
        <tr>
            <td>WeightUnitMeasureCode</td>
            <td>Unit of measure for the Weight column</td>
        </tr>
        <tr>
            <td>DaysToManufacture</td>
            <td>Number of days required to manufacture the product</td>
        </tr>
        <tr>
            <td>ProductLine</td>
            <td>R = Road, M = Mountain, T = Touring, S = Standard</td>
        </tr>
        <tr>
            <td>Class</td>
            <td>H = High, M = Medium, L = Low</td>
        </tr>
        <tr>
            <td>Style</td>
            <td>W = Womens, M = Mens, U = Universal</td>
        </tr>
        <tr>
            <td>ProductSubcategoryID</td>
            <td>Product is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID</td>
        </tr>
        <tr>
            <td>ProductModelID</td>
            <td>Product is a member of this product model. Foreign key to ProductModel.ProductModelID</td>
        </tr>
        <tr>
            <td>SellStartDate</td>
            <td>Date the product was available for sale</td>
        </tr>
        <tr>
            <td>SellEndDate</td>
            <td>Date the product was no longer available for sale</td>
        </tr>
        <tr>
            <td>DiscontinuedDate</td>
            <td>Date the product was discontinued</td>
        </tr>
        <tr>
            <td>rowguid</td>
            <td>ROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample</td>
        </tr>
        <tr>
            <td>ModifiedDate</td>
            <td>Date and time the record was last updated</td>
        </tr>
    </table>
</details>


<div id="p_read"></div>
<h4>Read in the Dataset</h4>

We are using the `read_csv()` method (and the `\t` separator to specify tab-delimited columns).

In [24]:
prod = pd.read_csv('../data/Production.Product.csv', sep='\t')

In [25]:
# let's check out the first 3 rows again, for old time's sake
prod.head(3)

Unnamed: 0,ProductID,Name,ProductNumber,MakeFlag,FinishedGoodsFlag,Color,SafetyStockLevel,ReorderPoint,StandardCost,ListPrice,...,ProductLine,Class,Style,ProductSubcategoryID,ProductModelID,SellStartDate,SellEndDate,DiscontinuedDate,rowguid,ModifiedDate
0,1,Adjustable Race,AR-5381,0,0,,1000,750,0.0,0.0,...,,,,,,2008-04-30 00:00:00,,,{694215B7-08F7-4C0D-ACB1-D734BA44C0C8},2014-02-08 10:01:36.827000000
1,2,Bearing Ball,BA-8327,0,0,,1000,750,0.0,0.0,...,,,,,,2008-04-30 00:00:00,,,{58AE3C20-4F3A-4749-A7D4-D568806CC537},2014-02-08 10:01:36.827000000
2,3,BB Ball Bearing,BE-2349,1,0,,800,600,0.0,0.0,...,,,,,,2008-04-30 00:00:00,,,{9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E},2014-02-08 10:01:36.827000000


In [26]:
# and the number of rows x cols
prod.shape

(504, 25)

<div id="p_resetidx"></div>
<h4>Reset Index</h4>

Let's bring our `ProductID` column into the index since it's the PK (primary key) of our table and that's where PKs belong as a best practice.

In [27]:
prod.set_index('ProductID', inplace=True)

<div id="p_types"></div>
<h3>Identify Types</h3>

<ul>
    <li>Print out the column data types</li>
    <li>Which columns in the `prod` dataframe are candidates for datetime conversion? Store the column names in a list named <code>datecols</code></li>
</ul>

In [28]:
prod.dtypes

Name                      object
ProductNumber             object
MakeFlag                   int64
FinishedGoodsFlag          int64
Color                     object
SafetyStockLevel           int64
ReorderPoint               int64
StandardCost             float64
ListPrice                float64
Size                      object
SizeUnitMeasureCode       object
WeightUnitMeasureCode     object
Weight                   float64
DaysToManufacture          int64
ProductLine               object
Class                     object
Style                     object
ProductSubcategoryID     float64
ProductModelID           float64
SellStartDate             object
SellEndDate               object
DiscontinuedDate         float64
rowguid                   object
ModifiedDate              object
dtype: object

In [29]:
datecols = ['SellStartDate', 'SellEndDate', 'DiscontinuedDate', 'ModifiedDate']

<div id="p_nullcheck"></div>
<h4>Check Nulls</h4>

<ul>
    <li>Report the number of nulls for each column contained in <code>datecols</code></li>
    <li>Display this result as a <code>pd.Series</code> object</li>
    <li>Make note of anything that might warrant further investigation</li>
</ul>

In [30]:
# Dicontinued date is null for every single row in the dataframe. 
# It is also stored as a float. This is because pandas has no information to 
# identify it as a date or string. 
# Moreover, it looks like a large portion (406/504) products have stopped 
# selling but are not discontinued.
prod[datecols].isnull().sum()

SellStartDate         0
SellEndDate         406
DiscontinuedDate    504
ModifiedDate          0
dtype: int64

<div id="p_convert"></div>
<h3>Convert the <code>SellStartDate</code> column to a <code>Timestamp</code> object</h3>

Convert the <code>SellStartDate</code> column to a <code>Timestamp</code> object using <a href="#todatetime"><code>to_datetime()</code></a>.

<ul>
    <li>Write a <code>for</code> loop to iterate over the columns in <code>datecols</code></i>
    <li>Using <a href="#todatetime"><code>to_datetime()</code></a>, convert each column to a <code>Timestamp</code> object</li>
    <li>Take the result of this <code>Timestamp</code> object and overwrite each respective source column</li>
    <li>Print the data types of the <code>datecols</code> columns to verify the conversion</li>
    <li>Print the first 3 rows of the <code>datecols</code> columns to verify the conversion</li>
</ul>

In [31]:
for col in datecols:
    prod[col] = pd.to_datetime(prod[col])

In [32]:
prod[datecols].dtypes

SellStartDate       datetime64[ns]
SellEndDate         datetime64[ns]
DiscontinuedDate    datetime64[ns]
ModifiedDate        datetime64[ns]
dtype: object

In [33]:
prod[datecols].head(3)

Unnamed: 0_level_0,SellStartDate,SellEndDate,DiscontinuedDate,ModifiedDate
ProductID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2008-04-30,NaT,NaT,2014-02-08 10:01:36.827
2,2008-04-30,NaT,NaT,2014-02-08 10:01:36.827
3,2008-04-30,NaT,NaT,2014-02-08 10:01:36.827


<div id="p_newcols"></div>
<h3>Create New Columns</h3>

Using <a href="https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties">series datetimelike properties</a>, create three new columns:

<ul>
    <li><code>SellStartDate_Year</code>, a column containing the <code>year</code> of the <code>SellStartDate</code> column.</li>
    <li><code>SellStartDate_Month</code>, a column containing the <code>month</code> of the <code>SellStartDate</code> column.</li>
    <li><code>SellStartDate_Day</code>, a column containing the <code>month</code> of the <code>SellStartDate</code> column.</li>
    <li>Print the data types of <code>SellStartDate</code>, <code>SellStartDate_Year</code>, <code>SellStartDate_Month</code>, and <code>SellStartDate_Day</code>.</li>
    <li>Print the first 3 rows of <code>SellStartDate</code>, <code>SellStartDate_Year</code>, <code>SellStartDate_Month</code>, and <code>SellStartDate_Day</code>.</li>
</ul>

In [34]:
prod['SellStartDate_Year'] = prod['SellStartDate'].dt.year
prod['SellStartDate_Month'] = prod['SellStartDate'].dt.month
prod['SellStartDate_Day'] = prod['SellStartDate'].dt.day

In [35]:
prod[['SellStartDate', 'SellStartDate_Year', 'SellStartDate_Month', 'SellStartDate_Day']].dtypes

SellStartDate          datetime64[ns]
SellStartDate_Year              int64
SellStartDate_Month             int64
SellStartDate_Day               int64
dtype: object

In [36]:
prod[['SellStartDate', 'SellStartDate_Year', 'SellStartDate_Month', 'SellStartDate_Day']].head(3)

Unnamed: 0_level_0,SellStartDate,SellStartDate_Year,SellStartDate_Month,SellStartDate_Day
ProductID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2008-04-30,2008,4,30
2,2008-04-30,2008,4,30
3,2008-04-30,2008,4,30


<div id="exercise_product"></div>
<h3>Sales.SalesOrderDetail</h3>

Here's the <i>Sales.SalesOrderDetail</i> table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html), which is a description of the fields (columns) in the table (the .csv file we will import below):<br>

<details>
    <summary>Data Dictionary</summary>
    <table>
        <tr>
            <th>Name</th>
            <th>Description</th>
        </tr>
        <tr>
            <td>TBD</td>
            <td>TBD</td>
        </tr>
    </table>
</details>

<div id="s_read"></div>
<h3>Read in the Dataset</h3>

We are using the `read_csv()` method (and the `\t` separator to specify tab-delimited columns).

In [37]:
sod = pd.read_csv('../data/Sales.SalesOrderDetail.csv', sep='\t')

In [38]:
sod.columns

Index(['SalesOrderID', 'SalesOrderDetailID', 'CarrierTrackingNumber',
       'OrderQty', 'ProductID', 'SpecialOfferID', 'UnitPrice',
       'UnitPriceDiscount', 'LineTotal', 'rowguid', 'ModifiedDate'],
      dtype='object')

<div id="s_resetidx"></div>
<h3>Reset Index</h3>

Using <code>.set_index()</code>, set the <code>sod</code> dataframe to a <a href="#timestampidx"><code>Timestamp</code> index</a>.

<ul>
    <li>Display the index of the <code>sod</code> dataframe as-is.</li>
    <li>Convert <code>ModifiedDate</code> to a <code>Timestamp</code> object.</li>
    <li>Use <code>.set_index()</code> to make this new column the dataframe index.</li>
    <li>Display the index of the <code>sod</code> dataframe after conversion. Has the type changed?</li>
</ul>

In [39]:
sod.index

RangeIndex(start=0, stop=121317, step=1)

In [40]:
sod.set_index(pd.to_datetime(sod['ModifiedDate']), inplace=True)
sod.index

DatetimeIndex(['2011-05-31', '2011-05-31', '2011-05-31', '2011-05-31',
               '2011-05-31', '2011-05-31', '2011-05-31', '2011-05-31',
               '2011-05-31', '2011-05-31',
               ...
               '2014-06-30', '2014-06-30', '2014-06-30', '2014-06-30',
               '2014-06-30', '2014-06-30', '2014-06-30', '2014-06-30',
               '2014-06-30', '2014-06-30'],
              dtype='datetime64[ns]', name='ModifiedDate', length=121317, freq=None)

<div id="s_slice"></div>
<h3>Slicing Dates</h3>

Using <a href="#stringslicing">string slicing</a> of the index:
<ul>
    <li>Find how many <code>SalesOrderID</code>s were processed between March 15th, 2013 and March 20th, 2013</li>
    <li>Return your result as an <code>int</code>.</li>
</ul>

In [41]:
sod['3/15/2013':'3/20/2013']['SalesOrderID'].shape[0]

55