{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "

\n", " \n", "

\n", "
\n", " \n", "

Pandas for Dates and Times

\n", "
\n", "
\n", "\n", "\n", "Pandas can also be used for times and dates! This is especially helpful for financial analysis, as most measures are with respect to time.\n", "\n", "\n", "
\n", " Overview\n", " \n", "
\n", "\n", "\n", "
\n", " Table of Contents\n", " \n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Import Pandas

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "print(f'Pandas v{pd.__version__}\\nNumpy v{np.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Datetime Objects

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UseClassRemarksHow to create
Time pointsTimestampRepresents a single timestampto_datetime, Timestamp
DatetimeIndexIndex of Timestampto_datetime, date_range, bdate_range, DatetimeIndex
Time spansPeriodRepresents a single time spanPeriod
PeriodIndexIndex of Periodperiod_range, PeriodIndex
\n", " \n", "Above is a list of the possible types of dates and times within pandas. Note that, under the hood, numpy `datetime64` and `timedelta64` objects are being used - the former for `Timestamp` and `DatetimeIndex` objects, and the latter for `Period` and `PeriodIndex` objects, respectively.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Creating Timestamp Objects

\n", "\n", "Let's start by making a `Timestamp` object.\n", "\n", "```python\n", "Init signature: pd.Timestamp(ts_input=, freq=None, tz=None, unit=None, year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=None)\n", "Docstring: \n", "Pandas replacement for datetime.datetime\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a `Timestamp` object using the kwargs explicitly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also implicitly cast a string as follows. Note the `T` separator between YYYYMMDD and HH:MM:SS." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that, under the hood, a `Timestamp` object is a `datetime64[ns]` numpy object which has nanosecond resolution and is stored as a 64 bit integer. As such, it's capable of covering about 584 years. That's a lot of nanoseconds! 2^64, to be exact." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Creating an Index of Timestamps

\n", "\n", "We can assign these `Timestamp` objects to the index of our `DataFrame` to create a table that is indexed chronologically. A timeseries database, if you will. Pandas has a helper function for this, `pd.date_range`. This takes three arguments:\n", "\n", "
    \n", "
  1. start: The beginning of the index
  2. \n", "
  3. end: The end of the index
  4. \n", "
  5. freq: The interval for each Timestamp
  6. \n", "
\n", "\n", "Note that the freq references what is referred to as an offset alias, which is pre-loaded set of common frequencies, or Timestamp spans.\n", "\n", "```python\n", "Signature: pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)\n", "Docstring:\n", "Return a fixed frequency DatetimeIndex.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will create an index starting a Jan 1, 2018, and ending at Jan 1, 2019. It does so with `BM`, or business month end frequency. This is the last work day of each month." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then create a series out of this index by specifying it in the index= kwarg of pd.Series:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that our index type is a DateTimeIndex" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This allows us the distinct advantage of slicing and indexing our index, just as we would with an automatically generated, integer index. Let's take the first 3 rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there's more! We can even select specific dates in the index, as a string, which returns to us the corresponding value in that 'cell':" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Finally, we can select ranges of dates, as strings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Converting Datetime Objects

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Using .to_datetime()

\n", "\n", "Previously, we manually created `Timestamp` and `Period` objects. This assumes we know the year, day, etc of our input data - nice and clean, ready to convert. \n", "\n", "Of course, data is never clean, and we'd need to parse the input string to feed the individual keyword arguments to a function like `Timestamp` for it to know how to convert it. What a pain! Surely, there must be a better way to parse these pesky strings!\n", "\n", "Enter `to_datetime()`.\n", "\n", "This function is extremely powerful and automatically detects and parses input dates (as strings) and returns the result as a `Timestamp` object. Nice!\n", "\n", "```python\n", "Signature: pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=False)\n", "Docstring:\n", "Convert argument to datetime.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see if it can convert this wild string correctly, 1:55pm and 24 seconds, January 22nd, 1985." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazing! Let's try it out on a `DataFrame` object to see how it fares. First, let's make a `DataFrame` containing a few dates we make up. Note that the 3rd entry is `np.nan`, which represents a null value in our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's try to convert this column to a `Timestamp` object, using `to_datetime()`. Do we think it will work? Note: we are storing the result `Series` object of the conversion in a variable, `s`, for later use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It did!

Note:\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Working with Nulls

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This brings us to our final point - what happens if our superhero `to_datetime()` function can't parse the input string and arrive at a usable date?\n", "\n", "The default behavior is `raise`, which raises a `ValueError` and exits the function (stops parsing immediately). What if we just want to stick a null in there and move on?\n", "\n", "That's what `coerce` is useful for. If `to_datetime()` can't parse the string, it'll just stick a `NaT` in there instead. In many cases, this is preferable. Make sure to keep an eye on the number of nulls you generate when using this as it won't warn you." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the default for errors kwarg is 'raise'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Extracting Components from Datetime Objects

\n", "\n", "So, we've gotten our messy string values all tidied up using `to_datetime()`. What happens when I want to retrieve the year or day of the data I've stored?\n", "\n", "Enter series datetimelike properties.\n", "\n", "These are a collection of `Series` object properties that are accessible when the datatype is `Timestamp` or `Period`. Basically, it allows us to extract date information from our date column, when it's stored as a date.\n", "\n", "We access these properties using the following dot notation:\n", "\n", "```python\n", "pd.Series.dt.\n", "```\n", "\n", "If our `Series` object were named `s`, it'd be written as:\n", "\n", "```python\n", "s.dt.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Let's extract just the `year` of the above `Series` object. What do we expect to see?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Now you try - retrieve the `day` of the time series. What is the result?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the data type of the resultant conversion? Why does this matter?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Best Practices:\n", "
    \n", "
  • Whenever possible, store dates as Timestamp or Period objects
  • \n", "
  • These datatypes use methods and properties that are memory (space) and compute (time) optimized
  • \n", "
  • Only during reporting or extraction should the above properties be used
  • \n", "
  • Any children of the parent datetime object are not stored to help reduce redundancy and database size
  • \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Exercise - AdventureWorks

\n", "

\n", "\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Production.Product

\n", "\n", "Here's the Production.Product table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Production.Product.html), which is a description of the fields (columns) in the table (the .csv file we will import below):
\n", "\n", "
\n", " Data Dictionary\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDescription
ProductIDPrimary key for Product records
NameName of the product
ProductNumberUnique product identification number
MakeFlag0 = Product is purchased, 1 = Product is manufactured in-house.
FinishedGoodsFlag0 = Product is not a salable item. 1 = Product is salable.
ColorProduct color
SafetyStockLevelMinimum inventory quantity
ReorderPointInventory level that triggers a purchase order or work order
StandardCostStandard cost of the product [USD]
ListPriceSelling price [USD]
SizeProduct size [units vary, see SizeUnitMeasureCode]
SizeUnitMeasureCodeUnit of measure for the Size column
WeightUnitMeasureCodeUnit of measure for the Weight column
DaysToManufactureNumber of days required to manufacture the product
ProductLineR = Road, M = Mountain, T = Touring, S = Standard
ClassH = High, M = Medium, L = Low
StyleW = Womens, M = Mens, U = Universal
ProductSubcategoryIDProduct is a member of this product subcategory. Foreign key to ProductSubCategory.ProductSubCategoryID
ProductModelIDProduct is a member of this product model. Foreign key to ProductModel.ProductModelID
SellStartDateDate the product was available for sale
SellEndDateDate the product was no longer available for sale
DiscontinuedDateDate the product was discontinued
rowguidROWGUIDCOL number uniquely identifying the record. Used to support a merge replication sample
ModifiedDateDate and time the record was last updated
\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Read in the Dataset

\n", "\n", "We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's check out the first 3 rows again, for old time's sake\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and the number of rows x cols\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Reset Index

\n", "\n", "Let's bring our `ProductID` column into the index since it's the PK (primary key) of our table and that's where PKs belong as a best practice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Identify Types

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Check Nulls

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Convert the SellStartDate column to a Timestamp object

\n", "\n", "Convert the SellStartDate column to a Timestamp object using to_datetime().\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Create New Columns

\n", "\n", "Using series datetimelike properties, create three new columns:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Sales.SalesOrderDetail

\n", "\n", "Here's the Sales.SalesOrderDetail table [data dictionary](https://www.sqldatadictionary.com/AdventureWorks2014/Sales.SalesOrderDetail.html), which is a description of the fields (columns) in the table (the .csv file we will import below).
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Read in the Dataset

\n", "\n", "We are using the `read_csv()` method (and the `\\t` separator to specify tab-delimited columns)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Reset Index

\n", "\n", "Using .set_index(), set the sod dataframe to a Timestamp index.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Slicing Dates

\n", "\n", "Using string slicing of the index:\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }