Intro to Data Science

Learning Objectives

After this lesson, you will be able to:

Apply the data science workflow.
Have a set up data science development ecosystem, specific to Python

What is Data Science?

The Harvard Business review called the industry the ‘sexiest job of the 21st century’.
Glassdoor determined the profession to be among the most desirable in 2016 and 2017.

Sounds cool, right? But… what is it?

Data Science Examples

Netflix recommendation engine.
Apple FaceID determining if a photo contains your face.
A bank approving a credit card.

Common thread:

All leverage data to make decisions.

Class Question: What is an example of data science you have heard of? What about your stated example makes it be, well, data science?

Data Science Definition

Compliments of GA’s Standard Board:

Data science is the practice of: acquiring, organizing, and delivering complex data; discovering relationships and anomalies among variables; building and deploying machine learning models; and synthesizing data to influence decision-making.

tl;dr: Data scientists:

Use data of all kinds (numbers, text, images).
Make explanations and predictive decisions.

Conway Venn Diagram

Specific Data Scientist Roles

What does that break down to?

Machine Learning Engineer
Data Engineer
Research Science
Advanced Analyst

Machine Learning Engineer

Identify machine learning applications.
Work in production code.
Manage infrastructure and data pipelines
“Straddle the line between knowing the mathematics and coding the mathematics.”
- eBay VP of engineering Japjit Tulsi

Data Engineer

Create the architecture that allows data acquisition and machine learning problems to run at scale.
Focus on the algorithm and the analysis.
Don’t work much on the software side.

Research Scientist

PhD-heavy field.
Determines new algorithmic optimizations.
Focused on driving scientific discovery.
Less concerned with pursuing industrial applications.

Applied research scientists:

Specialized research scientist.
Backgrounds in both data science and computer science.
Invaluable members of any AI team.
“They can both pitch in on data science and write code. Finding a good applied research scientist is worth her weight in gold.
- Japjit Tulsi

Advanced Analysts

Quantitative-minded.
Apply data descriptive and inferential exploratory data analysis and modeling.

Quick Review

Data science is the practice of:

Acquiring, organizing, and delivering complex data; discovering relationships and anomalies among variables.
Building and deploying machine learning models.
Synthesizing data to influence decision-making.

Specific Data Science Roles Include:

Machine Learning Engineer
Data Engineer
Research Science
Advanced Analyst

How Do We…

Go through data science workflow?
Solve a data science problem?
Craft a data science problem statement?

The Data Science Workflow

Class Discussion: Which step do you believe will be most challenging?

There’s no objectively correct answer!

Teaching Tips:

Draw the workflow on the board to reference.
Keep the example dataset open in a new tab.
Focus on the importance of defining a question, especially following the first class discussion of which workflow component is most challenging
Consider thinking through your own work, and anchoring the discussion of the workflow steps against that example to reduce abstraction
Make the step-by-step exercise engaging at every component. Let the class guide the problem you want to solve. If you’d like, you can encourage them to converge on a single problem statement you feel most comfortable with discussing (like those below)
You may consider running the whole exercise as unstructured time, or guiding step-by-step. Step-by-step is recommended to assure learners remain on task and do not get stuck.

Talking Points:

While the data science workflow is presented as five sequential steps, reinforce data science is often recursive among these areas. When an analysis yields an unexpected result, you may revisit the preparation of data to assure the steps were handled properly.
Defining a question and tying work against an objective is essential to emphasis because problems that progress without a hypothesis to prove or disprove ultimately become circular. There are a near infinite number of spurious correlations or “interesting” ideas to consider. Only those that further drive you towards on outcome are necessary.
There are caveats to this process. Note the area labeled “these steps are not hard-set rules.”
Frame: Assure students first, identify what factors affect cost. Then, consider how those costs can be reduced. Finally, hypothesize a way to describe or predict if those given factors can be reduced.
Prepare: Encourage learners to consider data integrity. A few easy points to call out: differing ways of reporting “No” (N and No) and missing values (NA). Reassure students that it is quite common to have datasets where the ground truth answers to questions like these are unknown.
Analyze: Reinforce the importance of data preparation and connecting analyses to the initial question with analysis.
Interpret: Restate the hypothesis you are aiming to prove or disprove. Identify if the limited dataset provides you with anecdotes to validate or invalidate that statement.
Communicate: Provide your best communications tips, written and verbal alike. These persist when using data.

Notes on the Steps

Not hard-set rules.
Really, problem-solving guidelines.

Every problem’s different!

Some projects may not require every step.
It’s normal to repeat certain steps a few times.
The process is cyclical with new findings!

Talking Points:

A recommended problem is like the following:

Frame: Let’s presume the key cost driver for this HR function is twofold: employees turning over early (low total years of service) and a high time to fill (positions going unfilled, costing producitivity losses). We’ll aim to minimize turnover. Let’s hypothesize we can explain how long an employee stays with the company based on their university, previous employment, and how they found our retail store, Data Science Wearables (DSW).
Prepare: We would want to create a consistent data standard for Current Employee -> No to N across the dataset. Moreover, we need to hypothesize why NA values exist. E.g. did the second candidate not have a previous employer, or was this data unavailable? (We do not know with the information given.)
Analyze: While we will dive deeper into analysis using Python soon enough, anecdotal evidence based on three observations seems to imply that Candidate Source is a useful explanatory variable. Employees that had experience with DSW previously (Internship and Referral) stayed longer. The relationship between waiting to fill (Time to Fill (Days)) and employee tenure may make a U-shape: an employee either previously knows of DSW (short time to fill) and stays for a while because they liked it or DSW waited for the perfect candidate (long time to fill) and stayed for a while. School does not seem to have useful signal for employee tenure.
Interpret: It appears many of our explanatory factors helped, but not all, and not in ways we may have anticipated. School does not seem to yield valuable insight, but knowing an employee has been referred or had experience with DSW previously is a key signal. Perhaps DSW should expand their internship and employee referral programs.
Communicate: Our driving thesis is: The best candidates for DSW are those that have connected with the store in a previous way (internship, referral). Investing in these programs is recommended. Visualizations of average employee tenure segmented by these factors are encouraged.

Step 1 is Always “Frame the Problem”

Solving data science task starts with a clearly defined problem.

Poor results stem from no defined goal.

“A problem well stated is half solved.” — Charles Kettering

From there, you can apply your steps.

The Data Science Workflow: Applied

You need to reduce the costs of staffing.

You have a table of DSW current retail sales associates across department stores.

The first three rows look like this:

Job Level	Current Employee	Reason for Termination	Years of Service	Candidate Source	Previous Employer	School	Time to Fill (Days)
Associate	N	New offer	1.5	Referral	Jake’s Hawaiian Shirts	University of Minnesota	40
Associate	Y	N/A	2.0	Internship	N/A	University of Iowa	15
Associate	No	Tardiness	0.5	Online	Hats and Caps	University of Nebraska	25

Teaching Tips:

We’ll be referring to this scenario a lot in the next few slides - keep this open or write it on the board.

Talking Points:

Let’s apply our workflow above to an interactive exercise. A given clothing retail company, Data Science Wearables (DSW), is interested in improving their human resource operations. Specifically, as a cost center in the business, this company wants to reduce their expenses associated with staffing the firm’s in-store associates across the United States.
Job Level: The role level. Our dataset is all current or former associates.
Current Employee: If the individual is a current employee, this is a “Y” otherwise “N”
Reason for Termination: If the employee no longer works at the retail store, this is why they left
Years of Service: How long did the employee work at DSW?
Candidate Source: Where did this employee learn of DSW?
Previous Employer: Where did the employer previously work?
School: Which university did the individual attend?
Time to Fill (Days): How long did it take to fill this person’s role? Typically minimizing time to fill is key to lower costs.

Step One: Frame

We know:

We want to reduce costs associated with staffing.

We don’t know:

What drives up costs of staffing?
Is there an underlying reason for those costs?
What hypothesis can we test to reduce costs?

Class Discussion: What factors affect HR costs? How could we minimize these?

Step Two: Prepare

Class Question: What questions do you have about the dataset?

Job Level	Current Employee	Reason for Termination	Years of Service	Candidate Source	Previous Employer	School	Time to Fill (Days)
Associate	N	New offer	1.5	Referral	Jake’s Hawaiian Shirts	University of Minnesota	40
Associate	Y	N/A	2.0	Internship	N/A	University of Iowa	15
Associate	No	Tardiness	0.5	Online	Hats and Caps	University of Nebraska	25

Step Three: Analyze

We want to:

Create meaning and conduct statistical description and inference.

For example, the average Years of Service is ~1.33 years.

Could we build a machine learning model to predict this?
The data could center on their background (school, previous employers, and application source).

For example, is the relationship between Time to Fill and Years of Service positive or negative?

Positive: when one increases, the other increases.
Negative: when one increases, the other decreases.

Talking Points:

For example, the average Years of Service in this given dataset is (1.5 + 2.0 + 0.5)/3 = 4/3 ≈ 1.33 years. In more complex situations, we may build a machine learning model to predict a given outcome. For example, we may want to predict how long a given candidate will stay in their role based on their background (school, previous employers, and application source).
We may also be interested in visualizing relationships between our variables/columns. For example, do we anticipate that the relationship between Time to Fill and Years of Service is positive (when one increases, the other increases) or negative (when one increases, the other decreases)? Considering questions like this help us approach the true explaining factor.
It is common for this step to reinforce and revisit the prior step as we discover anomalies or intriguing relationships.

Step Four: Interpret

How do our results compare to our initial hypothesis?

What concrete actions do we recommend?

Class Question: Even with an extremely limited dataset (n=3), can you identify hypothesis-validating or invalidating anecdotes?

At this stage, treat metrics and results like “check engine lights.”

Result summaries may point you in the right direction, but they do not necessarily explain the full context at hand.

Step Five: Communicate

Results are only as convincing as they are conveyed to key stakeholders!

Back up your statement with evidence, including statistical tests, visualizations, and model results.

Quick Review

The data science workflow:

Why Python for Data Science

Easy to write

Data science is inherently a cross-functional discipline!
A language for all audiences is key.

Open source

New techniques become available daily!
Developers from around the world race to implement new libraries.
This places Python in contrast to closed source, paid data analysis tools like SAS and SPSS.

Often used for data analysis, scripting, and rapid software development.

Talking Points:

For the first portion of this week, you’ve focused on learning the fundamentals of Python. Why do we (and the community) emphasize Python as a choice language for data science?
For starters, let’s return to a buzzword-heavy and unofficial Python definition
Let’s break down these definitional attributes and discuss their impact on Python being a common language for data:
High level: Python is “far from” our computer’s RAM and CPU, meaning it is less like binary (01101010) and closer to plaintext English. This makes Python comparatively more intuitive as a first language. Because data science is inherently a cross-functional discipline, allowing it to be picked up by all audiences is key.
Open source: Python’s source code is free to use, and anyone can contribute to improve it. Being an open source language is a huge reason Python is a choice language for data science. As new techniques become available daily, developers from around the world race to implement said methods in Python libraries. (This places Python in contrast to closed source, paid data analysis tools like SAS and SPSS.)
Object-oriented: You learned how to create objects and classes for reproducible use cases. Object-oriented languages are generally more familiar for introductory content, lending a helping hand to Python being approachable.

Getting Data Science Tools

We can analyze data to determine what Python is most used for:

Pandas?
- A Python package for exploratory analysis.
- Let’s use it!

You Do: Your Data Science Development Tools

Python packages in DS are ubiquitous: - Reading CSVs, linear algebra, linear regressions, matrices…

Anaconda (“Conda”): - Package manager. - Downloads everything for us!

Follow these steps:

Download Anaconda: https://www.anaconda.com/download/. Select Python 3.6+ for your machine (macOS or PC)
Open the file. Follow the on-screen prompts. Don’t hesitate to ask questions!

Please wait once you have successfully installed Anaconda.

Teaching Tips:

Emphasize that Python has packages as the reason why we see it being a choice language for data science. Connect that packages are the result of Python being an open source technology.
Work with your IAs to debug and assist students as complications (inevitably) pop up. Be sure everyone can successfully explain why we use packages, install Anaconda, and open a Jupyter Notebook.
With the Notebook, go step-by-step with all students. Do not encourage them to skip ahead of where you want them to be. (e.g. when they finish download, clearly tell them when to click through all install instructions. Stop and wait for further instruction.)
Delegate one IA to be the Windows or Mac person, and you take the other. Typically, Macs are more represented in these classes, so it makes most sense for the instructor to be the lead on Mac, and a IA to be the Windows person.

Talking Points:

Notice that when we use Python for data science, we are heavily relying on a open source packages. Python packages are Python scripts that allow us to easily performance reproducible actions.
For example, when we in data to analyze, we could handle input/output functions, parsing lines of a CSV, and correctly storing datatypes in memory. Or, we could (and will) use Pandas to simply say pd.read_csv(data.csv) to handle all of that work with a single line of code.
Python libraries are collections of packages. We may install a library for linear algebra or separately a library for linear regressions.

So, the question becomes: how do we install all of the necessary Python packages?

Anaconda is a product that solves many of the headaches associated with Python packaging.

Anaconda (maintained by Anaconda, Inc. and sometimes just called “Conda”) is a Python package manager.
It comes with many of the required tools and products we need to begin doing data science in Python.

Download Anaconda: Select Python 3.6+ for your machine (macOS or PC)

Please: open the file once it finishes downloading, and proceed with the on-screen prompts. There is no need to deviate from the default installation. (If you believe you have a question for your instructor based on your machine, please do not hesitate to raise your hand.)

What Are We Downloading?

Pandas:

The default tool for data exploration and manipulation in Python.

Jupyter Notebooks and Jupyter Lab:

The preferred integrated development environments (IDEs) of data science.
We’ll write our code in this!

NumPy, SciPy, and more:

Other packages for statistical inference, visualization, and parallelizing operations.

You Do: Launching Jupyter Notebooks

Use your computer’s program search method (Spotlight on Mac) to search “Anaconda Navigator”.
Open Anaconda Navigator
Click “Launch” on Jupyter Notebooks.

wait…

It opens in your browser!

You have a Jupyter Notebook!

Talking Points:

Development environments between Mac and Windows differ, and there are many ways to open Jupyter Notebooks. (Just like there are many ways to open any given file or program on your computer!)
Emphasize that data science is not all about just writing code. (Hardly!) Discuss the importance of justifying methods. The text included in the lesson provides one example of this (mean vs median), copied here. Consider discussing your own experiences here as well.
The methods we’re applying – which are typically far more subjective or indeterministic in contrast to straight software development – are the other half. For example, pretend we’re missing many values. Do you fill in missing values with the mean or the median? The code for doing either of these operations is far less significant than the justifying decision. Jupyter Notebooks make it easy to create code cells next to text cells.
When it comes to markdown, tell students that no one spends time memorizing markdown syntax. Rather, we reference the markdown cheatsheet to remember how to make large headers, bulleted lists, tables, and more. An apt analogy is a mechanic does not spend time memorizing the pantones of cars, but when he/she needs to do a paint job, they will look up the necessary color codes.

Teaching Tips:

Launch a Jupyter Notebook with students (using the Anaconda Navigator), explain code and markdown cells, and create examples of each.- Be patient with students as they launch their first Jupyter Notebook, and be cognizant of the differences in Mac vs Windows for this exercise.
Go slowly when creating and filling in code cells and markdown cells

Why Jupyter Notebooks?

Data science is both code and methods

What if we’re missing many values?

Do you fill in missing values with the mean or the median?
Easy to create code cells next to text cells.

Easy to connect to remote computers (datac enters).

Thus, the Jupyter Notebook is in your browser!

Talking Points:

Data science is both code and methods: As data scientists, the code we write is only half the story. The methods we’re applying – which are typically far more subjective or indeterministic in contrast to straight software development – are the other half. For example, pretend we’re missing many values. Do you fill in missing values with the mean or the median? The code for doing either of these operations is far less significant than the justifying decision. Jupyter Notebooks make it easy to create code cells next to text cells.
Connect to remote computing resources: While we will not be doing this in a today’s content, Jupyter Notebooks make it easy to connect to remote computers (datacenters). This is why the Jupyter Notebook is in your browser. You’ve created a localhost server – a website for one person: you. The brains for this operation are your computer. We could in swap the brains from your computer for stronger computers in a data center, but still write code in your own browser. Wow!

Quick Review

Pandas
- A Python package for exploratory analysis.
Jupyter Notebooks and Jupyter Lab:
- The preferred integrated development environments (IDEs) of data science.
- We’ll write our code in this!

Anaconda helps us download these. You only had to download it once!

We Do: Code Cells

Let’s begin!

Make a code cell: Click the + in the upper left corner.
Inside the code cell, write:
```
print('hello world')
```
Be sure your cursor is inside the cell. Press "control" + Enter.
- Always how you run cells!

Voila!

We Do: Markdown Cells

Write and format plain text.

Make a code cell: Click the + in the upper left corner.
- You’re going to be doing this a lot!
Change this cell to a markdown cell:
- Click: cell > cell Type > Markdown.
- (You can also click the dropdown menu that says “Code” and change it to “Markdown”)
Inside the markdown cell, write:
```
## Hello world
```

Run the cell: "control" + Enter Bam! Pretty formatted text.

Note: We will not spend time learning markdown syntax! Instead, take a look at the cheatsheet and links in Additional Resources.

Closing Down

Exit the tab in your browser.
That doesn’t quit the Notebook!
Open your Terminal (or Anaconda Prompt on Windows).
Hit control + C. This closes the running process.

Summary:

Data scientists:

Use data of all kinds (numbers, text, images).
Make explanations and predictive decisions.

Data Science Workflow:

Frame -> Prepare -> Analyze -> Interpret -> Communicate.

Jupyter Notebooks:

The industry tool!
Interactive with Python.

Additional Resources

What is data science from GA’s Standards Board blog post
Stack Overflow blog (1) posts (2) on Python’s growth
Markdown cheatsheet here
Interactive markdown cheatsheet here