You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4.4 KiB

Introduction to Data Science

Lesson Objectives

  1. Intros
  2. What is Data Science?

Intros

  1. Here's a bit about me
  2. This class can be about networking, too! Tell us about yourself!
    • What is Your Name?
    • What Brings You To GA?
    • What Are Your Current Activities?

What is Data Science?

What is it, exactly?

  • A set of tools and techniques used to extract useful information from data.
  • An interdisciplinary, problem-solving oriented subject

What does it consist of?

  • Programming skills
  • Math and Statistics knowledge
  • Business sense
  • Domain Knowledge
  • Communication Skills

venn diagram

Your Turn: Qualities Of A Data Scientist And You

Let's talk through the following questions in groups:

  1. What do you think are the most important qualities for a data scientist?
  2. Can you think of any other quality/skill we have not mentioned?
  3. What is your field of expertise?
  4. Do you use tools such as Excel, Stata, R, or Python?
  5. Where are you in the intersection of these skills?

Possible Answers: Qualities Of A Data Scientist And You

  • Ask good questions:
    • What is required?
    • How are results evaluated? (measures of success)
    • What do we currently know? (existing data)
    • What has happened? (descriptive analytics)
    • What will happen (if)? (predictive analytics)
    • What to do to achieve what we require? (insight)
  • Define and test a hypothesis/run experiments.
  • Scrape, & sample business relevant data.
  • Manipulate, sanitize, and wrangle data.
  • Visualize data.
  • Understand data relationships.
  • Tell the machine how to learn from data.
  • Create data products that deliver actionable insight.
  • Tell relevant business stories from data.

Self Assessment on Data Science Skills

For a given class size - how many people will rate themselves strongest in Programming Skills? - how many people will rate themselves strongest in Math and Statistics Knowledge? - how many people will rate themselves strongest in Business Sense? - how many people will rate themselves strongest in Domain Knowledge? - how many people will rate themselves strongest in Communication Skills?

  1. Create a table for the qualities of a data scientist and then rate yourself on each of these skills on a scale from 1-10.
  2. We will then use the data to show how simple statistics in action are part of the data science workflow.
Skill Value
Programming Skills
Math and Statistics Knowledge
Business Sense
Domain Knowledge
Communication Skills

The Data Science Workflow

  1. Identify the problem
    • what are we trying to do?
    • ask questions
    • form hypothesis
  2. Acquire the data
    • get data in its raw form
      • scraping the data from a website
      • downloading a file
      • reading a book/article
  3. Parse the data
    • format the data so that it's all the same
  4. Mine the data
    • collect information from the data
  5. Refine the data
    • clean the data up
      • discard outliers, etc
  6. Build a data model
    • figure out a formula that represents what we are trying to learn
  7. Present the results
    • visualize the results
  8. Deploy and validate
    • create a site
    • publish findings

data science workflow

Your Turn: Visualizing The Data Science Workflow

You are a junior data scientist at Amazon. Your boss asks you about the leading indicators that a user will make a new online purchase. How would you go about solving this question?

  1. Identify the problem: What do you think are the indicators?
  2. Acquire Data: What could we do first here? What are some considerations we should make?
  3. Parse Data: How do you format the data so it is all the same?
  4. Mine and Refine: What calculations/transformation do you recommend doing? How do you determine the presence of outliers?
  5. Data Model: What attributes would you include in the modeling stage? How do you know if the model is performing well?
  6. Present Results: Who is your audience? What is the best way to present your results?
  7. Deploy and Validate: How would this be shared with the community? How will it be validated?