You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.5 KiB
4.5 KiB
What Is Data Science?
Lesson Objectives
- Talk About Each Other
- Describe What Data Science Is
- Describe the Qualities Of A Data Scientist
- Describe the Data Science Workflow
Talk About Each Other
- Here's a bit about me
- This class can be about networking, too! Tell us about yourself!
- What is Your Name?
- What Brings You To GA?
- What Are Your Current Activities?
Describe What Data Science Is
What is it, exactly?
- A set of tools and techniques used to extract useful information from data.
- An interdisciplinary, problem-solving oriented subject
What does it consist of?
- Programming skills
- Math and Statistics knowledge
- Business sense
- Domain Knowledge
- Communication Skills
Describe the Qualities Of A Data Scientist
Exercise
Let's talk through the following questions in groups:
- What do you think are the most important qualities for a data scientist?
- Can you think of any other quality/skill we have not mentioned?
- What is your field of expertise?
- Do you use tools such as Excel, Stata, R, or Python?
- Where are you in the intersection of these skills?
Possible Answers
- Ask good questions:
- What is required?
- How are results evaluated? (measures of success)
- What do we currently know? (existing data)
- What has happened? (descriptive analytics)
- What will happen (if)? (predictive analytics)
- What to do to achieve what we require? (insight)
- Define and test a hypothesis/run experiments.
- Scrape, & sample business relevant data.
- Manipulate, sanitize, and wrangle data.
- Visualize data.
- Understand data relationships.
- Tell the machine how to learn from data.
- Create data products that deliver actionable insight.
- Tell relevant business stories from data.
Describe the Data Science Workflow
Self Assessment on Data Science Skills
For a given class size:
- how many people will rate themselves strongest in Programming Skills?
- how many people will rate themselves strongest in Math and Statistics Knowledge?
- how many people will rate themselves strongest in Business Sense?
- how many people will rate themselves strongest in Domain Knowledge?
- how many people will rate themselves strongest in Communication Skills?
What to do:
- Create a table for the qualities of a data scientist and then rate yourself on each of these skills on a scale from 1-10.
- We will then use the data to show how simple statistics in action are part of the data science workflow.
| Skill | Value |
|---|---|
| Programming Skills | |
| Math and Statistics Knowledge | |
| Business Sense | |
| Domain Knowledge | |
| Communication Skills |
The Data Science Workflow
- Identify the problem
- what are we trying to do?
- ask questions
- form hypothesis
- Acquire the data
- get data in its raw form
- scraping the data from a website
- downloading a file
- reading a book/article
- get data in its raw form
- Parse the data
- format the data so that it's all the same
- Mine the data
- collect information from the data
- Refine the data
- clean the data up
- discard outliers, etc
- clean the data up
- Build a data model
- figure out a formula that represents what we are trying to learn
- Present the results
- visualize the results
- Deploy and validate
- create a site
- publish findings
Your Turn: The Data Science Workflow
You are a junior data scientist at Amazon. Your boss asks you about the leading indicators that a user will make a new online purchase. How would you go about solving this question?
- Identify the problem: What do you think are the indicators?
- Acquire Data: What could we do first here? What are some considerations we should make?
- Parse Data: How do you format the data so it is all the same?
- Mine and Refine: What calculations/transformation do you recommend doing? How do you determine the presence of outliers?
- Data Model: What attributes would you include in the modeling stage? How do you know if the model is performing well?
- Present Results: Who is your audience? What is the best way to present your results?
- Deploy and Validate: How would this be shared with the community? How will it be validated?

