# Introduction to Data Science ## Lesson Objectives 1. Intros 1. What is Data Science? ## Intros 1. Here's a bit about me 1. This class can be about networking, too! Tell us about yourself! - What is Your Name? - What Brings You To GA? - What Are Your Current Activities? ## What is Data Science? What is it, exactly? - A set of tools and techniques used to extract useful information from data. - An interdisciplinary, problem-solving oriented subject What does it consist of? - Programming skills - Math and Statistics knowledge - Business sense - Domain Knowledge - Communication Skills ![venn diagram](https://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png) ## Your Turn: Qualities Of A Data Scientist And You Let's talk through the following questions in groups: 1. What do you think are the most important qualities for a data scientist? 2. Can you think of any other quality/skill we have not mentioned? 3. What is your field of expertise? 4. Do you use tools such as Excel, Stata, R, or Python? 5. Where are you in the intersection of these skills? ## Possible Answers: Qualities Of A Data Scientist And You - Ask good questions: - What is required? - How are results evaluated? (measures of success) - What do we currently know? (existing data) - What has happened? (descriptive analytics) - What will happen (if)? (predictive analytics) - What to do to achieve what we require? (insight) - Define and test a hypothesis/run experiments. - Scrape, & sample business relevant data. - Manipulate, sanitize, and wrangle data. - Visualize data. - Understand data relationships. - Tell the machine how to learn from data. - Create data products that deliver actionable insight. - Tell relevant business stories from data. ## Self Assessment on Data Science Skills For a given class size - how many people will rate themselves strongest in Programming Skills? - how many people will rate themselves strongest in Math and Statistics Knowledge? - how many people will rate themselves strongest in Business Sense? - how many people will rate themselves strongest in Domain Knowledge? - how many people will rate themselves strongest in Communication Skills? 1. Create a table for the qualities of a data scientist and then rate yourself on each of these skills on a scale from 1-10. 1. We will then use the data to show how simple statistics in action are part of the data science workflow. | Skill | Value | | --- | --- | | Programming Skills | | | Math and Statistics Knowledge | | | Business Sense | | | Domain Knowledge | | | Communication Skills | | ## The Data Science Workflow 1. Identify the problem - what are we trying to do? - ask questions - form hypothesis 1. Acquire the data - get data in its raw form - scraping the data from a website - downloading a file - reading a book/article 1. Parse the data - format the data so that it's all the same 1. Mine the data - collect information from the data 1. Refine the data - clean the data up - discard outliers, etc 1. Build a data model - figure out a formula that represents what we are trying to learn 1. Present the results - visualize the results 1. Deploy and validate - create a site - publish findings ![data science workflow](https://raw.githubusercontent.com/generalassembly-studio/data-science-101-cwe-materials/master/curriculum/02-materials/code/data-science-workflow-example.jpg) ## Your Turn: Visualizing The Data Science Workflow You are a junior data scientist at Amazon. Your boss asks you about the leading indicators that a user will make a new online purchase. How would you go about solving this question? 1. Identify the problem: What do you think are the indicators? 1. Acquire Data: What could we do first here? What are some considerations we should make? 1. Parse Data: How do you format the data so it is all the same? 1. Mine and Refine: What calculations/transformation do you recommend doing? How do you determine the presence of outliers? 1. Data Model: What attributes would you include in the modeling stage? How do you know if the model is performing well? 1. Present Results: Who is your audience? What is the best way to present your results? 1. Deploy and Validate: How would this be shared with the community? How will it be validated?