Next Steps in Data Science

--- ## Learning Objectives *After this lesson, you will be able to:* - Identify core libraries in the data science ecosystem. - Determine how to learn more about which area is most interesting to you! - Discuss hiring in the data science job market and strategies to support a search. --- ## Celebrate Reflect for a moment - you've: - Learned the fundamentals of Python, from data types to object oriented programming. - Used your first API to build a simple application. - Applied Pandas to synthesize insights from datasets. That's a lot! It deserves a huge congratulations. --- ## Discussion: Introspection - What did you enjoy most? - What did you find most intriguing? - What do you want to know more about? - What caused the most struggle? This isn't an all-frills exercise. It helps inform your future data science growth! --- ## Revisiting the data science process It's important to place our Pandas work into the broader picture of data science. To do so, recall our data science workflow: ![](https://s3.amazonaws.com/ga-instruction/assets/python-fundamentals/Data-Framework-White-BG.png) --- ## Discussion: Condensed Workflow 1. **Identify** the problem 2. **Acquire** the right data 3. **Parse** the data 4. **Mine** our data 5. **Refine** our data 6. **Build** a model 7. **Present** our work **Class Question**: Where have we focused our work? --- ## Where we focused 1. Identify the problem 2. Acquire the right data 3. **Parse the data. We did this!** Remember the Adventure Works Production.Product dictionary? Did you revisit IMDB's source to understand any columns? 4. **Mine our data. We did this!** Checked subpopulation analyses and, perhaps, feature creation. We filtered to a specific county; potentially creating our own IMDB v Rotten Tomato metrics. 5. **Refine our data. We did this!** We mutated our data using the .apply() method to modify prices and color of products. 6. Build a model 7. Present our work --- ## Where we did a bit 1. **Identify the problem. We did a bit!** Identify your own question about IMDB data, and answer it. 2. **Acquire the right data. We did a bit!** Using the OMDBApi to obtain Rotten Tomato data for our IMDB dataset. 3. Parse the data 4. Mine our data 5. Refine our data 6. Build a model 7. **Present our work. We did a bit!** Maintaining clean Jupyter Notebooks (right?) and creating takeaway visualizations. ***Whew***! We did cover a lot of ground! --- ## Where we didn't Focus 1. Identify the problem 2. Acquire the right data 3. Parse the data 4. Mine our data 5. Refine our data 6. **Build a model. We never did this!** 7. Present our work > "Hey! I thought that's all data science is! Machine learning artificial intelligence neural networks [on the blockchain]!" --- ## The truth about data science (sh) - Exploratory data analysis is typically **80%** of a data science problem. - Modeling is **20%**. What's more: - The steps you take to set up your models in EDA, ultimately have a outsized impact on the result you will achieve. --- ## Apologies in advance for this one ![](https://s3.amazonaws.com/ga-instruction/assets/python-fundamentals/what-i-really-ds.png) --- ## Exceptions - Many companies will structure teams such that some individuals focus 100% of their time on the 20% of the problem which is solved by modeling. - We've focused on Pandas EDA. - The area you can make the greatest impact with. --- ## Python Data Science Package Ecosystem We know Pandas! - Awesome! - Reads in data. - Exploratory data analysis. - Munging. - Wrangling. - Visualization via matplotlib What else is there? --- ## Recommend Libraries for DS Once you're comfortable with Pandas... - **Seaborn:** - Creates visualizations (of greater complexity than Pandas) - With a few lines of code via `matplotlib` - **NumPy:** - Numerical computation, particularly linear algebra. - **SciPy:** - Scientific computation, especially statistics. - **Requests:** - Making web requests - calling APIs! --- ## Other DS Libraries Not as ubiquitous or popular, but still good: - **BeautifulSoup:** - Easily parse HTML. - **Statsmodels:** - Traditional statistic inference techniques, like linear regression. - **Scikit-learn:** - All-purpose machine learning model construction. - **NLTK** | **SpaCy** - Natural language processing. - **TensorFlow** | **PyTorch** | **MxNet** - Neural network research and model construction. - **PySpark** - Interacting with big data. --- ## Discussion: What-for-what? At what step would each library be most helpful? The data science steps: - **Identify** the problem - **Acquire** the right data - **Parse** the data - **Mine** our data - **Refine** our data - **Build** a model - **Present** our work --- ## Discussion: What-for-what? Match up these libraries: - **Pandas:** for reading in data, exploratory data analysis, munging, wrangling, and visualization via matplotlib - **Seaborn:** creates visualizations (of greater complexity) with a few lines of code via matplotlib - **Requests:** for making web requests - **NumPy:** for numerical computation, particularly linear algebra - **SciPy:** for scientific computation, especially statistics --- ## Learning More - How? - Learn by doing. - Learning requires consuming and producing. (Perhaps even in 50/50 balance) - Consume relevant content about what you want to learn (videos, books, etc). - Have frequent **projects** and **exercises** to practice. --- ## Learning More - Where? There's an abundance of resources, which can seem overwhelming, but it's actually a huge benefit. For self-paced and online programs about a specific area, consider: - DataCamp - DataQuest - Coursera For instructor-led and guided education, come on back to General Assembly! - We have expert-led workshops and courses in data science: - A 10-week part-time data science (60hrs). - The Data Science Immersive, a full-time, three month program (480hrs). These classes walk through the full data science lifecycle. --- ## Stretchhhh ![](https://s3.amazonaws.com/ga-instruction/assets/python-fundamentals/panda-lying-down.jpeg) - Stand up, stretch a bit. - Or lie down! - I'm not a cop. --- ## What Do You Really Need? Data scientists need three core skills: - **Analytical thinking** - **Mathematics and statistics proficiency** - **Coding ability** Let's break these down. --- ## Analytical thinking - How well can you structure a data science problem / target an analysis for high impact output? - Do you select metrics that align with those goals? - Do you break a big problem into manageable, component parts? **Class Question:** - Imagine you are a data scientist at Facebook. - Users list high schools they attended - some real, some fake. How could you verify that a given high school a user listed is the one they attended? How would you measure success? --- ## Mathematics and statistics proficiency Can you apply fundamental maths and stats to problem solving? Do you have a firm understanding of probability? Linear algebra? **Class Question:** - There are 52 cards in a deck. - 26 are red, and 26 are black. The 52 cards make up four suits (hearts, diamonds, spades, clubs). - There are 13 of each suit (ace-10, jack, queen, king). - It is a fair deck of cards. What is the probability of drawing the 4 of spades OR a club? What is the probability of drawing any 3 OR a spade? --- ## Coding ability - Can you write readable, maintainable, efficient code? - Can you translate your thinking skills into programmatic thinking? - Do you know Python, R, SQL, and/or Scala? *(Yes, you do!)* **Question:** Do you recall Fizzbuzz? Try writing it again here from scratch. Open a new Python file, `fizz.py`. - Write a program that prints the numbers from 1 to `n` (passed in). - But, for multiples of three, print “Fizz” instead of the number. - For multiples of five, print “Buzz”. - For numbers which are multiples of both three and five, print “FizzBuzz”. --- ## Establishing Yourself as a Data Scientist 1. Start a blog. - Blogs are incredibly common in technology. - They demonstrate your learning process. 2. Share with your network. - Keep your friends and coworkers engaged on what you're doing and learning. - Opportunities are sometimes spurious. 3. Attend Meetups and other networking opportunities to learn, meet, and share. ---- ## Summary: - There are many paths you can go! - Check the Additional Reading for links to libraries. - You probably want Seaborn, NumPy, or SciPy. - Work on your core skills! - Analytical thinking. - Mathematics and statistics proficiency. - Coding ability. --- ## Additional Reading - [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/#) - [Seaborn docs](https://seaborn.pydata.org/) - [Requests docs](http://docs.python-requests.org/en/master/) - [NumPy tutorial](https://docs.scipy.org/doc/numpy-1.13.0/user/index.html) - [SciPy tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html)