## ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pandas for EDA
by [@josephofiowa](https://twitter.com/josephofiowa)
 
<!---
This assignment was developed by Joseph Nelson

Questions? Comments?
1. Log an issue to this repo to alert me of a problem.
2. Suggest an edit yourself by forking this repo, making edits, and submitting a pull request with your changes back to our master branch.
3. Hit me up on Slack @sonylnagale
--->

# Pandas Unit Lab

**Woo!** We've made it to the end of our Pandas Unit. Let's put our skills to the test.

We're going to explore data from some of the top movies according to IMDB. This is a guided question-and-response lab where some areas are specific asks and others are open ended for you to explore.

In this lab, we will:
- Use `movie_app.py` to obtain relevant moving rating data
- Leverage Pandas to conduct exploratory data analysis, including:
    - Assess data integrity
    - Create exploratory visualizations
    - Produce insights on top actors/actresses across films
    
Let's get going!

## The Dataset

We'll work with a dataset on the top [IMDB movies](https://www.imdb.com/search/title?count=100&groups=top_1000&sort=user_rating), as rated by IMDB.


Specifically, we have a CSV that contains:
- IMDB star rating
- Movie title
- Year
- Content rating
- Genre
- Duration
- Gross

_[Details available at the above link]_


### Import our necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import re
%matplotlib inline

### Read in the dataset

First, read in the dataset, called `movies.csv` into a DataFrame called "movies." It's in the `./data` folder.

## Check the dataset basics

Let's first explore our dataset to verify we have what we expect.

Print the first five rows.

How many rows and columns are in the datset?

What are the column names?

How many unique genres are there?

How many movies are there per genre?

## Only run the below cells if you've obtained an [API key!](http://www.omdbapi.com/apikey.aspx)<br>Otherwise, proceed to the `importing movies_rated.csv` section below.

### Obtain more data (with an API call)!

- Let's take advantage of our `OmdbAPI` module (stored in `./OmdbAPI.py`, if you'd like to look under the hood) to obtain data from OMDB API on movie ratings. This will enable us to answer the question: **How do other publication's scores compare to IMDB ratings?** Specifically, where do Rotten Tomato critics most disagree with IMDB reviews? 
- Using the OmdbAPI module, we will obtain the `Internet Movie Database`, the `Rotten Tomatoes`, and the `Metacritic` reviews on the top rated IMDB movies. We will store these ratings in new columns in a new `movies_rated` DataFrame. We have also stored the file locally at `./movies_rated.csv`.

In [None]:
import OmdbAPI

In [None]:
# replace e54ad9e7 with your API key
# this may take a minute
movies_rated = OmdbAPI.Omdb(movies, 'e54ad9e7').get_ratings()

Just in case there were movies that the API was unable to get, let's drop nulls.

Let's get the ratings in the same float format using an apply function with some regular expressions. Note the use of .copy() when writing and reading from the same dataframe as a best practice.

Finally, let's write the cleaned result to a local file so we don't have to call the API again and risk exceeding our daily limit.

## Importing `movies_rated.csv`

If you just called the API in the previous section, you can skip this and proceed to the `exploratory data analysis` section.

Let's read in the cleaned, rated `movies_rated.csv` file, which was included with this repo just in case you couldn't call the API.

Check our datatypes. Notice anything potentially problematic?

## Exploratory data analysis

Let's transition to asking and answering some questions with our data.

What are the top five R-Rated movies?

*hint: Boolean filters needed! Then sorting!*

What is the average Rotten Tomato score for the top IMDB films?

What is the Five Number Summary like for top rated films as per IMDB? Is it skewed?

The average is *slightly* higher than the median, so there's a small positive skew.

Create your own question...then answer it!

In [None]:
# correlation between star rating and Rotten Tomato rating?


**Challenge:** Create a dataframe that is the ratio between Rotten Tomato rating vs IMDB rating. What film has the highest IMDB : Rotten Tomato ratio? The lowest?

*[skip this if you are low on time]*

## Exploratory data analysis with visualizations

For each of these prompts, create a plot to visualize the answer. Consider what plot is *most appropriate* to explore the given prompt.


What is the relationship between IMDB ratings and Rotten Tomato ratings?

What is the relationship between IMDB rating and movie duration?

How many movies are there in each genre category? (Remember to create a plot here)

What does the distribution of Rotten Tomatoes ratings look like?

## Bonus

There are many things left unexplored! Consider investigating something about gross revenue and genres.

In [None]:
# histogram of gross sales

In [None]:
# top 10 grossing films