Movie Notebook
Movie Notebook
Movie Notebook
Overview
I have been tasked with assisting Microsoft in their venture into the movie industry. My goal was to explore what
type of films are currently doing the best at the box office and to provide these findings to Microsoft's new
movie studio executives. My analysis of the movie industry, achieved by garnering data and utilizing descriptive
statistics and visualizations, has shown that a larger budget is correlated with a higher worldwide box office
gross. By allocating 75 million to 200 million dollars to produce an animated musical movie released in June or
November, or allocating 200 to 400 million dollars to produce a live action super hero movie released in April or
May, the data shows that a movie studio will be extremely likely to succeed. I have also given recommendations
as to which composers should be hired for an animated musical movie, and which directors should be hired for
a super hero movie. Microsoft can use this report to target their production budget, genre, creative type,
production method, release-time, and crew members of their upcoming movie endeavors to generate the
highest amount of revenue possible.
Business Problem
I have been informed that Microsoft wants a piece of the multi-billion dollar movie-making industry, but that they
are unsure of where to begin. The challenge for their new movie studio is that they are ready to jump into the
industry but do not have the necessary knowledge to move forward. To assist them with this goal, I have been
looking at the movies that performed highest in worldwide box office amounts for the past ten years. By
analyzing the movies that have been most successful recently, I can make recommendations about attributes
that Microsoft's movies should have in order to achieve the highest revenue. I have based my analysis on four
main factors:
Movie Type (Genre/Creative Type/Production Method): What types of movie content are currently most
successful?
Release Month: When is the most lucrative time of year to release a movie?
Production Budget: What budget amount tends to achieve the highest box office gross?
Additional Attributes: Based on these findings, what else do top-grossing movies have in common?
I chose these questions after considering the business problem and combing through the data I obtained. I have
determined that the answers to these questions are integral to the steps that should be taken when considering
how to produce the most profitable movie in today's world.
Data Understanding
I utilized three different data sources for my analysis in order to have the most comprehensive view of the
industry as it currently is.
OpusData Movie Data: a free dataset available upon request for academic research, comprised of 1,900
movies with a production year from 2006 to 2018, with a production budget greater than or equal to ten
million dollars. This dataset contains values for movie name, production budget, domestic and international
gross, genre, production method, runtime, and movie board rating.
Web-scraped data from The-Numbers.com: The Numbers is described as "the premier provider of movie
industry data and research services". This website contains domestic, international, and worldwide box
office revenue amounts per movie, and allows filtering and ordering of results based on many different
criteria. Some of the criteria provided on this site that I found especially useful were the same criteria listed
above: title, production budget, domestic and international gross, genre, and production method, in
addition to release date and worldwide gross. For the purposes of this project, I generated and scraped
reports for the top 100 movies per year, in terms of revenue, from 2010 to 2020.
The Movie Database (TMDb) API: The Movie Database is a popular database for movies and TV shows.
Their API is a system made freely available for data acquisition. There is a very large amount of data
available on TMDb; for the purposes of this project, I used it mainly to fill in missing information from my
other two datasets as I moved through my analysis.
In [2]: # importing the packages I will be using for this project
import pandas as pd
# setting pandas display to avoid scientific notation in my dataframes
pd.options.display.float_format = '{:.2f}'.format
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import requests
%matplotlib inline
Out[3]:
movie_name production_year movie_odid production_budget domestic_box_office internation
Madea's
0 Family 2006 8220100 10000000 63257940
Reunion
End of the
2 2006 34620100 10000000 11748661
Spear
A Prairie
3 Home 2006 24910100 10000000 20342852
Companion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1936 entries, 0 to 1935
Data columns (total 13 columns):
movie_name 1936 non-null object
production_year 1936 non-null int64
movie_odid 1936 non-null int64
production_budget 1936 non-null int64
domestic_box_office 1936 non-null int64
international_box_office 1936 non-null int64
rating 1913 non-null object
creative_type 1923 non-null object
source 1915 non-null object
production_method 1925 non-null object
genre 1926 non-null object
sequel 1934 non-null float64
running_time 1822 non-null float64
dtypes: float64(2), int64(5), object(6)
memory usage: 196.8+ KB
2010
My first step was to obtain data for the top 100 grossing movies of 2010. I did this by building a report on The-
Numbers.com, scraping this data, and reading it into a pandas DataFrame.
In [8]: # url for the full customized report of top 100 movies for 2010
url = f"https://www.the-numbers.com/movies/report/All/All/All/All/All/Al
l/All/All/All/None/None/2010/2010/None/None/None/None/None/None?show-rel
ease-date=On&view-order-by=domestic-box-office&show-release-year=On&view
-order-direction=desc&show-production-budget=On&show-domestic-box-office
=On&show-international-box-office=On&show-worldwide-box-office=On&show-g
enre=On&show-production-method=On&show-creative-type=On"
response = requests.get(url)
# creating soup
soup = BeautifulSoup(response.text, 'lxml')
# finding table containing report info
table = soup.find('table')
# converting html of table into a string
table_string = f"""{table}"""
# reading html string into pandas
table_read = pd.read_html(table_string)
# converting into DataFrame
numbers_2010 = table_read[0]
# previewing DataFrame
numbers_2010.head()
Out[8]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Pro
0
Jun 18,
0 1 2010 Toy Story 3 Adventure Digital Animation Kids Fiction
2010
May 7,
2 3 2010 Iron Man 2 Action Live Action Super Hero
2010
The Twilight
Jun 30,
3 4 2010 Saga: Drama Live Action Fantasy
2010
Eclipse
Harry
Nov 19, Potter and Animation/Live
4 5 2010 Adventure Fantasy
2010 the Deathly Action
Hallows:…
Now that I had a DataFrame to work with, I was able to start running some summary statistics and exploring the
data.
In [9]: # getting info for DataFrame
numbers_2010.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
Unnamed: 0 100 non-null int64
Released 100 non-null object
Released.1 100 non-null int64
Title 100 non-null object
Genre 100 non-null object
ProductionMethod 100 non-null object
CreativeType 100 non-null object
ProductionBudget 100 non-null object
DomesticBox Office 100 non-null object
InternationalBox Office 100 non-null object
WorldwideBox Office 100 non-null object
dtypes: int64(2), object(9)
memory usage: 8.7+ KB
Out[10]:
Unnamed: 0 Released.1
In [11]: # retrieving data type for domestic box office column values
numbers_2010['DomesticBox Office'].dtype
Out[11]: dtype('O')
I noted that the describe method for this DataFrame was not very helpful at this point because my dollar
amounts for domestic, international, and worldwide gross were pulled as objects (not floats or integers). I knew
that would require further adjusting in the next stage.
In [12]: # getting value counts for genre column
numbers_2010.Genre.value_counts()
Out[12]: Adventure 21
Drama 18
Comedy 16
Action 13
Thriller/Suspense 11
Horror 9
Romantic Comedy 8
Western 1
Documentary 1
Black Comedy 1
Musical 1
Name: Genre, dtype: int64
Adventure and Drama were the most common movie genres for top grossing movies in 2010.
For 2010 production methods, Live Action was by far the most common, with 85% of the top grossing movies
being of this type.
Contemporary Fiction was the most common creative type by far for the top movies made in 2010.
Since I knew I'd be scraping data for each year in the exact same way as above, I created a function to speed
up the process.
In [15]: def number_scraper(year):
"""
Scrapes 100 top-grossing movies from The-Numbers.com.
Parameters:
year (int): user input 4-digit year for movie gross to be scraped.
Returns:
numbers_df (Pandas DataFrame): A dataframe populated with movie gros
s table values.
"""
# url for the full customized report of top 100 movies for release y
ears in range listed
url = f"https://www.the-numbers.com/movies/report/All/All/All/All/Al
l/All/All/All/All/None/None/{year}/{year}/None/None/None/None/None/None?
show-release-date=On&view-order-by=domestic-box-office&show-release-year
=On&view-order-direction=desc&show-production-budget=On&show-domestic-bo
x-office=On&show-international-box-office=On&show-worldwide-box-office=O
n&show-genre=On&show-production-method=On&show-creative-type=On"
response = requests.get(url)
# creating soup
soup = BeautifulSoup(response.text, 'lxml')
# finding table
table = soup.find('table')
# converting html of table into string
table_string = f"""{table}"""
# reading html string into pandas
table_read = pd.read_html(table_string)
# converting into DataFrame
numbers_df = table_read[0]
return numbers_df
2011
In [16]: # scraping 2011 data and compiling into DataFrame
numbers_2011 = number_scraper(2011)
#previewing DataFrame
numbers_2011.head()
Out[16]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType P
0
Harry Potter
Jul 15, and the Animation/Live
0 1 2011 Adventure Fantasy
2011 Deathly Action
Hallows:…
Transformers:
Jun 29, Animation/Live Science
1 2 2011 Dark of the Action
2011 Action Fiction
Moon
The Twilight
Nov 18, Saga:
2 3 2011 Drama Live Action Fantasy
2011 Breaking
Dawn, Part 1
The
May 26, Contemporary
3 4 2011 Hangover Comedy Live Action
2011 Fiction
Part II
Pirates of the
May 20, Caribbean:
4 5 2011 Adventure Live Action Fantasy
2011 On
Stranger…
2012
Out[17]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeTyp
0
The Dark
Jul 20,
1 2 2012 Knight Action Live Action Super He
2012
Rises
Nov 8, Contempora
3 4 2012 Skyfall Action Live Action
2012 Fictio
The Hobbit:
Dec 14, An Animation/Live
4 5 2012 Adventure Fanta
2012 Unexpected Action
Journey
2013
Out[18]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Prod
0
The
Hunger
Nov 22, Science
0 1 2013 Games: Adventure Live Action
2013 Fiction
Catching
Fire
May 3, Animation/Live
1 2 2013 Iron Man 3 Action Super Hero
2013 Action
Nov 22,
2 3 2013 Frozen Musical Digital Animation Kids Fiction
2013
Jul 3, Despicable
3 4 2013 Adventure Digital Animation Kids Fiction
2013 Me 2
2014
Out[19]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType
0
The
Hunger
Nov 21, Science
1 2 2014 Games: Thriller/Suspense Live Action
2014 Fiction
Mockingjay
- Part 1
Guardians
Aug 1, Animation/Live
2 3 2014 of the Action Super Hero
2014 Action
Galaxy
Captain
Apr 4, America:
3 4 2014 Action Live Action Super Hero
2014 The Winter
Soldier
Out[20]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Prod
0
Star Wars
Dec 18, Ep. VII: Animation/Live Science
0 1 2015 Adventure
2015 The Force Action Fiction
Awakens
Avengers:
May 1, Animation/Live
2 3 2015 Age of Action Super Hero
2015 Action
Ultron
Apr 3, Contemporary
4 5 2015 Furious 7 Action Live Action
2015 Fiction
2016
In [21]: # scraping 2016 data and compiling into DataFrame
numbers_2016 = number_scraper(2016)
# previewing the DataFrame
numbers_2016.head()
Out[21]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Produc
0
Rogue
One: A
Dec 16, Animation/Live Science
0 1 2016 Star Adventure $
2016 Action Fiction
Wars
Story
Captain
May 6,
2 3 2016 America: Action Live Action Super Hero $
2016
Civil War
The
Jul 8, Secret
3 4 2016 Adventure Digital Animation Kids Fiction
2016 Life of
Pets
The
Apr 15, Animation/Live
4 5 2016 Jungle Adventure Fantasy $
2016 Action
Book
2017
Out[22]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Produ
0
Star Wars
Dec 15, Ep. VIII: Animation/Live Science
0 1 2017 Adventure
2017 The Last Action Fiction
Jedi
Beauty
Mar 17, Animation/Live
1 2 2017 and the Musical Fantasy
2017 Action
Beast
Jun 2, Wonder
2 3 2017 Action Live Action Super Hero
2017 Woman
Jumanji:
Dec 20, Welcome Science
3 4 2017 Adventure Live Action
2017 to the Fiction
Jungle
Guardians
May 5, of the Animation/Live
4 5 2017 Action Super Hero
2017 Galaxy Action
Vol 2
2018
Out[23]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Prod
0
Avengers:
Apr 27, Animation/Live
1 2 2018 Infinity Action Super Hero
2018 Action
War
Jurassic
Jun 22, World: Science
3 4 2018 Action Live Action
2018 Fallen Fiction
Kingdom
2019
Out[24]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Produ
0
Star
Dec 20, Wars: The Animation/Live Science
2 3 2019 Adventure
2019 Rise of Action Fiction
Skywalker
Nov 22,
3 4 2019 Frozen II Adventure Digital Animation Kids Fiction
2019
Out[25]:
Unnamed:
Released Released.1 Title Genre ProductionMethod CreativeType Pro
0
Birds of
Prey (And
Feb 7,
2 3 2020 the Action Live Action Super Hero
2020
Fantabulous
Em…
Data Preparation
Now that my data was scraped ready to go, it was time to clean it up and prepare it for analysis.
Cleaning
There were a few columns in this dataset that were not relevant to my analysis: 'movie_odid', 'source', 'sequel',
'running-time', and 'rating'. I began by dropping those.
I then renamed some of the column names to make them easier to work with.
Cleaning
Due to the fact that I compiled my data from tables that were completely filled in, I was pleased that I had no
null values to deal with. I did, however, have an unnecessary column called 'Unnamed: 0', which was used as
the index of each table in its original html format. I began by dropping this.
In [30]: numbers_2010.isnull().sum()
Out[30]: Unnamed: 0 0
Released 0
Released.1 0
Title 0
Genre 0
ProductionMethod 0
CreativeType 0
ProductionBudget 0
DomesticBox Office 0
InternationalBox Office 0
WorldwideBox Office 0
dtype: int64
I then made all the column names lowercase for ease of use.
I also wanted to rename some of the columns to make them more comprehensive.
In [33]: # renaming columns
numbers_2010 = numbers_2010.rename(columns = {'released':'release_date',
'released.1':'release_year', 'productionmethod':'prod_method', 'domestic
box office':'dom_gross', 'internationalbox office':'int_gross', 'worldwi
debox office':'world_gross', 'creativetype': 'creative_type', 'productio
nbudget': 'budget'})
# showing all column names
numbers_2010.columns
Finally, I wanted to convert the dollar amounts to numbers I could actually work with.
Out[34]:
release_date release_year title genre prod_method creative_type budget do
Digital
0 Jun 18, 2010 2010 Toy Story 3 Adventure Kids Fiction 200000000 41
Animation
Alice in Animation/Live
1 Mar 5, 2010 2010 Adventure Fantasy 200000000 33
Wonderland Action
2 May 7, 2010 2010 Iron Man 2 Action Live Action Super Hero 170000000 31
The Twilight
3 Jun 30, 2010 2010 Saga: Drama Live Action Fantasy 68000000 30
Eclipse
Harry
Potter and Animation/Live
4 Nov 19, 2010 2010 Adventure Fantasy 125000000 29
the Deathly Action
Hallows:…
Since this is how I intended to clean all my DataFrames, I wrote a function to execute all the steps I had taken.
In [35]: def clean(df):
"""
Cleans and modifies a given dataframe according to criteria set for
this particular project.
Parameters:
df (Pandas DataFrame): user input dataframe based on previously scra
ped table values from The-Numbers.com.
Returns:
df (Pandas DataFrame): A dataframe cleaned and adjusted as per crite
ria listed above.
"""
# drop 'Unnamed' column
df = df.drop(columns='Unnamed: 0')
# make column names lowercase
df.columns = [x.lower() for x in df.columns]
# rename certain columns
df = df.rename(columns = {'released':'release_date', 'released.1':'r
elease_year', 'productionmethod':'prod_method',
'domesticbox office':'dom_gross', 'interna
tionalbox office':'int_gross', 'worldwidebox office':'world_gross',
'creativetype':'creative_type', 'productio
nbudget': 'budget'})
# removing dollar signs and commas from dollar amounts
# converting dollar amounts from strings into integers
df['dom_gross'] = df['dom_gross'].str.replace(',', '').str.replace(
'$', '').astype(int)
df['int_gross'] = df['int_gross'].str.replace(',', '').str.replace(
'$', '').astype(int)
df['world_gross'] = df['world_gross'].str.replace(',', '').str.repla
ce('$', '').astype(int)
df['budget'] = df['budget'].str.replace(',', '').str.replace('$', ''
).astype(int)
return df
In [36]: # cleaning data
numbers_2011 = clean(numbers_2011)
# previewing cleaned data
numbers_2011.head()
Out[36]:
release_date release_year title genre prod_method creative_type budget d
Harry Potter
and the Animation/Live
0 Jul 15, 2011 2011 Adventure Fantasy 125000000
Deathly Action
Hallows:…
Transformers:
Animation/Live Science
1 Jun 29, 2011 2011 Dark of the Action 195000000
Action Fiction
Moon
The Twilight
Saga:
2 Nov 18, 2011 2011 Drama Live Action Fantasy 127500000
Breaking
Dawn, Part 1
The
Contemporary
3 May 26, 2011 2011 Hangover Comedy Live Action 80000000
Fiction
Part II
Pirates of the
Caribbean:
4 May 20, 2011 2011 Adventure Live Action Fantasy 379000000
On
Stranger…
Combining
Now that all the data had been cleaned, I was ready to combine all my DataFrames into a single DataFrame for
further analysis.
In [46]: # concatenating my DataFrames for box office data from years 2015-2020
numbers_df = pd.concat([numbers_2010, numbers_2011, numbers_2012, number
s_2013,
numbers_2014, numbers_2015, numbers_2016, number
s_2017,
numbers_2018, numbers_2019, numbers_2020], axis=
0, ignore_index=True)
# getting info on my new DataFrame
numbers_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 10 columns):
release_date 1100 non-null object
release_year 1100 non-null int64
title 1100 non-null object
genre 1099 non-null object
prod_method 1100 non-null object
creative_type 1100 non-null object
budget 1100 non-null int64
dom_gross 1100 non-null int64
int_gross 1100 non-null int64
world_gross 1100 non-null int64
dtypes: int64(5), object(5)
memory usage: 86.1+ KB
At this point I realized that, somehow, one singular null value had snuck in there.
Out[47]:
release_date release_year title genre prod_method creative_type budget dom_gross
Science
1091 Jul 10, 2020 2020 Archive NaN Live Action 0 139593
Fiction
Since it was just one null value for genre for the movie 'Archive', it was easy enough to perform a quick search
on IMDb. The primary genre listed for this movie is Drama, so I filled in the value manually.
In [48]: # for movie with title value Archive, set genre to Drama
numbers_df.loc[numbers_df['title']=='Archive', 'genre'] = 'Drama'
Out[49]:
release_date release_year title genre prod_method creative_type budget dom_gross
Science
1091 Jul 10, 2020 2020 Archive Drama Live Action 0 139593
Fiction
Feature Engineering
I was curious about how the timing of a release impacted the revenue. So my next step was to create a column
for the month that each movie was released in.
Out[56]: (1936, 8)
In [58]: # filter DataFrame to only include movies with a production year greater
than or equal to 2010 (or null)
merged_df = merged_df[(merged_df['production_year']>=2010.00) | (merged_
df['production_year'].isnull())]
Out[60]:
release_date release_year title genre prod_method creative_type budget do
Digital
0 2010-06-18 2010.00 Toy Story 3 Adventure Kids Fiction 200000000 41
Animation
Alice in Animation/Live
1 2010-03-05 2010.00 Adventure Fantasy 200000000 33
Wonderland Action
2 2010-05-07 2010.00 Iron Man 2 Action Live Action Super Hero 170000000 31
The Twilight
3 2010-06-30 2010.00 Saga: Drama Live Action Fantasy 68000000 30
Eclipse
Harry
Potter and Animation/Live
4 2010-11-19 2010.00 Adventure Fantasy 125000000 29
the Deathly Action
Hallows:…
Cleaning Duplicates
In [62]: # generating rows where movies are duplicates in terms of title
merged_df.loc[merged_df.duplicated(subset=['title'])]
Out[62]:
release_date release_year title genre prod_method creative_type budget
Harry
Potter and
Animation/Live
100 2011-07-15 2011.00 the Adventure Fantasy 125000000
Action
Deathly
Hallows:…
Robin Historical
887 2018-11-21 2018.00 Action Live Action 99000000
Hood Fiction
Contemporary
1076 2020-10-02 2020.00 The Call Horror Live Action 0
Fiction
Midnight in Romantic
1858 NaT nan Live Action Fantasy 30000000
Paris Comedy
A
Nightmare
1866 NaT nan Horror Live Action Fantasy 35000000
on Elm
Street
Robin Historical
2432 NaT nan Action Live Action 99000000
Hood Fiction
Mary
2435 NaT nan Poppins Musical Live Action Kids Fiction 130000000
Returns
The
Nutcracker
2436 NaT nan and the Adventure Live Action Fantasy 132900000
Four
Realms
2437 NaT nan Aquaman Action Live Action Super Hero 160000000
Ralph
Breaks Digital
2438 NaT nan Adventure Kids Fiction 175000000
The Animation
Internet
In [63]: merged_df.loc[merged_df['title']=='Aquaman']
Out[63]:
release_date release_year title genre prod_method creative_type budget dom_
Animation/Live
804 2018-12-21 2018.00 Aquaman Action Super Hero 160000000 3350
Action
2437 NaT nan Aquaman Action Live Action Super Hero 160000000 3338
In [64]: merged_df.loc[merged_df['title']=='Ralph Breaks The Internet']
Out[64]:
release_date release_year title genre prod_method creative_type budget dom
Ralph
Breaks Digital
813 2018-11-21 2018.00 Adventure Kids Fiction 175000000 201
The Animation
Internet
Ralph
Breaks Digital
2438 NaT nan Adventure Kids Fiction 175000000 200
The Animation
Internet
Out[65]:
release_date release_year title genre prod_method creative_type budget dom
Midnight Romantic
157 2011-05-20 2011.00 Live Action Fantasy 30000000 568
in Paris Comedy
Midnight Romantic
1858 NaT nan Live Action Fantasy 30000000 568
in Paris Comedy
After reviewing many of the duplicates, I was able to determine that if a movie was showing up as a duplicate in
terms of title, it was truly a duplicate entry of the same movie. I wanted to give priority to the data that had come
from numbers_df since it was more current (and had more non-null values) than the data that had come from
opus_df. So I opted to keep the first instance of each duplicate and drop the rest.
In [68]: # creating a list of titles that did not have an associated release_date
value
dateless_titles = merged_df['title'].loc[merged_df['release_date'].isnul
l()]
I created a list of the movie titles that originated from the Opus dataset which did not contain release date
information. I then ran a function to query The Movies Database (TMDb) via API for each specific title and pull
the release date for the title.
Parameters:
title (str): user input movie title.
Returns:
df (Pandas DataFrame): A dataframe cleaned and adjusted as per crite
ria listed above.
"""
title_r = title.replace(' ', '+')
url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&
query={title_r}"
response = requests.get(url)
if len(response.json()['results']) > 0:
rdate = (response.json()['results'][0]['release_date'])
if rdate:
x = datetime.strptime(rdate, '%Y-%m-%d').strftime('%b %d, %
Y')
merged_df.loc[merged_df['title']==title, 'release_date'] = x
else:
pass
In [72]: # getting release dates for list of titles lacking release dates
for title in dateless_titles:
get_date(title)
Out[73]: release_date 6
release_year 474
title 0
genre 1
prod_method 1
creative_type 3
budget 0
dom_gross 0
int_gross 0
world_gross 474
release_month 474
production_year 502
dtype: int64
My get_date function successfully took care of almost all the null values for release_date! I had a look at the few
null values that remained.
In [74]: # show rows that contain null values for release date
merged_df[merged_df['release_date'].isnull()]
Out[74]:
release_date release_year title genre prod_method creative_type bu
San cheng
2160 NaT nan Drama Live Action Dramatization 1200
ji
Savva.
Digital
2202 NaT nan Serdtse Adventure Kids Fiction 3000
Animation
voyna
Baahubali
Historical
2282 NaT nan 2: The Action Live Action 3000
Fiction
Conclusion
Contemporary
2345 NaT nan Jìyì dàshī Thriller/Suspense Live Action 2000
Fiction
Based on the fact that all these movies had many columns of null values, I opted to drop them from the data.
Out[76]: release_date 0
release_year 468
title 0
genre 1
prod_method 1
creative_type 3
budget 0
dom_gross 0
int_gross 0
world_gross 468
release_month 468
production_year 502
dtype: int64
Next, I dealt with movies that did not have a world_gross value.
In [77]: # creating a list of titles that did not have a world_gross value
worldless_titles = merged_df['title'].loc[merged_df['world_gross'].isnul
l()]
worldless_titles
To fill in this data, I ran a function to add the domestic gross and international gross column values together for
each row to equal the world gross.
Out[80]: release_date 0
release_year 468
title 0
genre 1
prod_method 1
creative_type 3
budget 0
dom_gross 0
int_gross 0
world_gross 0
release_month 468
production_year 502
dtype: int64
I no longer needed the production_year column at this point, so I went ahead and dropped that from the
DataFrame.
In [81]: # dropping production_year column
merged_df.drop('production_year', axis=1, inplace=True)
I once again made use of the datetime methods I had used previously for numbers_df prior to the merge, in
order to fill in the null values for 'month' in merged_df.
Out[83]: release_date 0
release_year 468
title 0
genre 1
prod_method 1
creative_type 3
budget 0
dom_gross 0
int_gross 0
world_gross 0
release_month 0
dtype: int64
Finally, I had to set the release_year for the titles whose release_dates had been scraped after the merge.
Out[85]: release_date 0
release_year 0
title 0
genre 1
prod_method 1
creative_type 3
budget 0
dom_gross 0
int_gross 0
world_gross 0
release_month 0
dtype: int64
In [86]: # getting counts of each value in release_year column
merged_df.release_year.value_counts()
When I checked the value counts for the release_year column, I found 31 values where the release year was not
within my parameters.
In [87]: merged_df.loc[merged_df['release_year']<2010]
Out[87]:
release_date release_year title genre prod_method creative_type bu
Io sono Historical
1795 2009-09-05 2009 Drama Live Action 1000
l’amore Fiction
Contemporary
1805 2009-03-25 2009 Chloe Thriller/Suspense Live Action 1300
Fiction
I Love You,
Contemporary
1808 2009-01-18 2009 Phillip Comedy Live Action 1300
Fiction
Morris
Enter the
1820 2009-06-17 2009 Drama Live Action Fantasy 1600
Void
Youth in Contemporary
1824 2009-09-11 2009 Comedy Live Action 1800
Revolt Fiction
The Last
1826 2009-09-04 2009 Drama Live Action Dramatization 1800
Station
Historical
1833 2009-05-17 2009 Middle Men Comedy Live Action 2000
Fiction
Contemporary
1840 2001-11-16 2001 Stone Drama Live Action 2200
Fiction
Historical
1880 2000-05-26 2000 Shanghai Drama Live Action 5000
Fiction
Les Contemporary
1902 1969-04-01 1969 Comedy Live Action 1080
Intouchables Fiction
1946 1982-06-25 1982 The Thing Horror Live Action Fantasy 3800
The
1948 2006-07-28 2006 Drama Live Action Dramatization 4000
Impossible
Dangerous Historical
1996 1988-06-09 1988 Drama Live Action 2420
Liaisons Fiction
Contemporary
2053 1994-09-23 1994 Redemption Thriller/Suspense Live Action 2300
Fiction
Historical
2095 2007-01-05 2007 Freedom Drama Live Action 1450
Fiction
Contemporary
2129 2005-01-01 2005 Aloha Drama Live Action 3700
Fiction
Contemporary
2161 2005-08-26 2005 Brothers Action Live Action 1300
Fiction
Robinson Digital
2166 1997-05-12 1997 Adventure Kids Fiction 1300
Crusoe Animation
Historical
2210 2008-12-12 2008 Yip Man 3 Action Live Action 3600
Fiction
Contemporary
2253 1968-01-01 1968 Sultan Action Live Action 1100
Fiction
release_date release_year title genre prod_method creative_type bu
Historical
2296 2007-03-16 2007 Silence Drama Live Action 4650
Fiction
Contemporary
2414 2005-09-03 2005 Serenity Thriller/Suspense Live Action 2500
Fiction
All of these were movies with release dates prior to 2010 that had been added from my get_date function after
my earlier filtering. I went ahead and dropped those as well.
In [89]: # filter DataFrame to only include titles with release_year >= 2010
merged_df = merged_df[merged_df['release_year']>=2010.00]
In [91]: merged_df.isnull().sum()
Out[91]: release_date 0
release_year 0
title 0
genre 1
prod_method 1
creative_type 3
budget 0
dom_gross 0
int_gross 0
world_gross 0
release_month 0
dtype: int64
Out[92]:
release_date release_year title genre prod_method creative_type budget dom_gro
Le Contemporary
1976 2012-04-25 2012 NaN Live Action 11000000 616
prénom Fiction
When I searched this entry on IMDb, there was no useful information on this title, so I dropped this as well.
Out[94]:
release_date release_year title genre prod_method creative_type budget dom
The Killer
1807 2010-02-19 2010 Drama Live Action NaN 13000000
Inside Me
In [96]: # filling null values for creative_type with the most common value for t
his column
merged_df['creative_type'].fillna('Contemporary Fiction', inplace=True)
Out[97]: release_date 0
release_year 0
title 0
genre 0
prod_method 1
creative_type 0
budget 0
dom_gross 0
int_gross 0
world_gross 0
release_month 0
dtype: int64
In [98]: # show rows where value for prod_method is null
merged_df.loc[merged_df['prod_method'].isnull()]
Out[98]:
release_date release_year title genre prod_method creative_type budget dom_gro
Dhoom Contemporary
2054 2013-12-18 2013 Action NaN 24000000 80319
3 Fiction
In [100]: # fill null values for prod_method with most common value for column
merged_df['prod_method'].fillna('Live Action', inplace=True)
Data Analysis
In order to answer the questions I had posed at the onset of this project, I performed exploratory analyses on
the basis of genre, creative type, production method, time of year, and budget. For each of these themes, I
utilized statistical methods followed by plotting of visualizations to determine the relationship between each of
these key elements and movie revenue (world gross). While I had the data for domestic and international gross
for most movies as well, I was most focused on world gross because I wanted the total amount of money that
Microsoft could expect to make on a given production.
One thing I realized as soon as I began my analysis was that my dollar amounts were not very plot-friendly. I had
to reassess my gross and budget columns at this point and add new columns for budget in millions and world
gross in millions.
Genre
The Drama genre was most common in this dataset, followed by Adventure and Action.
In [107]: # getting mean and median world gross amounts by genre
genre_stats = merged_df.groupby('genre')['world_gross_mil'].agg(['media
n', 'mean'])
genre_stats.sort_values(by='mean', ascending=False)
Out[107]:
median mean
genre
Out[110]: genre
Action 372.28
Adventure 390.04
Black Comedy 54.42
Comedy 97.13
Concert/Performance 34.49
Documentary 34.69
Drama 84.75
Horror 96.79
Multiple Genres 1.58
Musical 445.58
Romantic Comedy 93.10
Thriller/Suspense 104.24
Western 175.06
Name: world_gross_mil, dtype: float64
For my visualizations, I found it helpful to show my plots both with and without outliers. While outliers paint the
whole picture of the data being analyzed, it can be helpful to see the data more closesly and "zoom in" (both
literally and figuratively) on the trends without outliers.
In [111]: # generating box plot of world gross statistics per genre
plt.figure(figsize=(12,5))
sns.set_style('darkgrid')
sns.boxplot(x='genre', y='world_gross_mil', data=merged_df, palette='vir
idis_r')
plt.xticks(rotation=-80)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xlabel('Genre', fontsize = 16)
plt.title('World Gross by Genre (With Outliers)', fontsize = 18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14);
#saved in images as fig2
#plt.subplots_adjust(bottom=0.2)
#plt.savefig('./images/fig2.png')
In [112]: # generating box plot of world gross statistics per genre
plt.figure(figsize=(12,5))
sns.set_style('darkgrid')
sns.boxplot(x='genre', y='world_gross_mil', data=merged_df, showfliers=F
alse, palette='viridis_r')
plt.xticks(rotation=-80)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xlabel('Genre', fontsize = 16)
plt.title('World Gross by Genre (Without Outliers)', fontsize = 18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14);
#saved in images as fig3
#plt.subplots_adjust(bottom=0.2)
#plt.savefig('./images/fig3.png')
Based on mean and median, Musicals appeared to be the most lucrative genre for the time period I had
explored. However, as can be seen from the box plot above, this was also the genre with the largest interquartile
range (IQR), meaning that the middle values of world gross for these movies were the most spread out. After
Musicals, Action and Adventure were the highest-grossing categories, with both high means and medians.
These categories did have high IQR's as well, with a large variance in world gross values. Both Action and
Adventure also had many high-grossing outliers, as can be seen in the first box plot above.
For musical movies in the past ten years, there were five titles that were much more successful than any others.
There was also an obvious dip in the year 2020, thanks to the global COVID-19 pandemic.
In [115]: # creating subset of DataFrame where genre is Musical
musicals = merged_df.loc[merged_df['genre']=='Musical']
musicals.sort_values(by='world_gross_mil', ascending=False).head()
Out[115]:
release_date release_year title genre prod_method creative_type budget dom_g
Digital
302 2013-11-22 2013 Frozen Musical Kids Fiction 150000000 40073
Animation
Beauty
Animation/Live
701 2017-03-17 2017 and the Musical Fantasy 160000000 50401
Action
Beast
Digital
712 2017-11-22 2017 Coco Musical Kids Fiction 175000000 21032
Animation
Digital
610 2016-11-23 2016 Moana Musical Kids Fiction 150000000 24875
Animation
Digital
9 2010-11-24 2010 Tangled Musical Kids Fiction 260000000 20082
Animation
I wanted to see which other attributes the most profitable movies in the Musical genre had shared. What I found
was that, for the top five highest-grossing Musical movies in this time period, 4 out of 5 also happened to be:
Digitally Animated, Kids Fiction, and released in November!
Based on my analysis of genre, my recommendation to Microsoft was to consider making a musical movie.
Creative Type
Contemporary Fiction was the most common creative type in this dataset by far.
In [118]: # getting mean and median world gross amounts by creative type
genre_stats = merged_df.groupby('creative_type')['world_gross_mil'].agg
(['median', 'mean'])
genre_stats.sort_values(by='mean', ascending=False)
Out[118]:
median mean
creative_type
Super Hero movies did substantially better at the box office than all other creative types, with Kids Fiction
coming in second place.
In [121]: # getting interquartile range (IQR) for world gross of each genre
iqr = b - a
iqr
Out[121]: creative_type
Contemporary Fiction 105.01
Dramatization 99.43
Factual 59.02
Fantasy 279.87
Historical Fiction 124.20
Kids Fiction 344.91
Multiple Creative Types 56.62
Science Fiction 329.81
Super Hero 498.54
Name: world_gross_mil, dtype: float64
In [122]: # generating box plot of world gross statistics per creative type
plt.figure(figsize=(10,5))
sns.set_style('darkgrid')
sns.boxplot(x='creative_type', y='world_gross_mil', data=merged_df, pale
tte='viridis_r')
plt.xticks(rotation=-80)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xlabel('Creative Type', fontsize = 16)
plt.title('World Gross by Creative Type', fontsize = 18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
plt.ylim(None, 2000);
#saved in images as fig6
#plt.subplots_adjust(bottom=0.2)
#plt.savefig('./images/fig6.png')
In [123]: # generating box plot of world gross statistics per creative type
plt.figure(figsize=(10,5))
sns.set_style('darkgrid')
sns.boxplot(x='creative_type', y='world_gross_mil', data=merged_df, show
fliers=False, palette='viridis_r')
plt.xticks(rotation=-80)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xlabel('Creative Type', fontsize = 16)
plt.title('World Gross by Creative Type (without outliers)', fontsize =
18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12);
#saved in images as fig7
#plt.subplots_adjust(bottom=0.2)
#plt.savefig('./images/fig7.png')
Based on mean and median, Super Hero movies were far above all the other creative types. Kids Fiction was in
second place, with many high-grossing outliers (mentioned previously). Science Fiction and Fantasy had
relatively high means and medians as well, and both creative types also contained many high-grossing outliers.
In [124]: # plotting relationship between world gross and release month for super
hero movies
sns.set_style('darkgrid')
plt.figure(figsize=(14,6))
sns.lineplot(data=time_df.loc[time_df['creative_type']=='Super Hero'], x
="release_date", y="world_gross_mil", markers='o', style=True, dashes=[(
2,2)], linewidth=3, color='darkorange', legend=False)
plt.title('World Gross of Super Hero Movies by Year', fontsize=18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Release Date', fontsize=16)
plt.ylabel('World Gross in Millions', fontsize=16)
for w, x, y, z in zip(time_df['creative_type'], time_df['release_date'],
time_df['world_gross_mil'], time_df['title']):
if (w == 'Super Hero') & (y>1150):
plt.text(x = x, y = y+20, s = z, fontsize=12, color='black').set
_backgroundcolor('white');
#saved in images as fig8
#plt.tight_layout()
#plt.savefig('./images/fig8.png')
Super Hero movies seemed to do consistently well over the past ten years, although the line plot showed some
ups and downs. Still, even the lows for Super Hero movies would be considered highs for other types of movies,
so perspective is important. The plot showed seven titles that did extremely well for their movie type.
In [125]: # creating subset of DataFrame where creative type is Super Hero
superhero = merged_df.loc[merged_df['creative_type']=='Super Hero']
superhero.sort_values(by='world_gross_mil', ascending=False).head(7)
Out[125]:
release_date release_year title genre prod_method creative_type budget dom_
Avengers: Animation/Live
900 2019-04-26 2019 Action Super Hero 400000000 8583
Endgame Action
Avengers:
Animation/Live
801 2018-04-27 2018 Infinity Action Super Hero 300000000 6788
Action
War
The Animation/Live
200 2012-05-04 2012 Action Super Hero 225000000 6233
Avengers Action
Avengers:
Animation/Live
502 2015-05-01 2015 Age of Action Super Hero 365000000 4590
Action
Ultron
Black
800 2018-02-16 2018 Action Live Action Super Hero 200000000 7000
Panther
Captain
602 2016-05-06 2016 America: Action Live Action Super Hero 250000000 4080
Civil War
As I had anticipated, The Avengers movies dominated the Super Hero category. All of these movies were
produced by Live Action in the Action genre, and 6 out of 7 were released in either April or May. Based on my
analysis of creative type, my recommendation to Microsoft was to explore the idea of making a Super Hero
movie.
Production Method
In [126]: # plotting the number of movies per production method in dataset
plt.figure(figsize=(8,4))
sns.countplot(x='prod_method', data=merged_df, palette='viridis')
plt.title('Number of Movies Per Production Method', fontsize=18)
plt.ylabel('Count', fontsize=16)
plt.xlabel('Production Method', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
plt.xticks(rotation=-80);
#saved in images as fig9
#plt.subplots_adjust(bottom=0.2)
#plt.savefig('./images/fig9.png')
Out[127]:
median mean
prod_method
However, the Animation/Live Action category had the highest mean and median, with Digital Animation coming
in second.
In [128]: # generating box plot of world gross statistics per production method
plt.figure(figsize=(10,4))
sns.set_style('darkgrid')
sns.boxplot(x='prod_method', y='world_gross_mil', data=merged_df, palett
e='viridis_r')
plt.xticks(rotation=-80)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xlabel('Production Method', fontsize = 16)
plt.title('World Gross by Production Method', fontsize = 18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
plt.ylim(None, 2000);
#saved in images as fig10
#plt.subplots_adjust(bottom=0.2)
#plt.savefig('./images/fig10.png')
Based on mean and median, Animation/Live Action and Digital Animation appeared to be the most successful
production methods for the time period I had explored.
In [129]: # creating subset of DataFrame where prod_method is Animation/Live Actio
n or Digital Animation
anim_df = merged_df.loc[(merged_df['prod_method']=='Animation/Live Actio
n') | (merged_df['prod_method']=='Digital Animation')]
anim_df.sort_values(by='world_gross_mil', ascending=False).head(10)
Out[129]:
release_date release_year title genre prod_method creative_type budget d
Avengers: Animation/Live
900 2019-04-26 2019 Action Super Hero 400000000 8
Endgame Action
Star Wars
Ep. VII: Animation/Live Science
500 2015-12-18 2015 Adventure 306000000 9
The Force Action Fiction
Awakens
Avengers:
Animation/Live
801 2018-04-27 2018 Infinity Action Super Hero 300000000 6
Action
War
The Animation/Live
200 2012-05-04 2012 Action Super Hero 225000000 6
Avengers Action
Digital
903 2019-11-22 2019 Frozen II Adventure Kids Fiction 150000000 4
Animation
Avengers:
Animation/Live
502 2015-05-01 2015 Age of Action Super Hero 365000000 4
Action
Ultron
Harry
Potter
and the Animation/Live
1964 2011-07-07 2011 Adventure Fantasy 125000000 3
Deathly Action
Hallows:
Part II
Star Wars
Ep. VIII: Animation/Live Science
700 2017-12-15 2017 Adventure 200000000 6
The Last Action Fiction
Jedi
Digital
302 2013-11-22 2013 Frozen Musical Kids Fiction 150000000 4
Animation
I immediately noticed some overlap between this subset and the Musical and Super Hero subsets. Many of the
top titles for this animation subset were either musicals or super hero movies as well. I also noted that while
Frozen II is generally classified as a musical like the original Frozen, the data had it listed as an adventure movie.
I wondered if there may be other children's animated movies in the data that were musical as well, but labeled
with a different genre.
In [130]: # creating subset of dataframe where production method is digital animat
ion
diganim_df = merged_df.loc[merged_df['prod_method']=='Digital Animation'
]
diganim_df.sort_values(by='world_gross_mil', ascending=False).head(10)
Out[130]:
release_date release_year title genre prod_method creative_type budget d
Digital
903 2019-11-22 2019 Frozen II Adventure Kids Fiction 150000000 4
Animation
Digital
302 2013-11-22 2013 Frozen Musical Kids Fiction 150000000 4
Animation
Incredibles Digital
802 2018-06-15 2018 Adventure Kids Fiction 200000000 6
2 Animation
Digital
505 2015-07-10 2015 Minions Adventure Kids Fiction 74000000 3
Animation
Despicable Digital
708 2017-06-30 2017 Adventure Kids Fiction 75000000 2
Me 3 Animation
Finding Digital
601 2016-06-17 2016 Adventure Kids Fiction 200000000 4
Dory Animation
Digital
606 2016-03-04 2016 Zootopia Adventure Kids Fiction 150000000 3
Animation
Despicable Digital
303 2013-07-03 2013 Adventure Kids Fiction 76000000 3
Me 2 Animation
As I had suspected, many of the top films with a production method of digital animation also had musical
components, but were labeled with a genre other than Musical. All of these were also Kids Fiction creative type,
which was the second most profitable creative type as seen above.
From reviewing the data above and analyzing this plot, I was able to determine that broadening the recomended
movie type from simply musicals to animated movies with a musical component would ensure that no high-
grossing animated movies were excluded. And again, since all of these animated and animated/musical movies
were Kids Fiction as well, they all had a combination of factors that were correlated with high world gross.
Based on my analysis of production methods, my recommendation to Microsoft was that they should prioritize
animation, whether solely digital animation or in conjunction with live action, in order to achieve the highest
possible movie gross.
Release Month: When is the most lucrative time of year to release a movie?
In this dataset, movie release months were fairly evenly distributed throughout the year, with the most releases
in September and the least in April.
In [135]: # getting mean and median world gross amounts by release month
months_df = merged_df['world_gross_mil'].groupby(merged_df['release_mont
h']).agg(['median', 'mean'])
months_df.sort_values(by='mean', ascending=False)
Out[135]:
median mean
release_month
Based on release month alone, May, June, and July seemed to be the most profitable months with the highest
mean value for world box office gross. But I wondered, how much does the type of movie impact the time of
year that a movie does best?
By plotting world gross by release month for only the types of movies I was recommending, super hero and
animated, I was able to determine that there were certain trends that could be observed. While the super hero
movies seemed to do best in April and May, the animated movies seemed to do best in June, July, and
November.
In [140]: # getting mean and median world gross amounts by release month for anima
ted movies
animated_months_df = merged_df.loc[merged_df['animated_or_super']=='anim
ated']
animated_months_gb = animated_months_df['world_gross_mil'].groupby(anima
ted_months_df['release_month']).agg(['median', 'mean'])
animated_months_gb.sort_values(by='mean', ascending=False)
Out[140]:
median mean
release_month
After taking a closer look at release month statistics for animated movies, I was able to confirm that June and
November were the highest-grossing months for an animated movie release.
In [142]: # plotting world gross and release month for animated movies
g = sns.catplot(data=merged_df.loc[merged_df['prod_method']=='Digital An
imation'], kind="swarm", x="release_month", y="world_gross_mil", color=
'royalblue', order=months, s=7, legend=False)
g.fig.set_size_inches(14,4)
plt.title('World Gross by Month for Animated Movies', fontsize=18)
plt.xlabel('Release Month', fontsize=16)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
plt.axhline(y=1000, ls='--', c='green');
#saved in images as fig16
#plt.tight_layout()
#plt.savefig('./images/fig16.png')
Almost all of the animated movies over the one-billion mark for world gross were released in June or November,
with the exception of one in July and one right on the mark in March. The highest-grossing ones were released
in November.
In [143]: # getting mean and median world gross amounts by release month for super
hero movies
super_months_df = merged_df.loc[merged_df['animated_or_super']=='super']
super_months_gb = super_months_df['world_gross_mil'].groupby(super_month
s_df['release_month']).agg(['median', 'mean'])
super_months_gb.sort_values(by='mean', ascending=False)
Out[143]:
median mean
release_month
After taking a closer look at release month statistics for super hero movies, I was able to confirm that April and
May were the highest-grossing months for a super hero movie release.
In [145]: # plotting world gross by month for super hero movies
g = sns.catplot(data=merged_df.loc[merged_df['creative_type']=='Super He
ro'], kind="swarm", x="release_month", y="world_gross_mil", color='darko
range', order=months, s=6, legend=False)
g.fig.set_size_inches(14,5)
plt.title('World Gross by Month for Super Hero Movies', fontsize=18)
plt.xlabel('Release Month', fontsize=16)
plt.ylabel('World Gross in Millions', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
plt.axhline(y=1000, ls='--', c='green');
#saved in images as fig18
#plt.tight_layout()
#plt.savefig('./images/fig18.png')
Most of the super hero movies over the one-billion mark were released in May. The two highest-grossing ones
were released in April. There were a couple released in other months as well.
1. Release a kids fiction animated movie with a musical component in June or November.
2. Release a super hero movie in April or May.
Production Budget: What budget amount tends to achieve the highest box office
gross?
The first thing I did was run the .corr method on merged_df to see if there were any strong correlations in the
numerical data. I then generated a heatmap based on these correlations.
In [146]: # generating correlations
corr = merged_df.corr()
corr
Out[146]:
release_year budget dom_gross int_gross world_gross budget_in_mil world_g
In [148]: # plotting world gross by budget in terms of genre with line of best fit
sns.lmplot(x='budget_in_mil', y='world_gross_mil', data=merged_df, aspec
t=4, line_kws={'color': 'crimson'})
plt.title('World Gross by Budget', fontsize=20)
plt.xlabel('Budget in Millions', fontsize=18)
plt.ylabel('World Gross in Millions', fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.ylim(0,None)
plt.xlim(0, None);
#saved in images as fig20
#plt.tight_layout()
#plt.savefig('./images/fig20.png')
The regression line in this plot shows the general increase in movie gross as the budget increases. While this
varies greatly from movie to movie, it was a good starting point as I worked towards combining my findings and
forming my recommendations. This plot confirmed my hypothesis that a higher production budget typically
leads to a higher box office gross.
In [149]: # plotting world gross by budget for animated and super hero movies
sns.set_style('darkgrid')
g = sns.lmplot(x='budget_in_mil', y='world_gross_mil', data=merged_df.lo
c[(merged_df['prod_method']=='Digital Animation') | (merged_df['creative
_type']=='Super Hero')], hue='animated_or_super', aspect=3, legend=False
)
plt.title('World Gross by Budget for Animated and Super Hero Movies', fo
ntsize=20)
plt.xlabel('Budget in Millions', fontsize=18)
plt.ylabel('World Gross in Millions', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
g.fig.set_size_inches(11,6)
plt.legend(loc='upper left', fontsize=14)
#for w, x, y, z in zip(merged_df['animated_or_super'], merged_df['budget
_in_mil'], merged_df['world_gross_mil'], merged_df['title']):
#if (w == 'animated') & (y>1250):
#plt.text(x = x+5, y = y-15, s = z.upper(), fontsize=14, color
='black')
#for w, x, y, z in zip(merged_df['animated_or_super'], merged_df['budget
_in_mil'], merged_df['world_gross_mil'], merged_df['title']):
#if (w == 'super') & (y>1250):
#plt.text(x = x+5, y = y-15, s = z.upper(), fontsize=14, color
='black')
plt.axhline(y=1000, ls='--', c='green');
#saved in images as fig21
#plt.tight_layout()
#plt.savefig('./images/fig21.png')
For the specific movie types we are looking at, we can see that super hero movies benefit from a very large
budget, while animated movies benefit from a moderately large budget. Both regression lines show that as the
budget increases, the world gross tends to increase as well.
In [150]: # getting descriptive stats for animated movie budgets
merged_df.loc[merged_df['prod_method']=='Digital Animation', 'budget_in_
mil'].describe()
Out[150]: count 124.00
mean 87.68
std 59.33
min 0.00
25% 37.50
50% 76.75
75% 135.00
max 260.00
Name: budget_in_mil, dtype: float64
Budget Recommendations
After careful consideration of the findings above, my main observation was that a higher production budget
leads to a higher gross. I had two specific recommendations regarding production budget for the preferred
movie types:
1. For the highest box office gross, an animated move should have a budget of 75 to 200 million dollars.
2. For the highest box office gross, a super hero movie should have a budget of 200 to 400 million dollars.
Now that I had answers to my three main questions regarding the business problem, I wanted to put them all
together into one visualization.
This plot shows the specifics of the top-grossing movies in the animation category. As discussed previously, the
highest-grossing movies were mostly released in June or November and had budgets between 75 and 200
million dollars.
In [173]: # plotting world gross by budget and release month for animated
sns.set_style('darkgrid')
g = sns.relplot(x='budget_in_mil', y='world_gross_mil', data=merged_df.l
oc[(merged_df['prod_method']=='Digital Animation')], hue='release_month'
, hue_order=['January', 'February', 'March', 'April', 'May', 'June', 'Ju
ly', 'August', 'September', 'October', 'November', 'December'], s=120, a
spect=3, palette='Paired')
plt.title('World Gross by Budget and Release Month for Animated Movies',
fontsize=20)
plt.xlabel('Budget in Millions', fontsize=18)
plt.ylabel('World Gross in Millions', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlim(0, None)
g.fig.set_size_inches(16,8)
for v, w, x, y, z in zip(merged_df['genre'], merged_df['prod_method'], m
erged_df['budget_in_mil'], merged_df['world_gross_mil'], merged_df['titl
e']):
if (z=='Frozen'):
plt.text(x=x-18, y=y-15, s=z.upper(), fontsize=14, color='black'
)
elif z=='Toy Story 3':
plt.text(x=x-27, y=y-10, s=z.upper(), fontsize=14, color='black'
)
elif z=='Zootopia':
plt.text(x=x+2, y=y, s=z.upper(), fontsize=14, color='black')
elif (w=='Digital Animation') and y>1000:
plt.text(x = x+2, y = y-15, s = z.upper(), fontsize=14, color='b
lack')
plt.axhline(y=1000, ls='--', c='green');
#saved in images as fig24
#plt.tight_layout()
#plt.savefig('./images/fig24.png')
In [175]: # plotting world gross by budget and release month for super hero movies
sns.set_style('darkgrid')
g = sns.relplot(x='budget_in_mil', y='world_gross_mil', data=merged_df.l
oc[merged_df['creative_type']=='Super Hero'], hue='release_month', hue_o
rder=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'A
ugust', 'September', 'October', 'November', 'December'], s=130, aspect=3
, palette='Paired')
plt.title('World Gross by Budget and Release Month for Super Hero Movie
s', fontsize=20)
plt.xlabel('Budget in Millions', fontsize=18)
plt.ylabel('World Gross in Millions', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlim(0, None)
g.fig.set_size_inches(16,10)
for w, x, y, z in zip(merged_df['animated_or_super'], merged_df['budget_
in_mil'], merged_df['world_gross_mil'], merged_df['title']):
if (z=='Avengers: Endgame'):
plt.text(x=x-60, y=y-15, s=z.upper(), fontsize=12, color='black'
)
elif (z=='Avengers: Infinity War'):
plt.text(x=x+2, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='Avengers: Age of Ultron'):
plt.text(x=x-76, y=y-15, s=z.upper(), fontsize=12, color='black'
)
elif (z=='The Avengers'):
plt.text(x=x+2, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='Spider-Man: Far From Home'):
plt.text(x=x-80, y=y-15, s=z.upper(), fontsize=12, color='black'
)
elif (z=='Aquaman'):
plt.text(x=x, y=y+10, s=z.upper(), fontsize=12, color='black')
elif (z=='Captain Marvel'):
plt.text(x=x+2, y=y-35, s=z.upper(), fontsize=12, color='black')
elif (w == 'super') & (y>1000):
plt.text(x = x+2, y = y-15, s = z.upper(), fontsize=12, color='b
lack')
plt.axhline(y=1000, ls='--', c='green');
#saved in images as fig25
#plt.tight_layout()
#plt.savefig('./images/fig25.png')
This plot shows the specifics of the top-grossing movies in the super hero category. As discussed previously,
the highest-grossing movies were mostly released in April, May, or July and most had budgets between 200 and
400 million dollars.
These visualizations show the relationship between all the key factors that I analyzed: movie type, release
month, and production budget. Based on the top movies that meet these criteria, super hero movies do best
when given a very high budget and are released in April or May. Animated musicals find the most success when
released in June or November, and don't necessarily need quite as high of a budget to achieve a high gross,
although their gross is also comparatively lower than that of super hero movies.
Directors
Now that we know that a high-budget super hero movie released in April or May is the most rewarding
combination of factors, how can we make this movie even better and ensure that it reaches its full potential?
One way that we can increase our chances of success is by hiring the right director for this type of movie.
To further this point, I utilized the TMDb API to obtain the director names for all the super hero movies in my
dataset.
In [156]: def get_director(title):
"""
Updates director information for movie in dataframe.
Parameters:
title (str): user input movie title.
Returns:
Updated cells in Pandas DataFrame.
"""
title_r = title.replace(' ', '+')
url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&
query={title_r}"
response = requests.get(url)
if len(response.json()['results']) > 0:
movie_id = response.json()['results'][0]['id']
url2 = f"https://api.themoviedb.org/3/movie/{movie_id}/credits?a
pi_key={api_key}"
response2 = requests.get(url2)
crew = response2.json()['crew']
directors = []
for member in crew:
if member['job'] == 'Director':
directors.append(member['name'])
d = str(directors)
d = d.replace('[', '').replace(']', '').replace("'","")
merged_df.loc[merged_df['title']==title, 'director'] = d
else:
pass
In [158]: # getting director info for movies in list and updating data accordingly
for title in superhero_titles:
get_director(title)
In [159]: # getting director value counts
merged_df.director.value_counts()
Out[162]:
release_date release_year title genre prod_method creative_type budget dom_
Captain
America:
403 2014-04-04 2014 The Action Live Action Super Hero 170000000 25974
Winter
Soldier
Captain
602 2016-05-06 2016 America: Action Live Action Super Hero 250000000 4080
Civil War
Avengers:
Animation/Live
801 2018-04-27 2018 Infinity Action Super Hero 300000000 6788
Action
War
Avengers: Animation/Live
900 2019-04-26 2019 Action Super Hero 400000000 8583
Endgame Action
Out[163]:
release_date release_year title genre prod_method creative_type budget dom_
The Animation/Live
200 2012-05-04 2012 Action Super Hero 225000000 6233
Avengers Action
Avengers:
Animation/Live
502 2015-05-01 2015 Age of Action Super Hero 365000000 4590
Action
Ultron
As both the director_stats DataFrame and the Mean Movie Gross by Director plot show, for the past ten years,
the top five directors (or combinations of directors) for super hero movies by mean world gross are:
Hiring one of these top directors to work on a super hero movie can further increase the chances of a
successful movie venture.
Composers
For the animated movie, I had found that the musical ones tended to have the highest box office gross. To that
end, I again used the TMDb API, this time to obtain composer information for all the animated movies in my
dataset. Because not every animated movie was a musical one, they did not all have composer information
available. Only the titles that had an associated composer value available were used in the following analysis.
Parameters:
title (str): user input movie title.
Returns:
Updated cells in Pandas DataFrame.
"""
title_r = title.replace(' ', '+')
url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&
query={title_r}"
response = requests.get(url)
if len(response.json()['results']) > 0:
movie_id = response.json()['results'][0]['id']
url2 = f"https://api.themoviedb.org/3/movie/{movie_id}/credits?a
pi_key={api_key}"
response2 = requests.get(url2)
crew = response2.json()['crew']
composers = []
for member in crew:
if member['job'] == 'Original Music Composer':
composers.append(member['name'])
c = str(composers)
c = c.replace('[', '').replace(']', '').replace("'","")
merged_df.loc[merged_df['title']==title, 'composer'] = c
else:
pass
Out[169]:
release_date release_year title genre prod_method creative_type budget dom_
Digital
302 2013-11-22 2013 Frozen Musical Kids Fiction 150000000 40073
Animation
Frozen Digital
903 2019-11-22 2019 Adventure Kids Fiction 150000000 4773
II Animation
In [170]: # seeing which movies had music composed by Christophe Beck, Robert Lope
z, Kristen Anderson-Lopez
merged_df.loc[merged_df['composer']=='Heitor Pereira, Pharrell Williams'
]
Out[170]:
release_date release_year title genre prod_method creative_type budget do
Despicable Digital
708 2017-06-30 2017 Adventure Kids Fiction 75000000 26
Me 3 Animation
As both the composer_stats DataFrame and the Mean Movie Gross by Composer plot show, for the past ten
years, the top five composers (or combinations of composers) for animated movies by mean world gross are:
These composers have a proven track record of success and should be hired to work on an animated musical
to increase box office success.
Now that I had my general answers to each of the primary business questions that were laid out at the onset of
this project, it was time to combine them into an actionable plan for Microsoft's up-and-coming movie studio.
Evaluation
Microsoft's ticket to worldwide box office success can be attained by releasing a Super Hero movie in April or
May, or by releasing an animated children's musical movie in June or November. My findings have shown that
the more funds a studio invests in their movie production, the more money (worldwide box office gross) they are
likely to receive as a result. Though there are some outliers, the majority of high-grossing movies for these
particular types of movies are ones with higher budgets. For an animated movie, this amount is between 75 and
200 million dollars. For a super hero movie, that amount is between 200 and 400 million dollars. Throughout my
analysis, I have found that the optimal time to release a movie depends on the type of movie that it is. While
animated musical movies tend to fare very well in November and June, super hero movies have seen immense
box office success in April and May. I chose to investigate a bit further and narrow down some additional
attributes that may increase a movie's value, such as the highest-grossing composers and directors for the past
ten years, based on mean world gross.
I am confident that the results I extrapolated from this analysis would generalize beyond the data that I have,
with the exception of this year and next year due to the COVID-19 pandemic. By looking at the data up until this
year, the trends and correlations I found were true for the past ten years, so I am confident that they will again
be true once the world returns to some semblance of normalcy.
If the recommendations that I made are put to use, I am confident that Microsoft will have a successful break
into the movie-making industry. From the data, it is clear that all the attributes I have discussed are correlated
with high worldwide box office gross, which is exactly what Microsoft will want for their first movies and beyond.
Conclusion
In conclusion, I would recommend that Microsoft release one of the following two movies, each with four
specific recommendations that have proven to be successful combinations:
Movie Option #1
an animated kids fiction movie
with a production budget of 75 to 200 million dollars
released in June or November
containing songs by a high-grossing composer with a track record of successful work in digital animation
movies, such as Christophe Beck, Robert Lopez, and Kristen Anderson-Lopez, or Heitor Pereira and
Pharrell Williams
Movie Option #2
a live action/animation super hero movie
with a production budget of 200 to 400 million dollars
released in April or May
directed by a top-grossing director with a history of proven successful superhero movies, such as Anthony
Russo, Joe Russo, or Joss Whedon
While the past ten years of data show that this should be a good recipe for success, one limitation is that we are
currently in a global pandemic, which has negatively affected many facets of the global economy. The
visualizations above displaying movie gross over time clearly show a significant drop in movie gross for this year
(2020). However, since movies take quite a bit of time to produce, the expectation is that the market will be
trending in the right direction by the time a future movie would be released.
In the future, this analysis could be improved by adding additional data as it becomes available. It could also be
expanded upon by determining how much money there is to be made on a streaming platform movie release
while theaters remain at low audience capacity.