Organized

Arthur Nice Passere
05/17/2024
Outline
• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix
2
Executive Summary
• Summary of methodologies
- Data Collection through API
- Data Collection with Web Scraping
- Data Wrangling
- Exploratory Data Analysis with SQL
- Exploratory Data Analysis with Data Visualization
- Interactive Visual Analytics with Folium
- Machine Learning Prediction
• Summary of all results
- Exploratory Data Analysis result
- Interactive analytics in screenshots
- Predictive Analytics result from Machine Learning Lab
3
Introduction
SpaceX is a revolutionary company who has disrupt the space industry by offering a
rocket launches specifically Falcon 9 as low as 62 million dollars; while other providers
cost upward of 165 million dollar each. Most of this saving thanks to SpaceX
astounding idea to reuse the first stage of the launch by re-land the rocket to be used
on the next mission. Repeating this process will make the price down even further. As a
data scientist of a startup rivaling SpaceX, the goal of this project is to create the
machine learning pipeline to predict the landing outcome of the first stage in the future.
This project is crucial in identifying the right price to bid against SpaceX for a rocket
launch.
The problems included:
• Identifying all factors that influence the landing outcome.
• The relationship between each variables and how it is affecting the outcome.
• The best condition needed to increase the probability of successful landing. 4
Section 1
5
Methodology
Executive Summary
• Data collection methodology:
• Data was collected using SpaceX REST API and web scrapping from Wikipedia
• Perform data wrangling

• Data was processed using one-hot encoding for categorical features
• Perform exploratory data analysis (EDA) using visualization and SQL

• Perform interactive visual analytics using Folium and Plotly Dash
• Perform predictive analysis using classification models
• How to build, tune, evaluate classification models
6
Data Collection
Data collection is the process of gathering and measuring information on targeted

variables in an established system, which then enables one to answer relevant
questions and evaluate outcomes. As mentioned, the dataset was collected by REST
API and Web Scrapping from Wikipedia
For REST API, its started by using the get request. Then, we decoded the response
content as Json and turn it into a pandas dataframe using json_normalize(). We
then cleaned the data, checked for missing values and fill with whatever needed.
For web scrapping, we will use the BeautifulSoup to extract the launch records as
HTML table, parse the table and convert it to a pandas dataframe for further
analysis
7
Data Collection – SpaceX API
Get request for rocket launch

data using API
Use json_normalize method to

convert json result to dataframe
Performed data cleaning and

filling the missing value
From:
https://github.com/farishelmi17/SpaceX/blob/m
ain/notebook:Data_Collection_yJPxhv2oU.ipynb
8
Data Collection - Scraping
Request the Falcon9

Launch Wiki page from url
Create a BeautifulSoup
from the HTML response
Extract all column/variable

names from the HTML
header
From:
https://github.com/farishelmi17/SpaceX/blo
b/main/notebook:Data_Collection_with_We
b_Scraping_nI89VIRCE.ipynb
9
Data Wrangling
Data Wrangling is the process of cleaning and

unifying messy and complex data sets for easy access
and Exploratory Data Analysis (EDA).
We will first calculate the number of launches on each

site, then calculate the number and occurrence of
mission outcome per orbit type.
We then create a landing outcome label from the

outcome column. This will make it easier for further
From: analysis, visualization, and ML. Lastly, we will export
https://github.com/farishelmi17/SpaceX/blob/main/noteboo
k:Data_Wrangling_9HnvfsJ5G.ipynb
the result to a CSV.
10
EDA with Data Visualization
We first started by using scatter graph to find the relationship
between the attributes such as between:
• Payload and Flight Number.
• Flight Number and Launch Site.
• Payload and Launch Site.
• Flight Number and Orbit Type.
• Payload and Orbit Type.
Scatter plots show dependency of attributes on each other.

Once a pattern is determined from the graphs. It’s very easy to
see which factors affecting the most to the success of the
landing outcomes.
ttps://github.com/farishelmi17/SpaceX/blob/main/notebook:Exploratory_Data_A
nalysis_with_Visualisation_Lab_jJkKVG6F1.ipynb
11
EDA with Data Visualization
Once we get a hint of the relationships using scatter plot. We
will then use further visualization tools such as bar graph and
line plots graph for further analysis.
Bar graphs is one of the easiest way to interpret the
relationship between the attributes. In this case, we will use
the bar graph to determine which orbits have the highest
probability of success.
We then use the line graph to show a trends or pattern of the
attribute over time which in this case, is used for see the
launch success yearly trend.
We then use Feature Engineering to be used in success
prediction in the future module by created the dummy
variables to categorical columns.
ttps://github.com/farishelmi17/SpaceX/blob/main/notebook:Exploratory_Data_Analysis_with
_Visualisation_Lab_jJkKVG6F1.ipynb
12
EDA with SQL
Using SQL, we had performed many queries to get better understanding of the dataset, Ex:
- Displaying the names of the launch sites.
- Displaying 5 records where launch sites begin with the string ‘CCA’.
- Displaying the total payload mass carried by booster launched by NASA (CRS).
- Displaying the average payload mass carried by booster version F9 v1.1.
- Listing the date when the first successful landing outcome in ground pad was achieved.
- Listing the names of the boosters which have success in drone ship and have payload mass
greater than 4000 but less than 6000.
- Listing the total number of successful and failure mission outcomes.
- Listing the names of the booster_versions which have carried the maximum payload mass.
- Listing the failed landing_outcomes in drone ship, their booster versions, and launch sites
names for in year 2015.
- Rank the count of landing outcomes or success between the date 2010-06-04 and
2017-03-20, in descending order.
https://github.com/farishelmi17/SpaceX/blob/main/notebook:Exploratory_Data_Analysis_with_SQL__eqznon1EA.ipynb 13
Build an Interactive Map with Folium
To visualize the launch data into an interactive map. We took the latitude and longitude
coordinates at each launch site and added a circle marker around each launch site with a
label of the name of the launch site.
We then assigned the dataframe launch_outcomes(failure,success) to classes 0 and 1 with

Red and Green markers on the map in MarkerCluster().
We then used the Haversine’s formula to calculated the distance of the launch sites to
various landmark to find answer to the questions of:
• How close the launch sites with railways, highways and coastlines?
• How close the launch sites with nearby cities?
From: https://github.com/farishelmi17/SpaceX/blob/main/notebook:Interactive_Visual_Analytics_with_Folium_M8uUhCmHY.ipynb
14
Build a Dashboard with Plotly Dash
• We built an interactive dashboard with Plotly dash which allowing the user to play
around with the data as they need.
• We plotted pie charts showing the total launches by a certain sites.
• We then plotted scatter graph showing the relationship with Outcome and Payload
Mass (Kg) for the different booster version.
The link of the app.py:: https://github.com/farishelmi17/SpaceX/blob/main/spacex_dash_app.py
15
Predictive Analysis (Classification)
Building the Model Evaluating the Model Improving the Model Find the Best Model
• Load the dataset into • Check the accuracy for each • Use Feature Engineering • The model with the best
NumPy and Pandas model and Algorithm Tuning accuracy score will be the
• Transform the data and • Get tuned hyperparameters best performing model.
then split into training and for each type of algorithms.
test datasets • plot the confusion matrix.
• Decide which type of ML to
From:
use https://github.com/farishelmi17
• set the parameters and /SpaceX/blob/main/spacex_das
algorithms to GridSearchCV h_app.py
and fit it to dataset.
16
Results
The results will be categorized to 3 main results which is:

• Exploratory data analysis results
• Interactive analytics demo in screenshots
• Predictive analysis results
17
Section 2
Flight Number vs. Launch Site
This scatter plot

shows that the larger
the flights amount
of the launch site,
the greater the the
success rate will be.
However, site CCAFS
SLC40 shows the
least pattern of this.
19
Payload vs. Launch Site
This scatter plot shows

once the pay load mass is
greater than 7000kg, the
probability of the success
rate will be highly
increased.
However, there is no clear

pattern to say the launch
site is dependent to the pay
load mass for the success
rate.
20
Success Rate vs. Orbit Type
This figure depicted the

possibility of the orbits to
influences the landing outcomes
as some orbits has 100% success
rate such as SSO, HEO, GEO AND
ES-L1 while SO orbit produced
0% rate of success.
However, deeper analysis show
that some of this orbits has only
1 occurrence such as GEO, SO,
HEO and ES-L1 which mean this
data need more dataset to see
pattern or trend before we draw
any conclusion.
21
Flight Number vs. Orbit Type
This scatter plot shows that

generally, the larger the flight
number on each orbits, the
greater the success rate
(especially LEO orbit) except for
GTO orbit which depicts no
relationship between both
attributes.
Orbit that only has 1 occurrence

should also be excluded from
above statement as it’s needed
more dataset.
22
Payload vs. Orbit Type
Heavier payload has positive
impact on LEO, ISS and P0 orbit.
However, it has negative impact
on MEO and VLEO orbit.
GTO orbit seem to depict no
relation between the attributes.
Meanwhile, again, SO, GEO and

HEO orbit need more dataset to
see any pattern or trend.
23
Launch Success Yearly Trend
This figures clearly depicted
and increasing trend from
the year 2013 until 2020.
• JDJD
If this trend continue for the
next year onward. The
success rate will steadily
increase until reaching
1/100% success rate.
.
24
All Launch Site Names
We used the key word DISTINCT to show only unique launch sites
from the SpaceX data.
25
Launch Site Names Begin with 'CCA'
We used the query above to display 5 records where launch sites

begin with `CCA`
26
Total Payload Mass
We calculated the total payload carried by boosters from NASA

as 45596 using the query below
27
Average Payload Mass by F9 v1.1
We calculated the average payload mass carried by booster version

F9 v1.1 as 2928.4
28
First Successful Ground Landing Date
We use the min() function to find the result
We observed that the dates of the first successful landing outcome on ground
pad was 22nd December 2015
29
Successful Drone Ship Landing with Payload between 4000 and 6000
We used the WHERE clause to filter for boosters which have successfully landed on
drone ship and applied the AND condition to determine successful landing with
payload mass greater than 4000 but less than 6000
30
Total Number of Successful and Failure Mission Outcomes
We used wildcard like ‘%’ to filter for WHERE MissionOutcome was a success or a failure.
31
Boosters Carried Maximum Payload
We determined the
booster that have
carried the maximum
payload using a
subquery in the
WHERE clause and
the MAX() function.
32
2015 Launch Records
We used a combinations of the WHERE clause, LIKE, AND, and BETWEEN

conditions to filter for failed landing outcomes in drone ship, their booster
versions, and launch site names for year 2015
33
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20
We selected Landing outcomes

and the COUNT of landing
outcomes from the data and
used the WHERE clause to
filter for landing outcomes
BETWEEN 2010-06-04 to
2010-03-20.
We applied the GROUP BY
clause to group the landing
outcomes and the ORDER BY
clause to order the grouped
landing outcome in
descending order.
34
Section 3
Location of all the Launch Sites
We can see that
all the SpaceX
launch sites are
located inside
the United
States
36
Markers showing launch sites with color labels
37
Launch Sites Distance to Landmarks
38
Section 4
The success percentage by each sites.
40
The highest launch-success ratio: KSC LC-39A
41
Payload vs Launch Outcome Scatter Plot
We can see that all the success rate for low weighted payload is higher than heavy weighted
payload
42
Section 5
Classification Accuracy
As we can see, by using the code as below: we could identify that the best algorithm to be
the Tree Algorithm which have the highest classification accuracy.
44
Confusion Matrix
The confusion matrix for the decision tree classifier shows that the classifier can
distinguish between the different classes. The major problem is the false positives
.i.e., unsuccessful landing marked as successful landing by the classifier.
45
Conclusions
We can conclude that:

• The Tree Classifier Algorithm is the best Machine Learning approach for this dataset.
• The low weighted payloads (which define as 4000kg and below) performed better
than the heavy weighted payloads.
• Starting from the year 2013, the success rate for SpaceX launches is increased,
directly proportional time in years to 2020, which it will eventually perfect the
launches in the future.
• KSC LC-39A have the most successful launches of any sites; 76.9%
• SSO orbit have the most success rate; 100% and more than 1 occurrence.
46

Organized

Uploaded by

Copyright:

Available Formats

Organized

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Organized

Uploaded by

Copyright:

Available Formats

Arthur Nice Passere

• Perform data wrangling

• Perform exploratory data analysis (EDA) using visualization and SQL

Data collection is the process of gathering and measuring information on targeted

Get request for rocket launch

Use json_normalize method to

Performed data cleaning and

Request the Falcon9

Extract all column/variable

Data Wrangling is the process of cleaning and

We will first calculate the number of launches on each

We then create a landing outcome label from the

Scatter plots show dependency of attributes on each other.

We then assigned the dataframe launch_outcomes(failure,success) to classes 0 and 1 with

The link of the app.py:: https://github.com/farishelmi17/SpaceX/blob/main/spacex_dash_app.py

The results will be categorized to 3 main results which is:

This scatter plot

This scatter plot shows

However, there is no clear

This figure depicted the

This scatter plot shows that

Orbit that only has 1 occurrence

Meanwhile, again, SO, GEO and

We used the query above to display 5 records where launch sites

We calculated the total payload carried by boosters from NASA

We calculated the average payload mass carried by booster version

We used a combinations of the WHERE clause, LIKE, AND, and BETWEEN

We selected Landing outcomes

We can conclude that:

You might also like