Organized

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Arthur Nice Passere

05/17/2024
Outline

• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix

2
Executive Summary
• Summary of methodologies
- Data Collection through API
- Data Collection with Web Scraping
- Data Wrangling
- Exploratory Data Analysis with SQL
- Exploratory Data Analysis with Data Visualization
- Interactive Visual Analytics with Folium
- Machine Learning Prediction
• Summary of all results
- Exploratory Data Analysis result
- Interactive analytics in screenshots
- Predictive Analytics result from Machine Learning Lab
3
Introduction

SpaceX is a revolutionary company who has disrupt the space industry by offering a
rocket launches specifically Falcon 9 as low as 62 million dollars; while other providers
cost upward of 165 million dollar each. Most of this saving thanks to SpaceX
astounding idea to reuse the first stage of the launch by re-land the rocket to be used
on the next mission. Repeating this process will make the price down even further. As a
data scientist of a startup rivaling SpaceX, the goal of this project is to create the
machine learning pipeline to predict the landing outcome of the first stage in the future.
This project is crucial in identifying the right price to bid against SpaceX for a rocket
launch.
The problems included:
• Identifying all factors that influence the landing outcome.
• The relationship between each variables and how it is affecting the outcome.
• The best condition needed to increase the probability of successful landing. 4
Section 1

5
Methodology

Executive Summary
• Data collection methodology:
• Data was collected using SpaceX REST API and web scrapping from Wikipedia

• Perform data wrangling


• Data was processed using one-hot encoding for categorical features

• Perform exploratory data analysis (EDA) using visualization and SQL


• Perform interactive visual analytics using Folium and Plotly Dash
• Perform predictive analysis using classification models
• How to build, tune, evaluate classification models
6
Data Collection

Data collection is the process of gathering and measuring information on targeted


variables in an established system, which then enables one to answer relevant
questions and evaluate outcomes. As mentioned, the dataset was collected by REST
API and Web Scrapping from Wikipedia

For REST API, its started by using the get request. Then, we decoded the response
content as Json and turn it into a pandas dataframe using json_normalize(). We
then cleaned the data, checked for missing values and fill with whatever needed.

For web scrapping, we will use the BeautifulSoup to extract the launch records as
HTML table, parse the table and convert it to a pandas dataframe for further
analysis
7
Data Collection – SpaceX API

Get request for rocket launch


data using API

Use json_normalize method to


convert json result to dataframe

Performed data cleaning and


filling the missing value

From:
https://github.com/farishelmi17/SpaceX/blob/m
ain/notebook:Data_Collection_yJPxhv2oU.ipynb
8
Data Collection - Scraping

Request the Falcon9


Launch Wiki page from url

Create a BeautifulSoup
from the HTML response

Extract all column/variable


names from the HTML
header

From:
https://github.com/farishelmi17/SpaceX/blo
b/main/notebook:Data_Collection_with_We
b_Scraping_nI89VIRCE.ipynb

9
Data Wrangling

Data Wrangling is the process of cleaning and


unifying messy and complex data sets for easy access
and Exploratory Data Analysis (EDA).

We will first calculate the number of launches on each


site, then calculate the number and occurrence of
mission outcome per orbit type.

We then create a landing outcome label from the


outcome column. This will make it easier for further
From: analysis, visualization, and ML. Lastly, we will export
https://github.com/farishelmi17/SpaceX/blob/main/noteboo
k:Data_Wrangling_9HnvfsJ5G.ipynb
the result to a CSV.

10
EDA with Data Visualization
We first started by using scatter graph to find the relationship
between the attributes such as between:
• Payload and Flight Number.
• Flight Number and Launch Site.
• Payload and Launch Site.
• Flight Number and Orbit Type.
• Payload and Orbit Type.

Scatter plots show dependency of attributes on each other.


Once a pattern is determined from the graphs. It’s very easy to
see which factors affecting the most to the success of the
landing outcomes.
ttps://github.com/farishelmi17/SpaceX/blob/main/notebook:Exploratory_Data_A
nalysis_with_Visualisation_Lab_jJkKVG6F1.ipynb
11
EDA with Data Visualization
Once we get a hint of the relationships using scatter plot. We
will then use further visualization tools such as bar graph and
line plots graph for further analysis.
Bar graphs is one of the easiest way to interpret the
relationship between the attributes. In this case, we will use
the bar graph to determine which orbits have the highest
probability of success.
We then use the line graph to show a trends or pattern of the
attribute over time which in this case, is used for see the
launch success yearly trend.
We then use Feature Engineering to be used in success
prediction in the future module by created the dummy
variables to categorical columns.
ttps://github.com/farishelmi17/SpaceX/blob/main/notebook:Exploratory_Data_Analysis_with
_Visualisation_Lab_jJkKVG6F1.ipynb

12
EDA with SQL
Using SQL, we had performed many queries to get better understanding of the dataset, Ex:
- Displaying the names of the launch sites.
- Displaying 5 records where launch sites begin with the string ‘CCA’.
- Displaying the total payload mass carried by booster launched by NASA (CRS).
- Displaying the average payload mass carried by booster version F9 v1.1.
- Listing the date when the first successful landing outcome in ground pad was achieved.
- Listing the names of the boosters which have success in drone ship and have payload mass
greater than 4000 but less than 6000.
- Listing the total number of successful and failure mission outcomes.
- Listing the names of the booster_versions which have carried the maximum payload mass.
- Listing the failed landing_outcomes in drone ship, their booster versions, and launch sites
names for in year 2015.
- Rank the count of landing outcomes or success between the date 2010-06-04 and
2017-03-20, in descending order.
https://github.com/farishelmi17/SpaceX/blob/main/notebook:Exploratory_Data_Analysis_with_SQL__eqznon1EA.ipynb 13
Build an Interactive Map with Folium
To visualize the launch data into an interactive map. We took the latitude and longitude
coordinates at each launch site and added a circle marker around each launch site with a
label of the name of the launch site.

We then assigned the dataframe launch_outcomes(failure,success) to classes 0 and 1 with


Red and Green markers on the map in MarkerCluster().

We then used the Haversine’s formula to calculated the distance of the launch sites to
various landmark to find answer to the questions of:
• How close the launch sites with railways, highways and coastlines?
• How close the launch sites with nearby cities?
From: https://github.com/farishelmi17/SpaceX/blob/main/notebook:Interactive_Visual_Analytics_with_Folium_M8uUhCmHY.ipynb
14
Build a Dashboard with Plotly Dash

• We built an interactive dashboard with Plotly dash which allowing the user to play
around with the data as they need.
• We plotted pie charts showing the total launches by a certain sites.
• We then plotted scatter graph showing the relationship with Outcome and Payload
Mass (Kg) for the different booster version.

The link of the app.py:: https://github.com/farishelmi17/SpaceX/blob/main/spacex_dash_app.py

15
Predictive Analysis (Classification)

Building the Model Evaluating the Model Improving the Model Find the Best Model
• Load the dataset into • Check the accuracy for each • Use Feature Engineering • The model with the best
NumPy and Pandas model and Algorithm Tuning accuracy score will be the
• Transform the data and • Get tuned hyperparameters best performing model.
then split into training and for each type of algorithms.
test datasets • plot the confusion matrix.
• Decide which type of ML to
From:
use https://github.com/farishelmi17
• set the parameters and /SpaceX/blob/main/spacex_das
algorithms to GridSearchCV h_app.py
and fit it to dataset.

16
Results

The results will be categorized to 3 main results which is:


• Exploratory data analysis results
• Interactive analytics demo in screenshots
• Predictive analysis results

17
Section 2
Flight Number vs. Launch Site

This scatter plot


shows that the larger
the flights amount
of the launch site,
the greater the the
success rate will be.
However, site CCAFS
SLC40 shows the
least pattern of this.

19
Payload vs. Launch Site

This scatter plot shows


once the pay load mass is
greater than 7000kg, the
probability of the success
rate will be highly
increased.

However, there is no clear


pattern to say the launch
site is dependent to the pay
load mass for the success
rate.

20
Success Rate vs. Orbit Type

This figure depicted the


possibility of the orbits to
influences the landing outcomes
as some orbits has 100% success
rate such as SSO, HEO, GEO AND
ES-L1 while SO orbit produced
0% rate of success.
However, deeper analysis show
that some of this orbits has only
1 occurrence such as GEO, SO,
HEO and ES-L1 which mean this
data need more dataset to see
pattern or trend before we draw
any conclusion.

21
Flight Number vs. Orbit Type

This scatter plot shows that


generally, the larger the flight
number on each orbits, the
greater the success rate
(especially LEO orbit) except for
GTO orbit which depicts no
relationship between both
attributes.

Orbit that only has 1 occurrence


should also be excluded from
above statement as it’s needed
more dataset.

22
Payload vs. Orbit Type
Heavier payload has positive
impact on LEO, ISS and P0 orbit.
However, it has negative impact
on MEO and VLEO orbit.
GTO orbit seem to depict no
relation between the attributes.

Meanwhile, again, SO, GEO and


HEO orbit need more dataset to
see any pattern or trend.

23
Launch Success Yearly Trend
This figures clearly depicted
and increasing trend from
the year 2013 until 2020.
• JDJD
If this trend continue for the
next year onward. The
success rate will steadily
increase until reaching
1/100% success rate.
.

24
All Launch Site Names

We used the key word DISTINCT to show only unique launch sites
from the SpaceX data.

25
Launch Site Names Begin with 'CCA'

We used the query above to display 5 records where launch sites


begin with `CCA`

26
Total Payload Mass

We calculated the total payload carried by boosters from NASA


as 45596 using the query below

27
Average Payload Mass by F9 v1.1

We calculated the average payload mass carried by booster version


F9 v1.1 as 2928.4

28
First Successful Ground Landing Date
We use the min() function to find the result
We observed that the dates of the first successful landing outcome on ground
pad was 22nd December 2015

29
Successful Drone Ship Landing with Payload between 4000 and 6000

We used the WHERE clause to filter for boosters which have successfully landed on
drone ship and applied the AND condition to determine successful landing with
payload mass greater than 4000 but less than 6000

30
Total Number of Successful and Failure Mission Outcomes

We used wildcard like ‘%’ to filter for WHERE MissionOutcome was a success or a failure.

31
Boosters Carried Maximum Payload

We determined the
booster that have
carried the maximum
payload using a
subquery in the
WHERE clause and
the MAX() function.

32
2015 Launch Records

We used a combinations of the WHERE clause, LIKE, AND, and BETWEEN


conditions to filter for failed landing outcomes in drone ship, their booster
versions, and launch site names for year 2015

33
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20

We selected Landing outcomes


and the COUNT of landing
outcomes from the data and
used the WHERE clause to
filter for landing outcomes
BETWEEN 2010-06-04 to
2010-03-20.
We applied the GROUP BY
clause to group the landing
outcomes and the ORDER BY
clause to order the grouped
landing outcome in
descending order.

34
Section 3
Location of all the Launch Sites
We can see that
all the SpaceX
launch sites are
located inside
the United
States

36
Markers showing launch sites with color labels

37
Launch Sites Distance to Landmarks

38
Section 4
The success percentage by each sites.

40
The highest launch-success ratio: KSC LC-39A

41
Payload vs Launch Outcome Scatter Plot
We can see that all the success rate for low weighted payload is higher than heavy weighted
payload

42
Section 5
Classification Accuracy
As we can see, by using the code as below: we could identify that the best algorithm to be
the Tree Algorithm which have the highest classification accuracy.

44
Confusion Matrix
The confusion matrix for the decision tree classifier shows that the classifier can
distinguish between the different classes. The major problem is the false positives
.i.e., unsuccessful landing marked as successful landing by the classifier.

45
Conclusions

We can conclude that:


• The Tree Classifier Algorithm is the best Machine Learning approach for this dataset.
• The low weighted payloads (which define as 4000kg and below) performed better
than the heavy weighted payloads.
• Starting from the year 2013, the success rate for SpaceX launches is increased,
directly proportional time in years to 2020, which it will eventually perfect the
launches in the future.
• KSC LC-39A have the most successful launches of any sites; 76.9%
• SSO orbit have the most success rate; 100% and more than 1 occurrence.

46

You might also like