Analyzing Orbital Parameters To Classify Asteroids

Predict whether a Near-Earth Asteroid is potentially
hazardous or not based on its characteristics and identify

which NEA categories are more likely to be PHAs.
By: - Avishka Edirisuriya (coadds211f-011)
BACHELOR OF SCIENCE (HONS) IN DATA SCIENCE NATIONAL INSTITUTE OF

BUSINESS MANAGEMENT (NIBM) Colombo, Sri Lanka
Date of submission- 30th April 2023
iii
DECLARATION
I hereby declare that the work presented in this project report was carried out independently by
myself and have cited the work of others and given due reference diligently.
………………………
Date
……………………....
Avishka Edirisuriya
I certify that the above student carried out his/her project under my supervision and guidance.
………………………
Date
……………………....
Thurairasa Balakumar
iii
Acknowledgment
I would like to express my gratitude to the individuals who contributed to the success of this
project. First and foremost, I am deeply grateful to my research supervisor, Mr.Thurairasa
Balakumar, for his exceptional guidance, patience, and continuous support, without which this
project should not have been possible.
I would also like to thank Dr. (Mrs.) M.D.T.Atyagalle, the Head of the Department of Statistics,
as well as all the senior lecturers, lecturers, and demonstrators of the department for their
guidance and assistance throughout this project. Additionally, I extend my sincere thanks to our
course coordinator and Senior Lecturer, Dr. (Mrs.) J.H.D.S.P. Tissera, for leading me on the
right path.
I am grateful to my friends for their valuable ideas, motivation, and willingness to help whenever
I needed. Finally, I am thankful to my parents, and family members, for their unwavering
support and encouragement throughout my studies.
iii
Executive Summary
This project aimed to develop a logistic regression model that can accurately predict whether a
near-Earth asteroid (NEA) is potentially hazardous or not. The dataset that was used included
information about NEAs' diameter, orbital characteristics, proximity to Earth, and whether or not
they were classified as potentially hazardous. The logistic regression model achieved an overall
accuracy of 82.3%. However, the comprehensive investigation shows that the model's sensitivity
and specificity for identifying potentially hazardous asteroids (PHAs) were not as high as
intended, with a significant number of false negatives and a low number of true positives.
Further investigation revealed that the NEA class Atiras, followed by Apollos and Atens, has the
highest likelihood of being a PHA, while Amors has the lowest likelihood. Overall, the
researchers concluded that logistic regression models have the capacity to correctly predict the
potential hazards of NEAs based on their characteristics, although more study and improvement
may be required to increase the model's sensitivity and specificity for detecting PHAs.
iii
Abstract
An asteroid that orbits the Earth in proximity is considered a Near-Earth asteroid. In
astronomical terminology, a NEO is estimated to have a trajectory that gets it within 0.3
astronomical units of the Earth's orbit and within 1.3 astronomical units of the Sun. Many
asteroids are classified as potentially hazardous asteroids (PHAs) because they have orbits that
bring them close to the Earth and have the potential to collide with our planet. In this research
study, we aimed to investigate the connection between orbital parameters and the hazardousness
of asteroids by analyzing a large dataset of near-earth asteroids from NASA’s Center for Near-
earth object studies which were published on Kaggle. I used statistical techniques, such as
regression and hypothesis testing, to analyze the data and identify trends or patterns in the orbital
parameters of PHAs. The findings of this research will have important implications for the
detection and mitigation of PHAs and will contribute to our understanding of the role of orbital
parameters in the hazardousness of asteroids.
vii
Contents
Chapter 1 : Introduction.............................................................................................................................1
1.1 Background..........................................................................................................................................1
1.2 Research problem.................................................................................................................................2
1.3 Objectives............................................................................................................................................3
1.4 Research questions...............................................................................................................................3
1.5 Scope of the research...........................................................................................................................4
1.6 Justification of the research.................................................................................................................4
1.7 Expected limitations.............................................................................................................................5
1.8 Proposed work schedule......................................................................................................................6
Chapter 2 : Literature Review....................................................................................................................6
2.1 Introduction to the research theme.......................................................................................................6
2.2 Theoretical explanation of the keywords in the topic..........................................................................7
2.3 Findings by other researchers............................................................................................................10
2.4 Research gap......................................................................................................................................11
2.5 Table of Variables..............................................................................................................................12
Chapter 3 : Data Preparation Process - Data Pre-processing and Data Wrangling...........................14
3.1 Data acquisition.................................................................................................................................14
3.2 Data pre-processing...........................................................................................................................14
3.3 Data Wrangling..................................................................................................................................17
Chapter 4: Methodology............................................................................................................................20
4.1 Introduction........................................................................................................................................20
4.2 Population, Sample, and Sampling technique....................................................................................21
4.3 Type of Data to be Collected and Data Sources................................................................................21
4.4 Data collection tools and plan............................................................................................................22
4.5 Methods, Techniques, and Tools.......................................................................................................22
Chapter 5 : Data Analysis, Visualization, and Interpretation...............................................................23
5.1 Descriptive Statistics..........................................................................................................................23
5.2 Feature Selection................................................................................................................................27
5.3 Regression Analysis...........................................................................................................................27
Chapter 6: Discussion & Conclusion.......................................................................................................33
6.1Discussion...........................................................................................................................................33
References...................................................................................................................................................34
vii
vii
Table of Tables
Table 1-1 Proposed Work Schedule..............................................................................................................6
Table 2-1 Theoretical explanation of the keywords in the topic...................................................................9
Table 2-2 Table of Variables.......................................................................................................................17
vii
Table of Figures
Figure 1 Before removing unwanted columns.............................................................................................18
Figure 2 After removing unwanted columns...............................................................................................19
Figure 3 Count of null value per columns in data1_new.............................................................................19
Figure 4 Added new column Nea_Class to the data frame..........................................................................21
Figure 5 Added new column Hazardous to the data frame..........................................................................21
Figure 6 After concatenating PHA data frame in to data1_new..................................................................22
Figure 7 After concatenating Nea data frame into data2_new....................................................................22
Figure 8 Count of Null Values per Column in Data1_new & Data2_new..................................................23
Figure 9 Summary statistics of data1_new..................................................................................................26
Figure 10 Histograms of Numerical variables.............................................................................................27
Figure 11 Correlation Heatmap of numerical pairs.....................................................................................28
Figure 12 Top Negative & Positive strongest correlation coefficient.........................................................29
Figure 13 Most impactful features for prediction........................................................................................30
Figure 14 Train Logistic Regression Model................................................................................................30
Figure 15 Generate Prediction.....................................................................................................................31
Figure 16 Prediction Data Frame.................................................................................................................32
Figure 17 Accuracy of Logistic Regression model......................................................................................32
Figure 18 Classification Report...................................................................................................................33
Figure 19 Confusion Matrix.........................................................................................................................33
Figure 20 Train regression model................................................................................................................34
Figure 21 Predict the probabilities...............................................................................................................34
Figure 22 New data frame with Predicted Probability.................................................................................35
Figure 23 Rank the categories by their predicted probabilities...................................................................35
vii
Chapter 1 : Introduction
1.1 Background
Near-Earth Asteroids are any small solar system bodies that orbit the sun and pass relatively
close to the Earth’s orbit. They are pushed into orbits that allow them to enter Earth’s
neighborhood due to the gravitational pull of nearby planets. These asteroids can originate from
various places, such as the main asteroid belt between Mars and Jupiter, and can be made of
different substances including metal, rock, and ice. In technical terms, an Asteroid is an NEA if
its perihelion distance is less than 1.3 AU which is approximately 195 million kilometers.
They are separated into several groups according to their size, shape, and orbital parameters.
According to NASA’s Center for Near Earth Object Studies, the most common classification
scheme for NEAs is the Apollo, Amor, Aten, and Atira categories, which are based on the
asteroid's orbital distance from the Sun.
Some NEAs are also classified as potentially hazardous asteroids. PHA is one whose orbit passes
through Earth’s orbit and is larger than 140 meters (460 feet) in diameter. In technical terms, an
NEA is a potentially hazardous asteroid if its MOID is less than 0.05 AU, and its absolute
magnitude is equal to or less than 22. PHAs are also a subset of NEAs that have the potential to
impact the Earth and cause tremendous damage.
NEAs are typically discovered by ground-based telescopes such as Wide-field infrared survey
explorer (WISE). Once an NEA is discovered, its orbit is tracked and monitored to determine its
potential risk to Earth. NEAs can also be used for space exploration. For example, some NEAs
1
have been proposed as a potential destination for robotic spacecraft, as they can provide a
relatively easy and safe target for exploration.
Scientists study NEAs to learn more about the solar system and to understand these asteroids'
potential benefits and drawbacks. In general, the study of orbital parameters is crucial to asteroid
exploration since it clarifies the characteristics and behavior of these small things and the risks
they can cause to the Earth.
1.2 Research problem
The classification of NEAs and the identification of PHAs is already an active area of research
and will continue to grow in the future. There is still much to learn about these tiny bodies and
their possible impact on the Earth.
This study analyzes orbital properties of near-earth asteroids to learn more about them. Orbital
parameters are the characteristics that describe the shape and position of an asteroid's orbit
around the sun. Some examples of orbital parameters are perihelion distance, aphelion distance,
the semi-major axis, and eccentricity.
This study involves classifying Near-Earth Asteroids according to standard methods from
NASA’s Center for Near-Earth Object Studies and Predict whether a Near-Earth Asteroid is
potentially hazardous or not based on its characteristics and identify which NEA categories are
more likely to be PHAs.
2
1.3 Objectives
More specific objectives derived from the research problem are listed below.
 Explore the dataset to gain insights into the distribution of asteroid characteristics.
 Identify which NEA categories are more likely to be PHAs.
 Identifying which asteroid characteristics are most strongly associated with whether an
asteroid is potentially hazardous or not.
 Train a logistic regression model to predict whether an asteroid is potentially hazardous or
not based on its characteristics.
 Evaluate the performance of the logistic regression model.
1.4 Research questions
The following questions will be answered at the conclusion of this study.
 Which asteroid characteristics have the strongest association with whether an asteroid is
potentially hazardous or not?
 Is it possible to accurately predict whether an asteroid is potentially hazardous or not using
its characteristics?
 What is the performance of the logistic regression model in predicting whether an asteroid is
potentially hazardous or not?
 Are there any subgroups of asteroids with higher or lower risk of being potentially
hazardous?
3
1.5 Scope of the research
The primary purpose of this research is to Analyze orbital parameters and other properties of
asteroids, such as relative velocity, to classify Near-Earth Asteroids, determine which category is
most likely to be potentially hazardous, and predict whether a Near-Earth Asteroid is potentially
hazardous or not based on its characteristics. This could involve using standard asteroid
classifying criteria used by NASA’s Center for Near Earth Object Studies, statistical methods,
and other analytical approaches. This research could help better understand the interconnections
of orbital and non-orbital parameters and the potential hazardousness of Near-Earth Asteroids.
1.6 Justification of the research
In the case of predicting whether a Near-Earth Asteroid is potentially hazardous or not, there
may be a variety of factors that affect their orbital characteristics, and Hazardousness and
statistical approaches can be used to discover patterns and correlations that may not be
immediately apparent. This can help to gain a deeper understanding of the processes that
underlie asteroids' behavior and the dangers that they might pose to Earth.
In addition, statistical analysis enables researchers to test hypotheses and make predictions
regarding asteroids' behavior based on their orbital parameters. This will aid future research and
policy decisions in classifying NEAs and identifying relationships between orbital and non-
orbital factors.
Overall, predicting whether a Near-Earth Asteroid is potentially hazardous or not using statistical
methods can offer a more comprehensive and profound understanding of NEAs, and their
categories and facilitate more effective decision-making.
4
1.7 Expected limitations
Statistical analysis is only as good as the data that is used to perform it. The quantity of data that
is currently accessible in the case of NEAs study may be constrained. And the relationships
between variables can only be estimated by statistical analysis, which would be based on
probability. There is always a chance that these estimations won't be precisely accurate or that
the outcomes won't be guaranteed.
Only asteroids with circular and elliptical orbits are considered in this study. Because they are
more likely to be NEAs. Therefore, asteroids with parabolic trajectories and hyperbolic
trajectories are disregarded.
Furthermore, statistical analysis may not be able to capture all the complexity of the systems
being studied. Statistical analysis is only one tool that can be used to study NEAs, and it may not
be appropriate for all research objectives or for all types of discoveries. It may be necessary to
use additional techniques, such as simulations or observational research, to enhance,
complement, or supplement the findings of statistical analysis.
Although statistical research can provide insights into the relationships between orbital
parameters and non-orbital parameters and the hazardousness of NEAs, it is crucial to be aware
of its limitations. Consequently, the data obtained to conduct this research is subject to these
limitations.
5
1.8 Proposed work schedule
December January February
3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
week week week week week week week week week week
Topic selection
Data acquisition
Study articles &

research papers
Project proposal
submission
Data cleaning
Data analyzing
First draft
preparation
Final report
submission
Table 1-1 Proposed Work Schedule
Chapter 2 : Literature Review
2.1 Introduction to the research theme
NEAs are small, rocky objects that orbit the Sun and are believed to be remnants from the inner
solar system. Since it can give insight into the origins and evolution of the solar system and help
determine the likelihood of asteroid impacts, understanding the characteristics and behavior of
asteroids is a crucial area of research in planetary science.
6
Several techniques have been developed to categorize NEAs and identify PHAs. One common
method is based on the asteroids' orbital characteristics. Such as semi-major axis, aphelion
distance & perihelion distance. Various studies have investigated the characteristics of NEAs and
PHAs using various data sources, including observations from ground-based telescopes, space-
based telescopes, and spacecraft missions.
The study of orbital characteristics and asteroids' hazardousness using statistical methods has
gained increasing attention in recent years. This research theme aims to understand how orbital
parameters can be used to classify asteroids and investigate interconnections between orbital and
non-orbital parameters and hazardousness. Researchers can develop models to discover patterns
and trends that may be helpful in categorizing asteroids and determining their hazardousness by
studying data on the orbital characteristics of known NEAs.
As a result, this chapter addresses previous studies and discusses the approaches used as well as
the research gap that was found.
2.2 Theoretical explanation of the keywords in the topic
Keywords Theoretical definition
Near Earth Asteroids (NEA) A near-Earth asteroid is a type of asteroid that
orbits the Sun and comes relatively close to the
Earth's orbit.
7
Potentially Hazardous Asteroids (PHA) A potentially hazardous asteroid is a subset of a
Near-Earth asteroid that has the potential to
impact the Earth and cause significant damage or
destruction.
Astronomical Unit (AU) An astronomical unit (AU) is a unit of distance
used to measure distances within the solar
system, it is defined as the average distance
between the Earth and the Sun. Which is
approximately 150 million kilometers.
Inner solar system The inner solar system is the region of the solar
system that includes the four terrestrial planets
(Mercury, Venus, Earth, and Mars) and the
asteroid belt.
Spectrophotometry Spectrophotometry is a method used to measure
the absorbance of light by a sample as a function
of the wavelength of the light.
Ascending node's longitude The ascending node's longitude is the angle
between an object's orbital plane and the plane of
the celestial equator.
Argument of perihelion The argument of perihelion is the angle between
the point of the orbit closest to the central body
(sun) and the point of the celestial equator where
the object would be if it were moving in a circular
8
orbit at the same distance from the central body.
Bus-DeMeo taxonomy The Bus-DeMeo taxonomy is an asteroid
classification system designed by Francesca
DeMeo, Schelte Bus, and Stephen Slivan in 2009.
It is based on reflectance spectrum
characteristics.
Amor Earth-approaching NEAs with orbits exterior to
Earth’s but interior to Mars’s. (Named after the
asteroid (1221) Amor)
Apollo Earth-crossing NEAs with semi-major axis larger
than Earth’s semi-major axis. (Named after the
asteroid (1862) Apollo)
Aten Earth-crossing NEAs with semi-major axis
smaller than Earth’s semi-major axis. (Named
after the asteroid (2062) Aten)
Atira NEAs whose orbits are contained entirely within
the orbit of the Earth. (Named after the asteroid
(163693) Atira)
Table 2-2 Theoretical explanation of the keywords in the topic
2.3 Findings by other researchers
9
There are numerous resources available for further study about classifying asteroids using their
orbital parameters and examining the relationship between those parameters and the
hazardousness of asteroids using statistical techniques. Some potential sources of information
include academic journals, Online databases, and repositories and websites.
A study published by Kathleen Jacinda Mcintyre on “Applying Machine Learning to Asteroid
Classification Utilizing Spectroscopically Derived Spectrophotometry” in 2019 investigated
Asteroids using a machine learning algorithm to identify spectrophotometric data, then compare
it with current Bus-DeMeo taxonomy (DeMeo, et al. 2009) categorization efforts. Then
investigated the relationship between classified asteroids and observational characteristics by
integrating historical methods and perspectives with modern computing capabilities.
The research titled “Prediction of orbital parameters for undiscovered potentially hazardous
asteroids using machine learning” (2018) published by Vadym Pasko analyzed PHAs in the
Amor group, and the results showed a relationship between the density of the PHA distribution
and the ascending node's longitude and argument of perihelion.
Another research named “Radar imaging and physical characterization of Near-Earth Asteroid
(162421) 2000 ET70” was published in the journal named Icarus (in 2013) shows that the author
of this research has been able to predict reliable trajectory for NEA (162421) 2000 ET70, by
utilizing precise measurements of the range and velocity of the asteroid which are non-orbital
parameters. These were measured by obtaining continuous wave spectra and range-Doppler
images with range resolutions as fine as 15 m. NEA (162421) 2000 ET70 was observed during
its closest approach to Earth in February 2012 over a period of 12 days by using the Arecibo and
Goldstone radar systems.
10
These are merely a few instances of the various types of discoveries that have been made in the
subject of asteroid studies throughout the past years. I suggest looking up more information on
this subject in online databases and repositories like the Minor Planet Center and the NASA
Astrophysics Data System (ADS), as well as academic publications like the Planetary Science
Journal and the Journal of the Royal Astronomical Society.
2.4 Research gap
There are various types of research about celestial bodies like planets, exoplanets, comets, and
asteroids. Nevertheless, as far as I know, only a few studies classify NEAs and predict the
hazardousness of Near-Earth Asteroids based on their orbital and no-orbital parameters by using
statistical approaches. Kathleen Jacinda Mcintyre’s research had a gap in the literature since the
author considered only spectrophotometric data to identify and categorize asteroids, and the
author used the Bus-DeMeo classification method which is an asteroid taxonomic system that
only allows reflectance spectrum characteristics.
Furthermore, the gap in the literature in the study by Vadym Pasko was that the author
investigated only one NEA group, which is Amors. And the study gives results about the
relationship between the density of the PHA distribution and several orbital parameters.
The research named “Radar imaging and physical characterization of near-Earth Asteroid
(162421) 2000 ET70” had a gap in the literature because the author only investigated a single
NEA which can be potentially hazardous. Also, Critical orbital parameters have not been
considered and no major statistical approaches have been taken.
Finally, there is still much we don't know about the relationship between asteroids' orbital
properties and their hazardousness. While some studies have found correlations between certain
11
orbital parameters and the likelihood of an asteroid impacting Earth, there is still a lot to be
discovered about these relationships. Therefore, further research is needed to reveal other
insightful information about these interconnections.
2.5 Table of Variables
Variable name Theoretical definition Result meaning
NEO Reference ID Reference ID refers to a unique Reference ID
identifier assigned to a specific
NEA.
Absolute magnitude Absolute magnitude is defined Absolute magnitude
as the apparent magnitude of an
NEA if it were located at a
distance of 10 parsecs (32.6
light years) with no extinction of
its light due to absorption by
interstellar dust particles.
Minimum orbit MOID is defined as the MOID in Astronomical units
intersection distance minimum distance between the
(MOID) two orbits at any point in their
respective cycles.
Semi-major axis It is defined as half of the length Semi-major axis length in
of the longest line that can be Astronomical units
drawn through the NEA’s orbit.
12
Perihelion distance The perihelion distance of an Perihelion distance in
NEA is the distance between the Astronomical units
asteroid and the Sun at the point
in its orbit when it is closest to
the Sun.
Aphelion distance The perihelion distance of an Aphelion distance in
NEA is the distance between the Astronomical units
asteroid and the Sun at the point
in its orbit when it is farthest
from the Sun.
Eccentricity The eccentricity of an asteroid is Value between 0 and 1
a measure of how much its orbit
around the earth deviates from a
perfect circle. It is a value
between 0 and 1, with a 0
representing a perfect circle and
values closer to 1 indicating a
highly eccentric orbit.
Relative velocity The speed at which an asteroid Relative velocity in kmph
is moving in relative to a frame
of reference, such as the Earth or
the Sun, is known as its relative
velocity. It’s a measurement of
13
the asteroid's motion speed and
direction.
Hazardousness This means whether a near-earth PHA, Non-PHA
asteroid can cause a significant
impact on the earth.
NEA-Class This implies which category the Apollo, Amor, Aten, Atira
NEAs belong to from the groups
of Apollo, Amor, Aten, and
Atira.
Asc Node The angular distance of an Angle measured in degrees
Longitude asteroid's orbital node measured
from the vernal equinox, which
is a reference point on the
celestial sphere. It is the point
where the ecliptic (the plane of
Earth's orbit around the Sun)
intersects the celestial equator
(the imaginary line in the sky
directly above the Earth's
equator) and defines the start of
the celestial coordinate system.
14
Perihelion Arg It describes the orientation of the Angle measured in degrees
asteroid's orbit with respect to
the position of the asteroid at its
closest approach to the Sun
(perihelion). It is the angle
between the asteroid's ascending
node (the point where the
asteroid crosses the ecliptic from
south to north) and the point
where the asteroid reaches
perihelion.
Mean Anomaly Mean anomaly means the Angle measured in degrees
position of a celestial object,
such as a Near-Earth Asteroid
(NEA), relative to its elliptical
orbit. It is defined as the angle
between the object's current
position and its position at
perihelion (the point in its orbit
where it is closest to the Sun),
measured in the plane of the
object's orbit.
Mean Motion Average angular speed at which Average angular speed of an
15
an asteroid travel in its orbit asteroid in its orbit around the
around the Sun Sun
Est Dia in KM (min) Minimum estimated diameter of Estimated diameter in
the Near-Earth Asteroid (NEA) kilometers
in kilometers.
Est Dia in KM (max) Maximum estimated diameter of Estimated diameter in
the Near-Earth Asteroid (NEA) kilometers
in kilometers.
Inclination Inclination is a measure of the Angle measured in degrees.
tilt or angle of an asteroid's orbit
relative to the plane of the
ecliptic, which is the plane of
Earth's orbit around the Sun.
Asc Node Asc Node Longitude means Angle measured in degrees.
Longitude longitude of the point at which
an asteroid's orbit crosses the
ecliptic plane (the plane of
Earth's orbit around the Sun) as
seen from above the Earth's
North Pole.
16
Miss Dist. It means closest distance Distance in astronomical units.
(Astronomical) between the NEO and the
Earth's orbit in astronomical
units (AU) at the time of closest
approach.
Table 2-3 Table of Variables
Chapter 3 : Data Preparation Process - Data Pre-processing and

Data Wrangling
3.1 Data acquisition
The first step in the data preparation process is to acquire the data needed for the research. This
study relies on a dataset from "Kaggle," a secondary data source, for its data. However, the Jet
Propulsion Laboratory of NASA is the original owner of the data in this dataset.
3.2 Data pre-processing
Once the data has been acquired, it may need to be pre-processed to make it suitable for analysis.
This may involve cleaning the data by removing errors, inconsistencies, or missing values. To do
that, this could also entail applying methods like outlier identification, imputation, or data
interpolation. After the data has been cleaned, it may need to be transformed into a suitable
format for analysis. This could involve converting data from one type to another, such as from
string to numeric, or scaling data to a common range.
3.2.1 Remove unwanted columns.
17
The first step of data cleaning was carried out by removing unnecessary columns after importing
the required libraries and dataset. Figure 3.1 depicts the complete dataset prior to data cleaning,
which consisted of 39 columns. Following the data cleaning process, as illustrated in Figure 3.2,
the number of columns was reduced to 17. A new data frame called ‘data1_new’ has been
created as a copy of the updated original data frame after the columns that are not contained in
the specified list have been removed from the original data frame. The columns that have been
specified for inclusion are: “Neo Reference ID”, “Absolute Magnitude”, “Est Dia in KM(min)”,
“Relative Velocity km per hr”, “Est Dia in KM(max)”, “Miss Dist.(Astronomical)”, “Minimum
Orbit Intersection”, “Eccentricity”, “Semi Major Axis”, “Inclination”, “Asc Node Longitude”,
“Orbital Period”, “Perihelion Distance”, “Aphelion Dist”, “Mean Anomaly”, “Mean Motion”,
“Perihelion Arg”, and “Orbit Uncertainity”.
Figure 1 Before removing unwanted columns.
Figure 2 After removing unwanted columns.

18
3.3.1 Checking Null Values
Checking for missing values is an important step in the preparation of data. It's crucial to
recognize and effectively manage missing values in order to ensure the data quality. In this case,
the sum of null values presents in each column of the data1_new data frame has been computed,
and the result indicates that there are no null values in the data frame. Therefore, it can be
concluded that the data1_new data frame does not contain any missing values.
Figure 3 Count of null value per columns in data1_new.
3.3 Data Wrangling
19
Depending on the individual research objectives, the pre-processed data may need to be
wrangled or transformed. This may involve filtering the data based on certain criteria, such as the
type of asteroid or the range of orbital parameters. If the research involves combining data from
multiple sources, it could be required to do so consistently and meaningfully. To make the data
more suitable for analysis, it may be necessary to transform it in various ways. This may involve
creating new variables or features from the data or reshaping the data into a different format.
A new column named “Nea_Class” has been attached to the data1_new data frame with the
intention of categorizing NEAs into distinct groups according to their orbital characteristics. The
standard criteria established by NASA's Center for Near Earth Object Studies have been utilized
to categorize near-earth asteroids. The specific criteria used for this purpose are listed below:
- Nea_Class value “Atens” has been allocated to NEAs with Semi Major Axis less than 1
and Aphelion Distance larger than 0.983.
- Nea_Class value “Atiras” has been allocated to NEAs with Semi Major Axis less than 1
and Aphelion Distance less than 0.983.
- Nea_Class value “Apollos” has been allocated to NEAs with Semi Major Axis larger
than 1 and a Perihelion Distance less than 1.017.
- Nea_Class value “Amors” has been allocated to NEAs with Semi Major Axis greater
than 1 and a Perihelion Distance between 1.017 and 1.3.
- The Nea_Class column now has the value “NA” for those NEAs whose classification is
still null.
Another new column labeled "Hazardous" has been attached to the data1_new data frame. The
objective of this addition is to classify NEAs into different categories based on their orbital
20
characteristics. This column's purpose is to categorize NEAs into two different categories "PHA"
and “NON_PHA” using the standards established by NASA’s Center for Near Earth Object
Studies. The specific criteria used for this purpose are listed below: The value of the Hazardous
column has been set to “PHA” for rows where the Minimum Orbit Intersection column value is
less than or equal to 0.05 and the Absolute Magnitude column value is less than or equal to 22.
The value of the Hazardous column has been set to “NON_PHA” for rows where the value is
null.
Figure 4 Added new column Nea_Class to the data frame.
Figure 5 Added new column Hazardous to the data frame.
21
This project heavily relies on logistic regression, and in order to make the analysis process
easier, categorical values were transformed into binary values using the “get_dummies ()”
method found in the pandas library. Subsequently, the PHA data frame obtained by applying
get_dummies () function to the Hazardous column of the data1_new data frame was
concatenated with the data1_new data frame using the “concat()” function from pandas. And
also, Nea data frame obtained by applying get_dummies () function to the Nea_Class column of
the data1_new data frame was concatenated with the new data frame called “data2_new”
perform specific objective in this project.
Figure 6 After concatenating PHA data frame in to data1_new
Figure 7 After concatenating Nea data frame into data2_new.
22
After adding two new columns, the missing values were re-examined to ensure that the data was
complete and ready for analysis. It was found that the data frames had no null values.
Figure 8 Count of Null Values per Column in Data1_new & Data2_new.
Chapter 4: Methodology
4.1 Introduction
This chapter aims to examine and analyze asteroids to classify them and investigate relationships
between orbital parameters and other factors of NEAs and investigate the hazardousness of
classified NEA groups utilizing statistical methods. Accordingly, this chapter also covers the
population, sample, sampling techniques, data collection techniques, and sources.
23
4.2 Population, Sample, and Sampling technique
In statistical research, it is often necessary to study a population in order to understand certain
characteristics or patterns. A population is a collection of individuals or objects with common
features or characteristics relevant to the investigation. The population of this study may be all
known asteroids in the solar system in the context of a statistical study on orbital parameter
analysis to classify asteroids and investigate the relationship between orbital parameters, non-
orbital parameters, and the hazardousness of asteroids.
Researchers often utilize a sample, which is a subset of the population, to represent the
population because it is typically difficult or impractical to deal with the whole population. Out
of the entire population of asteroids, near-earth asteroids will be the sample selected for this
research.
4.3 Type of Data to be Collected and Data Sources
This can involve gathering both quantitative and qualitative data. The quantitative data could
include measurements of orbital parameters, such as the semi-major axis, eccentricity, perihelion
distance, and aphelion distance, as well as data on the physical properties of the asteroids, such
as size, shape, and composition. The categorical variables or text data, such as the classification
of the asteroids or any other relevant data, could be included in the qualitative data. Data
collection for this research could be gathered from a variety of sources, including primary data
sources such as observations or experiments, and secondary data sources such as published
papers, databases, or online resources. The data to be used for this study is a dataset published on
“Kaggle”. which is a data science and machine learning community. Therefore, it can be defined
as a secondary data source. However, the data in this dataset is originally owned by NASA’s Jet
24
Propulsion Laboratory ( http://neo.jpl.nasa.gov/ ). This API is maintained by SpaceRocks Team:
David Greenfield, Arezu Sarvestani, Jason English, and Peter Baunach. And, the data was
gathered in CSV format, making it easier to handle and comprehend. All collected information
will be kept confidential and secure.
4.4 Data collection tools and plan
As mentioned before, "Kaggle" was used to obtain the dataset. The source used to publish this
data in Kaggle is NeoWs (Near Earth Object Web Service). Which is a web service that provides
information about near-Earth asteroids. With NeoWs a user can: search for Asteroids based on
their closest approach date to Earth, look up a specific Asteroid with its NASA JPL small body
id, as well as browse the overall dataset. Since the data has already been processed and
published, data collection methods are no longer required.
4.5 Methods, Techniques, and Tools
All NEAs are classified using commonly used standard asteroid classification criteria of NASA’s
center for near-earth object studies utilizing the data provided in the data set at the beginning of
this research and it is expected to use the standard criteria from NASA to determine the
hazardousness of NEAs using relevant data available in the gathered dataset from Kaggle. And,
this study would consist of various statistical techniques, such as descriptive statistics, linear
regression, logistic regression, and hypothesis testing. It might also include information on the
assumptions and limitations of these techniques and how they will be applied to the data.
Furthermore, it is intended to use visualization techniques, such as scatter plots, histograms, bar
charts, and box plots to get a better understanding of the outcomes of this analysis and to identify
relationships and patterns between variables that cannot be detected by just looking at the data
25
while the analysis is being performed. Data visualization will aid the users of this study to get
visual insight along with the interpretations for a better understanding. All these techniques are
performed using R or Python programming languages.
Chapter 5: Data Analysis, Visualization, and Interpretation
5.1 Descriptive Statistics
This section provides a useful framework for analyzing and summarizing large datasets in a
meaningful way. Descriptive statistical analysis can help users comprehend the distribution of
various variables and their correlations with one another in the context of determining whether or
not an asteroid is potentially hazardous based on its characteristics. By identifying patterns and
trends in the data, we can gain valuable insights into the factors that are most important for
predicting asteroid hazard potential.
Figure 9 Summary statistics of data1_new.
26
An overview of the most significant descriptive statistics, including count, mean, standard
deviation, minimum, maximum, and quartiles, for each numerical column is given in the
summary statistics table for the numerical columns in the “data1_new” data frame. However, to
ensure the accuracy of the statistics, columns that are not appropriate for summarization,
specifically 'Neo Reference ID' and 'PHA', have been dropped from the table using the “drop()”
function in pandas.
Figure 10 Histograms of Numerical variables.
27
First, the columns "Neo Reference ID" and "PHA" from data1_new are eliminated to generate a
new data frame called “hist_dataframe”. With the exception of the columns "Neo Reference ID"
and "PHA," it creates a set of histograms for each numerical column in the data1_new data
frame.
Anyway, correlation is also important because it helps to identify the strength and direction of
the relationships between the variables in the dataset. The variables that have a high correlation
with the target variable can be identified by computing the correlation coefficients between the
features. The following correlation heatmap displays the correlation coefficients between all
pairs of numerical variables in the dataset in a matrix format.
28
Figure 11 Correlation Heatmap of numerical pairs.
The following list shows the top ten strongest positive correlations and negative correlations. It
demonstrates that there are significant positive connections between the Est Dia in KM (min) and
Est Dia in KM (max), the Aphelion Dist and Semi Major Axis, the Aphelion Dist and
Eccentricity, etc. in the section on positive correlations. This implies that these variables are
highly dependent on each other and changes in one variable will cause changes in the other
variable. The significant negative correlations between Mean Motion and Perihelion Distance,
Est Dia in KM (max) and Absolute Magnitude, Est Dia in KM (min) and Absolute Magnitude,
Mean Motion and Aphelion Distance, Mean Motion and Semi Major Axis, etc. are displayed in
the section on negative correlations. This suggests that two variables are inversely proportional
to one another and that changes in one variable will result in opposite changes in the other.
Figure 12 Top Negative & Positive strongest correlation

coefficient
29
5.2 Feature Selection
Figure 13 Most impactful features for prediction
Figure 5.5 displays the top ten most significant characteristics taken from the dataset for
predicting the 'PHA' (Potentially Hazardous Asteroid) variable. “Absolute Magnitude”, “Est Dia
in KM (min)”, “Est Dia in KM (max)”, “Relative Velocity km/hr”, “Orbit Uncertainty”,
“Minimum Orbit Intersection”, “Eccentricity”, “Perihelion Distance”, “Aphelion Distance”, and
“Mean Anomaly” are among the features provided in the Index section. These features were
selected using the SelectKBest method, which applies the f_classif scoring function to evaluate
the significance of each feature in relation to the target variable. The chosen characteristics are
expected to have the most impact on determining whether or not an asteroid is potentially
hazardous. By concentrating analysis and modeling efforts on the most crucial traits, this
information might help the prediction model become more accurate and effective.
5.3 Regression Analysis
5.3.1 Training Logistic Regression Model
Figure 14 Train Logistic Regression Model.

30
In this case the dependent variable is “PHA”, which is the target variable that we are trying to
predict. The independent variables are the features that were selected using the feature selection
method. The logistic regression model built using the supplied hyperparameters is represented by
the output “LogisticRegression (random_state=0, solver='liblinear')”. Since the random_state
parameter is set to 0, the model always produces the same outcomes when used with the same
input data and hyperparameters. The solver parameter is set to 'liblinear', which is an algorithm
used by the model to optimize its coefficients. This output confirms that the logistic regression
model can be used to predict the probability of an asteroid being potentially hazardous based on
the selected features in X.
5.3.2 Generate Predictions
Figure 15 Generate Prediction.
The output of this code is an array of predicted values for the target variable (PHA) based on the
logistic regression model trained on the training dataset and applied to the test dataset. Due to the
binary nature of the target variable (PHA), the predicted values are binary (0 or 1). In this
particular case, the result is an array of unsigned integers (uint8) with the data types of zeros (0)
and ones (1). The predicted value of 0 indicates that the asteroids are not potentially hazardous
asteroids (NON_PHA), whereas the expected value of 1 indicates that the asteroids are
potentially hazardous asteroids (PHA). The array is represented as a data frame in the following
figure.
31
Figure 16 Prediction Data Frame.
The Data Frame has two columns, “PHA” and “PHA_Pred”. The PHA column contains the
actual PHA labels for the corresponding samples in the test set. The PHA_Pred column contains
the predicted PHA labels for the same samples.
5.3.3 Evaluating Regression Model
Figure 17 Accuracy of Logistic Regression model.
Figure 5.9 displays the logistic regression model's test set accuracy score. The accuracy score is a
representation of the percentage of instances in the test set that was properly classified. In this
case, the model has an accuracy of 0.8230 or 82.3%. This means that out of all the instances in
the test set, 82.3% were classified correctly by the model. However, accuracy alone may not be a
sufficient measure of a model's performance, especially when dealing with imbalanced datasets.
32
There is a class imbalance in the dataset, with a substantially higher proportion of cases falling
into the negative class (PHA = 0) than the positive class (PHA = 1). Therefore, a classification
report may be needed to evaluate the performance of the model in a more comprehensive way.
Figure 18 Classification Report
With a precision of 0.83 and a recall of 0.98, the report demonstrates that the model has
high precision and accuracy in predicting NON-PHA observations (class 0). On the other hand,
the model's performance in predicting the PHA observations (class 1) is inferior, with a low
precision of 0.17 and a weak recall of only 0.02. The model's weighted average F1-score is 0.76,
which represents its overall effectiveness in both classes. This may be due to the imbalanced
nature of the dataset, where there are many more non-hazardous asteroids than hazardous ones.
33
Figure 19 Confusion Matrix.

In this case, the matrix's first row displays the actual number of observations for class 0. 1154 of
the 1173 observations were classified as 0 (TN) correctly, while 19 were misclassified as 1 (FP).
The actual number of observations for class 1 is displayed in the second row. 4 of the 234
observations were correctly classified as 1 (TP), whereas 230 were mistakenly classified as 0
(FN). Overall, the model's performance seems to be poor, as it is only correctly predicting a
small fraction of the actual PHAs. The confusion matrix suggests that the model has a low
sensitivity and specificity for the category of "potentially hazardous asteroids," as indicated by
the high number of false negatives (FN) and the low number of true positives (TP) it appears to
have.
5.3.4 Identify which NEA categories are more likely to be PHAs.
Figure 20 Train regression model.
Using the training set produced by splitting the dataset into training and testing sets, the code
trains a logistic regression model with a random state of 0. The logistic regression object with its
default parameter values is displayed in the output.
34
Figure 21 Predict the probabilities.
Using the trained logistic regression model, the code determines the predicted probability that
each data point in the test set will belong to the positive class (PHA). The result is an array with
these probabilities for every single data point.
Figure 22 New data frame with Predicted Probability
With the Nea_Class and the associated probability of being a potentially hazardous asteroid
(PHA) predicted by the logistic regression model for the test set, the function generates a Data
Frame. The Nea_Class and its associated likelihood of becoming a PHA are shown in the Data
Frame.
Figure 23 Rank the categories by their predicted probabilities.
The result displays, for each NEA class (Atiras, Apollos, Atens, and Amors), the highest
likelihood that an asteroid will fall into the potentially hazardous asteroid (PHA) category. The
35
NEA class Atiras has the highest probability of being a PHA, followed by Apollos and Atens,
while Amors has the lowest probability of being a PHA. This knowledge can be helpful in
identifying and ranking NEAs for more observation and research.
Chapter 6: Discussion & Conclusion
6.1Discussion
The logistic regression model developed in this project was able to predict the potential
hazardousness of Near-Earth Asteroids (NEAs) based on their orbital and physical
characteristics. The model was trained on a dataset consisting of various NEA classes, and its
performance was evaluated using a confusion matrix. The results showed that the model had a
low sensitivity and specificity for identifying potentially hazardous asteroids, which indicates
that the model needs further improvement.
One possible reason for the model's poor performance could be the quality of the input data. The
dataset used in this project was derived from the JPL Small-Body Database, which provides
information on the orbits and physical properties of NEAs. However, the database is not
complete, and there may be biases in the data due to the way observations are made. This may
have affected the model's ability to accurately predict the hazardousness of NEAs.
36
Another possible explanation could be the choice of features used in the model. The logistic
regression model was trained on a set of features that were selected based on their potential
relevance to the hazardousness of NEAs. However, there may be other important features that
were not included in the model, which could have contributed to the model's poor performance.
Despite these limitations, the logistic regression model developed in this project has some
potential practical applications. For example, it can be used to prioritize the observation of NEAs
that are most likely to be potentially hazardous, which can help to identify asteroids that pose a
threat to Earth.
37
References
 NASA: Asteroids Classification (n.d.) available from

<https://www.kaggle.com/datasets/shrutimehta/nasa-asteroids-classification> [14
December 2022]
 NEO Basics (n.d.) available from <https://cneos.jpl.nasa.gov/about/neo_groups.html>

[17 December 2022]
38
 Vadym Pasko (n.d.) available from <http://vadym-pasko.com/#research> [17 December
2022]
 ‘Near-Earth Object’ (2022) in Wikipedia [online] available from

<https://en.wikipedia.org/w/index.php?title=Near-Earth_object&oldid=1127646254> [18
December 2022]
 ‘Orbital Eccentricity’ (2022) in Wikipedia [online] available from

<https://en.wikipedia.org/w/index.php?title=Orbital_eccentricity&oldid=1097430348>
[20 December 2022]
 Naidu, S.P., Margot, J.-L., Busch, M.W., Taylor, P.A., Nolan, M.C., Brozovic, M.,
Benner, L.A.M., Giorgini, J.D., and Magri, C. (2013) ‘Radar Imaging and Physical
Characterization of Near-Earth Asteroid (162421) 2000 ET70’. Icarus [online] 226 (1),
323–335. available from
<https://www.sciencedirect.com/science/article/pii/S001910351300225X> [19 December
2022]
 Selliez, L., Briois, C., Carrasco, N., Thirkell, L., Gaubicher, B., Lebreton, J.-P., and
Colin, F. (2019) ‘Analytical Performances of the LAb-CosmOrbitrap Mass Spectrometer
for Astrobiology’. Planetary and Space Science [online] 225, 105607. available from
<https://www.sciencedirect.com/science/article/pii/S0032063322001933> [19 December
2022]
 Center for NEO Studies (n.d.) available from <https://cneos.jpl.nasa.gov/> [21 December
2022]
 Definitions & Assumptions - NEO (n.d.) available from

<https://neo.ssa.esa.int/definitions-assumptions> [21 December 2022]
39

Analyzing Orbital Parameters To Classify Asteroids

Uploaded by

Copyright:

Available Formats

Analyzing Orbital Parameters To Classify Asteroids

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analyzing Orbital Parameters To Classify Asteroids

Uploaded by

Copyright:

Available Formats

Predict whether a Near-Earth Asteroid is potentially

hazardous or not based on its characteristics and identify

By: - Avishka Edirisuriya (coadds211f-011)

BACHELOR OF SCIENCE (HONS) IN DATA SCIENCE NATIONAL INSTITUTE OF

Date of submission- 30th April 2023

project. First and foremost, I am deeply grateful to my research supervisor, Mr.Thurairasa

project should not have been possible.

support and encouragement throughout my studies.

An asteroid that orbits the Earth in proximity is considered a Near-Earth asteroid. In

parameters in the hazardousness of asteroids.

Table 1-1 Proposed Work Schedule..............................................................................................................6

Table 2-1 Theoretical explanation of the keywords in the topic...................................................................9

Table 2-2 Table of Variables.......................................................................................................................17

Figure 1 Before removing unwanted columns.............................................................................................18

Figure 2 After removing unwanted columns...............................................................................................19

Figure 3 Count of null value per columns in data1_new.............................................................................19

Figure 4 Added new column Nea_Class to the data frame..........................................................................21

Figure 5 Added new column Hazardous to the data frame..........................................................................21

Figure 6 After concatenating PHA data frame in to data1_new..................................................................22

Figure 7 After concatenating Nea data frame into data2_new....................................................................22

Figure 8 Count of Null Values per Column in Data1_new & Data2_new..................................................23

Figure 9 Summary statistics of data1_new..................................................................................................26

Figure 10 Histograms of Numerical variables.............................................................................................27

Figure 11 Correlation Heatmap of numerical pairs.....................................................................................28

Figure 12 Top Negative & Positive strongest correlation coefficient.........................................................29

Figure 13 Most impactful features for prediction........................................................................................30

Figure 14 Train Logistic Regression Model................................................................................................30

Figure 15 Generate Prediction.....................................................................................................................31

Figure 16 Prediction Data Frame.................................................................................................................32

Figure 17 Accuracy of Logistic Regression model......................................................................................32

Figure 18 Classification Report...................................................................................................................33

Figure 19 Confusion Matrix.........................................................................................................................33

Figure 20 Train regression model................................................................................................................34

Figure 21 Predict the probabilities...............................................................................................................34

Figure 22 New data frame with Predicted Probability.................................................................................35

Figure 23 Rank the categories by their predicted probabilities...................................................................35

asteroid's orbital distance from the Sun.

impact the Earth and cause tremendous damage.

relatively easy and safe target for exploration.

they can cause to the Earth.

1.2 Research problem

their possible impact on the Earth.

the semi-major axis, and eccentricity.

more likely to be PHAs.

 Identify which NEA categories are more likely to be PHAs.

asteroid is potentially hazardous or not.

 Train a logistic regression model to predict whether an asteroid is potentially hazardous or

not based on its characteristics.

 Evaluate the performance of the logistic regression model.

1.4 Research questions

The following questions will be answered at the conclusion of this study.

potentially hazardous or not?

 Is it possible to accurately predict whether an asteroid is potentially hazardous or not using

potentially hazardous or not?

1.6 Justification of the research

categories and facilitate more effective decision-making.

the outcomes won't be guaranteed.

trajectories are disregarded.

use additional techniques, such as simulations or observational research, to enhance,