Analyzing Orbital Parameters To Classify Asteroids
Analyzing Orbital Parameters To Classify Asteroids
Analyzing Orbital Parameters To Classify Asteroids
iii
DECLARATION
I hereby declare that the work presented in this project report was carried out independently by
myself and have cited the work of others and given due reference diligently.
………………………
Date
……………………....
Avishka Edirisuriya
I certify that the above student carried out his/her project under my supervision and guidance.
………………………
Date
……………………....
Thurairasa Balakumar
iii
Acknowledgment
I would like to express my gratitude to the individuals who contributed to the success of this
Balakumar, for his exceptional guidance, patience, and continuous support, without which this
I would also like to thank Dr. (Mrs.) M.D.T.Atyagalle, the Head of the Department of Statistics,
as well as all the senior lecturers, lecturers, and demonstrators of the department for their
guidance and assistance throughout this project. Additionally, I extend my sincere thanks to our
course coordinator and Senior Lecturer, Dr. (Mrs.) J.H.D.S.P. Tissera, for leading me on the
right path.
I am grateful to my friends for their valuable ideas, motivation, and willingness to help whenever
I needed. Finally, I am thankful to my parents, and family members, for their unwavering
iii
Executive Summary
This project aimed to develop a logistic regression model that can accurately predict whether a
near-Earth asteroid (NEA) is potentially hazardous or not. The dataset that was used included
information about NEAs' diameter, orbital characteristics, proximity to Earth, and whether or not
they were classified as potentially hazardous. The logistic regression model achieved an overall
accuracy of 82.3%. However, the comprehensive investigation shows that the model's sensitivity
and specificity for identifying potentially hazardous asteroids (PHAs) were not as high as
intended, with a significant number of false negatives and a low number of true positives.
Further investigation revealed that the NEA class Atiras, followed by Apollos and Atens, has the
highest likelihood of being a PHA, while Amors has the lowest likelihood. Overall, the
researchers concluded that logistic regression models have the capacity to correctly predict the
potential hazards of NEAs based on their characteristics, although more study and improvement
may be required to increase the model's sensitivity and specificity for detecting PHAs.
iii
Abstract
astronomical terminology, a NEO is estimated to have a trajectory that gets it within 0.3
astronomical units of the Earth's orbit and within 1.3 astronomical units of the Sun. Many
asteroids are classified as potentially hazardous asteroids (PHAs) because they have orbits that
bring them close to the Earth and have the potential to collide with our planet. In this research
study, we aimed to investigate the connection between orbital parameters and the hazardousness
of asteroids by analyzing a large dataset of near-earth asteroids from NASA’s Center for Near-
earth object studies which were published on Kaggle. I used statistical techniques, such as
regression and hypothesis testing, to analyze the data and identify trends or patterns in the orbital
parameters of PHAs. The findings of this research will have important implications for the
detection and mitigation of PHAs and will contribute to our understanding of the role of orbital
vii
Contents
Chapter 1 : Introduction.............................................................................................................................1
1.1 Background..........................................................................................................................................1
1.2 Research problem.................................................................................................................................2
1.3 Objectives............................................................................................................................................3
1.4 Research questions...............................................................................................................................3
1.5 Scope of the research...........................................................................................................................4
1.6 Justification of the research.................................................................................................................4
1.7 Expected limitations.............................................................................................................................5
1.8 Proposed work schedule......................................................................................................................6
Chapter 2 : Literature Review....................................................................................................................6
2.1 Introduction to the research theme.......................................................................................................6
2.2 Theoretical explanation of the keywords in the topic..........................................................................7
2.3 Findings by other researchers............................................................................................................10
2.4 Research gap......................................................................................................................................11
2.5 Table of Variables..............................................................................................................................12
Chapter 3 : Data Preparation Process - Data Pre-processing and Data Wrangling...........................14
3.1 Data acquisition.................................................................................................................................14
3.2 Data pre-processing...........................................................................................................................14
3.3 Data Wrangling..................................................................................................................................17
Chapter 4: Methodology............................................................................................................................20
4.1 Introduction........................................................................................................................................20
4.2 Population, Sample, and Sampling technique....................................................................................21
4.3 Type of Data to be Collected and Data Sources................................................................................21
4.4 Data collection tools and plan............................................................................................................22
4.5 Methods, Techniques, and Tools.......................................................................................................22
Chapter 5 : Data Analysis, Visualization, and Interpretation...............................................................23
5.1 Descriptive Statistics..........................................................................................................................23
5.2 Feature Selection................................................................................................................................27
5.3 Regression Analysis...........................................................................................................................27
Chapter 6: Discussion & Conclusion.......................................................................................................33
6.1Discussion...........................................................................................................................................33
References...................................................................................................................................................34
vii
vii
Table of Tables
vii
Table of Figures
vii
Chapter 1 : Introduction
1.1 Background
Near-Earth Asteroids are any small solar system bodies that orbit the sun and pass relatively
close to the Earth’s orbit. They are pushed into orbits that allow them to enter Earth’s
neighborhood due to the gravitational pull of nearby planets. These asteroids can originate from
various places, such as the main asteroid belt between Mars and Jupiter, and can be made of
different substances including metal, rock, and ice. In technical terms, an Asteroid is an NEA if
its perihelion distance is less than 1.3 AU which is approximately 195 million kilometers.
They are separated into several groups according to their size, shape, and orbital parameters.
According to NASA’s Center for Near Earth Object Studies, the most common classification
scheme for NEAs is the Apollo, Amor, Aten, and Atira categories, which are based on the
Some NEAs are also classified as potentially hazardous asteroids. PHA is one whose orbit passes
through Earth’s orbit and is larger than 140 meters (460 feet) in diameter. In technical terms, an
NEA is a potentially hazardous asteroid if its MOID is less than 0.05 AU, and its absolute
magnitude is equal to or less than 22. PHAs are also a subset of NEAs that have the potential to
NEAs are typically discovered by ground-based telescopes such as Wide-field infrared survey
explorer (WISE). Once an NEA is discovered, its orbit is tracked and monitored to determine its
potential risk to Earth. NEAs can also be used for space exploration. For example, some NEAs
1
have been proposed as a potential destination for robotic spacecraft, as they can provide a
Scientists study NEAs to learn more about the solar system and to understand these asteroids'
potential benefits and drawbacks. In general, the study of orbital parameters is crucial to asteroid
exploration since it clarifies the characteristics and behavior of these small things and the risks
The classification of NEAs and the identification of PHAs is already an active area of research
and will continue to grow in the future. There is still much to learn about these tiny bodies and
This study analyzes orbital properties of near-earth asteroids to learn more about them. Orbital
parameters are the characteristics that describe the shape and position of an asteroid's orbit
around the sun. Some examples of orbital parameters are perihelion distance, aphelion distance,
This study involves classifying Near-Earth Asteroids according to standard methods from
NASA’s Center for Near-Earth Object Studies and Predict whether a Near-Earth Asteroid is
potentially hazardous or not based on its characteristics and identify which NEA categories are
2
1.3 Objectives
More specific objectives derived from the research problem are listed below.
Explore the dataset to gain insights into the distribution of asteroid characteristics.
Identifying which asteroid characteristics are most strongly associated with whether an
Which asteroid characteristics have the strongest association with whether an asteroid is
its characteristics?
What is the performance of the logistic regression model in predicting whether an asteroid is
Are there any subgroups of asteroids with higher or lower risk of being potentially
hazardous?
3
1.5 Scope of the research
The primary purpose of this research is to Analyze orbital parameters and other properties of
asteroids, such as relative velocity, to classify Near-Earth Asteroids, determine which category is
most likely to be potentially hazardous, and predict whether a Near-Earth Asteroid is potentially
hazardous or not based on its characteristics. This could involve using standard asteroid
classifying criteria used by NASA’s Center for Near Earth Object Studies, statistical methods,
and other analytical approaches. This research could help better understand the interconnections
of orbital and non-orbital parameters and the potential hazardousness of Near-Earth Asteroids.
In the case of predicting whether a Near-Earth Asteroid is potentially hazardous or not, there
may be a variety of factors that affect their orbital characteristics, and Hazardousness and
statistical approaches can be used to discover patterns and correlations that may not be
immediately apparent. This can help to gain a deeper understanding of the processes that
underlie asteroids' behavior and the dangers that they might pose to Earth.
In addition, statistical analysis enables researchers to test hypotheses and make predictions
regarding asteroids' behavior based on their orbital parameters. This will aid future research and
policy decisions in classifying NEAs and identifying relationships between orbital and non-
orbital factors.
Overall, predicting whether a Near-Earth Asteroid is potentially hazardous or not using statistical
methods can offer a more comprehensive and profound understanding of NEAs, and their
4
1.7 Expected limitations
Statistical analysis is only as good as the data that is used to perform it. The quantity of data that
is currently accessible in the case of NEAs study may be constrained. And the relationships
between variables can only be estimated by statistical analysis, which would be based on
probability. There is always a chance that these estimations won't be precisely accurate or that
Only asteroids with circular and elliptical orbits are considered in this study. Because they are
more likely to be NEAs. Therefore, asteroids with parabolic trajectories and hyperbolic
Furthermore, statistical analysis may not be able to capture all the complexity of the systems
being studied. Statistical analysis is only one tool that can be used to study NEAs, and it may not
be appropriate for all research objectives or for all types of discoveries. It may be necessary to
Although statistical research can provide insights into the relationships between orbital
parameters and non-orbital parameters and the hazardousness of NEAs, it is crucial to be aware
of its limitations. Consequently, the data obtained to conduct this research is subject to these
limitations.
5
1.8 Proposed work schedule
December January February
3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
week week week week week week week week week week
Topic selection
Data acquisition
Data analyzing
First draft
preparation
Final report
submission
NEAs are small, rocky objects that orbit the Sun and are believed to be remnants from the inner
solar system. Since it can give insight into the origins and evolution of the solar system and help
determine the likelihood of asteroid impacts, understanding the characteristics and behavior of
6
Several techniques have been developed to categorize NEAs and identify PHAs. One common
method is based on the asteroids' orbital characteristics. Such as semi-major axis, aphelion
distance & perihelion distance. Various studies have investigated the characteristics of NEAs and
PHAs using various data sources, including observations from ground-based telescopes, space-
The study of orbital characteristics and asteroids' hazardousness using statistical methods has
gained increasing attention in recent years. This research theme aims to understand how orbital
parameters can be used to classify asteroids and investigate interconnections between orbital and
non-orbital parameters and hazardousness. Researchers can develop models to discover patterns
and trends that may be helpful in categorizing asteroids and determining their hazardousness by
As a result, this chapter addresses previous studies and discusses the approaches used as well as
Earth's orbit.
7
Potentially Hazardous Asteroids (PHA) A potentially hazardous asteroid is a subset of a
destruction.
Inner solar system The inner solar system is the region of the solar
asteroid belt.
8
orbit at the same distance from the central body.
characteristics.
(163693) Atira)
9
There are numerous resources available for further study about classifying asteroids using their
orbital parameters and examining the relationship between those parameters and the
Asteroids using a machine learning algorithm to identify spectrophotometric data, then compare
it with current Bus-DeMeo taxonomy (DeMeo, et al. 2009) categorization efforts. Then
The research titled “Prediction of orbital parameters for undiscovered potentially hazardous
asteroids using machine learning” (2018) published by Vadym Pasko analyzed PHAs in the
Amor group, and the results showed a relationship between the density of the PHA distribution
Another research named “Radar imaging and physical characterization of Near-Earth Asteroid
(162421) 2000 ET70” was published in the journal named Icarus (in 2013) shows that the author
of this research has been able to predict reliable trajectory for NEA (162421) 2000 ET70, by
utilizing precise measurements of the range and velocity of the asteroid which are non-orbital
parameters. These were measured by obtaining continuous wave spectra and range-Doppler
images with range resolutions as fine as 15 m. NEA (162421) 2000 ET70 was observed during
its closest approach to Earth in February 2012 over a period of 12 days by using the Arecibo and
10
These are merely a few instances of the various types of discoveries that have been made in the
subject of asteroid studies throughout the past years. I suggest looking up more information on
this subject in online databases and repositories like the Minor Planet Center and the NASA
Astrophysics Data System (ADS), as well as academic publications like the Planetary Science
There are various types of research about celestial bodies like planets, exoplanets, comets, and
asteroids. Nevertheless, as far as I know, only a few studies classify NEAs and predict the
hazardousness of Near-Earth Asteroids based on their orbital and no-orbital parameters by using
statistical approaches. Kathleen Jacinda Mcintyre’s research had a gap in the literature since the
author considered only spectrophotometric data to identify and categorize asteroids, and the
author used the Bus-DeMeo classification method which is an asteroid taxonomic system that
Furthermore, the gap in the literature in the study by Vadym Pasko was that the author
investigated only one NEA group, which is Amors. And the study gives results about the
relationship between the density of the PHA distribution and several orbital parameters.
The research named “Radar imaging and physical characterization of near-Earth Asteroid
(162421) 2000 ET70” had a gap in the literature because the author only investigated a single
NEA which can be potentially hazardous. Also, Critical orbital parameters have not been
Finally, there is still much we don't know about the relationship between asteroids' orbital
properties and their hazardousness. While some studies have found correlations between certain
11
orbital parameters and the likelihood of an asteroid impacting Earth, there is still a lot to be
discovered about these relationships. Therefore, further research is needed to reveal other
NEA.
respective cycles.
12
Perihelion distance The perihelion distance of an Perihelion distance in
the Sun.
13
the asteroid's motion speed and
direction.
NEA-Class This implies which category the Apollo, Amor, Aten, Atira
Atira.
14
Perihelion Arg It describes the orientation of the Angle measured in degrees
perihelion.
object's orbit.
15
an asteroid travel in its orbit asteroid in its orbit around the
in kilometers.
in kilometers.
North Pole.
16
Miss Dist. It means closest distance Distance in astronomical units.
approach.
The first step in the data preparation process is to acquire the data needed for the research. This
study relies on a dataset from "Kaggle," a secondary data source, for its data. However, the Jet
Propulsion Laboratory of NASA is the original owner of the data in this dataset.
Once the data has been acquired, it may need to be pre-processed to make it suitable for analysis.
This may involve cleaning the data by removing errors, inconsistencies, or missing values. To do
that, this could also entail applying methods like outlier identification, imputation, or data
interpolation. After the data has been cleaned, it may need to be transformed into a suitable
format for analysis. This could involve converting data from one type to another, such as from
17
The first step of data cleaning was carried out by removing unnecessary columns after importing
the required libraries and dataset. Figure 3.1 depicts the complete dataset prior to data cleaning,
which consisted of 39 columns. Following the data cleaning process, as illustrated in Figure 3.2,
the number of columns was reduced to 17. A new data frame called ‘data1_new’ has been
created as a copy of the updated original data frame after the columns that are not contained in
the specified list have been removed from the original data frame. The columns that have been
specified for inclusion are: “Neo Reference ID”, “Absolute Magnitude”, “Est Dia in KM(min)”,
“Relative Velocity km per hr”, “Est Dia in KM(max)”, “Miss Dist.(Astronomical)”, “Minimum
Orbit Intersection”, “Eccentricity”, “Semi Major Axis”, “Inclination”, “Asc Node Longitude”,
“Orbital Period”, “Perihelion Distance”, “Aphelion Dist”, “Mean Anomaly”, “Mean Motion”,
Checking for missing values is an important step in the preparation of data. It's crucial to
recognize and effectively manage missing values in order to ensure the data quality. In this case,
the sum of null values presents in each column of the data1_new data frame has been computed,
and the result indicates that there are no null values in the data frame. Therefore, it can be
concluded that the data1_new data frame does not contain any missing values.
19
Depending on the individual research objectives, the pre-processed data may need to be
wrangled or transformed. This may involve filtering the data based on certain criteria, such as the
type of asteroid or the range of orbital parameters. If the research involves combining data from
multiple sources, it could be required to do so consistently and meaningfully. To make the data
more suitable for analysis, it may be necessary to transform it in various ways. This may involve
creating new variables or features from the data or reshaping the data into a different format.
A new column named “Nea_Class” has been attached to the data1_new data frame with the
intention of categorizing NEAs into distinct groups according to their orbital characteristics. The
standard criteria established by NASA's Center for Near Earth Object Studies have been utilized
to categorize near-earth asteroids. The specific criteria used for this purpose are listed below:
- Nea_Class value “Atens” has been allocated to NEAs with Semi Major Axis less than 1
- Nea_Class value “Atiras” has been allocated to NEAs with Semi Major Axis less than 1
- Nea_Class value “Apollos” has been allocated to NEAs with Semi Major Axis larger
- Nea_Class value “Amors” has been allocated to NEAs with Semi Major Axis greater
- The Nea_Class column now has the value “NA” for those NEAs whose classification is
still null.
Another new column labeled "Hazardous" has been attached to the data1_new data frame. The
objective of this addition is to classify NEAs into different categories based on their orbital
20
characteristics. This column's purpose is to categorize NEAs into two different categories "PHA"
and “NON_PHA” using the standards established by NASA’s Center for Near Earth Object
Studies. The specific criteria used for this purpose are listed below: The value of the Hazardous
column has been set to “PHA” for rows where the Minimum Orbit Intersection column value is
less than or equal to 0.05 and the Absolute Magnitude column value is less than or equal to 22.
The value of the Hazardous column has been set to “NON_PHA” for rows where the value is
null.
21
This project heavily relies on logistic regression, and in order to make the analysis process
easier, categorical values were transformed into binary values using the “get_dummies ()”
method found in the pandas library. Subsequently, the PHA data frame obtained by applying
get_dummies () function to the Hazardous column of the data1_new data frame was
concatenated with the data1_new data frame using the “concat()” function from pandas. And
also, Nea data frame obtained by applying get_dummies () function to the Nea_Class column of
the data1_new data frame was concatenated with the new data frame called “data2_new”
22
After adding two new columns, the missing values were re-examined to ensure that the data was
complete and ready for analysis. It was found that the data frames had no null values.
Chapter 4: Methodology
4.1 Introduction
This chapter aims to examine and analyze asteroids to classify them and investigate relationships
between orbital parameters and other factors of NEAs and investigate the hazardousness of
classified NEA groups utilizing statistical methods. Accordingly, this chapter also covers the
23
4.2 Population, Sample, and Sampling technique
features or characteristics relevant to the investigation. The population of this study may be all
known asteroids in the solar system in the context of a statistical study on orbital parameter
analysis to classify asteroids and investigate the relationship between orbital parameters, non-
Researchers often utilize a sample, which is a subset of the population, to represent the
population because it is typically difficult or impractical to deal with the whole population. Out
of the entire population of asteroids, near-earth asteroids will be the sample selected for this
research.
This can involve gathering both quantitative and qualitative data. The quantitative data could
include measurements of orbital parameters, such as the semi-major axis, eccentricity, perihelion
distance, and aphelion distance, as well as data on the physical properties of the asteroids, such
as size, shape, and composition. The categorical variables or text data, such as the classification
of the asteroids or any other relevant data, could be included in the qualitative data. Data
collection for this research could be gathered from a variety of sources, including primary data
sources such as observations or experiments, and secondary data sources such as published
papers, databases, or online resources. The data to be used for this study is a dataset published on
“Kaggle”. which is a data science and machine learning community. Therefore, it can be defined
as a secondary data source. However, the data in this dataset is originally owned by NASA’s Jet
24
Propulsion Laboratory ( http://neo.jpl.nasa.gov/ ). This API is maintained by SpaceRocks Team:
David Greenfield, Arezu Sarvestani, Jason English, and Peter Baunach. And, the data was
gathered in CSV format, making it easier to handle and comprehend. All collected information
As mentioned before, "Kaggle" was used to obtain the dataset. The source used to publish this
data in Kaggle is NeoWs (Near Earth Object Web Service). Which is a web service that provides
information about near-Earth asteroids. With NeoWs a user can: search for Asteroids based on
their closest approach date to Earth, look up a specific Asteroid with its NASA JPL small body
id, as well as browse the overall dataset. Since the data has already been processed and
All NEAs are classified using commonly used standard asteroid classification criteria of NASA’s
center for near-earth object studies utilizing the data provided in the data set at the beginning of
this research and it is expected to use the standard criteria from NASA to determine the
hazardousness of NEAs using relevant data available in the gathered dataset from Kaggle. And,
this study would consist of various statistical techniques, such as descriptive statistics, linear
regression, logistic regression, and hypothesis testing. It might also include information on the
assumptions and limitations of these techniques and how they will be applied to the data.
Furthermore, it is intended to use visualization techniques, such as scatter plots, histograms, bar
charts, and box plots to get a better understanding of the outcomes of this analysis and to identify
relationships and patterns between variables that cannot be detected by just looking at the data
25
while the analysis is being performed. Data visualization will aid the users of this study to get
visual insight along with the interpretations for a better understanding. All these techniques are
This section provides a useful framework for analyzing and summarizing large datasets in a
meaningful way. Descriptive statistical analysis can help users comprehend the distribution of
various variables and their correlations with one another in the context of determining whether or
not an asteroid is potentially hazardous based on its characteristics. By identifying patterns and
trends in the data, we can gain valuable insights into the factors that are most important for
26
An overview of the most significant descriptive statistics, including count, mean, standard
deviation, minimum, maximum, and quartiles, for each numerical column is given in the
summary statistics table for the numerical columns in the “data1_new” data frame. However, to
ensure the accuracy of the statistics, columns that are not appropriate for summarization,
specifically 'Neo Reference ID' and 'PHA', have been dropped from the table using the “drop()”
function in pandas.
27
First, the columns "Neo Reference ID" and "PHA" from data1_new are eliminated to generate a
new data frame called “hist_dataframe”. With the exception of the columns "Neo Reference ID"
and "PHA," it creates a set of histograms for each numerical column in the data1_new data
frame.
Anyway, correlation is also important because it helps to identify the strength and direction of
the relationships between the variables in the dataset. The variables that have a high correlation
with the target variable can be identified by computing the correlation coefficients between the
features. The following correlation heatmap displays the correlation coefficients between all
28
Figure 11 Correlation Heatmap of numerical pairs.
The following list shows the top ten strongest positive correlations and negative correlations. It
demonstrates that there are significant positive connections between the Est Dia in KM (min) and
Est Dia in KM (max), the Aphelion Dist and Semi Major Axis, the Aphelion Dist and
Eccentricity, etc. in the section on positive correlations. This implies that these variables are
highly dependent on each other and changes in one variable will cause changes in the other
variable. The significant negative correlations between Mean Motion and Perihelion Distance,
Est Dia in KM (max) and Absolute Magnitude, Est Dia in KM (min) and Absolute Magnitude,
Mean Motion and Aphelion Distance, Mean Motion and Semi Major Axis, etc. are displayed in
the section on negative correlations. This suggests that two variables are inversely proportional
to one another and that changes in one variable will result in opposite changes in the other.
29
5.2 Feature Selection
Figure 5.5 displays the top ten most significant characteristics taken from the dataset for
predicting the 'PHA' (Potentially Hazardous Asteroid) variable. “Absolute Magnitude”, “Est Dia
“Mean Anomaly” are among the features provided in the Index section. These features were
selected using the SelectKBest method, which applies the f_classif scoring function to evaluate
the significance of each feature in relation to the target variable. The chosen characteristics are
expected to have the most impact on determining whether or not an asteroid is potentially
hazardous. By concentrating analysis and modeling efforts on the most crucial traits, this
information might help the prediction model become more accurate and effective.
predict. The independent variables are the features that were selected using the feature selection
method. The logistic regression model built using the supplied hyperparameters is represented by
parameter is set to 0, the model always produces the same outcomes when used with the same
input data and hyperparameters. The solver parameter is set to 'liblinear', which is an algorithm
used by the model to optimize its coefficients. This output confirms that the logistic regression
model can be used to predict the probability of an asteroid being potentially hazardous based on
The output of this code is an array of predicted values for the target variable (PHA) based on the
logistic regression model trained on the training dataset and applied to the test dataset. Due to the
binary nature of the target variable (PHA), the predicted values are binary (0 or 1). In this
particular case, the result is an array of unsigned integers (uint8) with the data types of zeros (0)
and ones (1). The predicted value of 0 indicates that the asteroids are not potentially hazardous
asteroids (NON_PHA), whereas the expected value of 1 indicates that the asteroids are
potentially hazardous asteroids (PHA). The array is represented as a data frame in the following
figure.
31
Figure 16 Prediction Data Frame.
The Data Frame has two columns, “PHA” and “PHA_Pred”. The PHA column contains the
actual PHA labels for the corresponding samples in the test set. The PHA_Pred column contains
Figure 5.9 displays the logistic regression model's test set accuracy score. The accuracy score is a
representation of the percentage of instances in the test set that was properly classified. In this
case, the model has an accuracy of 0.8230 or 82.3%. This means that out of all the instances in
the test set, 82.3% were classified correctly by the model. However, accuracy alone may not be a
sufficient measure of a model's performance, especially when dealing with imbalanced datasets.
32
There is a class imbalance in the dataset, with a substantially higher proportion of cases falling
into the negative class (PHA = 0) than the positive class (PHA = 1). Therefore, a classification
report may be needed to evaluate the performance of the model in a more comprehensive way.
With a precision of 0.83 and a recall of 0.98, the report demonstrates that the model has
high precision and accuracy in predicting NON-PHA observations (class 0). On the other hand,
the model's performance in predicting the PHA observations (class 1) is inferior, with a low
precision of 0.17 and a weak recall of only 0.02. The model's weighted average F1-score is 0.76,
which represents its overall effectiveness in both classes. This may be due to the imbalanced
nature of the dataset, where there are many more non-hazardous asteroids than hazardous ones.
33
the 1173 observations were classified as 0 (TN) correctly, while 19 were misclassified as 1 (FP).
The actual number of observations for class 1 is displayed in the second row. 4 of the 234
observations were correctly classified as 1 (TP), whereas 230 were mistakenly classified as 0
(FN). Overall, the model's performance seems to be poor, as it is only correctly predicting a
small fraction of the actual PHAs. The confusion matrix suggests that the model has a low
sensitivity and specificity for the category of "potentially hazardous asteroids," as indicated by
the high number of false negatives (FN) and the low number of true positives (TP) it appears to
have.
Using the training set produced by splitting the dataset into training and testing sets, the code
trains a logistic regression model with a random state of 0. The logistic regression object with its
34
Figure 21 Predict the probabilities.
Using the trained logistic regression model, the code determines the predicted probability that
each data point in the test set will belong to the positive class (PHA). The result is an array with
With the Nea_Class and the associated probability of being a potentially hazardous asteroid
(PHA) predicted by the logistic regression model for the test set, the function generates a Data
Frame. The Nea_Class and its associated likelihood of becoming a PHA are shown in the Data
Frame.
The result displays, for each NEA class (Atiras, Apollos, Atens, and Amors), the highest
likelihood that an asteroid will fall into the potentially hazardous asteroid (PHA) category. The
35
NEA class Atiras has the highest probability of being a PHA, followed by Apollos and Atens,
while Amors has the lowest probability of being a PHA. This knowledge can be helpful in
6.1Discussion
The logistic regression model developed in this project was able to predict the potential
characteristics. The model was trained on a dataset consisting of various NEA classes, and its
performance was evaluated using a confusion matrix. The results showed that the model had a
low sensitivity and specificity for identifying potentially hazardous asteroids, which indicates
One possible reason for the model's poor performance could be the quality of the input data. The
dataset used in this project was derived from the JPL Small-Body Database, which provides
information on the orbits and physical properties of NEAs. However, the database is not
complete, and there may be biases in the data due to the way observations are made. This may
have affected the model's ability to accurately predict the hazardousness of NEAs.
36
Another possible explanation could be the choice of features used in the model. The logistic
regression model was trained on a set of features that were selected based on their potential
relevance to the hazardousness of NEAs. However, there may be other important features that
were not included in the model, which could have contributed to the model's poor performance.
Despite these limitations, the logistic regression model developed in this project has some
potential practical applications. For example, it can be used to prioritize the observation of NEAs
that are most likely to be potentially hazardous, which can help to identify asteroids that pose a
threat to Earth.
37
References
38
Vadym Pasko (n.d.) available from <http://vadym-pasko.com/#research> [17 December
2022]
Naidu, S.P., Margot, J.-L., Busch, M.W., Taylor, P.A., Nolan, M.C., Brozovic, M.,
Benner, L.A.M., Giorgini, J.D., and Magri, C. (2013) ‘Radar Imaging and Physical
Characterization of Near-Earth Asteroid (162421) 2000 ET70’. Icarus [online] 226 (1),
323–335. available from
<https://www.sciencedirect.com/science/article/pii/S001910351300225X> [19 December
2022]
Selliez, L., Briois, C., Carrasco, N., Thirkell, L., Gaubicher, B., Lebreton, J.-P., and
Colin, F. (2019) ‘Analytical Performances of the LAb-CosmOrbitrap Mass Spectrometer
for Astrobiology’. Planetary and Space Science [online] 225, 105607. available from
<https://www.sciencedirect.com/science/article/pii/S0032063322001933> [19 December
2022]
Center for NEO Studies (n.d.) available from <https://cneos.jpl.nasa.gov/> [21 December
2022]
39