Analyzing Orbital Parameters To Classify Asteroids

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 48

Predict whether a Near-Earth Asteroid is potentially

hazardous or not based on its characteristics and identify


which NEA categories are more likely to be PHAs.

By: - Avishka Edirisuriya (coadds211f-011)

BACHELOR OF SCIENCE (HONS) IN DATA SCIENCE NATIONAL INSTITUTE OF


BUSINESS MANAGEMENT (NIBM) Colombo, Sri Lanka

Date of submission- 30th April 2023

iii
DECLARATION

I hereby declare that the work presented in this project report was carried out independently by

myself and have cited the work of others and given due reference diligently.

………………………

Date

……………………....

Avishka Edirisuriya

I certify that the above student carried out his/her project under my supervision and guidance.

………………………

Date

……………………....

Thurairasa Balakumar

iii
Acknowledgment

I would like to express my gratitude to the individuals who contributed to the success of this

project. First and foremost, I am deeply grateful to my research supervisor, Mr.Thurairasa

Balakumar, for his exceptional guidance, patience, and continuous support, without which this

project should not have been possible.

I would also like to thank Dr. (Mrs.) M.D.T.Atyagalle, the Head of the Department of Statistics,

as well as all the senior lecturers, lecturers, and demonstrators of the department for their

guidance and assistance throughout this project. Additionally, I extend my sincere thanks to our

course coordinator and Senior Lecturer, Dr. (Mrs.) J.H.D.S.P. Tissera, for leading me on the

right path.

I am grateful to my friends for their valuable ideas, motivation, and willingness to help whenever

I needed. Finally, I am thankful to my parents, and family members, for their unwavering

support and encouragement throughout my studies.

iii
Executive Summary

This project aimed to develop a logistic regression model that can accurately predict whether a

near-Earth asteroid (NEA) is potentially hazardous or not. The dataset that was used included

information about NEAs' diameter, orbital characteristics, proximity to Earth, and whether or not

they were classified as potentially hazardous. The logistic regression model achieved an overall

accuracy of 82.3%. However, the comprehensive investigation shows that the model's sensitivity

and specificity for identifying potentially hazardous asteroids (PHAs) were not as high as

intended, with a significant number of false negatives and a low number of true positives.

Further investigation revealed that the NEA class Atiras, followed by Apollos and Atens, has the

highest likelihood of being a PHA, while Amors has the lowest likelihood. Overall, the

researchers concluded that logistic regression models have the capacity to correctly predict the

potential hazards of NEAs based on their characteristics, although more study and improvement

may be required to increase the model's sensitivity and specificity for detecting PHAs.

iii
Abstract

An asteroid that orbits the Earth in proximity is considered a Near-Earth asteroid. In

astronomical terminology, a NEO is estimated to have a trajectory that gets it within 0.3

astronomical units of the Earth's orbit and within 1.3 astronomical units of the Sun. Many

asteroids are classified as potentially hazardous asteroids (PHAs) because they have orbits that

bring them close to the Earth and have the potential to collide with our planet. In this research

study, we aimed to investigate the connection between orbital parameters and the hazardousness

of asteroids by analyzing a large dataset of near-earth asteroids from NASA’s Center for Near-

earth object studies which were published on Kaggle. I used statistical techniques, such as

regression and hypothesis testing, to analyze the data and identify trends or patterns in the orbital

parameters of PHAs. The findings of this research will have important implications for the

detection and mitigation of PHAs and will contribute to our understanding of the role of orbital

parameters in the hazardousness of asteroids.

vii
Contents

Chapter 1 : Introduction.............................................................................................................................1
1.1 Background..........................................................................................................................................1
1.2 Research problem.................................................................................................................................2
1.3 Objectives............................................................................................................................................3
1.4 Research questions...............................................................................................................................3
1.5 Scope of the research...........................................................................................................................4
1.6 Justification of the research.................................................................................................................4
1.7 Expected limitations.............................................................................................................................5
1.8 Proposed work schedule......................................................................................................................6
Chapter 2 : Literature Review....................................................................................................................6
2.1 Introduction to the research theme.......................................................................................................6
2.2 Theoretical explanation of the keywords in the topic..........................................................................7
2.3 Findings by other researchers............................................................................................................10
2.4 Research gap......................................................................................................................................11
2.5 Table of Variables..............................................................................................................................12
Chapter 3 : Data Preparation Process - Data Pre-processing and Data Wrangling...........................14
3.1 Data acquisition.................................................................................................................................14
3.2 Data pre-processing...........................................................................................................................14
3.3 Data Wrangling..................................................................................................................................17
Chapter 4: Methodology............................................................................................................................20
4.1 Introduction........................................................................................................................................20
4.2 Population, Sample, and Sampling technique....................................................................................21
4.3 Type of Data to be Collected and Data Sources................................................................................21
4.4 Data collection tools and plan............................................................................................................22
4.5 Methods, Techniques, and Tools.......................................................................................................22
Chapter 5 : Data Analysis, Visualization, and Interpretation...............................................................23
5.1 Descriptive Statistics..........................................................................................................................23
5.2 Feature Selection................................................................................................................................27
5.3 Regression Analysis...........................................................................................................................27
Chapter 6: Discussion & Conclusion.......................................................................................................33
6.1Discussion...........................................................................................................................................33
References...................................................................................................................................................34

vii
vii
Table of Tables

Table 1-1 Proposed Work Schedule..............................................................................................................6

Table 2-1 Theoretical explanation of the keywords in the topic...................................................................9

Table 2-2 Table of Variables.......................................................................................................................17

vii
Table of Figures

Figure 1 Before removing unwanted columns.............................................................................................18

Figure 2 After removing unwanted columns...............................................................................................19

Figure 3 Count of null value per columns in data1_new.............................................................................19

Figure 4 Added new column Nea_Class to the data frame..........................................................................21

Figure 5 Added new column Hazardous to the data frame..........................................................................21

Figure 6 After concatenating PHA data frame in to data1_new..................................................................22

Figure 7 After concatenating Nea data frame into data2_new....................................................................22

Figure 8 Count of Null Values per Column in Data1_new & Data2_new..................................................23

Figure 9 Summary statistics of data1_new..................................................................................................26

Figure 10 Histograms of Numerical variables.............................................................................................27

Figure 11 Correlation Heatmap of numerical pairs.....................................................................................28

Figure 12 Top Negative & Positive strongest correlation coefficient.........................................................29

Figure 13 Most impactful features for prediction........................................................................................30

Figure 14 Train Logistic Regression Model................................................................................................30

Figure 15 Generate Prediction.....................................................................................................................31

Figure 16 Prediction Data Frame.................................................................................................................32

Figure 17 Accuracy of Logistic Regression model......................................................................................32

Figure 18 Classification Report...................................................................................................................33

Figure 19 Confusion Matrix.........................................................................................................................33

Figure 20 Train regression model................................................................................................................34

Figure 21 Predict the probabilities...............................................................................................................34

Figure 22 New data frame with Predicted Probability.................................................................................35

Figure 23 Rank the categories by their predicted probabilities...................................................................35

vii
Chapter 1 : Introduction

1.1 Background

Near-Earth Asteroids are any small solar system bodies that orbit the sun and pass relatively

close to the Earth’s orbit. They are pushed into orbits that allow them to enter Earth’s

neighborhood due to the gravitational pull of nearby planets. These asteroids can originate from

various places, such as the main asteroid belt between Mars and Jupiter, and can be made of

different substances including metal, rock, and ice. In technical terms, an Asteroid is an NEA if

its perihelion distance is less than 1.3 AU which is approximately 195 million kilometers.

They are separated into several groups according to their size, shape, and orbital parameters.

According to NASA’s Center for Near Earth Object Studies, the most common classification

scheme for NEAs is the Apollo, Amor, Aten, and Atira categories, which are based on the

asteroid's orbital distance from the Sun.

Some NEAs are also classified as potentially hazardous asteroids. PHA is one whose orbit passes

through Earth’s orbit and is larger than 140 meters (460 feet) in diameter. In technical terms, an

NEA is a potentially hazardous asteroid if its MOID is less than 0.05 AU, and its absolute

magnitude is equal to or less than 22. PHAs are also a subset of NEAs that have the potential to

impact the Earth and cause tremendous damage.

NEAs are typically discovered by ground-based telescopes such as Wide-field infrared survey

explorer (WISE). Once an NEA is discovered, its orbit is tracked and monitored to determine its

potential risk to Earth. NEAs can also be used for space exploration. For example, some NEAs

1
have been proposed as a potential destination for robotic spacecraft, as they can provide a

relatively easy and safe target for exploration.

Scientists study NEAs to learn more about the solar system and to understand these asteroids'

potential benefits and drawbacks. In general, the study of orbital parameters is crucial to asteroid

exploration since it clarifies the characteristics and behavior of these small things and the risks

they can cause to the Earth.

1.2 Research problem

The classification of NEAs and the identification of PHAs is already an active area of research

and will continue to grow in the future. There is still much to learn about these tiny bodies and

their possible impact on the Earth.

This study analyzes orbital properties of near-earth asteroids to learn more about them. Orbital

parameters are the characteristics that describe the shape and position of an asteroid's orbit

around the sun. Some examples of orbital parameters are perihelion distance, aphelion distance,

the semi-major axis, and eccentricity.

This study involves classifying Near-Earth Asteroids according to standard methods from

NASA’s Center for Near-Earth Object Studies and Predict whether a Near-Earth Asteroid is

potentially hazardous or not based on its characteristics and identify which NEA categories are

more likely to be PHAs.

2
1.3 Objectives

More specific objectives derived from the research problem are listed below.

 Explore the dataset to gain insights into the distribution of asteroid characteristics.

 Identify which NEA categories are more likely to be PHAs.

 Identifying which asteroid characteristics are most strongly associated with whether an

asteroid is potentially hazardous or not.

 Train a logistic regression model to predict whether an asteroid is potentially hazardous or

not based on its characteristics.

 Evaluate the performance of the logistic regression model.

1.4 Research questions

The following questions will be answered at the conclusion of this study.

 Which asteroid characteristics have the strongest association with whether an asteroid is

potentially hazardous or not?

 Is it possible to accurately predict whether an asteroid is potentially hazardous or not using

its characteristics?

 What is the performance of the logistic regression model in predicting whether an asteroid is

potentially hazardous or not?

 Are there any subgroups of asteroids with higher or lower risk of being potentially

hazardous?

3
1.5 Scope of the research

The primary purpose of this research is to Analyze orbital parameters and other properties of

asteroids, such as relative velocity, to classify Near-Earth Asteroids, determine which category is

most likely to be potentially hazardous, and predict whether a Near-Earth Asteroid is potentially

hazardous or not based on its characteristics. This could involve using standard asteroid

classifying criteria used by NASA’s Center for Near Earth Object Studies, statistical methods,

and other analytical approaches. This research could help better understand the interconnections

of orbital and non-orbital parameters and the potential hazardousness of Near-Earth Asteroids.

1.6 Justification of the research

In the case of predicting whether a Near-Earth Asteroid is potentially hazardous or not, there

may be a variety of factors that affect their orbital characteristics, and Hazardousness and

statistical approaches can be used to discover patterns and correlations that may not be

immediately apparent. This can help to gain a deeper understanding of the processes that

underlie asteroids' behavior and the dangers that they might pose to Earth.

In addition, statistical analysis enables researchers to test hypotheses and make predictions

regarding asteroids' behavior based on their orbital parameters. This will aid future research and

policy decisions in classifying NEAs and identifying relationships between orbital and non-

orbital factors.

Overall, predicting whether a Near-Earth Asteroid is potentially hazardous or not using statistical

methods can offer a more comprehensive and profound understanding of NEAs, and their

categories and facilitate more effective decision-making.

4
1.7 Expected limitations

Statistical analysis is only as good as the data that is used to perform it. The quantity of data that

is currently accessible in the case of NEAs study may be constrained. And the relationships

between variables can only be estimated by statistical analysis, which would be based on

probability. There is always a chance that these estimations won't be precisely accurate or that

the outcomes won't be guaranteed.

Only asteroids with circular and elliptical orbits are considered in this study. Because they are

more likely to be NEAs. Therefore, asteroids with parabolic trajectories and hyperbolic

trajectories are disregarded.

Furthermore, statistical analysis may not be able to capture all the complexity of the systems

being studied. Statistical analysis is only one tool that can be used to study NEAs, and it may not

be appropriate for all research objectives or for all types of discoveries. It may be necessary to

use additional techniques, such as simulations or observational research, to enhance,

complement, or supplement the findings of statistical analysis.

Although statistical research can provide insights into the relationships between orbital

parameters and non-orbital parameters and the hazardousness of NEAs, it is crucial to be aware

of its limitations. Consequently, the data obtained to conduct this research is subject to these

limitations.

5
1.8 Proposed work schedule
December January February
3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
week week week week week week week week week week
Topic selection

Data acquisition

Study articles &


research papers
Project proposal
submission
Data cleaning

Data analyzing

First draft
preparation
Final report
submission

Table 1-1 Proposed Work Schedule

Chapter 2 : Literature Review

2.1 Introduction to the research theme

NEAs are small, rocky objects that orbit the Sun and are believed to be remnants from the inner

solar system. Since it can give insight into the origins and evolution of the solar system and help

determine the likelihood of asteroid impacts, understanding the characteristics and behavior of

asteroids is a crucial area of research in planetary science.

6
Several techniques have been developed to categorize NEAs and identify PHAs. One common

method is based on the asteroids' orbital characteristics. Such as semi-major axis, aphelion

distance & perihelion distance. Various studies have investigated the characteristics of NEAs and

PHAs using various data sources, including observations from ground-based telescopes, space-

based telescopes, and spacecraft missions.

The study of orbital characteristics and asteroids' hazardousness using statistical methods has

gained increasing attention in recent years. This research theme aims to understand how orbital

parameters can be used to classify asteroids and investigate interconnections between orbital and

non-orbital parameters and hazardousness. Researchers can develop models to discover patterns

and trends that may be helpful in categorizing asteroids and determining their hazardousness by

studying data on the orbital characteristics of known NEAs.

As a result, this chapter addresses previous studies and discusses the approaches used as well as

the research gap that was found.

2.2 Theoretical explanation of the keywords in the topic

Keywords Theoretical definition

Near Earth Asteroids (NEA) A near-Earth asteroid is a type of asteroid that

orbits the Sun and comes relatively close to the

Earth's orbit.

7
Potentially Hazardous Asteroids (PHA) A potentially hazardous asteroid is a subset of a

Near-Earth asteroid that has the potential to

impact the Earth and cause significant damage or

destruction.

Astronomical Unit (AU) An astronomical unit (AU) is a unit of distance

used to measure distances within the solar

system, it is defined as the average distance

between the Earth and the Sun. Which is

approximately 150 million kilometers.

Inner solar system The inner solar system is the region of the solar

system that includes the four terrestrial planets

(Mercury, Venus, Earth, and Mars) and the

asteroid belt.

Spectrophotometry Spectrophotometry is a method used to measure

the absorbance of light by a sample as a function

of the wavelength of the light.

Ascending node's longitude The ascending node's longitude is the angle

between an object's orbital plane and the plane of

the celestial equator.

Argument of perihelion The argument of perihelion is the angle between

the point of the orbit closest to the central body

(sun) and the point of the celestial equator where

the object would be if it were moving in a circular

8
orbit at the same distance from the central body.

Bus-DeMeo taxonomy The Bus-DeMeo taxonomy is an asteroid

classification system designed by Francesca

DeMeo, Schelte Bus, and Stephen Slivan in 2009.

It is based on reflectance spectrum

characteristics.

Amor Earth-approaching NEAs with orbits exterior to

Earth’s but interior to Mars’s. (Named after the

asteroid (1221) Amor)

Apollo Earth-crossing NEAs with semi-major axis larger

than Earth’s semi-major axis. (Named after the

asteroid (1862) Apollo)

Aten Earth-crossing NEAs with semi-major axis

smaller than Earth’s semi-major axis. (Named

after the asteroid (2062) Aten)

Atira NEAs whose orbits are contained entirely within

the orbit of the Earth. (Named after the asteroid

(163693) Atira)

Table 2-2 Theoretical explanation of the keywords in the topic

2.3 Findings by other researchers

9
There are numerous resources available for further study about classifying asteroids using their

orbital parameters and examining the relationship between those parameters and the

hazardousness of asteroids using statistical techniques. Some potential sources of information

include academic journals, Online databases, and repositories and websites.

A study published by Kathleen Jacinda Mcintyre on “Applying Machine Learning to Asteroid

Classification Utilizing Spectroscopically Derived Spectrophotometry” in 2019 investigated

Asteroids using a machine learning algorithm to identify spectrophotometric data, then compare

it with current Bus-DeMeo taxonomy (DeMeo, et al. 2009) categorization efforts. Then

investigated the relationship between classified asteroids and observational characteristics by

integrating historical methods and perspectives with modern computing capabilities.

The research titled “Prediction of orbital parameters for undiscovered potentially hazardous

asteroids using machine learning” (2018) published by Vadym Pasko analyzed PHAs in the

Amor group, and the results showed a relationship between the density of the PHA distribution

and the ascending node's longitude and argument of perihelion.

Another research named “Radar imaging and physical characterization of Near-Earth Asteroid

(162421) 2000 ET70” was published in the journal named Icarus (in 2013) shows that the author

of this research has been able to predict reliable trajectory for NEA (162421) 2000 ET70, by

utilizing precise measurements of the range and velocity of the asteroid which are non-orbital

parameters. These were measured by obtaining continuous wave spectra and range-Doppler

images with range resolutions as fine as 15 m. NEA (162421) 2000 ET70 was observed during

its closest approach to Earth in February 2012 over a period of 12 days by using the Arecibo and

Goldstone radar systems.

10
These are merely a few instances of the various types of discoveries that have been made in the

subject of asteroid studies throughout the past years. I suggest looking up more information on

this subject in online databases and repositories like the Minor Planet Center and the NASA

Astrophysics Data System (ADS), as well as academic publications like the Planetary Science

Journal and the Journal of the Royal Astronomical Society.

2.4 Research gap

There are various types of research about celestial bodies like planets, exoplanets, comets, and

asteroids. Nevertheless, as far as I know, only a few studies classify NEAs and predict the

hazardousness of Near-Earth Asteroids based on their orbital and no-orbital parameters by using

statistical approaches. Kathleen Jacinda Mcintyre’s research had a gap in the literature since the

author considered only spectrophotometric data to identify and categorize asteroids, and the

author used the Bus-DeMeo classification method which is an asteroid taxonomic system that

only allows reflectance spectrum characteristics.

Furthermore, the gap in the literature in the study by Vadym Pasko was that the author

investigated only one NEA group, which is Amors. And the study gives results about the

relationship between the density of the PHA distribution and several orbital parameters.

The research named “Radar imaging and physical characterization of near-Earth Asteroid

(162421) 2000 ET70” had a gap in the literature because the author only investigated a single

NEA which can be potentially hazardous. Also, Critical orbital parameters have not been

considered and no major statistical approaches have been taken.

Finally, there is still much we don't know about the relationship between asteroids' orbital

properties and their hazardousness. While some studies have found correlations between certain

11
orbital parameters and the likelihood of an asteroid impacting Earth, there is still a lot to be

discovered about these relationships. Therefore, further research is needed to reveal other

insightful information about these interconnections.

2.5 Table of Variables

Variable name Theoretical definition Result meaning

NEO Reference ID Reference ID refers to a unique Reference ID

identifier assigned to a specific

NEA.

Absolute magnitude Absolute magnitude is defined Absolute magnitude

as the apparent magnitude of an

NEA if it were located at a

distance of 10 parsecs (32.6

light years) with no extinction of

its light due to absorption by

interstellar dust particles.

Minimum orbit MOID is defined as the MOID in Astronomical units

intersection distance minimum distance between the

(MOID) two orbits at any point in their

respective cycles.

Semi-major axis It is defined as half of the length Semi-major axis length in

of the longest line that can be Astronomical units

drawn through the NEA’s orbit.

12
Perihelion distance The perihelion distance of an Perihelion distance in

NEA is the distance between the Astronomical units

asteroid and the Sun at the point

in its orbit when it is closest to

the Sun.

Aphelion distance The perihelion distance of an Aphelion distance in

NEA is the distance between the Astronomical units

asteroid and the Sun at the point

in its orbit when it is farthest

from the Sun.

Eccentricity The eccentricity of an asteroid is Value between 0 and 1

a measure of how much its orbit

around the earth deviates from a

perfect circle. It is a value

between 0 and 1, with a 0

representing a perfect circle and

values closer to 1 indicating a

highly eccentric orbit.

Relative velocity The speed at which an asteroid Relative velocity in kmph

is moving in relative to a frame

of reference, such as the Earth or

the Sun, is known as its relative

velocity. It’s a measurement of

13
the asteroid's motion speed and

direction.

Hazardousness This means whether a near-earth PHA, Non-PHA

asteroid can cause a significant

impact on the earth.

NEA-Class This implies which category the Apollo, Amor, Aten, Atira

NEAs belong to from the groups

of Apollo, Amor, Aten, and

Atira.

Asc Node The angular distance of an Angle measured in degrees

Longitude asteroid's orbital node measured

from the vernal equinox, which

is a reference point on the

celestial sphere. It is the point

where the ecliptic (the plane of

Earth's orbit around the Sun)

intersects the celestial equator

(the imaginary line in the sky

directly above the Earth's

equator) and defines the start of

the celestial coordinate system.

14
Perihelion Arg It describes the orientation of the Angle measured in degrees

asteroid's orbit with respect to

the position of the asteroid at its

closest approach to the Sun

(perihelion). It is the angle

between the asteroid's ascending

node (the point where the

asteroid crosses the ecliptic from

south to north) and the point

where the asteroid reaches

perihelion.

Mean Anomaly Mean anomaly means the Angle measured in degrees

position of a celestial object,

such as a Near-Earth Asteroid

(NEA), relative to its elliptical

orbit. It is defined as the angle

between the object's current

position and its position at

perihelion (the point in its orbit

where it is closest to the Sun),

measured in the plane of the

object's orbit.

Mean Motion Average angular speed at which Average angular speed of an

15
an asteroid travel in its orbit asteroid in its orbit around the

around the Sun Sun

Est Dia in KM (min) Minimum estimated diameter of Estimated diameter in

the Near-Earth Asteroid (NEA) kilometers

in kilometers.

Est Dia in KM (max) Maximum estimated diameter of Estimated diameter in

the Near-Earth Asteroid (NEA) kilometers

in kilometers.

Inclination Inclination is a measure of the Angle measured in degrees.

tilt or angle of an asteroid's orbit

relative to the plane of the

ecliptic, which is the plane of

Earth's orbit around the Sun.

Asc Node Asc Node Longitude means Angle measured in degrees.

Longitude longitude of the point at which

an asteroid's orbit crosses the

ecliptic plane (the plane of

Earth's orbit around the Sun) as

seen from above the Earth's

North Pole.

16
Miss Dist. It means closest distance Distance in astronomical units.

(Astronomical) between the NEO and the

Earth's orbit in astronomical

units (AU) at the time of closest

approach.

Table 2-3 Table of Variables

Chapter 3 : Data Preparation Process - Data Pre-processing and


Data Wrangling

3.1 Data acquisition

The first step in the data preparation process is to acquire the data needed for the research. This

study relies on a dataset from "Kaggle," a secondary data source, for its data. However, the Jet

Propulsion Laboratory of NASA is the original owner of the data in this dataset.

3.2 Data pre-processing

Once the data has been acquired, it may need to be pre-processed to make it suitable for analysis.

This may involve cleaning the data by removing errors, inconsistencies, or missing values. To do

that, this could also entail applying methods like outlier identification, imputation, or data

interpolation. After the data has been cleaned, it may need to be transformed into a suitable

format for analysis. This could involve converting data from one type to another, such as from

string to numeric, or scaling data to a common range.

3.2.1 Remove unwanted columns.

17
The first step of data cleaning was carried out by removing unnecessary columns after importing

the required libraries and dataset. Figure 3.1 depicts the complete dataset prior to data cleaning,

which consisted of 39 columns. Following the data cleaning process, as illustrated in Figure 3.2,

the number of columns was reduced to 17. A new data frame called ‘data1_new’ has been

created as a copy of the updated original data frame after the columns that are not contained in

the specified list have been removed from the original data frame. The columns that have been

specified for inclusion are: “Neo Reference ID”, “Absolute Magnitude”, “Est Dia in KM(min)”,

“Relative Velocity km per hr”, “Est Dia in KM(max)”, “Miss Dist.(Astronomical)”, “Minimum

Orbit Intersection”, “Eccentricity”, “Semi Major Axis”, “Inclination”, “Asc Node Longitude”,

“Orbital Period”, “Perihelion Distance”, “Aphelion Dist”, “Mean Anomaly”, “Mean Motion”,

“Perihelion Arg”, and “Orbit Uncertainity”.

Figure 1 Before removing unwanted columns.

Figure 2 After removing unwanted columns.


18
3.3.1 Checking Null Values

Checking for missing values is an important step in the preparation of data. It's crucial to

recognize and effectively manage missing values in order to ensure the data quality. In this case,

the sum of null values presents in each column of the data1_new data frame has been computed,

and the result indicates that there are no null values in the data frame. Therefore, it can be

concluded that the data1_new data frame does not contain any missing values.

Figure 3 Count of null value per columns in data1_new.

3.3 Data Wrangling

19
Depending on the individual research objectives, the pre-processed data may need to be

wrangled or transformed. This may involve filtering the data based on certain criteria, such as the

type of asteroid or the range of orbital parameters. If the research involves combining data from

multiple sources, it could be required to do so consistently and meaningfully. To make the data

more suitable for analysis, it may be necessary to transform it in various ways. This may involve

creating new variables or features from the data or reshaping the data into a different format.

A new column named “Nea_Class” has been attached to the data1_new data frame with the

intention of categorizing NEAs into distinct groups according to their orbital characteristics. The

standard criteria established by NASA's Center for Near Earth Object Studies have been utilized

to categorize near-earth asteroids. The specific criteria used for this purpose are listed below:

- Nea_Class value “Atens” has been allocated to NEAs with Semi Major Axis less than 1

and Aphelion Distance larger than 0.983.

- Nea_Class value “Atiras” has been allocated to NEAs with Semi Major Axis less than 1

and Aphelion Distance less than 0.983.

- Nea_Class value “Apollos” has been allocated to NEAs with Semi Major Axis larger

than 1 and a Perihelion Distance less than 1.017.

- Nea_Class value “Amors” has been allocated to NEAs with Semi Major Axis greater

than 1 and a Perihelion Distance between 1.017 and 1.3.

- The Nea_Class column now has the value “NA” for those NEAs whose classification is

still null.

Another new column labeled "Hazardous" has been attached to the data1_new data frame. The

objective of this addition is to classify NEAs into different categories based on their orbital

20
characteristics. This column's purpose is to categorize NEAs into two different categories "PHA"

and “NON_PHA” using the standards established by NASA’s Center for Near Earth Object

Studies. The specific criteria used for this purpose are listed below: The value of the Hazardous

column has been set to “PHA” for rows where the Minimum Orbit Intersection column value is

less than or equal to 0.05 and the Absolute Magnitude column value is less than or equal to 22.

The value of the Hazardous column has been set to “NON_PHA” for rows where the value is

null.

Figure 4 Added new column Nea_Class to the data frame.

Figure 5 Added new column Hazardous to the data frame.

21
This project heavily relies on logistic regression, and in order to make the analysis process

easier, categorical values were transformed into binary values using the “get_dummies ()”

method found in the pandas library. Subsequently, the PHA data frame obtained by applying

get_dummies () function to the Hazardous column of the data1_new data frame was

concatenated with the data1_new data frame using the “concat()” function from pandas. And

also, Nea data frame obtained by applying get_dummies () function to the Nea_Class column of

the data1_new data frame was concatenated with the new data frame called “data2_new”

perform specific objective in this project.

Figure 6 After concatenating PHA data frame in to data1_new

Figure 7 After concatenating Nea data frame into data2_new.

22
After adding two new columns, the missing values were re-examined to ensure that the data was

complete and ready for analysis. It was found that the data frames had no null values.

Figure 8 Count of Null Values per Column in Data1_new & Data2_new.

Chapter 4: Methodology

4.1 Introduction

This chapter aims to examine and analyze asteroids to classify them and investigate relationships

between orbital parameters and other factors of NEAs and investigate the hazardousness of

classified NEA groups utilizing statistical methods. Accordingly, this chapter also covers the

population, sample, sampling techniques, data collection techniques, and sources.

23
4.2 Population, Sample, and Sampling technique

In statistical research, it is often necessary to study a population in order to understand certain

characteristics or patterns. A population is a collection of individuals or objects with common

features or characteristics relevant to the investigation. The population of this study may be all

known asteroids in the solar system in the context of a statistical study on orbital parameter

analysis to classify asteroids and investigate the relationship between orbital parameters, non-

orbital parameters, and the hazardousness of asteroids.

Researchers often utilize a sample, which is a subset of the population, to represent the

population because it is typically difficult or impractical to deal with the whole population. Out

of the entire population of asteroids, near-earth asteroids will be the sample selected for this

research.

4.3 Type of Data to be Collected and Data Sources

This can involve gathering both quantitative and qualitative data. The quantitative data could

include measurements of orbital parameters, such as the semi-major axis, eccentricity, perihelion

distance, and aphelion distance, as well as data on the physical properties of the asteroids, such

as size, shape, and composition. The categorical variables or text data, such as the classification

of the asteroids or any other relevant data, could be included in the qualitative data. Data

collection for this research could be gathered from a variety of sources, including primary data

sources such as observations or experiments, and secondary data sources such as published

papers, databases, or online resources. The data to be used for this study is a dataset published on

“Kaggle”. which is a data science and machine learning community. Therefore, it can be defined

as a secondary data source. However, the data in this dataset is originally owned by NASA’s Jet

24
Propulsion Laboratory ( http://neo.jpl.nasa.gov/ ). This API is maintained by SpaceRocks Team:

David Greenfield, Arezu Sarvestani, Jason English, and Peter Baunach. And, the data was

gathered in CSV format, making it easier to handle and comprehend. All collected information

will be kept confidential and secure.

4.4 Data collection tools and plan

As mentioned before, "Kaggle" was used to obtain the dataset. The source used to publish this

data in Kaggle is NeoWs (Near Earth Object Web Service). Which is a web service that provides

information about near-Earth asteroids. With NeoWs a user can: search for Asteroids based on

their closest approach date to Earth, look up a specific Asteroid with its NASA JPL small body

id, as well as browse the overall dataset. Since the data has already been processed and

published, data collection methods are no longer required.

4.5 Methods, Techniques, and Tools

All NEAs are classified using commonly used standard asteroid classification criteria of NASA’s

center for near-earth object studies utilizing the data provided in the data set at the beginning of

this research and it is expected to use the standard criteria from NASA to determine the

hazardousness of NEAs using relevant data available in the gathered dataset from Kaggle. And,

this study would consist of various statistical techniques, such as descriptive statistics, linear

regression, logistic regression, and hypothesis testing. It might also include information on the

assumptions and limitations of these techniques and how they will be applied to the data.

Furthermore, it is intended to use visualization techniques, such as scatter plots, histograms, bar

charts, and box plots to get a better understanding of the outcomes of this analysis and to identify

relationships and patterns between variables that cannot be detected by just looking at the data

25
while the analysis is being performed. Data visualization will aid the users of this study to get

visual insight along with the interpretations for a better understanding. All these techniques are

performed using R or Python programming languages.

Chapter 5: Data Analysis, Visualization, and Interpretation

5.1 Descriptive Statistics

This section provides a useful framework for analyzing and summarizing large datasets in a

meaningful way. Descriptive statistical analysis can help users comprehend the distribution of

various variables and their correlations with one another in the context of determining whether or

not an asteroid is potentially hazardous based on its characteristics. By identifying patterns and

trends in the data, we can gain valuable insights into the factors that are most important for

predicting asteroid hazard potential.

Figure 9 Summary statistics of data1_new.

26
An overview of the most significant descriptive statistics, including count, mean, standard

deviation, minimum, maximum, and quartiles, for each numerical column is given in the

summary statistics table for the numerical columns in the “data1_new” data frame. However, to

ensure the accuracy of the statistics, columns that are not appropriate for summarization,

specifically 'Neo Reference ID' and 'PHA', have been dropped from the table using the “drop()”

function in pandas.

Figure 10 Histograms of Numerical variables.

27
First, the columns "Neo Reference ID" and "PHA" from data1_new are eliminated to generate a

new data frame called “hist_dataframe”. With the exception of the columns "Neo Reference ID"

and "PHA," it creates a set of histograms for each numerical column in the data1_new data

frame.

Anyway, correlation is also important because it helps to identify the strength and direction of

the relationships between the variables in the dataset. The variables that have a high correlation

with the target variable can be identified by computing the correlation coefficients between the

features. The following correlation heatmap displays the correlation coefficients between all

pairs of numerical variables in the dataset in a matrix format.

28
Figure 11 Correlation Heatmap of numerical pairs.
The following list shows the top ten strongest positive correlations and negative correlations. It

demonstrates that there are significant positive connections between the Est Dia in KM (min) and

Est Dia in KM (max), the Aphelion Dist and Semi Major Axis, the Aphelion Dist and

Eccentricity, etc. in the section on positive correlations. This implies that these variables are

highly dependent on each other and changes in one variable will cause changes in the other

variable. The significant negative correlations between Mean Motion and Perihelion Distance,

Est Dia in KM (max) and Absolute Magnitude, Est Dia in KM (min) and Absolute Magnitude,

Mean Motion and Aphelion Distance, Mean Motion and Semi Major Axis, etc. are displayed in

the section on negative correlations. This suggests that two variables are inversely proportional

to one another and that changes in one variable will result in opposite changes in the other.

Figure 12 Top Negative & Positive strongest correlation


coefficient

29
5.2 Feature Selection

Figure 13 Most impactful features for prediction

Figure 5.5 displays the top ten most significant characteristics taken from the dataset for

predicting the 'PHA' (Potentially Hazardous Asteroid) variable. “Absolute Magnitude”, “Est Dia

in KM (min)”, “Est Dia in KM (max)”, “Relative Velocity km/hr”, “Orbit Uncertainty”,

“Minimum Orbit Intersection”, “Eccentricity”, “Perihelion Distance”, “Aphelion Distance”, and

“Mean Anomaly” are among the features provided in the Index section. These features were

selected using the SelectKBest method, which applies the f_classif scoring function to evaluate

the significance of each feature in relation to the target variable. The chosen characteristics are

expected to have the most impact on determining whether or not an asteroid is potentially

hazardous. By concentrating analysis and modeling efforts on the most crucial traits, this

information might help the prediction model become more accurate and effective.

5.3 Regression Analysis

5.3.1 Training Logistic Regression Model

Figure 14 Train Logistic Regression Model.


30
In this case the dependent variable is “PHA”, which is the target variable that we are trying to

predict. The independent variables are the features that were selected using the feature selection

method. The logistic regression model built using the supplied hyperparameters is represented by

the output “LogisticRegression (random_state=0, solver='liblinear')”. Since the random_state

parameter is set to 0, the model always produces the same outcomes when used with the same

input data and hyperparameters. The solver parameter is set to 'liblinear', which is an algorithm

used by the model to optimize its coefficients. This output confirms that the logistic regression

model can be used to predict the probability of an asteroid being potentially hazardous based on

the selected features in X.

5.3.2 Generate Predictions

Figure 15 Generate Prediction.

The output of this code is an array of predicted values for the target variable (PHA) based on the

logistic regression model trained on the training dataset and applied to the test dataset. Due to the

binary nature of the target variable (PHA), the predicted values are binary (0 or 1). In this

particular case, the result is an array of unsigned integers (uint8) with the data types of zeros (0)

and ones (1). The predicted value of 0 indicates that the asteroids are not potentially hazardous

asteroids (NON_PHA), whereas the expected value of 1 indicates that the asteroids are

potentially hazardous asteroids (PHA). The array is represented as a data frame in the following

figure.

31
Figure 16 Prediction Data Frame.

The Data Frame has two columns, “PHA” and “PHA_Pred”. The PHA column contains the

actual PHA labels for the corresponding samples in the test set. The PHA_Pred column contains

the predicted PHA labels for the same samples.

5.3.3 Evaluating Regression Model

Figure 17 Accuracy of Logistic Regression model.

Figure 5.9 displays the logistic regression model's test set accuracy score. The accuracy score is a

representation of the percentage of instances in the test set that was properly classified. In this

case, the model has an accuracy of 0.8230 or 82.3%. This means that out of all the instances in

the test set, 82.3% were classified correctly by the model. However, accuracy alone may not be a

sufficient measure of a model's performance, especially when dealing with imbalanced datasets.

32
There is a class imbalance in the dataset, with a substantially higher proportion of cases falling

into the negative class (PHA = 0) than the positive class (PHA = 1). Therefore, a classification

report may be needed to evaluate the performance of the model in a more comprehensive way.

Figure 18 Classification Report

With a precision of 0.83 and a recall of 0.98, the report demonstrates that the model has

high precision and accuracy in predicting NON-PHA observations (class 0). On the other hand,

the model's performance in predicting the PHA observations (class 1) is inferior, with a low

precision of 0.17 and a weak recall of only 0.02. The model's weighted average F1-score is 0.76,

which represents its overall effectiveness in both classes. This may be due to the imbalanced

nature of the dataset, where there are many more non-hazardous asteroids than hazardous ones.

33

Figure 19 Confusion Matrix.


In this case, the matrix's first row displays the actual number of observations for class 0. 1154 of

the 1173 observations were classified as 0 (TN) correctly, while 19 were misclassified as 1 (FP).

The actual number of observations for class 1 is displayed in the second row. 4 of the 234

observations were correctly classified as 1 (TP), whereas 230 were mistakenly classified as 0

(FN). Overall, the model's performance seems to be poor, as it is only correctly predicting a

small fraction of the actual PHAs. The confusion matrix suggests that the model has a low

sensitivity and specificity for the category of "potentially hazardous asteroids," as indicated by

the high number of false negatives (FN) and the low number of true positives (TP) it appears to

have.

5.3.4 Identify which NEA categories are more likely to be PHAs.

Figure 20 Train regression model.

Using the training set produced by splitting the dataset into training and testing sets, the code

trains a logistic regression model with a random state of 0. The logistic regression object with its

default parameter values is displayed in the output.

34
Figure 21 Predict the probabilities.
Using the trained logistic regression model, the code determines the predicted probability that

each data point in the test set will belong to the positive class (PHA). The result is an array with

these probabilities for every single data point.

Figure 22 New data frame with Predicted Probability

With the Nea_Class and the associated probability of being a potentially hazardous asteroid

(PHA) predicted by the logistic regression model for the test set, the function generates a Data

Frame. The Nea_Class and its associated likelihood of becoming a PHA are shown in the Data

Frame.

Figure 23 Rank the categories by their predicted probabilities.

The result displays, for each NEA class (Atiras, Apollos, Atens, and Amors), the highest

likelihood that an asteroid will fall into the potentially hazardous asteroid (PHA) category. The

35
NEA class Atiras has the highest probability of being a PHA, followed by Apollos and Atens,

while Amors has the lowest probability of being a PHA. This knowledge can be helpful in

identifying and ranking NEAs for more observation and research.

Chapter 6: Discussion & Conclusion

6.1Discussion

The logistic regression model developed in this project was able to predict the potential

hazardousness of Near-Earth Asteroids (NEAs) based on their orbital and physical

characteristics. The model was trained on a dataset consisting of various NEA classes, and its

performance was evaluated using a confusion matrix. The results showed that the model had a

low sensitivity and specificity for identifying potentially hazardous asteroids, which indicates

that the model needs further improvement.

One possible reason for the model's poor performance could be the quality of the input data. The

dataset used in this project was derived from the JPL Small-Body Database, which provides

information on the orbits and physical properties of NEAs. However, the database is not

complete, and there may be biases in the data due to the way observations are made. This may

have affected the model's ability to accurately predict the hazardousness of NEAs.

36
Another possible explanation could be the choice of features used in the model. The logistic

regression model was trained on a set of features that were selected based on their potential

relevance to the hazardousness of NEAs. However, there may be other important features that

were not included in the model, which could have contributed to the model's poor performance.

Despite these limitations, the logistic regression model developed in this project has some

potential practical applications. For example, it can be used to prioritize the observation of NEAs

that are most likely to be potentially hazardous, which can help to identify asteroids that pose a

threat to Earth.

37
References

 NASA: Asteroids Classification (n.d.) available from


<https://www.kaggle.com/datasets/shrutimehta/nasa-asteroids-classification> [14
December 2022]

 NEO Basics (n.d.) available from <https://cneos.jpl.nasa.gov/about/neo_groups.html>


[17 December 2022]

38
 Vadym Pasko (n.d.) available from <http://vadym-pasko.com/#research> [17 December
2022]

 ‘Near-Earth Object’ (2022) in Wikipedia [online] available from


<https://en.wikipedia.org/w/index.php?title=Near-Earth_object&oldid=1127646254> [18
December 2022]

 ‘Orbital Eccentricity’ (2022) in Wikipedia [online] available from


<https://en.wikipedia.org/w/index.php?title=Orbital_eccentricity&oldid=1097430348>
[20 December 2022]

 Naidu, S.P., Margot, J.-L., Busch, M.W., Taylor, P.A., Nolan, M.C., Brozovic, M.,
Benner, L.A.M., Giorgini, J.D., and Magri, C. (2013) ‘Radar Imaging and Physical
Characterization of Near-Earth Asteroid (162421) 2000 ET70’. Icarus [online] 226 (1),
323–335. available from
<https://www.sciencedirect.com/science/article/pii/S001910351300225X> [19 December
2022]

 Selliez, L., Briois, C., Carrasco, N., Thirkell, L., Gaubicher, B., Lebreton, J.-P., and
Colin, F. (2019) ‘Analytical Performances of the LAb-CosmOrbitrap Mass Spectrometer
for Astrobiology’. Planetary and Space Science [online] 225, 105607. available from
<https://www.sciencedirect.com/science/article/pii/S0032063322001933> [19 December
2022]

 Center for NEO Studies (n.d.) available from <https://cneos.jpl.nasa.gov/> [21 December
2022]

 Definitions & Assumptions - NEO (n.d.) available from


<https://neo.ssa.esa.int/definitions-assumptions> [21 December 2022]

39

You might also like