Epics Springer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Genomic Risk Assessment and Early Intervention For

Rare Genetic Disorders


K.Vaibhav1, Dr G. Kalyani*2, N. Ram Tarun3, S. Sashank Sai4

1,3,4
Dept of IT, VR Siddhartha Engineering College, Vijayawada
2
Associate Professor, Dept of IT, VR Siddhartha Engineering College, Vijayawada.
[email protected], [email protected],

[email protected], [email protected]

Abstract:
There are a lot of intricate challenges posed by genetic diseases, emphasizing
their complex nature and the necessity for specialized techniques in
identification and classification. Unlike hereditary diseases, these conditions
demand advanced methods such as random forest classification, artificial neural
networks, and support vector machines for extracting intricate genetic data. The
program's goal is to enhance individuals' awareness of their genetic health,
facilitating participation in prevention and early intervention. Utilizing cutting-
edge Machine Learning algorithms aligns with the overarching objective of
averting adverse outcomes linked to genetic diseases by prioritizing disease
progression. The research contributes to healthcare by providing a
comprehensive framework for accurate classification, integrating new
technologies such as chain classifier approach, and improving diagnostic
accuracy. In the realm of genetic diseases, the complexity arises from intricate
genetic interactions, necessitating advanced diagnostic and classification
methods, including the application of machine learning. This approach aims to
systematically identify and classify diverse genetic diseases, offering a data-
driven solution to challenges in genetics and contributing significantly to global
health.
Keywords: Genetic Disorders, Hereditary, Random Forest, Artificial Neural Network
(ANN), Support Vector Machine (SVM), Machine Learning, Chain Classifier Approach.

1 Introduction

Genetic disorders are caused due to the alterations in a person's genetic code (DNA)
leading to changes in an organism's function. Genomic data provides vital health
indicators and is analyzed to understand genetic diseases. Genomics, a branch of
bioinformatics, focuses on the study of genomes and their abnormalities [1]. Genetic
disorders include single gene, mitochondrial, and multifactorial inheritance disorders,
each examined based on DNA structure. Single gene disorders result from mutations in
a single gene, while mitochondrial disorders stem from deletions or replacements in the
mitochondria which is a non-nuclear part of DNA structure. Multifactorial disorders
are caused by a combination of environmental aspects and mutations in multiple genes
[2]. Genes in DNA encode information for various proteins, and alterations can lead to
abnormal protein formation, affecting cellular function and causing genetic disorders.
Out of the 7.8 billion on earth it is believed that 350 million of them which is nearly
4.5% are living with a rare genetical disorder and at first glance it does seem like a
2

small proportion but we also need to account in the fact that 95 percent of these diseases
are incurable[3]. 85% of the people who found out about their condition were lately
diagnosed and the reason for this is the lack of awareness about genetic disorders among
people. The genetic disorders are of 3 classes and they are namely mitochondrial, single
gene and multifactorial genetic disorders[4]. With the hopes of creating a project that
can help people we have created a machine learning model that can predict the class of
the genetic disorder a particular person's data falls into[5].

2 Literature Survey
Collectively, these papers highlight the transformative role that machine learning can
play in genetic disease prediction and genomic medicine. They highlighted possible
applications in various medical fields such as cancer, dementia and diabetes. Accelerate
progress in genomic medicine by calling for collaboration and data sharing to address
the challenge of identifying disease mutations. The unique challenges and opportunities
of rare genetic disease research in India are explored, demonstrating the contribution of
genomics to the understanding of rare diseases [6]: India Alliance Network
(GUaRDIAN) . The study of the application of machine learning in genetics and
genomics highlights the great opportunities to revolutionize biomedical research [7]. A
paper proposes a new chain classification method for predicting disease and genetic
subtypes, demonstrating improved accuracy compared to traditional methods. Another
paper demonstrates the effectiveness of machine learning in identifying disease genes
using the functional similarity of genes assessed using Gene Ontology (GO)
annotations. Together, these papers demonstrate the growing importance of
incorporating machine learning into genomics to provide insight into the genetic basis
of disease and pave the way for personalized medicine and better patient outcomes
[5;8].

3.1 Proposed Architecture

Figure 1: Architecture of the proposed work


3

Fig. 1 shows the actual flow of the work. The process starts with collection of the data
and then this data is turned into visual form which helps in finding hidden patterns and
then in the part of feature engineering the data is normalized and then balanced in order
to avoid biased results [9]. The resultant data is splitted into parts 20% for testing and
the rest for training which is used to train the model with the three algorithms (SVM,
RANDOM FOREST, ANN) which forms the chain classifier model and the other part
is used to test the model.

3.2 Proposed Methodology


3.2.1 Data Preprocessing
The data preprocessing stands as a vital phase in both data analysis and machine
learning work flows, playing a critical role in refining raw data for optimal utilization.
The process in addressing the missing values and removing the duplicates and
rectifying the errors during the process of data cleaning. Transformative measures
encompass normalizing the numerical features, standardizing data encoding the
categorical variables [10]. Techniques such as dimensionality reduction and feature
selection contribute to data reduction. Dealing with outliers and addressing class
imbalances ensures the robustness of model performance. Further, the division of data
into training, validation, and test sets, along with considerations for temporal aspects in
time series data, forms integral aspects of the comprehensive preprocessing approach
[9:11]. The iterative nature of this process enables it to adapt to the unique
characteristics of the data, ultimately enhancing its quality and suitability for specific
analytical or modeling tasks
3.2.2 Data Wrangling and Cleaning
In the considered dataset has numerous columns were deemed irrelevant for meaningful
analysis, including patient name, family name, patient id, location of institute, institute
name, father’s name, and place of birth, as they did not contribute to insights related to
patient health or hereditary factors. Addressing the issue of missing values, I opted to
fill numeric gaps with the median, and for categorical variables, I employed a backfill
method to ensure data continuity. Notably, the Genetic Disorder and Subclass Disorder
columns, being the target variables, had missing values, and to preserve data integrity,
I excluded rows with NA values in these columns. The resultant dataset for the Genetic
Disorder neural network comprised 65,190 data points and 46 columns, with Genetic
Disorder as the target variable. Similarly, the dataset for the Subclass Disorder neural
4

network encompassed 65,190 data points and 47 columns, with Subclass Disorder as
the target variable.
3.2.3 Exploratory Data Analysis:
Correlation matrix is a statistical tool used in exploratory data analysis to quantify the
degree and direction of relationships between numerical variables in a dataset. It
consists of correlation coefficients that range from -1 to 1, indicating negative to
positive associations, with 0 implying no correlation and then visualizing the matrix
through techniques like heat-maps provides a quick overview, aiding in the
identification of significant correlations and potential patterns among variables.

Figure 2: Correlation Matrix

3.2.4 Density Plot and Table of Results:

A density plot can effectively illustrate the distribution of data points for three distinct
genetic disorders – mitochondrial, single-gene, and multifactorial. In this plot, the x-
axis typically denotes a relevant variable such as age of onset or severity, while the y-
axis represents the probability density [12]. The resulting curves showcase the
concentration and distribution patterns of data points for each disorder, with peaks
indicating areas of higher density. This visualization facilitates a comparative analysis,
enabling a clear understanding of the unique characteristics and prevalence of each
genetic disorder within the dataset.
5

Figure 3: Density Plot for Blood Cell Count (mcL)

Figure 4: Density Plot for White Blood Cell Count (thousand per microliter)

Table 1: Density Plot Results for Blood Cell Count (mcL) grouped by ‘Genetic
Disorder’

Genetic Coun Mea Std Min 25% 50% 75% Max


Disorder t n
0 1234 4.89 0.200 4.092 4.763 4.900 5.033 5.592
8
1 2071 4.89 0.199 4.248 4.762 4.892 5.032 5.515
2 7664 4.89 0.198 4.215 4.763 4.899 5.034 5.609
6

Table 2: Density Plot Results for White Blood Cell Count (1000 per microliter) grouped
by ‘Genetic Disorder’

Genetic Coun Mea Std Min 25% 50% 75% Max


Disorder t n
0 1234 7.47 2.514 3.0 5.666 7.477 9.279 12.0
8
1 2071 7.52 2.644 3.0 5.529 7.477 9.516 12.0
2 7664 7.48 2.498 3.0 5.666 7.477 9.238 12.0

3.2.5 Box Plot

This figure represents the box plots for "Blood cell count (mcL)" and "White Blood cell
count (thousand per microliter)" against a specified target variable. Box plots, visually
depicting the distribution and central tendency of numerical data, help identify potential
outliers and reveal distributional characteristics. Each plot showcases the interquartile
range, median, and potential outliers beyond whiskers, providing insights into the
spread and skewness of blood cell counts concerning the target variable [11:13]. This
visualization aids in discerning patterns and variations within the dataset associated
with the specified blood cell attributes. Figure.5 provides key statistics. It also identifies
outliers, helping to compare distributions across groups.

Figure 5: Box Plot for Blood Cell Count and White Blood Cell Count
7

Table 3: Box Plot Results for Blood Cell Count (mcL) grouped by ‘Genetic Disorder’
Genetic Count Mea Std Min 25% 50% 75% Max
Disord n
er
0 12348 4.89 0.200 4.092 4.763 4.900 5.033 5.592
1 2071 4.89 0.199 4.248 4.762 4.892 5.032 5.515
2 7664 4.89 0.198 4.215 4.763 4.899 5.034 5.609

Table 4: Box Plot Results for White Blood Cell Count (1000 per microliter) grouped by
‘Genetic Disorder’
Genetic Count Mean Std Min 25% 50% 75% Max
Disord
er
0 12348 7.479 2.514 3.0 5.666 7.477 9.279 12.0
1 2071 7.522 2.644 3.0 5.529 7.477 9.516 12.0
2 7664 7.484 2.498 3.0 5.666 7.477 9.238 12.0

3.2.6 Count Plot


We utilize Sea-born to create count plots for categorical data in the dataset. The initial
count plot displays the distribution of the target variable by counting occurrences of
each category along with the x-axis. The second count plot introduces a comparison
between the target variable and patient age, incorporating the hue parameter to
distinguish categories within the target variable. These count plots are valuable for
visually representing the frequency of each category, providing an overall view of the
target variable's distribution and a more detailed insight into its relationship with patient
age.
8

Figure 6: Count Plot

3.2.7 Algorithm

In the employed classification strategy, Support Vector Machines (SVM), Random


Forest, and Artificial Neural Networks (ANN) were deliberately selected to cater to
distinct dataset characteristics. SVM's adaptability is suited for scenarios with clear
class margins, Random Forest's robustness and interpretability contribute to accuracy
and comprehension, and ANN excels in discerning intricate patterns in datasets
featuring non-linearities. The algorithmic choices were aimed at balancing the
utilization of each model's strengths. Furthermore, a sequential learning approach was
introduced, integrating SVM, Random Forest, and ANN. This process involved using
the output predictions from one model as supplementary input features for the
subsequent model, harnessing the individual strengths of each algorithm. This ensemble
method proves particularly advantageous in situations involving intricate relationships
and diverse patterns, ultimately augmenting the precision and resilience of the
comprehensive classification system and the steps involved are [14:15].

Step 1: Data Collection


Collect genetic data to serve as input for the chain classification model.
Step 2: Labeling Genetic Dataset
9

Assign labels to the genetic dataset, specifying the targeted sections or classes.
Step 3: Data Pre-processing
Perform data pre-processing, including tasks like handling missing values and
normalizing features, to enhance data quality.
Step 4: Data Splitting
Divide the genetic dataset into training and testing sets for model evaluation.
Step 5: Model Training and Validation
Train the chain classification model using algorithms such as SVM, Random Forest,
and ANN, implementing a chain classification approach.
Step 6: Chain Classification Implementation
Execute the chain classification approach by utilizing output predictions from one
model as input features for the subsequent model.
Step 7: Model Validation
Validate the model's performance using the testing dataset.
Step 8: Accuracy Evaluation
Assess the accuracy measures to gauge the effectiveness of the chain classification
model.
Step 9: Fine-Tuning and Optimization
Fine-tune model parameters and optimize algorithmic choices based on accuracy
measures.
Step 10: Results Analysis and Interpretation.

4 Experimental Investigations
4.1 Dataset
The Genetic Variant Classification dataset on Kaggle is a collection of over 650,000
genetic variants and their corresponding clinical classifications, created by the ClinVar
consortium. It includes information on the variant's location on the genome, type (e.g.,
single nucleotide polymorphism, deletion), and clinical significance (e.g., benign,
pathogenic). The dataset is used by researchers and clinicians to study the genetics of
human disease, identify new disease-causing variants, develop diagnostic tests, and
inform genetic counseling. The dataset has a dataset size of 65188 records and is
formatted in CSV. It includes features such as chromosome, position on the
chromosome, reference allele, alternate allele, clinical classification, variant type [7],
significance sign, sequence ontology term, gene symbol, depth, allele frequency,
genotype, filter status, and additional information about the variant.
The Genomes and Genetics database on Bio-Portal is a vast collection of ontologies and
resources related to genomics and genetics. It has 31897 rows and 45 columns and
offers a central location for researchers and clinicians to access and utilize this
information to advance their understanding of the genetic basis of human disease. The
database contains over 100 ontologies, including Gene Ontology (GO), Human Gene
Mutation Database (HGMD), Online Mendelian Inheritance in Man (OMIM), Clinical
Genome Interpretation (CGI) Ontology, and Disease Ontology (DO). The database
10

features a search and browse interface, annotations and links to external resources, and
a download option for users to use in their research and applications [9].

4.2 Discussion on Results


● Table Overview: The table shows precision, recall, F1 score, and support for
Class 1 and Class 2.
● Class 1: With a precision of 0.92, recall of 0.93, and F1 score of 0.92, the
model is highly effective at identifying Class 1 instances.
● Class 2: The precision is 0.89, recall is 0.90, and F1 score is 0.89, indicating
reliable performance for Class 2.
● Overall Accuracy: The model achieves an overall accuracy of 0.91,
demonstrating its effectiveness in identifying genetic disease risks.
● Support: The dataset contains 2470 samples for Class 1 and 1533 for Class 2,
showing a class imbalance.
● Average Metrics: The average precision, recall, and F1 score are 0.91,
reflecting strong and consistent model performance.
● Implications: The results suggest the model is accurate and resilient, suitable
for identifying genetic disease risks across different data sets.

The accuracy curves indicate that both training and validation accuracy quickly rise and
then stabilize around the 10th epoch. The similar trends and stability of both curves
suggest the model generalizes well and has converged without overfitting.

Figure 7: Loss and Accuracy curves of ANN model


11

Table 5: Classification Report of Chain Classifier Based Model


Algorithm Accuracy Precision Recall F1-Score Support
Support 0.90 0.91 0.89 0.90 1553
Vector
Machine
Random 0.92 0.91 0.89 0.90 1533
Forest
Artificial 0.93 0.92 0.89 0.88 1531
Neural
Networks
Chain 94 0.91 0.90 0.90 1551
Classifier
Model

5. Conclusion
The system we developed aims to identify genetic diseases by carefully testing three
different machine learning methods: Support Vector Machine, Random Forest, and
Artificial Neural Networks. Each of these methods has its own strengths, so we
compare them to find out which one works best for each specific disease. After selecting
the best method for each disease, we present the results with easy-to-understand graphs,
making it clear how each method performs. By using these three algorithms together,
we significantly improve the accuracy of our predictions, helping us to correctly
identify diseases more often. We also use special techniques called balancing methods
to make sure that our test scores are fair and accurate for each disease, even if some
diseases are less common than others. The graphs we provide give a clear picture of
how well each algorithm performs, showing differences in accuracy, speed, and other
important factors. This makes it easier to see which method works best for different
situations, and helps us choose the most reliable approach for diagnosing genetic
diseases.
12

References
1. Ali Raza, Furqan Rustam, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto
Lee and Imran Ashraf : Predicting Genetic Disorder and Types of Disorder Using
chain classifier approach.
2. TaherM. Ghazal,HussamAlHamadi ,MuhammadUmarNasir, Atta-ur-Rahman
,Mohammed Gollapalli,6Muhammad Zubair,7Muhammad Adnan Khan:
Supervised Machine Learning Empowered Multifactorial Genetic Inheritance
Disorder Prediction.
3. Dhavendra Kumar: Disorders of the genome architecture: a review
4. Muhammad Asif, Hugo F. M. C. M. Martiniano, Astrid M. Vicente,
Supervision,Jianjun Hu: Identifying disease genes using machine learning and gene
functional similarities, assessed through Gene Ontology.
5. Vinod Scaria: Genomics of rare genetic diseases—experiences from India:
https://humgenomics.biomedcentral.com/articles/10.1186/s40246-019-0215-5
6. Gholson J Lyon & Kai Wang: Identifying disease mutations in genomic medicine
settings: https://genomemedicine.biomedcentral.com/articles/10.1186/gm359
7. Rare Genetic Diseases: Nature's Experiments on Human Development: E.Lee,
Kaela S. Singleton , Melissa Wallin , Victor Faundez.

8. Kotze, M., Lückhoff, H., Peeters, A., Baatjes, K., Schoeman, M., Merwe, L., Grant,
K., Fisher, L., Merwe, N., Pretorius, J., Velden, D., Myburgh, E., Pienaar, F.,
Rensburg, S., Yako, Y., September, A., Moremi, K., Cronjé, F., Tiffin, N., Bouwens,
C., Bezuidenhout, J., Apffelstaedt, J., Hough, F., Erasmus, R., & Schneider, J.
(2015). Genomic medicine and risk prediction across the disease spectrum. Critical
Reviews in Clinical Laboratory Sciences, 52, 120 - 137.
https://doi.org/10.3109/10408363.2014.997930.

9. Khoury, M., Yang, Q., Gwinn, M., Little, J., Little, J., & Flanders, W. (2004). An
epidemiologic assessment of genomic profiling for measuring susceptibility to
common diseases and targeting interventions. Genetics in Medicine, 6, 38-47.
https://doi.org/10.1097/01.GIM.0000105751.71430.79.

10. Weitzel, J., Blazer, K., Macdonald, D., Culver, J., & Offit, K. (2011). Genetics,
genomics, and cancer risk assessment. CA: A Cancer Journal for Clinicians, 61.
https://doi.org/10.3322/caac.20128.

11. Cichon, S., Craddock, N., Daly, M., Faraone, S., Gejman, P., Kelsoe, J., Lehner, T.,
Levinson, D., Moran, A., Sklar, P., & Sullivan, P. (2009). Genomewide association
studies: history, rationale, and prospects for psychiatric disorders.. The American
journal of psychiatry, 166 5, 540-56 .
https://doi.org/10.1176/appi.ajp.2008.08091354.

You might also like