Epics Springer
Epics Springer
Epics Springer
1,3,4
Dept of IT, VR Siddhartha Engineering College, Vijayawada
2
Associate Professor, Dept of IT, VR Siddhartha Engineering College, Vijayawada.
[email protected], [email protected],
[email protected], [email protected]
Abstract:
There are a lot of intricate challenges posed by genetic diseases, emphasizing
their complex nature and the necessity for specialized techniques in
identification and classification. Unlike hereditary diseases, these conditions
demand advanced methods such as random forest classification, artificial neural
networks, and support vector machines for extracting intricate genetic data. The
program's goal is to enhance individuals' awareness of their genetic health,
facilitating participation in prevention and early intervention. Utilizing cutting-
edge Machine Learning algorithms aligns with the overarching objective of
averting adverse outcomes linked to genetic diseases by prioritizing disease
progression. The research contributes to healthcare by providing a
comprehensive framework for accurate classification, integrating new
technologies such as chain classifier approach, and improving diagnostic
accuracy. In the realm of genetic diseases, the complexity arises from intricate
genetic interactions, necessitating advanced diagnostic and classification
methods, including the application of machine learning. This approach aims to
systematically identify and classify diverse genetic diseases, offering a data-
driven solution to challenges in genetics and contributing significantly to global
health.
Keywords: Genetic Disorders, Hereditary, Random Forest, Artificial Neural Network
(ANN), Support Vector Machine (SVM), Machine Learning, Chain Classifier Approach.
1 Introduction
Genetic disorders are caused due to the alterations in a person's genetic code (DNA)
leading to changes in an organism's function. Genomic data provides vital health
indicators and is analyzed to understand genetic diseases. Genomics, a branch of
bioinformatics, focuses on the study of genomes and their abnormalities [1]. Genetic
disorders include single gene, mitochondrial, and multifactorial inheritance disorders,
each examined based on DNA structure. Single gene disorders result from mutations in
a single gene, while mitochondrial disorders stem from deletions or replacements in the
mitochondria which is a non-nuclear part of DNA structure. Multifactorial disorders
are caused by a combination of environmental aspects and mutations in multiple genes
[2]. Genes in DNA encode information for various proteins, and alterations can lead to
abnormal protein formation, affecting cellular function and causing genetic disorders.
Out of the 7.8 billion on earth it is believed that 350 million of them which is nearly
4.5% are living with a rare genetical disorder and at first glance it does seem like a
2
small proportion but we also need to account in the fact that 95 percent of these diseases
are incurable[3]. 85% of the people who found out about their condition were lately
diagnosed and the reason for this is the lack of awareness about genetic disorders among
people. The genetic disorders are of 3 classes and they are namely mitochondrial, single
gene and multifactorial genetic disorders[4]. With the hopes of creating a project that
can help people we have created a machine learning model that can predict the class of
the genetic disorder a particular person's data falls into[5].
2 Literature Survey
Collectively, these papers highlight the transformative role that machine learning can
play in genetic disease prediction and genomic medicine. They highlighted possible
applications in various medical fields such as cancer, dementia and diabetes. Accelerate
progress in genomic medicine by calling for collaboration and data sharing to address
the challenge of identifying disease mutations. The unique challenges and opportunities
of rare genetic disease research in India are explored, demonstrating the contribution of
genomics to the understanding of rare diseases [6]: India Alliance Network
(GUaRDIAN) . The study of the application of machine learning in genetics and
genomics highlights the great opportunities to revolutionize biomedical research [7]. A
paper proposes a new chain classification method for predicting disease and genetic
subtypes, demonstrating improved accuracy compared to traditional methods. Another
paper demonstrates the effectiveness of machine learning in identifying disease genes
using the functional similarity of genes assessed using Gene Ontology (GO)
annotations. Together, these papers demonstrate the growing importance of
incorporating machine learning into genomics to provide insight into the genetic basis
of disease and pave the way for personalized medicine and better patient outcomes
[5;8].
Fig. 1 shows the actual flow of the work. The process starts with collection of the data
and then this data is turned into visual form which helps in finding hidden patterns and
then in the part of feature engineering the data is normalized and then balanced in order
to avoid biased results [9]. The resultant data is splitted into parts 20% for testing and
the rest for training which is used to train the model with the three algorithms (SVM,
RANDOM FOREST, ANN) which forms the chain classifier model and the other part
is used to test the model.
network encompassed 65,190 data points and 47 columns, with Subclass Disorder as
the target variable.
3.2.3 Exploratory Data Analysis:
Correlation matrix is a statistical tool used in exploratory data analysis to quantify the
degree and direction of relationships between numerical variables in a dataset. It
consists of correlation coefficients that range from -1 to 1, indicating negative to
positive associations, with 0 implying no correlation and then visualizing the matrix
through techniques like heat-maps provides a quick overview, aiding in the
identification of significant correlations and potential patterns among variables.
A density plot can effectively illustrate the distribution of data points for three distinct
genetic disorders – mitochondrial, single-gene, and multifactorial. In this plot, the x-
axis typically denotes a relevant variable such as age of onset or severity, while the y-
axis represents the probability density [12]. The resulting curves showcase the
concentration and distribution patterns of data points for each disorder, with peaks
indicating areas of higher density. This visualization facilitates a comparative analysis,
enabling a clear understanding of the unique characteristics and prevalence of each
genetic disorder within the dataset.
5
Figure 4: Density Plot for White Blood Cell Count (thousand per microliter)
Table 1: Density Plot Results for Blood Cell Count (mcL) grouped by ‘Genetic
Disorder’
Table 2: Density Plot Results for White Blood Cell Count (1000 per microliter) grouped
by ‘Genetic Disorder’
This figure represents the box plots for "Blood cell count (mcL)" and "White Blood cell
count (thousand per microliter)" against a specified target variable. Box plots, visually
depicting the distribution and central tendency of numerical data, help identify potential
outliers and reveal distributional characteristics. Each plot showcases the interquartile
range, median, and potential outliers beyond whiskers, providing insights into the
spread and skewness of blood cell counts concerning the target variable [11:13]. This
visualization aids in discerning patterns and variations within the dataset associated
with the specified blood cell attributes. Figure.5 provides key statistics. It also identifies
outliers, helping to compare distributions across groups.
Figure 5: Box Plot for Blood Cell Count and White Blood Cell Count
7
Table 3: Box Plot Results for Blood Cell Count (mcL) grouped by ‘Genetic Disorder’
Genetic Count Mea Std Min 25% 50% 75% Max
Disord n
er
0 12348 4.89 0.200 4.092 4.763 4.900 5.033 5.592
1 2071 4.89 0.199 4.248 4.762 4.892 5.032 5.515
2 7664 4.89 0.198 4.215 4.763 4.899 5.034 5.609
Table 4: Box Plot Results for White Blood Cell Count (1000 per microliter) grouped by
‘Genetic Disorder’
Genetic Count Mean Std Min 25% 50% 75% Max
Disord
er
0 12348 7.479 2.514 3.0 5.666 7.477 9.279 12.0
1 2071 7.522 2.644 3.0 5.529 7.477 9.516 12.0
2 7664 7.484 2.498 3.0 5.666 7.477 9.238 12.0
3.2.7 Algorithm
Assign labels to the genetic dataset, specifying the targeted sections or classes.
Step 3: Data Pre-processing
Perform data pre-processing, including tasks like handling missing values and
normalizing features, to enhance data quality.
Step 4: Data Splitting
Divide the genetic dataset into training and testing sets for model evaluation.
Step 5: Model Training and Validation
Train the chain classification model using algorithms such as SVM, Random Forest,
and ANN, implementing a chain classification approach.
Step 6: Chain Classification Implementation
Execute the chain classification approach by utilizing output predictions from one
model as input features for the subsequent model.
Step 7: Model Validation
Validate the model's performance using the testing dataset.
Step 8: Accuracy Evaluation
Assess the accuracy measures to gauge the effectiveness of the chain classification
model.
Step 9: Fine-Tuning and Optimization
Fine-tune model parameters and optimize algorithmic choices based on accuracy
measures.
Step 10: Results Analysis and Interpretation.
4 Experimental Investigations
4.1 Dataset
The Genetic Variant Classification dataset on Kaggle is a collection of over 650,000
genetic variants and their corresponding clinical classifications, created by the ClinVar
consortium. It includes information on the variant's location on the genome, type (e.g.,
single nucleotide polymorphism, deletion), and clinical significance (e.g., benign,
pathogenic). The dataset is used by researchers and clinicians to study the genetics of
human disease, identify new disease-causing variants, develop diagnostic tests, and
inform genetic counseling. The dataset has a dataset size of 65188 records and is
formatted in CSV. It includes features such as chromosome, position on the
chromosome, reference allele, alternate allele, clinical classification, variant type [7],
significance sign, sequence ontology term, gene symbol, depth, allele frequency,
genotype, filter status, and additional information about the variant.
The Genomes and Genetics database on Bio-Portal is a vast collection of ontologies and
resources related to genomics and genetics. It has 31897 rows and 45 columns and
offers a central location for researchers and clinicians to access and utilize this
information to advance their understanding of the genetic basis of human disease. The
database contains over 100 ontologies, including Gene Ontology (GO), Human Gene
Mutation Database (HGMD), Online Mendelian Inheritance in Man (OMIM), Clinical
Genome Interpretation (CGI) Ontology, and Disease Ontology (DO). The database
10
features a search and browse interface, annotations and links to external resources, and
a download option for users to use in their research and applications [9].
The accuracy curves indicate that both training and validation accuracy quickly rise and
then stabilize around the 10th epoch. The similar trends and stability of both curves
suggest the model generalizes well and has converged without overfitting.
5. Conclusion
The system we developed aims to identify genetic diseases by carefully testing three
different machine learning methods: Support Vector Machine, Random Forest, and
Artificial Neural Networks. Each of these methods has its own strengths, so we
compare them to find out which one works best for each specific disease. After selecting
the best method for each disease, we present the results with easy-to-understand graphs,
making it clear how each method performs. By using these three algorithms together,
we significantly improve the accuracy of our predictions, helping us to correctly
identify diseases more often. We also use special techniques called balancing methods
to make sure that our test scores are fair and accurate for each disease, even if some
diseases are less common than others. The graphs we provide give a clear picture of
how well each algorithm performs, showing differences in accuracy, speed, and other
important factors. This makes it easier to see which method works best for different
situations, and helps us choose the most reliable approach for diagnosing genetic
diseases.
12
References
1. Ali Raza, Furqan Rustam, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto
Lee and Imran Ashraf : Predicting Genetic Disorder and Types of Disorder Using
chain classifier approach.
2. TaherM. Ghazal,HussamAlHamadi ,MuhammadUmarNasir, Atta-ur-Rahman
,Mohammed Gollapalli,6Muhammad Zubair,7Muhammad Adnan Khan:
Supervised Machine Learning Empowered Multifactorial Genetic Inheritance
Disorder Prediction.
3. Dhavendra Kumar: Disorders of the genome architecture: a review
4. Muhammad Asif, Hugo F. M. C. M. Martiniano, Astrid M. Vicente,
Supervision,Jianjun Hu: Identifying disease genes using machine learning and gene
functional similarities, assessed through Gene Ontology.
5. Vinod Scaria: Genomics of rare genetic diseases—experiences from India:
https://humgenomics.biomedcentral.com/articles/10.1186/s40246-019-0215-5
6. Gholson J Lyon & Kai Wang: Identifying disease mutations in genomic medicine
settings: https://genomemedicine.biomedcentral.com/articles/10.1186/gm359
7. Rare Genetic Diseases: Nature's Experiments on Human Development: E.Lee,
Kaela S. Singleton , Melissa Wallin , Victor Faundez.
8. Kotze, M., Lückhoff, H., Peeters, A., Baatjes, K., Schoeman, M., Merwe, L., Grant,
K., Fisher, L., Merwe, N., Pretorius, J., Velden, D., Myburgh, E., Pienaar, F.,
Rensburg, S., Yako, Y., September, A., Moremi, K., Cronjé, F., Tiffin, N., Bouwens,
C., Bezuidenhout, J., Apffelstaedt, J., Hough, F., Erasmus, R., & Schneider, J.
(2015). Genomic medicine and risk prediction across the disease spectrum. Critical
Reviews in Clinical Laboratory Sciences, 52, 120 - 137.
https://doi.org/10.3109/10408363.2014.997930.
9. Khoury, M., Yang, Q., Gwinn, M., Little, J., Little, J., & Flanders, W. (2004). An
epidemiologic assessment of genomic profiling for measuring susceptibility to
common diseases and targeting interventions. Genetics in Medicine, 6, 38-47.
https://doi.org/10.1097/01.GIM.0000105751.71430.79.
10. Weitzel, J., Blazer, K., Macdonald, D., Culver, J., & Offit, K. (2011). Genetics,
genomics, and cancer risk assessment. CA: A Cancer Journal for Clinicians, 61.
https://doi.org/10.3322/caac.20128.
11. Cichon, S., Craddock, N., Daly, M., Faraone, S., Gejman, P., Kelsoe, J., Lehner, T.,
Levinson, D., Moran, A., Sklar, P., & Sullivan, P. (2009). Genomewide association
studies: history, rationale, and prospects for psychiatric disorders.. The American
journal of psychiatry, 166 5, 540-56 .
https://doi.org/10.1176/appi.ajp.2008.08091354.