ML and Biochemistry
ML and Biochemistry
ML and Biochemistry
Chemico-Biological Interactions
journal homepage:
Keywords: Artificial intelligence (AI) and machine learning models are today frequently used for classification and pre
Reactive oxygen species diction of various biochemical processes and phenomena. In recent years, numerous research efforts have been
Machine learning focused on developing such models for assessment, categorization, and prediction of oxidative stress. Supervised
Oxidative damage
machine learning can successfully automate the process of evaluation and quantification of oxidative damage in
Signal analysis
biological samples, as well as extract useful data from the abundance of experimental results. In this concise
review, we cover the possible applications of neural networks, decision trees and regression analysis as three
common strategies in machine learning. We also review recent works on the various weaknesses and limitations
of artificial intelligence in biochemistry and related scientific areas. Finally, we discuss future innovative ap
proaches on the ways how AI can contribute to the automation of oxidative stress measurement and diagnosis of
diseases associated with oxidative damage.
University of Belgrade, Faculty of Medicine, Institute of Medical Physiology, Visegradska 26/II, RS, 11129, Belgrade, Serbia.
E-mail addresses: [email protected], [email protected] (I. Pantic).
URL: (I. Pantic).
Received 31 January 2022; Received in revised form 4 March 2022; Accepted 9 March 2022
Available online 13 March 2022
0009-2797/© 2022 Elsevier B.V. All rights reserved.
I. Pantic et al. Chemico-Biological Interactions 358 (2022) 109888
models of machine learning that are able to recognize and classify cell relatively easy to determine. One of the most important products of this
and tissue damage, as well as to predict various biochemical processes process is malondialdehyde (MDA) which can be detected with thio
and physiological mechanisms. Most of these models have not yet barbituric reactive substances (TBARS). Even relatively small ROS
become part of the contemporary research and clinical practices, and it generation in pathological conditions may in some circumstances result
is assumed that this will take a long time to be achieved. However, a in high levels of MDA. Malondialdehyde is also a relatively sensitive
large number of authors believe that artificial intelligence has bright indicator of ROS concentrations in tissue homogenates such as liver.
future when it comes to certain methodological approaches in toxi Another, less common way to evaluate the consequences of lipid per
cology [7,8]. oxidation is to measure 4-hydroxynonenal (4-HNE or HNE) by using
Recently, many new and innovative machine learning models have high-performance liquid chromatography and 2,4-dinitrophenylhydra
been developed and implemented for prediction and classification of zine and 1,3-cyclohexandione probes. Finally, prostaglandin-like com
oxidative stress in cells and tissues. Oxidative stress usually refers to the pounds such as isoprostanes can be determined to achieve the same
generation of reactive oxygen species such as superoxide, peroxides, effect [14,17].
hydroxyl radical or singlet oxygen [9–11]. Changes in oxidative status in Od all DNA damage markers of oxidative stress, 8-hydroxydeoxygua
tissues and cell cultures have been reported as a possible result of nosine (8-OHdG) is probably the most frequently used in biological
exposure to toxic chemical agents. Also, oxidative stress may be a research. This is essentially a DNA base modification that occurs when
contributing factor to numerous diseases and conditions in internal guanine is exposed to reactive oxygen species (mainly hydroxyl radical).
medicine, neurology and oncology. For example, some neurological Today various assays exist for 8-OHdG determination in biological
degenerative disorders such as Parkinson’s and Alzheimer’s disorder samples, including ones based on ELISA kits. DNA damage that takes
may in certain conditions be associated with the generation of reactive place due to the exposure of various toxic environmental factors is
oxygen species [12]. Certain substances with antioxidant properties are sometimes assessed using this methodological approach [18].
thought to be beneficial for human health and may contribute to the The third set of techniques for indirect measurement of oxidative
prevention of a large number of chronic noncommunicable diseases. stress refers to the quantification of protein oxidation and nitration.
Process of physiological aging is also sometimes associated with cellular These techniques rely on the fact that ROS and ROS-associated com
and DNA damage associated with oxidative stress [13]. pounds can induce structural modifications in proteins such as the for
This concise narrative review focuses on the development and use of mation of protein carbonyl groups, advanced oxidation protein products
contemporary AI and machine learning models for detection, prediction, (AOPP) or advanced glycation end products [19]. Protein carbonyl
and classification of oxidative stress. We cover the applications of neural content is probably the most commonly measured biomarker and there
networks, decision trees and regression analysis as three common stra is a variety of commercially available assays and kits designed for this
tegies in machine learning. We also discuss future innovative ap purpose. These include methods based on Western blotting, ELISA or
proaches on the ways how AI can contribute to the automation of relatively simple spectrophotometric analysis [19].
oxidative stress measurement and diagnosis of diseases associated with Finally, it should be noted that oxidative stress can also be indirectly
oxidative damage. assessed by quantifying compounds that contribute to antioxidant de
fense [14,20–22]. These may include enzymes such as superoxide dis
2. Biomarkers of oxidative stress that can be used as input data mutase, catalase or glutathione peroxidase. Non-enzymatic antioxidant
for machine learning defenses consisting of glutathione, lipoic acid, transferrine, vitamins C
and E may also be evaluated. For example, a common methodological
Today, there are many methods for assessment of oxidative stress in approach for establishing oxidative status in tissue homogenates would
biological samples. Direct measurement of reactive oxygen species is be to measure MDA as an indicator of lipid peroxidation in combination
possible although relatively rare when compared to indirect approaches. with glutathione, superoxide dismutase and catalase.
Direct assessment of superoxide anion radical can for example be The above-mentioned biomarkers are just a few in a wide spectrum
determined using electron paramagnetic resonance (EPR) spectroscopy of oxidative stress indicators that can today be determined in cells, tis
and so-called “spin trapping”. Spin trapping with 5, 5-dimethyl-1-pyr sues or other biological samples. Most, if not all of these indicators can
roline-N-oxide (DMPO) is a common way to monitor superoxide levels in be used as inputs for training and testing contemporary machine
some cell population in in vitro conditions, and sometimes it can be learning models for prediction or classification of oxidative damage
additionally used for ex vivo assessments [14]. Apart from superoxide, (Table 2). In certain experimental conditions, a sole marker may be
this nitrone spin trap approach can be used for quantification of hy sensitive enough to distinguish between damaged and intact specimen,
droxyl radical. Alternative way of spin trapping for determination of however, frequently this is not the case, and a variety of parameters need
superoxide hydroxyl radical levels is the utilization 5-(diethox to be measured. Artificial intelligence and machine learning strategies
yphosphoryl)-5-methyl-1-pyrroline-N-oxide (DEPMPO) for in vitro may offer a fast and affordable route to increasing sensitivity and
measurements. Another common way to directly measure superoxides is specificity of these parameters.
fluorescence analysis of hydrocyans. Less frequently we see studies done
with Spin Probes associated with cyclic hydroxylamine or other hy 3. Artificial neural networks and deep learning
droxylamines (for superoxide anion), or strategies involving PBN
(α-phenyl-N-t-butylnitrone) or POBN [α-(4-pyridyl-1-ox Artificial neural network (ANN) is basically a composition of nodes
ide)-N-tert-butylnitrone (superoxide anion measurements, in vivo ex connected in a way that may partially resemble biological organization
periments). Also, in some cases, acetylated cytochrome c reduction, of brain synapses. The nodes are referred to as artificial neurons and are
fluorescence analysis of hydroethidium dihydroethidium might also be often clustered in layers. The first layer of neurons is the input layer and
used. Intracellular hydrogen peroxide can be directly quantified with the last is the output layer [23–25]. A neural network may have a large
HyPer fluorescent sensing system, while extracellularly, Amplex Red number of so-called hidden layers between the input and the output
assay is today successfully used [14–16]. layer. The neurons and their connections (synapses) are associated with
Free radicals in biological samples are sometimes short-lived, so their so-called “weights” which influence the probability of signal trans
direct measurement can be difficult and impractical. Therefore, a mission. The weights are modified during the learning process in which
number of indirect methods exist and are frequently used in biomedical the network is exposed to a series of examples of inputs and (the correct)
research. This includes detection of various biomarkers such as products outputs. Generally, the objective of learning here is to minimize the
of lipid peroxidation, oxidative DNA, proteins and others. Lipid perox observed errors by weight adjustment. During the network training, a
idation is associated with the oxidative damage of cell membranes and is cost function is defined in order to evaluate the reduction of the error
I. Pantic et al. Chemico-Biological Interactions 358 (2022) 109888
I. Pantic et al. Chemico-Biological Interactions 358 (2022) 109888
grouped into two main categories that differ on whether the outcome is a 5. Linear, logistic and other regression approaches
class or a real number: Classification trees and Regression trees. In
biomedicine, contemporary statistical and data mining programs are Linear and logistic regression machine learning models are probably
usually able to perform a variety of tree analyses ranging from classical one of the simplest supervised ML approaches that can be effectively
classification and regression tree (CART) to Chi-square automatic used for prediction and classification of biological phenomena. Linear
interaction detection (CHAID) and QUEST (Quick, Unbiased, Efficient regression essentially tries to predict a dependent variable (i.e. param
Statistical Tree). Random decision forests are composed of decision trees eter of ROS generation) based on the values of independent variable (i.e.
that are constructed during the learning process. In molecular biology, some other biochemical parameter), with presumption of a linear rela
random forests are commonly used for classification of biological sam tionship between the variables. Logistic regression is commonly used for
ples or other data types with classes selected by the most trees are binary classification problems, for example to predict whether the cell is
considered as output. Random forests often have higher accuracy and damaged or intact using the available biochemical data or other data.
discriminatory power in comparison to individual decision trees For creation of both linear and logistic regression ML models, training
although this is not always the case [8,34]. data with known inputs and outcomes need to be presented to the ma
One of the possible applications of random forests in label-free chine, and later the model is tested for classification accuracy,
discrimination and quantitative analysis of cytotoxicity due to oxida discriminatory power or other performance indicator.
tive stress is covered in recent work of Zhang et al. (2020). Here the One of the recent examples of application of linear regression ML
authors combined the work on cell cultures (A549 cells) treated with model in oxidative stress research is the work by Shemshaki and asso
toxic diesel exhaust particles, Raman spectroscopy imaging and a ciates [40]. Here the authors conducted a clinical study on infertile male
number of different machine learning methods (apart from random patients and quantified various biochemical parameters in semen,
forests, support vector machine, k nearest neighbors, linear discriminant including reactive oxygen species. Linear regression model was used to
analysis and many others). The models were used for classification and predict ROS from citric acid, fructose, BMI (body mass index), BMR
evaluation of effects of the toxic particles and antioxidants (resveratrol (basal metabolic rate), sperm motility and sperm morphology. Linear
and mesobiliverdin IXα) on cell behavior [35]. Random forests proved to regression in some circumstances showed relatively good performance,
have relatively good classification accuracy although its performance similar to support vector machine, and even better in comparison to
was not as good as in some other models such as k nearest neighbors. artificial neural networks (although worse than random forests). The
It may be possible that Random Forest models that use data on genes machine learning approach in this study helped reveal the potential
associated with oxidative stress can be utilized to design a diagnostic connection between BMI and ROS generation [40].
strategy for some of the most common diseases and disorders. This in In 2015, Lavender and associates used a logistic regression approach
cludes a diagnostic model of acute myocardial infarction as described by in combination with multifactor dimensionality reduction was used for
Yifan et al. (2021) where authors used data representing expression of evaluate the impact of oxidative stress response related genetic variants,
different ferroptosis-related genes. Ferroptosis is a type of iron- antioxidants and prooxidants on prostate cancer risk and aggressiveness.
dependent programmed cell death in which lipid peroxides accumu The study included a relatively large sample of 2286 subjects. Using this
late as the result of the disfunction of antioxidant defenses. In this study, sophisticated statistical analysis, numerous gene-gene gene-environ
the authors were able to create a RF model based on genes in circulating ment interactions were investigated which allowed to identify factors
endothelial cells with strong diagnostic performance for infarction and associated with oxidative stress that contribute most to in prostate
areas under the ROC curve higher than 85% in the validation data set carcinogenesis [41]. Another, more recent example of using binomial
[36]. logistic regression model would be the study done by Zhang et al. (2020)
Random forest can be integrated with support vector machines to where the model was able to distinguishing the active chemicals
achieve good prediction results, at least the ones related post- inducing oxidative stress from inactive compounds. Data from a total of
translational modification of proteins during oxidative stress. Such 638 active and 3632 inactive chemicals were used to develop this ML
integration was achieved by Hasan et al. (2019) who used this strategy strategy which resulted in prediction accuracy of almost 70% and
to predict Cysteine S-nitrosylation which is often related to antioxidative satisfactory area under the receiver operating characteristic curve [42].
defense and redox-based cell signaling. The value of this study lies in the A specific machine learning approach based on the least absolute
fact that experimental identification of this process demands significant shrinkage and selection operator (LASSO)/elastic net regression algo
material and other resources, while the machine learning methodology rithm was developed by Kim et al. (2021) and used for quantitative
is inexpensive, fast and efficient [37]. determination of oxidative stress risks in healthy human subjects. In this
Another important recent study on the application of Random Forest work, measurement of malondialdehyde (indicator of lipid peroxida
model in oxidative stress research is the one by Ho Thanh Lam and as tion) was used for evaluation of oxidative stress, although many other
sociates (2020). Here the authors use a benchmark set of sequencing biochemical parameters were also used for the creation of ML model.
data to develop various models in order to identify antioxidant proteins The model was applied for classification, and differentiation between
based on their highest performance [38]. Random Forest is presented as individuals with pathological oxidative status and healthy controls. The
the model with high accuracy, and relatively good balance between authors reported an outstanding discriminatory power of the model with
specificity and sensitivity during the identification of proteins. These area under the ROC curve of 0.949 and excellent sensitivity and speci
data indicate that in certain conditions RF may be superior in compar ficity for selected confidence intervals [43].
ison with more complex deep learning approaches. Simple regression models such as the ones relying on linear and
Random forest as well as decision trees were proved to be capable for binomial logistic regression are often underestimated and avoided by
prediction of the activity as well as classification of various drug mole some data scientists who prefer more complex neural network designs.
cules associated with some oxidative stress signaling pathways such as In some circumstances this is a mistake since regression models can be
the Nrf2-antioxidant response element path [39]. One of such studies equally sensitive or even more sensitive in prediction of biological
used activity information of total 10 486 molecules and compared or phenomena. For example in a recent study done by our laboratory, we
dinary decision trees, random forests, ada boost, linear model and neural compared binomial logistic regression, decision trees and multilayer
networks. When used for binary classification by oversampling, random perceptrons for classification of damaged and intact cells after ethanol
forest had the largest area under the ROC curve (86%) and similar re induced toxicity [44]. We found that all three models have similar
sults were obtained for binary classification by undersampling. classification accuracy (tree-based learning algorithm 80.6%, multilayer
perceptron 83.3% and binomial logistic regression algorithm 83.2%).
I. Pantic et al. Chemico-Biological Interactions 358 (2022) 109888
Fig. 1. The proposed use of MLP model for prediction of oxidative damage based on contemporary image analysis methods such as textural GLCM (Gray level co-
occurrence matrix) and wavelet analysis. For details on the methods, the reader is referred to the recent publication (Davidovic et al., 2022).
I. Pantic et al. Chemico-Biological Interactions 358 (2022) 109888
