Machine Learning For Lung Cancer Diagnosi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Genomics Proteomics Bioinformatics 20 (2022) 850–866

Genomics Proteomics Bioinformatics


www.elsevier.com/locate/gpb
www.sciencedirect.com

REVIEW

Machine Learning for Lung Cancer Diagnosis,


Treatment, and Prognosis
Yawei Li 1, Xin Wu 2, Ping Yang 3, Guoqian Jiang 4, Yuan Luo 1,*

1
Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
2
Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
3
Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905 / Scottsdale, AZ 85259, USA
4
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA

Received 4 March 2022; revised 3 October 2022; accepted 17 November 2022


Available online 1 December 2022

Handled by Feng Gao

KEYWORDS Abstract The recent development of imaging and sequencing technologies enables systematic
Omics dataset; advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively
Imaging dataset; handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-
Feature extraction; based approaches play a critical role in integrating and analyzing these large and complex datasets,
Prediction; which have extensively characterized lung cancer through the use of different perspectives from
Immunotherapy these accrued data. In this review, we provide an overview of machine learning-based approaches
that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection,
auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the
challenges and opportunities for future applications of machine learning in lung cancer.

Introduction evolution of technologies in cancer research has contributed


to many large collaborative cancer projects, which have gener-
ated numerous clinical, medical imaging, and sequencing data-
Lung cancer is one of the most frequently diagnosed cancers
bases [4–6]. These databases facilitate researchers in
and the leading cause of cancer deaths worldwide. About
investigating comprehensive patterns of lung cancer from diag-
2.20 million new patients are diagnosed with lung cancer each
nosis, treatment, and responses to clinical outcomes [7]. In par-
year [1], and 75% of them die within five years of diagnosis [2].
ticular, current studies on -omics analysis, such as genomics,
High intra-tumor heterogeneity (ITH) and complexity of can-
transcriptomics, proteomics, and metabolomics, have
cer cells giving rise to drug resistance make cancer treatment
expanded our tools and capabilities for research. Cancer stud-
more challenging [3]. Over the past decades, the continuous
ies are undergoing a shift toward the integration of multiple
data types and mega sizes. However, using diverse and high-
* Corresponding author.
dimensional data types for clinical tasks requires significant
E-mail: [email protected] (Luo Y). time and expertise even with assistance from dimension reduc-
Peer review under responsibility of Beijing Institute of Genomics, tion methods such as matrix and tensor factorizations [8–11],
Chinese Academy of Sciences / China National Center for Bioinfor-
and analyzing the exponentially growing cancer-associated
mation and Genetics Society of China.
https://doi.org/10.1016/j.gpb.2022.11.003
1672-0229 Ó 2022 The Authors. Published by Elsevier B.V. and Science Press on behalf of Beijing Institute of Genomics, Chinese Academy of Sciences /
China National Center for Bioinformation and Genetics Society of China.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Li Y et al / Machine Learning Applications in Lung Cancer 851

databases poses a major challenge to researchers. Therefore, feature-based CAD systems, the DL-based CAD system can
using machine learning (ML) models to automatically learn automatically retrieve and extract intrinsic features of a suspi-
the internal characteristics of different data types to assist cious nodule [38,39], and can model the 3D shape of a nodule
physicians’ decision-making has become increasingly (Figure 2). For example, Ciompi et al. [40] designed a model
important. based on OverFeat [41,42] by extracting three 2D-view-
ML is a subgroup of artificial intelligence (AI) that focuses feature vectors (axial, coronal, and sagittal) of the nodule from
on making predictions by identifying patterns in data using CT scans. The recently integrated CNN models facilitate a glo-
mathematical algorithms [12]. It has served as an assisting tool bal and comprehensive inspection of nodules for feature char-
in cancer phenotyping and therapy for decades [13–19], and acterization from CT images. Buty et al. [37] designed a
has been widely implemented in advanced approaches for early complementary CNN model, where a spherical harmonic
detection, cancer type classification, signature extraction, model [43] for nodule segmentation was used to obtain the
tumor microenvironment (TME) deconvolution, prognosis shape descriptions (‘‘shape” feature) of the segmented nodule
prediction, and drug response evaluation [20–27]. Herein, we and a deep convolutional neural network (DCNN)-based
present an overview of the main ML algorithms that have been model [41] to extract the texture and intensity features
used to integrate complex biomedical data (e.g., imaging or (‘‘appearance” feature) of the nodule. The downstream classi-
sequencing data) for different aspects of lung cancer (Figure 1; fication relied on the combination of ‘‘shape” and ‘‘appear-
Tables S1 and S2), and outline major challenges and opportu- ance” features. Similarly, Venkadesh et al. [44] used an
nities for future applications of ML in lung cancer clinical ensemble model from two different models, 2D-ResNet50-
research and practice. We hope that this review promotes a based [45] and 3D-Inception-V1 [46], to respectively extract
better understanding of the roles and potentialities of ML in two features of a pulmonary nodule, and then concatenated
this field. the two features as the input features for classification. A supe-
riority of the ensemble CNN model is that it can accurately
Apply ML for early detection and auxiliary diagnosis identify malignant nodules from different sizes of nodules
using the raw CT images. Benefiting from the features
of lung cancer
extracted from state-of-the-art CNN models, clinical judgment
inference can be implemented through frequent ML tech-
ML on early detection and diagnosis using medical imaging
niques, including LR, random forest (RF), support vector
datasets
machine (SVM), and neural networks (NNs). Notably, some
studies also employed CNN models for final clinical judgment
Early diagnosis is an important procedure for reducing deaths inference. Ardila et al. [47] proposed an end-to-end approach
related to lung cancer. Chest screening using low-dose com- to systematically model both localization and lung cancer risk
puted tomography (CT) is the primary approach for the categorization tasks using the input CT data alone. Their
surveillance of people with increased lung cancer risk. To pro- approach was based on a combination of three CNN models:
mote diagnostic efficiency, the computer-aided diagnosis a Mask-RCNN [48] model for lung tissue segmentation, a
(CAD) system was developed to assist physicians in the inter- modified RetinaNet [49] model for cancer region of interest
pretation of medical imaging data [28,29], which has been (ROI) detection, and a full-volume model based on 3D-
demonstrated as a useful second opinion for physicians [30]. inflated Inception-V1 [50,51] for malignancy risk prediction.
The traditional feature-based CAD task can be broken into In addition to CT images, CNN-based models are also widely
three steps: nodule segmentation, feature extraction and selec- used in histological imaging to help with lung cancer diagnosis.
tion, and clinical judgment inference (classification) (Figure 2). Compared with CT imaging, histological imaging can provide
Some approaches apply the measured texture features of spec- more biological information about cancer at the cellular level.
ified nodules in CT images combined with the patient’s clinical To this end, AbdulJabbar et al. [52] used the Micro-Net [53]
variables as input features to train an ML classifier, including model to identify tissue boundaries followed by an SC-CNN
logistic regression (LR) [31–33] or linear discriminant analysis [54] model to segment individual cells from hematoxylin and
(LDA) [34], for malignancy risk estimation. Typically, these eosin (H&E)-stained and immunohistochemistry (IHC)
measurements include nodule size, nodule type, nodule loca- images. The segmented cells were then applied for cell type
tion, nodule count, nodule boundary, and emphysema infor- classification to evaluate the proportions of each cell type in
mation in CT images, and the clinical variables include the the images. This model helps to identify the differential evolu-
patient’s age, gender, specimen collection timing, family his- tion and immune evasion mechanisms between lung adenocar-
tory of lung cancer, smoking exposure, and more. However, cinoma (LUAD) and lung squamous cell carcinoma (LUSC)
these features are mostly subjective and arbitrarily defined, with high resolution. Another study [55] utilized the
and usually fail to achieve a complete and quantitative descrip- Inception-V3 network [51] to classify whether the tissue was
tion of malignant nodule appearances. LUAD, LUSC, or normal from H&E-stained histopathology
With the development of deep learning (DL) algorithms, whole-slide images. A highlight of this study is that the model
especially convolutional neural networks (CNNs), more stud- can also predict whether a given tissue has somatic mutations
ies have been conducted to apply DL-based models in the in several lung cancer driver genes, including STK11, EGFR,
CAD system to improve its accuracy and reduce its false pos- FAT1, SETBP1, KRAS, and TP53. Note that considering
itive rate and execution time during lung tumor detection the high complexity and large resources of the datasets, some
(Table 1) [35,36]. Similar to feature-based CAD system, the studies utilized transfer learning to improve their efficiency
workflow of these models usually consists of three steps: nod- and robustness when training new models [38,55].
ule detection and segmentation, nodule feature extraction, and Though these ML algorithms are already widely used in
clinical judgment inference [37]. Compared with traditional CAD, the challenge is that only a limited number of the images
852 Genomics Proteomics Bioinformatics 20 (2022) 850–866
Li Y et al / Machine Learning Applications in Lung Cancer 853

Figure 2 Feature-based CAD and DL-based CAD systems


Differences in the development process of feature-based CAD systems and CNN-based CAD systems. Compared with feature-based CAD
systems, the DL-based CAD systems can automatically retrieve and extract intrinsic features of a suspicious nodule. CNN, convolutional
neural network; LR, logistic regression; SVM, support vector machine; RF, random forest.

are labeled. Training a complex CNN model using a limited by a high false discovery rate [61,62]. Therefore, there is a crit-
number of training sets may result in overfitting. Recently, ical need for new techniques in early detection of lung cancers.
generative adversarial network (GAN)-based models have Recent sequencing technologies enable diverse methods for
been used to improve the performance of discriminative classi- early detection of lung cancer [63]. In the meantime, accurately
fiers by generating pseudo images [56]. Chuquicusma et al. [57] classifying lung cancer subtypes is crucial in guiding optimal
first employed a deep convolutional GAN (DCGAN) [58] therapeutic decision-making. LUAD ( 45%) and LUSC
model to generate synthetic lung nodule CT scans. With their ( 25%) are the two most common subtypes of lung cancer
work, more recent studies have integrated the GAN models but are often treated similarly except for targeted therapy
with other CNN models to address the overfitting problem [64]. However, studies have indicated that LUAD and LUSC
in lung cancer classification. Lin et al. [59] used a two-step have drastically different biological signatures, and they have
model — a DCGAN to generate synthetic lung cancer images suggested that LUAD and LUSC should be classified and trea-
and an AlexNet [41] for lung cancer classification using both ted as different cancers [65,66]. From a computational perspec-
original and synthetic datasets. Similar work was also done tive, both early detection and subtype identification are part of
by Ren and colleagues [60]. They also used DCGAN [58] for the classification task. Previous ML studies have shown the
data augmentation. To improve performance, they then efficiency and advancement of early detection and cancer type
designed a regularization-enhanced transfer learning model classification in large pan-cancer sequencing datasets [67–75],
called VGG-DF for data discrimination to prevent overfitting which may provide evidence for lung cancer diagnosis. It is
problems with pre-trained model auto-selection. known that cancer cells are characterized by many genetic
variations, and the accumulation of these genetic variations
ML on early detection and diagnosis using -omics sequencing can be signatures that document the mutational patterns of
datasets different cancer types [3,5,76,77]. For this reason, recent stud-
ies have concentrated on extracting better genomic signatures
Although periodic medical imaging tests are recommended for as input features to boost the accuracy of their ML models.
high-risk populations, implementation has been complicated For early detection, blood-based liquid biopsy, including

3
Figure 1 Applications of ML model in lung cancer
We presented an overview of ML methodologies for different aspects of lung cancer therapies, including CAD from imaging datasets, lung
cancer early detection based on sequencing technologies, data integration and biomarker extraction from multi-omics datasets, treatment
response and prognosis prediction, and immunotherapy studies. ML, machine learning; IC50, half-maximal inhibitory concentration;
HLA, human leukocyte antigen; CT, computed tomography; MALDI, matrix-assisted laser desorption/ionization; DL, deep learning;
cfDNA, cell-free DNA; CAD, computer-aided diagnosis; CNV, copy number variation; RECIST, Response Evaluation Criteria in Solid
Tumors; TIL, tumor-infiltrating lymphocyte.
854
Table 1 Publications relevant to ML on early detection and diagnosis using imaging data
Publication Feature extraction Classification model Sample size Imaging data type Performance Validation method Feature selection/input Highlight/advantage Shortcoming

McWilliams et al. [31] NA LR 2961 CT images AUC (0.907–0.960) Hold-out Clinical risk factors + nodule Using the extracted feature as input, the The selection of nodule
characteristics on CT images classifier can achieve high AUC in small characteristics affects the
nodules (< 10 mm) predictive performance of the
model
Riel et al. [32] NA LR 300 CT images AUC (0.706–0.932) Hold-out Clinical factors + nodule The classifier can perform equivalently The performance heavily relies
characteristics on CT images as human observers for malignant and on nodule size as the
benign classification discriminator, and is not robust
in small nodules
Kriegsmann et al. [34] NA LDA 326 MALDI Accuracy (0.991) Hold-out Mass spectra from ROIs of MALDI The model maintains high accuracy on The performance relies on the
image FFPE biopsies quality of the MALDI

Genomics Proteomics Bioinformatics 20 (2022) 850–866


stratification
Buty et al. [37] Spherical harmonics [44]; RF 1018 CT images Accuracy (0.793–0.824) 10-fold cross-validation CT imaging patches + radiologists’ The model reaches higher predictive No benchmarking comparisons
DCNN [41] binary nodule segmentations accuracy by integrating shape and were used in the study
appearance nodule imaging features
Hussein et al. [38] 3D CNN-based multi-task model 3D CNN-based multi-task 1018 CT images Accuracy (0.9126) 10-fold cross-validation 3D CT volume feature The model achieves higher accuracy The ground truth scores defined
model than other benchmarked models by radiologists for the
benchmark might be arbitrary
Khosravan et al. [39] 3D CNN-based multi-task model 3D CNN-based multi-task 6960 CT images Segmentation DSC (0.91); 10-fold cross-validation 3D CT volume feature The model integration of clustering and Segmentation might fail if the
model classification accuracy (0.97) sparsification algorithms helps to ROIs are outside the lung
accurately extract potential attentional regions
regions
Ciompi et al. [40] OverFeat [42] SVM; RF 1729 CT images AUC (0.868) 10-fold cross-validation 3D CT volume feature, nodule position This is the first study attempting to The model requires specifying
coordinate, and maximum diameter classify whether the diagnosed nodule is the position and diameter of the
benign or malignant nodule as input, but many
nodules could not be located on
the CT images
Venkadesh et al. [44] 2D-ResNet50-based [45]; An ensemble model based 16,429 CT images AUC (0.86–0.96) 10-fold cross-validation 3D CT volume feature and nodule The model achieves higher AUC than The model requires specifying
3D-Inception-V1 [46] on two CNN models coordinates other benchmarked models the position of the nodule, but
many nodules are unable to be
located on the CT images
Ardila et al. [47] Mask-RCNN [48]; Mask-RCNN [48]; 14,851 CT images AUC (0.944) Hold-out Patient’s current and prior (if available) The model achieves higher AUC than The training cohort is from only
RetinaNet [49]; RetinaNet [49]; 3D CT volume features radiologists when samples do not have one dataset, although the sample
3D-inflated Inception-V1 [50,51] 3D-inflated prior CT images size is large
Inception-V1 [50,51]
AbdulJabbar et al. [52] Micro-Net [53]; SC-CNN [54] An ensemble model based 100 Histological images Accuracy (0.913) Hold-out Image features of H&E-stained tumor The model can annotate cell types at the The annotation accuracy is
on SC-CNN [54] section histological slides single-cell level using histological affected by the used reference
images only dataset
Coudray et al. [55] Multi-task CNN model based on Multi-task CNN model 1634 Histological images AUC (0.733–0.856) Hold-out Transformed 512  512-pixel tiles from The model can predict whether a given The accuracy of the gene
Inception-V3 [51] based on Inception-V3 nonoverlapping ‘patches’ of the whole- tissue has somatic mutations in genes mutation prediction is not very
network [51] slide images STK11, EGFR, FAT1, SETBP1, high
KRAS, and TP53
Lin et al. [59] DCGAN [58] + AlexNet [41] DCGAN [58] + 22,489 CT images Accuracy (0.9986) Hold-out Initial + synthetic CT images The model uses GAN to generate No benchmarking comparisons
AlexNet [41] synthetic lung cancer images to reduce were used
overfitting
Ren et al. [60] DCGAN [58] + VGG-DF DCGAN [58] + 15,000 Histopathological images Accuracy (0.9984); Hold-out Initial + synthetic histopathological The model uses GAN to generate The dimension of images by
VGG-DF F1-score (99.84%) images synthetic lung cancer images and a generator (64  64) is not
regularization-enhanced model to sufficient for biomedical domain
reduce overfitting

Note: ML, machine learning; NA, not applicable; LR, logistic regression; AUC, area under the curve; CT, computed tomography; LDA, linear discriminant analysis; MALDI, matrix-assisted laser
desorption/ionization; ROI, region of interest; FFPE, formalin-fixed paraffin-embedded; CNN, convolutional neural network; DSC, dice similarity coefficient; SVM, support vector machine; RF,
random forest; DCNN, deep convolutional neural network; SC-CNN, spatially constrained convolutional neural network; DCGAN, deep convolutional generative adversarial network; RCNN,
Region-CNN; H&E, hematoxylin and eosin; 2D, two dimensional; 3D, three dimensional. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between
possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method.
Li Y et al / Machine Learning Applications in Lung Cancer 855

Figure 3 Omics analysis in lung cancer studies


Different sequencing techniques allow for the simultaneous measurement of multiple molecular features of a biological sample. To
improve efficiency and reduce overfitting, statistical and ML tools perform differential analysis or feature selection. Further ML models
concatenate the obtained omics features with clinical features as input for lung cancer diagnostic/prognostic prediction. DEG,
differentially expressed gene; RFE, recursive feature elimination; UAF, univariate association filtering.

cell-free DNA (cfDNA) fragments, circulating tumor DNA studies used different computational approaches to select mul-
(ctDNA), microRNA (miRNA), methylation, exosomes, and tiple cancer-associated genes to enhance their ML models
circulating tumor cells (CTCs), to explore potential circulating (Figure 3). Some studies used ML-based algorithms for feature
tumor signatures is considered a reliable method [63] (Figure 3). selection. For example, Liang et al. [80] and Whitney et al. [86]
Integrating these liquid biopsy signatures, many discriminative employed the least absolute shrinkage and selection operator
models (SVM, RF, and LR) have been used to detect tumors (LASSO) method to select the optimal markers for model
with high discovery rates [78–81]. For lung cancer subtype training; Aliferis et al. [89] utilized recursive feature elimina-
classification, somatic mutations, including single-nucleotide tion (RFE) [95] and univariate association filtering (UAF)
variants (SNVs), insertions, and deletions, usually have specific models to select highly cancer-associated genes. In addition,
cancer type profiles [82]. Thus, studies have leveraged somatic using unsupervised models for sample population subtype
mutations as input features to train classifiers for LUAD– clustering, and then identifying each cluster’s marker genes is
LUSC classification [83]. Many of these mutations, especially also seen in many studies [96,97]. Apart from ML-based mod-
driver mutations, can change expression levels, which impact els, some studies used statistical methods for feature selection.
gene function and interrupt cellular signaling processes [82]. Raman et al. [81] designed a copy number profile abnormality
As a result, different cancer types show different expression (CPA) score to reinforce the CNV feature which is more
levels of certain proteins [84,85]. Imposed by these unique robust and less subject to variable sample quality than directly
expression profiles of cancer type, ML models can leverage using CNVs as the input feature. Daemen et al. [92] integrated
RNA sequencing as input data to categorize the malignancy several statistical tests (ordinary fold changes, ordinary
(benign or malignant) and subtypes (LUAD or LUSC) of t-statistics, SAM-statistics, and moderated t-statistics) to select
patients [86–89]. Similarly, copy number variation (CNV) is a robust differential expression gene set. Aside from these
reported to be highly correlated with differential gene expres- single-measured signatures, some studies [81,86,88] combined
sion [90], and can be ubiquitously detected in cancer cells. As the -omics signatures with clinical signatures to achieve better
such, CNVs can also be used to train ML models for cancer results. Using these tumor-type specific -omics signatures,
type classification in lung cancer studies [81,91,92]. Note that many algorithms, K-nearest neighbors (KNN), naive Bayes
Daemen et al. [92] proposed a recurrent hidden Markov model (NB), SVM, decision tree (DT), LR, RF, LDA, gradient
(HMM) for the identification of extended chromosomal boosting, and NN, have demonstrated their ability to accu-
regions of altered copy numbers, which offers high accuracy rately detect and classify different lung cancer patterns
for classification. More recently, Jurmeister et al. [93] used (Table 2). Note that to improve the accuracy of ML models,
DNA methylation profiles as input features to determine if Kobayashi et al. [83] added an element-wise input scaling for
the detected malignant nodule is primary lung cancer or the the NN model, which allows the model to maintain its
metastasis of another cancer. Directly using all generated genes accuracy with a small number of learnable parameters for
as an input feature may result in overfitting [94]. Thus, many optimization.
856
Table 2 Publications relevant to ML on early detection and diagnosis using sequencing data
Publication ML method Sample size Sequencing data type Performance Validation method Feature selection Highlight/advantage Shortcoming

Mathios et al. [78] LR model with a LASSO penalty 799 cfDNA fragment AUC (0.98) 10-fold cross-validation cfDNA fragment features, This study provides a framework for DNA variations in late-stage disease may
clinical risk factors, and CT imaging features combining cfDNA fragmentation profiles with affect cfDNA detection
other markers for lung cancer detection

Genomics Proteomics Bioinformatics 20 (2022) 850–866


Lung-CLiP [79] 5-nearest neighbor; 3-nearest neighbor; 160 cfDNA AUC (0.69–0.98) Leave-one-out cross-validation SNV + CNV features This study establishes an ML framework for Sampling bias exists (most are smokers) in the
NB; LR; DT the early detection of lung cancers using training dataset
cfDNA
Liang et al. [80] LR 296 ctDNA AUC (0.816) 10-fold cross-validation Nine DNA methylation markers This study establishes an ML framework for The selected features are comprised of only
the early detection of lung cancers using DNA nine methylation biomarkers, which poses a
methylation markers limitation on assay performance
Raman et al. [81] RF; SVM; LR with ridge, elastic net; 843 cfDNA mAUC (0.896–0.936) Leave-one-out cross-validation Copy number profiling of cfDNA The model provides a framework for using Feature selection methods can be used to
LASSO regularization copy number profiling of cfDNA as a reduce overfitting and may have the potential
biomarker in lung cancer detection to achieve higher AUC
Kobayashi et al. [83] Diet Networks with EIS 954 Somatic mutation Accuracy (0.8) 5-fold cross-validation SNVs, insertions, and deletions across 1796 The EIS helps to stabilize the training process The interpretable hidden interpretations
genes of Diet Networks obtained from EIS may vary between different
datasets
Whitney et al. [86] LR 299 RNA-seq of BECs AUC (0.81) 10-fold cross-validation Lung cancer-associated and clinical covariate The model keeps sensitivity for small and The selected genes vary greatly under different
RNA markers peripheral suspected lesions feature selection processes and parameters
Podolsky et al. [87] KNN; NB normal distribution of attributes; 529 RNA-seq AUC (0.91) Hold-out RNA-seq This study systematically compares different Feature selection methods can be used to
NB distribution through histograms; models of lung cancer subtype classification reduce overfitting
SVM; C4.5 DT across different datasets
Choi et al. [88] An ensemble model based on elastic net LR; 2285 RNA-seq of bronchial AUC (0.74) 5-fold cross-validation RNA-seq of 1232 genes with clinical The model integrates RNA-seq features and Sample sizes in certain subgroups are small
SVM; hierarchical LR brushing samples covariates clinical information to improve the accuracy and may cause unbalanced training
of risk prediction
Aliferis et al. [89] Linear SVM; polynomial-kernel SVM; KNN; NN 203 RNA-seq AUC (0.8783–0.9980) 5-fold cross-validation RNA-seq of selected genes using RFE and The study uses different gene selection The selected genes vary greatly across different
UAF algorithms to improve the classification training cohorts
accuracy
Aliferis et al. [91] DT; KNN; linear SVM; polynomial-kernel SVM; 37 CNV measured by CGH Accuracy (0.892) Leave-one-out cross-validation Copy number of 80 selected genes based on The study systematically compares different The sample size is small
RBF-kernel SVM; NN linear SVM models of lung cancer subtype classification
Daemen et al. [92] HMM; weighted LS-SVM 89 CNV measured by CGH Accuracy (0.880–0.955) 10-fold cross-validation CNV measured by CGH The use of recurrent HMMs for CNV Benchmarked comparisons are needed to
detection provides high accuracy for cancer demonstrate the superiority of using the
classification HMM model
Jurmeister et al. [93] NN; SVM; RF 972 DNA methylation Accuracy (0.878–0.964) 5-fold cross-validation Top 2000 variable CpG sites The study provides a framework for using The model cannot accurately predict samples
DNA methylation data to predict tumor with low tumor cellularity through
metastases methylation data

Note: LASSO, least absolute shrinkage and selection operator; cfDNA, cell-free DNA; NB, naive Bayes; DT, decision tree; SNV, single-nucleotide variant; CNV, copy number variation; ctDNA,
circulating tumor DNA; mAUC, mean area under the curve; EIS, element-wise input scaling; BEC, bronchial epithelial cell; KNN, K-nearest neighbors; NN, neural network; RFE, recursive feature
elimination; UAF, univariate association filtering; CGH, comparative genomic hybridization; HMM, hidden Markov model; LS-SVM, least squares support vector machines; RNA-seq, RNA
sequencing. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is
more time consuming than using the simple holdout method.
Li Y et al / Machine Learning Applications in Lung Cancer 857

Apply ML to lung cancer treatment response and Another study from Geeleher et al. [109] used half-maximal
inhibitory concentration (IC50) to evaluate drug response. In
survival prediction
their model, the authors applied a ridge regression model
[110] to estimate IC50 values for different cell lines in terms
Prognosis and therapy response prediction
of their whole-genome expression level. More recently, Quiros
et al. [111] established a phenotype representation learning
Sophisticated ML models have acted as supplements for can- (PRL) through self-supervised learning and community detec-
cer intervention response evaluation and prediction [98,99], tion for spatial clustering cell type annotation on histopatho-
and have demonstrated advances in optimizing therapy deci- logical images. Their clustering results can be further used
sions that improve chances of successful recovery (Figure 4; for tracking histological tumor growth patterns and identify-
Table 3) [100,101]. There are several metrics that are available ing tumor recurrence. Indeed, their model has also demon-
for evaluating cancer therapy response, including the Response strated good performance in the LUAD and LUSC
Evaluation Criteria in Solid Tumors (RECIST) [102]. The def- classifications.
inition of RECIST relies on imaging data, mainly CT and
magnetic resonance imaging (MRI), to determine how tumors
grow or shrink in patients [103]. To track the tumor volume Survival prediction
changes from CT images, Jiang et al. [104] designed an inte-
grated CNN model. Their CNN model used two deep net- Prognosis and survival prediction as a part of clinical oncology
works based on a full-resolution residual network [105] is a tough but essential task for physicians, as knowing the
model by adding multiple residual streams of varying resolu- survival period can inform treatment decisions and benefit
tions, so that they could simultaneously combine features at patients in managing costs [112–114]. For most of the medical
different resolutions for segmenting lung tumors. Using the history, predictions relied primarily on the physician’s
RECIST criterion, Qureshi [106] set up a RF model to predict knowledge and experience based on prior patient histories
the RECIST level under EGFR tyrosine kinase inhibitor (TKI) and medical records. However, studies have indicated that
therapy given the patient’s mutation profile in gene EGFR. To physicians tend to execute poorly in predicting the prognosis
improve the prediction performance, the model integrated clin- and survival expectancy, often over-predicting survival time
ical information, geometrical features, and energy features [115–117]. Statistical algorithms, such as the Cox
obtained from a patient’s EGFR mutant drug complex as input proportional-hazards model [118], have been implemented to
to train the classifiers. In a recent study, the authors defined a assist physicians’ prediction in many studies [119–122], but
different metric, tumor proportional scoring (TPS) calculated they are not particularly accurate [12]. As a comparison, ML
as the percentage of tumor cells in digital pathology images, has shown its potential to predict a patient’s prognosis and
to evaluate the lung cancer treatment response [107]. They survival in genomic, transcriptomic, proteomic, radiomic,
applied the Otsu threshold [108] with an auxiliary classifier and other datasets (Figure 4; Table 3). Chen et al. [123] used
generative adversarial network (AC-GAN) model to identify 3-year survival as a threshold to split the patients into high-
positive tumor cell regions (TC+) and negative tumor cell risk (survival time < 36 months) and low-risk (survival
regions (TC ). And they ultimately used the ratio between time > 36 months) groups, and then constructed a NN model
the pixel count of the TC+ regions and the pixel count of all to binary predict the risk of a patient using his gene expression
detected tumor cell regions to evaluate the TPS number. data and clinical variables. In their model, they tested four

Figure 4 Diagram of ML applications in treatment response and survival prediction


858
Table 3 Publications relevant to ML on treatment response and survival prediction
Publication Feature extraction method Prediction model Sample size Data type Performance Validation method Feature selection/input Highlight/advantage Shortcoming

Jiang et al. [104] MRRN-based model MRRN-based model 1210 CT Images DSC (0.68–0.75) 5-fold cross-validation 3D image features The model can accurately track the The model does not predict accurately
tumor volume changes from CT images enough when the tumor size is small
across multiple image resolutions
Qureshi [106] NA RF; SVM; KNN; LDA; 201 Molecular structure and Accuracy (0.975) 10-fold cross-validation Among the possible 594 EGFR
4 clinical features + 4 protein drug interaction The model integrates multiple features
CART somatic mutations of features + 5 geometrical features mutations available in the COSMIC
for data training, and achieves better
EGFR performance than other benchmarked database, the model only considers the
models most common 33 EGFR mutations for
model training
Kapil et al. [107] AC-GAN AC-GAN 270 Digital pathology images Lcc (0.94); Pcc (0.95); Hold-out PD-L1-stained tumor section histological The model achieves better performance In the experiments, the use of PD-L1

Genomics Proteomics Bioinformatics 20 (2022) 850–866


MAE (8.03) slides than other benchmarked, fully staining for TPS evaluation may not be
supervised models accurate enough
Geeleher et al. [109] NA Ridge regression model 62 RNA-seq Accuracy (0.89) Leave-one-out cross- Removed low variable genes The model can accurately predict the The training sample size is small
validation drug response using RNA-seq profiles
only
Chen et al. [123] Chi-square test + NN NN 440 RNA-seq Accuracy (0.83) Hold-out RNA-seq of 5 genes The model uses multiple laboratory The model doesn’t consider
datasets for training to improve its demographic and clinical features,
robustness which may affect the prediction
LUADpp [125] Top genes with most SVM 371 Somatic mutations Accuracy (0.81) 5-fold cross-validation Somatic mutation features in 85 genes The model can predict with high Mutation frequency may be impacted
significant mutation accuracy with only seven gene mutation by the sampling bias across datasets;
frequency difference features LD may also affect the feature selection
Cho et al. [126] Information gain; Chi- NB; KNN; SVM; DT 471 Somatic mutations Accuracy (0.68–0.88) 5-fold cross-validation Somatic mutation features composed of 19 To improve performance, the model The training cohort consists of only one
squared test; minimum genes uses four different methods for feature dataset
redundancy maximum selection
relevance; correlation
algorithm
Yu et al. [128] Information gain ratio; RF 538 Multi-omics (histology, AUC (> 0.8) leave-one-out cross- 15 gene set features The study uses an integrative omics- Cox models may be overfitted in
hierarchical clustering pathology reports, RNA, validation pathology model to improve the multiple-dimension data
proteomics) accuracy in predicting patients’
prognosis
Asada et al. [130] Autoencoder + SVM 364 Multi-omics (miRNA, Accuracy (0.81) Hold-out 20 miRNAs + 25 mRNAs The study uses ML algorithms to The model does not consider the impact
Cox-PH + K-means + mRNA) of clinical and demographic variances
systematically model feature extraction
ANOVA from multi-omics datasets in data training
Takahashi et al. [131] Autoencoder + LR 483 Multi-omics (mRNA, AUC (0.43–0.99 under Hold-out 12 mRNAs, 3 miRNAs, 3 methylations, The study uses ML algorithms to The datasets collected in this study
Cox-PH + K-means + somatic mutation, CNV, different omics data) 5 CNVs, 3 somatic mutations, and 3 RPPA systematically model feature extraction
contain uncommon samples between
XGBoost/LightGBM mythelation, RPPA) from multi-omics datasets different omics datasets, which may
cause bias in model evaluation
Wiesweg et al. [136] Lasso regression SVM 122 RNA-seq Significant hazard ratio Hold-out 7 genes from feature selection model + The ML-based feature extraction model The metrics used in this study does not
differences 25 cell type-specific genes performs better than using any single perceptual intuition. Using accuracy or
immune marker for immunotherapy AUC may be better
response prediction
Trebeschi et al. [137] LR; RF LR; RF 262 CT imaging AUC (0.76–0.83) Hold-out 10 radiographic features The model can extract potential The predictive performance between
predictive CT-derived radiomic different cancer types is not robust
biomarkers to improve immunotherapy
response prediction
Saltz et al. [142] CAE [143] VGG16 [144] + 4612 Histological images AUC (0.9544) Hold-out Image features of H&E-stained tumor section The model outperforms pathologists The predictive performance between
DeconvNet [145] (13 cancer types) histological slides and other benchmarked models different cancer types is not robust

Note: MRRN, resolution residually connected network; CART, classification and regression trees; AC-GAN, auxiliary classifier generative adversarial networks; Lcc, Lin’s concordance coefficient; Pcc,
Pearson correlation coefficient; MAE, mean absolute error; TPS, tumor proportional scoring; LD, linkage disequilibrium; Cox-PH, Cox proportional-hazards; ANOVA, analysis of variance; miRNA,
microRNA; RPPA, reverse phase protein array; CAE, convolutional autoencoder; mRNA, messenger RNA; PD-L1, programmed cell death 1 ligand 1; COSMIC, the Catalogue Of Somatic Mutations
In Cancer; EGFR, epidermal growth factor receptor. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation,
and test data. However, cross-validation is more time consuming than using the simple holdout method.
Li Y et al / Machine Learning Applications in Lung Cancer 859

microarray gene expression datasets and achieved an overall as other imaging features of tumor lesions from contrast-
accuracy of 83.0% with only five identified genes correlated enhanced computed tomography (CE-CT) scans to train a
with survival time. Liu et al. [124] also utilized gene expression classifier, including LR and RF, for RECIST classification.
data for a 3-year survival classification. Unlike Chen et al.
[123], the authors integrated three types of sequencing data —
Tumor-infiltrating lymphocyte evaluation
RNA sequencing, DNA methylation, and DNA mutation —
to select a total of 22 genes to improve their model’s stability.
Meanwhile, LUADpp [125] and Cho et al. [126] used the The proportion of tumor-infiltrating lymphocytes (TILs) is
somatic mutations as input features to model a 3-year survival another important metric for immunotherapy response evalu-
risk classification. To select the genes associated with the high- ation. To this end, using transcriptomics data, DeepTIL [139]
est significant mortality, Cho et al. [126] used chi-squared tests, optimized the cell deconvolution model CIBERSORT [140] to
and LUADpp [125] used a published genome-wide rate com- automatically compute the abundance of the leucocyte subsets
parison test [127] that was able to balance statistical power (B cells, CD4+ T cells, CD8+ T cells, cd T cells, Mo-Ma-DC
and precision to compare gene mutation rates. Due to the cells, and granulocytes) within a tumor sample. A different
complexity of survival prediction, multi-omics tumor data approach [141] utilized a total of 84 radiomic features from
have been integrated for analysis in many studies. Compared the CE-CT scans, along with RNA sequencing of 20,530 genes
with single-omics data, the multi-omics data are more chal- as biomarkers to train a linear elastic-net regression model to
lenging to accurately extract the most significant genes for pre- predict the abundance of CD8+ T cells. Another study [142]
diction. To address the issue, several studies [128–131] created a DL model to identify TILs in digitized H&E-
designed a similar workflow. They first constructed a matrix stained images (Table 3). The methodology consisted of two
representing the similarity between patients based on their unique CNN modules to evaluate TILs at different scales: a
multi-omics data. Using the obtained matrix, they then lymphocyte infiltration classification CNN (lymphocyte
employed an unsupervised clustering model (usually autoen- CNN) and a necrosis segmentation CNN (necrosis CNN).
coder with K-means clustering) to categorize the patients into The ‘‘lymphocyte CNN” aimed to categorize the input image
two clusters. The two clusters were labeled ‘‘high-risk” and into with- and without-lymphocyte infiltration regions. It con-
‘‘low-risk” in terms of the different survival outcomes between sists of two steps: a convolutional autoencoder (CAE) [143] for
the two clusters in the Kaplan–Meier analysis. Following the feature extraction, followed by a VGG 16-layer network [144]
survival outcome differences, the genes associated with mortal- for TIL region classification. The ‘‘necrosis CNN” aimed to
ity were extracted using a statistical model [128,129] or an ML detect TILs within a necrosis region. They used the DeconvNet
model [130,131] for downstream analyses. [145] model for TIL segmentation in ‘‘necrosis CNN” as the
model has been shown to achieve high accuracy with several
benchmark imaging datasets.
Apply ML to lung cancer immunotherapy

Immunotherapy response prediction Neoantigen prediction

Immunotherapy has become increasingly important in recent In addition to immunotherapy response prediction, ML algo-
years. It enables a patient’s own immune system to fight can- rithms have shed light on neoantigen prediction for
cer, in most cases, by stimulating T cells. Up to date, distinct immunotherapy. Neoantigens are tumor-specific mutated pep-
novel immunotherapy treatments are being tested for lung can- tides generated by somatic mutations in tumor cells, which can
cer, and a variety of them have become standard parts of induce antitumor immune responses [146–148]. Recent work
immunotherapy. Immune checkpoint inhibitors (ICIs), espe- has demonstrated that immunogenic neoantigens are benefit
cially programmed cell death protein 1 (PD-1)/programmed to the development and optimization of neoantigen-targeted
cell death protein ligand 1 (PD-L1) blockade therapy [132], immune therapies [149–152]. In accordance with neoantigen
have been demonstrated to be valuable in the treatment of studies in clinical trials, state-of-the-art ML approaches have
patients with non-small cell lung cancer (NSCLC) [133,134]. been implemented to identify neoantigens based on human
However, immunotherapy is not yet as widely used as surgery, leukocyte antigen (HLA) class I and II processing and presen-
chemotherapy, or radiation therapies. One interpretation is tation [153–157]. Using the identified somatic mutations, ML
that it does not work for all patients due to the uniqueness models can estimate the binding affinity of the encoded
of a patient’s tumor immune microenvironment (TIME). mutated peptides to the patient’s HLA alleles (peptide–HLA
Therefore, estimating whether a patient will respond to binding affinity). The neoantigens can be further predicted
immunotherapy is important for cancer treatment. Recently, based on the estimated peptide–HLA binding affinity.
AI-based technologies have been developed to predict NetMHC [158,159] utilized a receptor–ligand dataset consist-
immunotherapy responses based on immune sequencing signa- ing of 528 peptide–HLA binding interactions measured by
tures and medical imaging signatures (Figure 4; Table 3) [135]. Buus et al. [160] to train a combination of several NNs for
To predict the response to PD-1/PD-L1 blockade therapy, neo-peptide affinity prediction. To make the prediction more
Wiesweg et al. [136] utilized gene expression profiles of 7 signif- accurate, NetMHCpan [161,162] used a larger dataset consist-
icant genes extracted from ML models plus 25 cell type-specific ing of 37,384 unique peptide–HLA interactions covering 24
genes as input features to train an SVM classifier for RECIST HLA-A alleles and 18 HLA-B alleles (26,503 and 10,881 for
classification. Aside from sequencing data, features from CT the A and B alleles, respectively) to train their NN model. Both
scans can also be used to assess the RECIST level of a patient. tools have been implemented to study the neoantigen land-
Two recent studies [137,138] used radiomic biomarkers as well scape in lung cancers [146,163–165].
860 Genomics Proteomics Bioinformatics 20 (2022) 850–866

Challenges and future perspectives identification [179] or cell population subtype annotation
[180–183]. In addition, to process the complex structure of
multi-omics data, graph neural network (GNN) models are
Despite the widespread use of ML studies in lung cancer clin-
increasingly popular in dataset integration [184], biomedical
ical practice and research, there are still challenges to be
classification [185], prognosis prediction [186], and so on.
addressed. Here, we post some examples of recent ML algo- Though these studies have not been directly applied to lung
rithms, especially the increasingly popular and important DL
cancer clinical analysis, they are a good inspiration for using
algorithms of the past decade, to enlighten them on lung can-
DL tools to address complex lung cancer omics datasets.
cer therapy analyses, as well as the challenges for future lung
cancer studies.
Multi-view data and multi-database integration
Imaging data analysis
It is common to access large amounts of imaging data, multi-
omics data, and clinical records from a single patient nowa-
Learning how to effectively extract nuance from imaging data
days. Integrating these data provides a comprehensive insight
is critical for clinical use. In the earlier ML-based CAD system,
into the molecular functions of lung cancer studies. However,
feature extractions were typically based on the image intensity,
these data types are typically obtained from different plat-
shape, and texture of a suspicious region along with other clin- forms, so platform noise inevitably exists between these data
ical variables [166]. However, these approaches are arbitrarily
types. For example, imaging data analysis, especially radio-
defined and may not retrieve the intrinsic features of a suspi-
mics, usually comes with the challenges of complicated data
cious nodule. To this end, a DL-based CAD system was devel- normalization, data fusion, and data integration. To overcome
oped leveraging CNN models to extract features directly from
this limitation, multimodality medical segmentation networks
raw imaging data with multilevel representations and hierar-
have been developed to jointly process multimodality medical
chical abstraction [167–169]. Contrary to previous methods, images [187]. Similarly, for sequencing data types, batch noise
features from a CNN model are not designed by humans,
also exists between different databases (i.e., batch effect).
and reflect the intrinsic features of the nodule in an objective
Removing batch effects and integrating datasets from multiple
and comprehensive manner. Recently, the Vision Transformer
platforms together in a framework that allows us to further
(ViT) has emerged as the current state-of-the-art in computer analyze the mechanisms of cancer drug resistance and recur-
vision [170,171]. In comparison to CNN, ViT outperformed
rence is important for cancer therapies. Though biomedical
almost 4 in terms of computational efficiency and accuracy,
studies have experimented and/or benchmarked integrative
and was more robust when training on smaller datasets [172]. tools [188–191], they are not comprehensive and discriminating
Although, to our knowledge, ViT models have not been imple-
enough to address the choice of tools in the context of biolog-
mented in any lung cancer imaging studies, they have shown
ical questions of interest.
their potential as a competitive alternative to CNN in imaging
data analysis.
Model generalizability and robustness

Omics dataset analysis In terms of this review, we find that the performance of an ML
algorithm usually varies across different datasets. One inter-
DL is a subfield of ML, which uses programmable NNs to pretation might be the existence of a database batch effect that
make accurate decisions. It particularly shines when it comes we discussed earlier. However, the absence of generalizability
to complex problems such as image classification. In this study, and robustness might be other factors that hurdle these ML
we reviewed the utility of DL models in imaging datasets. models in clinical studies. In addition, to reduce overfitting,
Compared with imaging datasets, DL algorithms were less fre- most studies used either statistical models or ML models to
quent in lung cancer clinical studies using omics data. How- select marker genes before classification. However, these mar-
ever, DL models have been extensively applied in other fields ker genes are usually quite different between studies, indicating
of omics analysis. For example, the genomics data are contin- that the identified marker genes lack generalizability and bio-
uous sequences, thus recurrent neural network (RNN) models logical interpretability. To improve the generalizability and
[173] and CNN models [174] are good tools for the population robustness of a model, it is important to develop a better
genetics analysis. Moreover, considering the input dimension understanding of robustness issues in different ML architec-
of the omics data is usually very high, to improve efficiency tures and bridge the gap in robustness techniques among dif-
and reduce overfitting, many studies have used autoencoders ferent domains. For example, recent studies have applied
or deep generative models for feature extraction and dimen- transfer learning to use a pre-trained model when training their
sionality reduction [175]. In the meantime, self-supervised rep- own datasets in lung cancer imaging data analysis [38,55,192],
resentation learning models can overcome the curse of and have improved the efficiency and robustness of their
dimensionality and integrate multi-omics data to combine CNN-based models. For sequencing datasets, transfer learning
information about different aspects of the same tissue samples has also been used in deep NNs to provide a generalizability
[176]. Accompanied by the development of single-cell-based approach [193], which could be a good example of building a
[177] and spatial-based [178] technologies that have been general and robust model for lung cancer sequencing data
applied in molecular studies, numerous DL models are becom- analysis. In addition, DL is a complex black-box model.
ing more popular for computationally intensive analysis. To Understanding the mechanisms of a DL system in clinical
deal with the complexity of large genomics data, unsupervised studies could help to build a standardized and unified DL
deep clustering tools have been built for population structure framework to improve its performance and robustness. The
Li Y et al / Machine Learning Applications in Lung Cancer 861

explainable AI (XAI) models have provided a tool for model- system that considers both imaging data and omics data treat-
specific and model-agnostic analysis [194,195]. These methods ment, and the integration of multiple data types. Finally, we
can provide the explanations of a model at local and global expect that these challenges could motivate further studies to
levels, which further helps the researchers to fine-tune hyper- focus on lung cancer therapies.
parameters from different models with high efficacy [196,197].
CRediT author statement
Metrics for performance evaluation
Yawei Li: Conceptualization, Data curation, Writing - original
Studies usually focus on the development of algorithms for draft, Visualization. Xin Wu: Data curation, Writing - original
clinical studies. However, metrics selection for performance draft, Visualization. Ping Yang: Writing - review & editing.
assessment of these algorithms is usually neglected, though it Guoqian Jiang: Writing - review & editing. Yuan Luo: Concep-
usually plays an important component in ML systems [198]. tualization, Funding acquisition, Writing - review & editing.
Based on this review (Tables 1–3), accuracy and under the All authors have read and approved the final manuscript.
curve (AUC) are the two most conventional metrics, whereas
these metrics do not always reflect the clinical needs and Competing interests
should be translated into clinically explainable metrics. Com-
pared with accuracy, sensitivity or specificity might be more
associated with clinical needs under certain circumstances, The authors have declared no competing interests.
for example, patients at high risk of emergency department vis-
its [199]. Acknowledgments
Clinical decision-making This study is supported in part by the National Institutes of
Health, USA (Grant Nos. U01TR003528 and R01LM013337).
A recent study estimated that the overall costs for lung cancer
therapy would exceed $50,000 [200] for most patients, and that Supplementary material
the cost would be high for most families. Thus, accurate prog-
nosis prediction and decision-making will pave the way for Supplementary data to this article can be found online at
personalized treatment. Recent DL models have been used to https://doi.org/10.1016/j.gpb.2022.11.003.
predict the effectiveness of a therapy/drug and optimize the
combination of different therapies/drugs [201,202]. However,
most existing DL models for clinical decision-making have dif- ORCID
ficulty in keeping up with knowledge evolution and/or
dynamic health care data change [203]. Currently, clinical deci- ORCID 0000-0001-9699-5118 (Yawei Li)
sion support systems, including IBM Watson Health and Goo- ORCID 0000-0003-2386-6344 (Xin Wu)
gle DeepMind Health, have been implemented in lung cancer ORCID 0000-0002-8588-847X (Ping Yang)
treatments in recent years [204,205]. Although the efficiency ORCID 0000-0003-2940-0019 (Guoqian Jiang)
of clinical work has improved with the help of these systems, ORCID 0000-0003-0195-7456 (Yuan Luo)
they are still far from perfect in terms of clinical trials, and cur-
rently cannot replace physicians at this stage [205]. References
[1] Thai AA, Solomon BJ, Sequist LV, Gainor JF, Heist RS. Lung
Conclusion cancer. Lancet 2021;398:535–54.
[2] Svoboda E. Artificial intelligence is improving the detection of
AI grants us a different perspective on lung cancer research lung cancer. Nature 2020;587:S20–2.
and allows for exploring the implementation of decision sup- [3] Ling S, Hu Z, Yang Z, Yang F, Li Y, Lin P, et al. Extremely high
port tools to facilitate precision oncology. In this review, we genetic diversity in a single tumor points to prevalence of non-
darwinian cell evolution. Proc Natl Acad Sci U S A 2015;112:
surveyed the current advances of ML algorithms in various
E6496–505.
areas of lung cancer therapy, including early detection, diagno- [4] International Cancer Genome Consortium, Hudson TJ, Ander-
sis decision, prognosis prediction, drug response evaluation, son W, Artez A, Barker AD, Bell C, et al. International network
and immunotherapy practice. To aid future ML development of cancer genome projects. Nature 2010;464:993–8.
in lung cancer therapies, we thoroughly summarized the data- [5] Cancer Genome Atlas Research Network, Weinstein JN, Collis-
sets (Table S1), baseline methods (Table S2), and characteris- son EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer
tics of the methods (Tables 1–3). At last, we highlighted the Genome Atlas Pan-Cancer analysis project. Nat Genet
current challenges that need to be addressed, such as the cur- 2013;45:1113–20.
rent lack of quantity and quality of medical data labels for [6] Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P,
training, the importance of model robustness and biomedical et al. The Cancer Imaging Archive (TCIA): maintaining and
operating a public information repository. J Digit Imaging
explanations for clinical use, the concern of the metrics used
2013;26:1045–57.
for performance evaluation, and the need for data integration [7] Pavlopoulou A, Spandidos DA, Michalopoulos I. Human cancer
and batch removal. As this review indicates, future lung cancer databases (review). Oncol Rep 2015;33:3–18.
therapies will include both imaging data and omics data, so an [8] Luo Y, Wang F, Szolovits P. Tensor factorization toward
ML clinical decision-making tool should be a multi-modal precision medicine. Brief Bioinform 2016;18:511–4.
862 Genomics Proteomics Bioinformatics 20 (2022) 850–866

[9] Kolda TG, Bader BW. Tensor decompositions and applications. [32] van Riel SJ, Ciompi F, Wille MMW, Dirksen A, Lam S,
SIAM Rev 2009;51:455–500. Scholten ET, et al. Malignancy risk estimation of pulmonary
[10] Chao G, Mao C, Wang F, Zhao Y, Luo Y. Supervised nonnegative nodules in screening CTs: comparison between a computer
matrix factorization to predict icu mortality risk. Proceedings model and human observers. PLoS One 2017;12:e0185032.
(IEEE Int Conf Bioinformatics Biomed) 2018;2018:1189–94. [33] Wille MMW, van Riel SJ, Saghir Z, Dirksen A, Pedersen JH,
[11] Chi EC, Kolda TG. On tensors, sparsity, and nonnegative Jacobs C, et al. Predictive accuracy of the pancan lung cancer
factorizations. SIAM J Matrix Anal Appl 2012;33:1272–99. risk prediction model — external validation based on CT from
[12] Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson the Danish Lung Cancer Screening Trial. Eur Radiol
JV, Waddell N. Deep learning in cancer diagnosis, prognosis and 2015;25:3093–9.
treatment selection. Genome Med 2021;13:152. [34] Kriegsmann M, Casadonte R, Kriegsmann J, Dienemann H,
[13] Zeng Z, Yao L, Roy A, Li X, Espino S, Clare SE, et al. Identifying Schirmacher P, Kobarg JH, et al. Reliable entity subtyping in
breast cancer distant recurrences from electronic health records non-small cell lung cancer by matrix-assisted laser desorption/
using machine learning. J Healthc Inform Res 2019;3:283–99. ionization imaging mass spectrometry on formalin-fixed paraf-
[14] Cruz JA, Wishart DS. Applications of machine learning in fin-embedded tissue specimens. Mol Cell Proteomics
cancer prediction and prognosis. Cancer Inform 2007;2:59–77. 2016;15:3081–9.
[15] Wang H, Li Y, Khan SA, Luo Y. Prediction of breast cancer [35] Mohammad BA, Brennan PC, Mello-Thoms C. A review of lung
distant recurrence using natural language processing and knowl- cancer screening and the role of computer-aided detection. Clin
edge-guided convolutional neural network. Artif Intell Med Radiol 2017;72:433–42.
2020;110:101977. [36] Armato 3rd SG, Li F, Giger ML, MacMahon H, Sone S, Doi K.
[16] Cochran AJ. Prediction of outcome for patients with cutaneous Lung cancer: performance of automated lung nodule detection
melanoma. Pigment Cell Res 1997;10:162–7. applied to cancers missed in a CT screening program. Radiology
[17] Zeng Z, Li X, Espino S, Roy A, Kitsch K, Clare S, et al. 2002;225:685–92.
Contralateral breast cancer event detection using natural lan- [37] Buty M, Xu Z, Gao M, Bagci U, Wu A, Mollura D.
guage processing. AMIA Annu Symp Proc 2018;2017:1885–92. Characterization of lung nodule malignancy using hybrid shape
[18] Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, and appearance features. In: Ourselin S, Joskowicz L, Sabuncu
Fotiadis DI. Machine learning applications in cancer prognosis M, Unal G, Wells W, editors. Medical image computing and
and prediction. Comput Struct Biotechnol J 2015;13:8–17. computer-assisted intervention. Cham: Springer; 2016, p.662–70.
[19] Luo Y, Xin Y, Hochberg E, Joshi R, Uzuner O, Szolovits P. [38] Hussein S, Cao K, Song Q, Bagci U. Risk stratification of lung
Subgraph augmented non-negative tensor factorization nodules using 3D CNN-based multi-task learning. In: Nietham-
(SANTF) for modeling clinical narrative text. J Am Med Inform mer M, Styner M, Aylward S, Zhu H, Oguz I, Yap PT, editors.
Assoc 2015;22:1009–19. Information processing in medical imaging. Cham: Springer;
[20] Benzekry S. Artificial intelligence and mechanistic modeling for 2017, p.249–60.
clinical decision making in oncology. Clin Pharmacol Ther [39] Khosravan N, Celik H, Turkbey B, Jones EC, Wood B, Bagci U.
2020;108:471–86. A collaborative computer aided diagnosis (C-CAD) system with
[21] Li Y, Luo Y. Optimizing the evaluation of gene-targeted panels eye-tracking, sparse attentional model, and deep learning. Med
for tumor mutational burden estimation. Sci Rep 2021;11:21072. Image Anal 2019;51:101–15.
[22] Bhinder B, Gilvary C, Madhukar NS, Elemento O. Artificial [40] Ciompi F, de Hoop B, van Riel SJ, Chung K, Scholten ET,
intelligence in cancer research and precision medicine. Cancer Oudkerk M, et al. Automatic classification of pulmonary peri-
Discov 2021;11:900–15. fissural nodules in computed tomography using an ensemble of
[23] Zeng Z, Espino S, Roy A, Li X, Khan SA, Clare SE, et al. Using 2D views and a convolutional neural network out-of-the-box.
natural language processing and machine learning to identify Med Image Anal 2015;26:195–202.
breast cancer local recurrence. BMC Bioinformatics 2018;19:498. [41] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification
[24] Luo Y, Sohani AR, Hochberg EP, Szolovits P. Automatic with deep convolutional neural networks. Commun ACM
lymphoma classification with sentence subgraph mining from 2017;60:84–90.
pathology reports. J Am Med Inform Assoc 2014;21:824–32. [42] Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun
[25] Luchini C, Pea A, Scarpa A. Artificial intelligence in oncology: Y. OverFeat: integrated recognition, localization and detection
current applications and future perspectives. Br J Cancer using convolutional networks. arXiv 2014;1312.6229.
2022;126:4–9. [43] Gu X, Wang Y, Chan TF, Thompson PM, Yau ST. Genus zero
[26] Zeng Z, Vo A, Li X, Shidfar A, Saldana P, Blanco L, et al. surface conformal mapping and its application to brain surface
Somatic genetic aberrations in benign breast disease and the risk mapping. Inf Process Med Imaging 2003;18:172–84.
of subsequent breast cancer. NPJ Breast Cancer 2020;6:24. [44] Venkadesh KV, Setio AAA, Schreuder A, Scholten ET, Chung
[27] Na J, Zong N, Wang C, Midthun DE, Luo Y, Yang P, et al. KM, Wille MMW, et al. Deep learning for malignancy risk
Characterizing phenotypic abnormalities associated with high- estimation of pulmonary nodules detected at low-dose screening
risk individuals developing lung cancer using electronic health CT. Radiology 2021;300:438–47.
records from the All of Us researcher workbench. J Am Med [45] He K, Zhang X, Ren S, Sun J. Deep residual learning for image
Inform Assoc 2021;28:2313–24. recognition. IEEE Conf Comput Vis Pattern Recognit 2016:770–8.
[28] Fujita H. AI-based computer-aided diagnosis (AI-CAD): the [46] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al.
latest review to read first. Radiol Phys Technol 2020;13:6–19. Going deeper with convolutions. IEEE Conf Comput Vis Pattern
[29] Yanase J, Triantaphyllou E. A systematic survey of computer- Recognit 2015:1–9.
aided diagnosis in medicine: past and present developments. [47] Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L,
Expert Syst Appl 2019;138:112821. et al. End-to-end lung cancer screening with three-dimensional
[30] Abe Y, Hanai K, Nakano M, Ohkubo Y, Hasizume T, Kakizaki deep learning on low-dose chest computed tomography. Nat
T, et al. A computer-aided diagnosis (CAD) system in lung Med 2019;25:954–61.
cancer screening with computed tomography. Anticancer Res [48] He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. IEEE
2005;25:483–8. Int Conf Comput Vis 2017:2980–8.
[31] McWilliams A, Tammemagi MC, Mayo JR, Roberts H, Liu G, [49] Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for
Soghrati K, et al. Probability of cancer in pulmonary nodules dense object detection. IEEE Trans Pattern Anal Mach Intell
detected on first screening CT. N Engl J Med 2013;369:910–9. 2020;42:318–27.
Li Y et al / Machine Learning Applications in Lung Cancer 863

[50] Carreira J, Zisserman A. Quo vadis, action recognition? A new [70] Eraslan G, Avsec Z, Gagneur J, Theis FJ. Deep learning: new
model and the kinetics dataset. IEEE Conf Comput Vis Pattern computational modelling techniques for genomics. Nat Rev
Recognit 2017:4724–33. Genet 2019;20:389–403.
[51] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking [71] Luo Y, Mao C. Panther: pathway augmented nonnegative tensor
the inception architecture for computer vision. IEEE Conf factorization for higher-order feature learning. Proc AAAI Conf
Comput Vis Pattern Recognit 2016:2818–26. Artif Intell 2021:37–180.
[52] AbdulJabbar K, Raza SEA, Rosenthal R, Jamal-Hanjani M, [72] Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar
Veeriah S, Akarca A, et al. Geospatial immune variability RCT, et al. Diffuse large B-cell lymphoma outcome prediction
illuminates differential evolution of lung adenocarcinoma. Nat by gene-expression profiling and supervised machine learning.
Med 2020;26:1054–62. Nat Med 2002;8:68–74.
[53] Raza SEA, Cheung L, Shaban M, Graham S, Epstein D, [73] Kononenko I. Machine learning for medical diagnosis: history,
Pelengaris S, et al. Micro-Net: a unified model for segmentation state of the art and perspective. Artif Intell Med 2001;23:89–109.
of various objects in microscopy images. Med Image Anal [74] Luo Y, Mao C. ScanMap: supervised confounding aware non-
2019;52:160–73. negative matrix factorization for polygenic risk modeling. Proc
[54] Sirinukunwattana K, Raza SEA, Tsang YW, Snead DRJ, Cree Mach Learn Res 2020;126:27–45.
IA, Rajpoot NM. Locality sensitive deep learning for detection [75] Zeng Z, Mao C, Vo A, Li X, Nugent JO, Khan SA, et al. Deep
and classification of nuclei in routine colon cancer histology learning for cancer type classification and driver gene identifica-
images. IEEE Trans Med Imaging 2016;35:1196–206. tion. BMC Bioinformatics 2021;22:491.
[55] Ocampo P, Moreira A, Coudray N, Sakellaropoulos T, Narula [76] Zhang Y, Li Y, Li T, Shen X, Zhu T, Tao Y, et al. Genetic load
N, Snuderl M, et al. Classification and mutation prediction from and potential mutational meltdown in cancer cell populations.
non-small cell lung cancer histopathology images using deep Mol Biol Evol 2019;36:541–52.
learning. J Thorac Oncol 2018;13:S562. [77] Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR,
[56] Yao Q, Xiao L, Liu P, Zhou SK. Label-free segmentation of Behjati S, Biankin AV, et al. Signatures of mutational processes
COVID-19 lesions in lung CT. IEEE Trans Med Imaging in human cancer. Nature 2013;500:415–21.
2021;40:2808–19. [78] Mathios D, Johansen JS, Cristiano S, Medina JE, Phallen J,
[57] Chuquicusma MJM, Hussein S, Burt J, Bagci U. How to fool Larsen KR, et al. Detection and characterization of lung cancer
radiologists with generative adversarial networks? A visual using cell-free DNA fragmentomes. Nat Commun 2021;12:5060.
turing test for lung cancer diagnosis. IEEE Int Symp Biomed [79] Chabon JJ, Hamilton EG, Kurtz DM, Esfahani MS, Moding EJ,
Imaging 2018:240–4. Stehr H, et al. Integrating genomic features for non-invasive
[58] Li J, Jia J, Xu D. Unsupervised representation learning of image- early lung cancer detection. Nature 2020;580:245–51.
based plant disease with deep convolutional generative adver- [80] Liang W, Zhao Y, Huang W, Gao Y, Xu W, Tao J, et al. Non-
sarial networks. 37th Chinese Control Conference 2018:9159– invasive diagnosis of early-stage lung cancer using high-through-
63. put targeted DNA methylation sequencing of circulating tumor
[59] Lin CH, Lin CJ, Li YC, Wang SH. Using generative adversarial DNA (ctDNA). Theranostics 2019;9:2056–70.
networks and parameter optimization of convolutional neural [81] Raman L, van der Linden M, van der Eecken K, Vermaelen K,
networks for lung tumor classification. Appl Sci 2021;11:480. Demedts I, Surmont V, et al. Shallow whole-genome sequencing
[60] Ren Z, Zhang Y, Wang S. A hybrid framework for lung cancer of plasma cell-free DNA accurately differentiates small from
classification. Electronics 2022;11:1614. non-small cell lung carcinoma. Genome Med 2020;12:35.
[61] Pinsky PF, Gierada DS, Black W, Munden R, Nath H, Aberle [82] Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P.
D, et al. Performance of lung-RADS in the national lung Molecular biology of the cell. 4th ed. New York: Garland
screening trial: a retrospective assessment. Ann Intern Med Science; 2002.
2015;162:485–91. [83] Kobayashi K, Bolatkan A, Shiina S, Hamamoto R. Fully-
[62] National Lung Screening Trial Research Team, Church TR, connected neural networks with reduced parameterization for
Black WC, Aberle DR, Berg CD, Clingan KL, et al. Results of predicting histological types of lung cancer from somatic
initial low-dose computed tomographic screening for lung mutations. Biomolecules 2020;10:1249.
cancer. N Engl J Med 2013;368:1980–91. [84] Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway
[63] Herath S, Rad HS, Radfar P, Ladwa R, Warkiani M, O’Byrne LA, Golub TR, et al. Discovery and saturation analysis of cancer
K, et al. The role of circulating biomarkers in lung cancer. Front genes across 21 tumour types. Nature 2014;505:495–501.
Oncol 2021;11:801269. [85] Uhlen M, Zhang C, Lee S, Sjostedt E, Fagerberg L, Bidkhori G,
[64] Politi K, Herbst RS. Lung cancer in the era of precision et al. A pathology atlas of the human cancer transcriptome.
medicine. Clin Cancer Res 2015;21:2213–20. Science 2017;357:eaan2507.
[65] Relli V, Trerotola M, Guerra E, Alberti S. Abandoning the [86] Whitney DH, Elashoff MR, Porta-Smith K, Gower AC, Vachani
notion of non-small cell lung cancer. Trends Mol Med A, Ferguson JS, et al. Derivation of a bronchial genomic
2019;25:585–94. classifier for lung cancer in a prospective study of patients
[66] Chen JW, Dhahbi J. Lung adenocarcinoma and lung squamous undergoing diagnostic bronchoscopy. BMC Med Genomics
cell carcinoma cancer classification, biomarker identification, 2015;8:18.
and gene expression analysis using overlapping feature selection [87] Podolsky MD, Barchuk AA, Kuznetcov VI, Gusarova NF,
methods. Sci Rep 2021;11:13323. Gaidukov VS, Tarakanov SA. Evaluation of machine learning
[67] Zeng Z, Vo AH, Mao C, Clare SE, Khan SA, Luo Y. Cancer algorithm utilization for lung cancer classification based on gene
classification and pathway discovery using non-negative matrix expression levels. Asian Pac J Cancer Prev 2016;17:835–8.
factorization. J Biomed Inform 2019;96:103247. [88] Choi Y, Qu J, Wu S, Hao Y, Zhang J, Ning J, et al. Improving
[68] Jiao W, Atwal G, Polak P, Karlic R, Cuppen E, Danyi A, et al. lung cancer risk stratification leveraging whole transcriptome
A deep learning system accurately classifies primary and RNA sequencing and machine learning across multiple cohorts.
metastatic cancers using passenger mutation patterns. Nat BMC Med Genomics 2020;13:151.
Commun 2020;11:728. [89] Aliferis CF, Tsamardinos I, Massion PP, Statnikov A, Fanana-
[69] Li Y, Luo Y. Performance-weighted-voting model: an ensemble pazir N, Hardin D. Machine learning models for classification of
machine learning method for cancer type classification using lung cancer and selection of genomic markers using array gene
whole-exome sequencing mutation. Quant Biol 2020;8:347–58.
864 Genomics Proteomics Bioinformatics 20 (2022) 850–866

expression data. Proceedings of the 16th International Florida [108] Otsu N. A threshold selection method from gray-level his-
Artificial Intelligence Research Society Conference 2003:67–71. tograms. IEEE Trans Syst Man Cybern 1979;9:62–6.
[90] Shao X, Lv N, Liao J, Long JB, Xue R, Ai N, et al. Copy [109] Geeleher P, Cox NJ, Huang RS. Clinical drug response can be
number variation is highly correlated with differential gene predicted using baseline gene expression levels and in vitro drug
expression: a pan-cancer study. BMC Med Genet 2019;20:175. sensitivity in cell lines. Genome Biol 2014;15:R47.
[91] Aliferis CF, Hardin D, Massion PP. Machine learning models [110] Cule E, De Iorio M. Ridge regression in prediction problems:
for lung cancer classification using array comparative genomic automatic choice of the ridge parameter. Genet Epidemiol
hybridization. Proc AMIA Symp 2002:7–11. 2013;37:704–14.
[92] Daemen A, Gevaert O, Leunen K, Legius E, Vergote I, De Moor [111] Quiros AC, Coudray N, Yeaton A, Yang X, Chiriboga L,
B. Supervised classification of array CGH data with HMM- Karimkhan A, et al. Self-supervised learning unveils morpho-
based feature selection. Pac Symp Biocomput 2009:468–79. logical clusters behind lung cancer types and prognosis. arXiv
[93] Jurmeister P, Bockmayr M, Seegerer P, Bockmayr T, Treue D, 2022;2205.01931.
Montavon G, et al. Machine learning analysis of DNA methy- [112] Gensheimer MF, Aggarwal S, Benson KRK, Carter JN, Henry
lation profiles distinguishes primary lung squamous cell carci- AS, Wood DJ, et al. Automated model versus treating physician
nomas from head and neck metastases. Sci Transl Med 2019;11: for predicting survival time of patients with metastatic cancer. J
eaaw8513. Am Med Inform Assoc 2021;28:1108–16.
[94] Luo Y, Riedlinger G, Szolovits P. Text mining in cancer gene [113] Doppalapudi S, Qiu RG, Badr Y. Lung cancer survival period
and pathway prioritization. Cancer Inform 2014;13:69–79. prediction and understanding: deep learning approaches. Int J
[95] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for Med Inform 2021;148:104371.
cancer classification using support vector machines. Mach Learn [114] Nair M, Sandhu SS, Sharma AK. Prognostic and predictive
2002;46:389–422. biomarkers in cancer. Curr Cancer Drug Targets
[96] Mirhadi S, Tam S, Li Q, Moghal N, Pham NA, Tong J, et al. 2014;14:477–504.
Integrative analysis of non-small cell lung cancer patient-derived [115] Chow E, Davis L, Panzarella T, Hayter C, Szumacher E, Loblaw
xenografts identifies distinct proteotypes associated with patient A, et al. Accuracy of survival prediction by palliative radiation
outcomes. Nat Commun 2022;13:1811. oncologists. Int J Radiat Oncol Biol Phys 2005;61:870–3.
[97] Xu JY, Zhang C, Wang X, Zhai L, Ma Y, Mao Y, et al. [116] Lakin JR, Robinson MG, Bernacki RE, Powers BW, Block SD,
Integrative proteomic characterization of human lung adenocar- Cunningham R, et al. Estimating 1-year mortality for high-risk
cinoma. Cell 2020;182:245–61. primary care patients using the ‘‘surprise” question. JAMA
[98] El-Deredy W, Ashmore SM, Branston NM, Darling JL, Intern Med 2016;176:1863–5.
Williams SR, Thomas DG. Pretreatment prediction of the [117] White N, Reid F, Harris A, Harries P, Stone P. A systematic
chemotherapeutic response of human glioma cell cultures using review of predictions of survival in palliative care: how accurate
nuclear magnetic resonance spectroscopy and artificial neural are clinicians and who are the experts? PLoS One 2016;11:
networks. Cancer Res 1997;57:4196–9. e0161407.
[99] Zeng Z, Amin A, Roy A, Pulliam NE, Karavites LC, Espino S, [118] Cox DR. Regression models and life-tables. J R Stat Soc B
et al. Preoperative magnetic resonance imaging use and onco- 1972;34:187–220.
logic outcomes in premenopausal breast cancer patients. NPJ [119] Wang X, Yao S, Xiao Z, Gong J, Liu Z, Han B, et al.
Breast Cancer 2020;6:49. Development and validation of a survival model for lung
[100] Chang Y, Park H, Yang HJ, Lee S, Lee KY, Kim TS, et al. adenocarcinoma based on autophagy-associated genes. J Transl
Cancer drug response profile scan (CDRscan): a deep learning Med 2020;18:149.
model that predicts drug effectiveness from cancer genomic [120] Zhang YH, Lu Y, Lu H, Zhou YM. Development of a survival
signature. Sci Rep 2018;8:8857. prognostic model for non-small cell lung cancer. Front Oncol
[101] Menden MP, Iorio F, Garnett M, McDermott U, Benes CH, 2020;10:362.
Ballester PJ, et al. Machine learning prediction of cancer cell [121] Yu KH, Zhang C, Berry GJ, Altman RB, Re C, Rubin DL, et al.
sensitivity to drugs based on genomic and chemical properties. Predicting non-small cell lung cancer prognosis by fully auto-
PLoS One 2013;8:e61318. mated microscopic pathology image features. Nat Commun
[102] Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH, Sargent 2016;7:12474.
D, Ford R, et al. New response evaluation criteria in solid [122] Hatlen P, Gronberg BH, Langhammer A, Carlsen SM, Amund-
tumours: revised RECIST guideline (version 1.1). Eur J Cancer sen T. Prolonged survival in patients with lung cancer with
2009;45:228–47. diabetes mellitus. J Thorac Oncol 2011;6:1810–7.
[103] Adam G, Rampasek L, Safikhani Z, Smirnov P, Haibe-Kains B, [123] Chen YC, Ke WC, Chiu HW. Risk classification of cancer
Goldenberg A. Machine learning approaches to drug response survival using ANN with gene expression data from multiple
prediction: challenges and recent progress. NPJ Precis Oncol laboratories. Comput Biol Med 2014;48:1–7.
2020;4:19. [124] Liu Y, Yang M, Sun W, Zhang M, Sun J, Wang W, et al.
[104] Jiang J, Hu YC, Liu CJ, Halpenny D, Hellmann MD, Deasy JO, Developing prognostic gene panel of survival time in lung
et al. Multiple resolution residually connected feature streams for adenocarcinoma patients using machine learning. Transl Cancer
automatic lung tumor segmentation from CT images. IEEE Res 2020;9:3860–9.
Trans Med Imaging 2019;38:134–44. [125] Yu J, Hu Y, Xu Y, Wang J, Kuang J, Zhang W, et al. LUADpp:
[105] Pohlen T, Hermans A, Mathias M, Leibe B. Full-resolution an effective prediction model on prognosis of lung adenocarci-
residual networks for semantic segmentation in street scenes. nomas based on somatic mutational features. BMC Cancer
IEEE Conf Comput Vis Pattern Recognit 2017:3309–18. 2019;19:263.
[106] Qureshi R. Personalized drug-response prediction model for lung [126] Cho HJ, Lee S, Ji YG, Lee DH. Association of specific gene
cancer patients using machine learning. TechRxiv 2020; mutations derived from machine learning with survival in lung
13273319.v1. adenocarcinoma. PLoS One 2018;13:e0207204.
[107] Kapil A, Meier A, Zuraw A, Steele KE, Rebelatto MC, Schmidt [127] Hui X, Hu Y, Sun MA, Shu X, Han R, Ge Q, et al. EBT: a
G, et al. Deep semi supervised generative learning for automated statistic test identifying moderate size of significant features with
tumor proportion scoring on NSCLC tissue needle biopsies. Sci balanced power and precision for genome-wide rate compar-
Rep 2018;8:17343. isons. Bioinformatics 2017;33:2631–41.
Li Y et al / Machine Learning Applications in Lung Cancer 865

[128] Yu KH, Berry GJ, Rubin DL, Re C, Altman RB, Snyder M. [147] Roudko V, Greenbaum B, Bhardwaj N. Computational predic-
Association of omics features with histopathology patterns in tion and validation of tumor-associated neoantigens. Front
lung adenocarcinoma. Cell Syst 2017;5:620–7. Immunol 2020;11:27.
[129] Ramazzotti D, Lal A, Wang B, Batzoglou S, Sidow A. Multi- [148] Zhang Z, Lu M, Qin Y, Gao W, Tao L, Su W, et al. Neoantigen:
omic tumor data reveal diversity of molecular mechanisms that a new breakthrough in tumor immunotherapy. Front Immunol
correlate with survival. Nat Commun 2018;9:4453. 2021;12:672356.
[130] Asada K, Kobayashi K, Joutard S, Tubaki M, Takahashi S, [149] Hilf N, Kuttruff-Coqui S, Frenzel K, Bukur V, Stevanovic S,
Takasawa K, et al. Uncovering prognosis-related genes and Gouttefangeas C, et al. Actively personalized vaccination trial
pathways by multi-omics analysis in lung cancer. Biomolecules for newly diagnosed glioblastoma. Nature 2019;565:240–5.
2020;10:524. [150] Carreno BM, Magrini V, Becker-Hapak M, Kaabinejadian S,
[131] Takahashi S, Asada K, Takasawa K, Shimoyama R, Sakai A, Hundal J, Petti AA, et al. A dendritic cell vaccine increases the
Bolatkan A, et al. Predicting deep learning based multi-omics breadth and diversity of melanoma neoantigen-specific T cells.
parallel integration survival subtypes in lung cancer using reverse Science 2015;348:803–8.
phase protein array data. Biomolecules 2020;10:1460. [151] Ott PA, Hu Z, Keskin DB, Shukla SA, Sun J, Bozym DJ, et al.
[132] Xia L, Liu Y, Wang Y. PD-1/PD-L1 blockade therapy in An immunogenic personal neoantigen vaccine for patients with
advanced non-small-cell lung cancer: current status and future melanoma. Nature 2017;547:217–21.
directions. Oncologist 2019;24:S31–41. [152] Keskin DB, Anandappa AJ, Sun J, Tirosh I, Mathewson ND,
[133] Doroshow DB, Sanmamed MF, Hastings K, Politi K, Rimm Li S, et al. Neoantigen vaccine generates intratumoral T cell
DL, Chen L, et al. Immunotherapy in non-small cell lung cancer: responses in phase Ib glioblastoma trial. Nature 2019;565:
facts and hopes. Clin Cancer Res 2019;25:4592–602. 234–9.
[134] Lim SM, Hong MH, Kim HR. Immunotherapy for non-small [153] Zhao W, Sher X. Systematically benchmarking peptide-MHC
cell lung cancer: current landscape and future perspectives. binding predictors: from synthetic to naturally processed epi-
Immune Netw 2020;20:e10. topes. PLoS Comput Biol 2018;14:e1006457.
[135] Xu Z, Wang X, Zeng S, Ren X, Yan Y, Gong Z. Applying [154] Racle J, Michaux J, Rockinger GA, Arnaud M, Bobisse S,
artificial intelligence for cancer immunotherapy. Acta Pharm Sin Chong C, et al. Robust prediction of HLA class II epitopes by
B 2021;11:3393–405. deep motif deconvolution of immunopeptidomes. Nat Biotech-
[136] Wiesweg M, Mairinger F, Reis H, Goetz M, Kollmeier J, Misch nol 2019;37:1283–6.
D, et al. Machine learning reveals a PD-L1-independent predic- [155] Chen BB, Khodadoust MS, Olsson N, Wagar LE, Fast E, Liu
tion of response to immunotherapy of non-small cell lung cancer CL, et al. Predicting HLA class II antigen presentation through
by gene expression context. Eur J Cancer 2020;140:76–85. integrated deep learning. Nature Biotechnol 2019;37:1332–43.
[137] Trebeschi S, Drago SG, Birkbak NJ, Kurilova I, Calin AM, Pizzi [156] O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson
AD, et al. Predicting response to cancer immunotherapy using U, Hammerbacher J. MHCflurry: open-source class I MHC
noninvasive radiomic biomarkers. Ann Oncol 2019;30:998–1004. binding affinity prediction. Cell Syst 2018;7:129–32.
[138] Coroller TP, Agrawal V, Narayan V, Hou Y, Grossmann P, Lee [157] Bulik-Sullivan B, Busby J, Palmer CD, Davis MJ, Murphy T,
SW, et al. Radiomic phenotype features predict pathological Clark A, et al. Deep learning using tumor HLA peptide mass
response in non-small cell lung cancer. Radiother Oncol spectrometry datasets improves neoantigen identification. Nat-
2016;119:480–6. ure Biotechnol 2019;37:55–63.
[139] Tosolini M, Pont F, Poupot M, Vergez F, Nicolau-Travers ML, [158] Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O,
Vermijlen D, et al. Assessment of tumor-infiltrating TCRVc9Vd2 Nielsen M. NetMHC-3.0: accurate web accessible predictions of
cd lymphocyte abundance by deconvolution of human cancers human, mouse and monkey MHC class I affinities for peptides of
microarrays. Oncoimmunology 2017;6:e1284723. length 8–11. Nucleic Acids Res 2008;36:W509–12.
[140] Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, [159] Nielsen M, Lundegaard C, Worning P, Lauemoller SL, Lam-
et al. Robust enumeration of cell subsets from tissue expression berth K, Buus S, et al. Reliable prediction of T-cell epitopes
profiles. Nat Methods 2015;12:453–7. using neural networks with novel sequence representations.
[141] Sun R, Limkin EJ, Vakalopoulou M, Dercle L, Champiat S, Han Protein Sci 2003;12:1007–17.
SR, et al. A radiomics approach to assess tumour-infiltrating [160] Buus S, Stryhn A, Winther K, Kirkby N, Pedersen LO.
CD8 cells and response to anti-PD-1 or anti-PD-L1 immunother- Receptor-ligand interactions measured by an improved spun
apy: an imaging biomarker, retrospective multicohort study. column chromatography technique. a high efficiency and high
Lancet Oncol 2018;19:1180–91. throughput size separation method. Biochim Biophys Acta
[142] Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, et al. 1995;1243:453–60.
Spatial organization and molecular correlation of tumor-infil- [161] Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M.
trating lymphocytes using deep learning on pathology images. NetMHCpan-4.0: Improved peptide–MHC class I interaction
Cell Rep 2018;23:181–93. predictions integrating eluted ligand and peptide binding affinity
[143] Hou L, Nguyen V, Kanevsky AB, Samaras D, Kurc TM, Zhao data. J Immunol 2017;199:3360–8.
T, et al. Sparse autoencoder for unsupervised nucleus detection [162] Nielsen M, Lundegaard C, Blicher T, Lamberth K, Harndahl M,
and representation in histopathology images. Pattern Recognit Justesen S, et al. NetMHCpan, a method for quantitative
2019;86:188–200. predictions of peptide binding to any HLA-A and -B locus
[144] Xu Y, Jia ZP, Ai Y, Zhang F, Lai M, Chang EIC. Deep protein of known sequence. PLoS One 2007;2:e796.
convolutional activation features for large scale brain tumor [163] Ye L, Creaney J, Redwood A, Robinson B. The current lung
histopathology image classification and segmentation. IEEE Int cancer neoantigen landscape and implications for therapy. J
Conf Acoust Spee Signal Process 2015:947–51. Thorac Oncol 2021;16:922–32.
[145] Noh H, Hong S, Han B. Learning deconvolution network for [164] Gong L, He R, Xu Y, Luo T, Jin K, Yuan W, et al. Neoantigen
semantic segmentation. IEEE Int Conf Comp Vis 2015:1520–8. load as a prognostic and predictive marker for stage II/III non-
[146] De Mattos-Arruda L, Vazquez M, Finotello F, Lepore R, Porta small cell lung cancer in chinese patients. Thorac Cancer
E, Hundal J, et al. Neoantigen prediction and computational 2021;12:2170–81.
perspectives towards clinical benefit: Recommendations from the [165] Zhang W, Yin Q, Huang H, Lu J, Qin H, Chen S, et al. Personal
ESMO precision medicine working group. Ann Oncol neoantigens from patients with NSCLC induce efficient antitu-
2020;31:978–90. mor responses. Front Oncol 2021;11:628456.
866 Genomics Proteomics Bioinformatics 20 (2022) 850–866

[166] Zou L, Yu S, Meng T, Zhang Z, Liang X, Xie Y. A technical [187] Zhang YD, Dong Z, Wang SH, Yu X, Yao X, Zhou Q, et al.
review of convolutional neural network-based mammographic Advances in multimodal data fusion in neuroimaging: overview,
breast cancer diagnosis. Comput Math Methods Med challenges, and novel orientation. Inf Fusion 2020;64:149–87.
2019;2019:6509357. [188] Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E,
[167] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature Mauck 3rd WM, et al. Comprehensive integration of single-cell
2015;521:436–44. data. Cell 2019;177:1888–902.
[168] Mao C, Yao L, Luo Y. ImageGCN: multi-relational image [189] Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-
graph convolutional networks for disease identification with omics data integration, interpretation, and its application.
chest X-rays. IEEE Trans Med Imaging 2022;41:1990–2003. Bioinform Biol Insights 2020;14:1177932219899051.
[169] Mao C, Yao L, Pan Y, Zeng Z, Luo Y. Deep generative [190] Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C,
classifiers for thoracic disease diagnosis with chest X-ray images. Macosko EZ. Single-cell multi-omic integration compares and
Proceedings (IEEE Int Conf Bioinformatics Biomed) contrasts features of brain cell identity. Cell 2019;177:1873–87.
2018;2018:1209–14. [191] Luo Y, Eran A, Palmer N, Avillach P, Levy-Moonshine A,
[170] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Szolovits P, et al. A multidimensional precision medicine
Unterthiner T, et al. An image is worth 16  16 words: approach identifies an autism subtype characterized by dyslipi-
transformers for image recognition at scale. arXiv demia. Nat Med 2020;26:1375–9.
2020;2010.11929. [192] Diao L, Guo H, Zhou Y, He Y. Bridging the gap between
[171] Khan S, Naseer M, Hayat M, Zamir SW, Khan F, Shah M. outputs: domain adaptation for lung cancer IHC segmentation.
Transformers in vision: a survey. arXiv 2022;2101.01169. IEEE Int Conf Image Process 2021:6–10.
[172] Boesch G. Vision transformers (ViT) in image recognition - 2022 [193] Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Buttner
guide [Internet]. https://viso.ai/deep-learning/vision-transformer- M, Wagenstetter M, et al. Mapping single-cell data to reference
vit/. atlases by transfer learning. Nat Biotechnol 2022;40:121–30.
[173] Hie B, Zhong ED, Berger B, Bryson B. Learning the language of [194] Arrieta AB, Diaz-Rodriguez N, Ser JD, Bennetot A, Tabik S,
viral evolution and escape. Science 2021;371:284–8. Barbado A, et al. Explainable artificial intelligence (XAI):
[174] Flagel L, Brandvain Y, Schrider DR. The unreasonable effec- concepts, taxonomies, opportunities and challenges toward
tiveness of convolutional neural networks in population genetic responsible AI. In Fusion 2020;58:82–115.
inference. Mol Biol Evol 2019;36:220–38. [195] Salahuddin Z, Woodruff HC, Chatterjee A, Lambin P. Trans-
[175] Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep parency of deep neural networks for medical image analysis: a
generative modeling for single-cell transcriptomics. Nat Methods review of interpretability methods. Comput Biol Med 2021;140:
2018;15:1053–8. 105111.
[176] Hashim S, Ali M, Nandakumar K, Yaqub M. SubOmiEmbed: [196] Antoniadi AM, Du Y, Guendouz Y, Wei L, Mazo C, Becker BA,
self-supervised representation learning of multi-omics data for et al. Current challenges and future opportunities for XAI in
cancer type classification. 10th Int Conf Bioinform Comput Biol machine learning-based clinical decision support systems: a
2022:66–72. systematic review. Appl Sci 2021;11:5088.
[177] Lee J, Hyeon DY, Hwang D. Single-cell multiomics: technolo- [197] Kourou K, Exarchos KP, Papaloukas C, Sakaloglou P, Exar-
gies and data analysis methods. Exp Mol Med 2020;52: chos T, Fotiadis DI. Applied machine learning in cancer
1428–42. research: a systematic review for patient diagnosis, classification
[178] Rao A, Barkley D, Franca GS, Yanai I. Exploring tissue and prognosis. Comput Struct Biotechnol J 2021;19:5546–55.
architecture using spatial transcriptomics. Nature [198] Maier-Hein L, Reinke A, Godau P, Tizabi MD, Christodoulou
2021;596:211–20. E, Glocker B, et al. Metrics reloaded: pitfalls and recommenda-
[179] Li Y, Liu Q, Zeng Z, Luo Y. Using an unsupervised clustering tions for image analysis validation. arXiv 2022;2206.01653.
model to detect the early spread of SARS-CoV-2 worldwide. [199] Meropol NJ, Donegan J, Rich AS. Progress in the application of
Genes 2022;13:648. machine learning algorithms to cancer research and care. JAMA
[180] Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq Netw Open 2021;4:e2116063.
data with a model-based deep learning approach. Nat Mach [200] Sheehan DF, Criss SD, Chen YF, Eckel A, Palazzo L,
Intell 2019;1:191–8. Tramontano AC, et al. Lung cancer costs by treatment strategy
[181] Hu J, Li X, Hu G, Lyu Y, Susztak K, Li M. Iterative transfer and phase of care among patients enrolled in medicare. Cancer
learning with neural network for clustering and cell type Med 2019;8:94–103.
classification in single-cell RNA-seq analysis. Nat Mach Intell [201] Mao C, Yao L, Luo Y. MedGCN: medication recommendation
2020;2:607–18. and lab test imputation via graph convolutional networks. J
[182] Brbic M, Zitnik M, Wang S, Pisco AO, Altman RB, Darmanis S, Biomed Inform 2022;127:104000.
et al. MARS: discovering novel cell types across heterogeneous [202] Zhu J, Wang J, Wang X, Gao M, Guo B, Gao M, et al.
single-cell experiments. Nat Methods 2020;17:1200–6. Prediction of drug efficacy from transcriptional profiles with
[183] Shen H, Li Y, Feng M, Shen X, Wu D, Zhang C, et al. Miscell: deep learning. Nat Biotechnol 2021;39:1444–52.
an efficient self-supervised learning approach for dissecting [203] Luo Y, Wunderink RG, Lloyd-Jones D. Proactive vs reactive
single-cell transcriptome. iScience 2021;24:103200. machine learning in health care: lessons from the COVID-19
[184] Song Q, Su J, Zhang W. scGCN is a graph convolutional pandemic. JAMA 2022;327:623–4.
networks algorithm for knowledge transfer in single cell omics. [204] You HS, Gao CX, Wang HB, Luo SS, Chen SY, Dong YL, et al.
Nat Commun 2021;12:3826. Concordance of treatment recommendations for metastatic non-
[185] Wang T, Shao W, Huang Z, Tang H, Zhang J, Ding Z, et al. small-cell lung cancer between watson for oncology system and
Mogonet integrates multi-omics data using graph convolutional medical team. Cancer Manag Res 2020;12:1947–58.
networks allowing patient classification and biomarker identifi- [205] Liu C, Liu X, Wu F, Xie M, Feng Y, Hu C. Using artificial
cation. Nat Commun 2021;12:3445. intelligence (watson for oncology) for treatment recommenda-
[186] Wang Y, Zhang Z, Chai H, Yang Y. Multi-omics cancer tions amongst chinese patients with lung cancer: feasibility study.
prognosis analysis based on graph convolution network. J Med Internet Res 2018;20:e11087.
Proceedings (IEEE Int Conf Bioinformatics Biomed)
2021;2021:1564–8.

You might also like