Machine Learning For Lung Cancer Diagnosis

Genomics Proteomics Bioinformatics 20 (2022) 850–866

Genomics Proteomics Bioinformatics


Machine Learning for Lung Cancer Diagnosis,

Treatment, and Prognosis
Yawei Li, Xin Wu, Ping Yang, Guoqian Jiang, Yuan Luo

Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905 / Scottsdale, AZ 85259, USA
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA

Received 4 March 2022; revised 3 October 2022; accepted 17 November 2022

Available online 1 December 2022

Handled by Feng Gao

KEYWORDS Abstract The recent development of imaging and sequencing technologies enables systematic
Omics dataset; advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively
Imaging dataset; handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-
Feature extraction; based approaches play a critical role in integrating and analyzing these large and complex datasets,
Prediction; which have extensively characterized lung cancer through the use of different perspectives from
Immunotherapy these accrued data. In this review, we provide an overview of machine learning-based approaches
that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection,
auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the
challenges and opportunities for future applications of machine learning in lung cancer.

Introduction evolution of technologies in cancer research has contributed

to many large collaborative cancer projects, which have gener-
ated numerous clinical, medical imaging, and sequencing data-
Lung cancer is one of the most frequently diagnosed cancers
bases [4–6]. These databases facilitate researchers in
and the leading cause of cancer deaths worldwide. About
investigating comprehensive patterns of lung cancer from diag-
2.20 million new patients are diagnosed with lung cancer each
nosis, treatment, and responses to clinical outcomes [7]. In par-
year [1], and 75% of them die within five years of diagnosis [2].
ticular, current studies on -omics analysis, such as genomics,
High intra-tumor heterogeneity (ITH) and complexity of can-
transcriptomics, proteomics, and metabolomics, have
cer cells giving rise to drug resistance make cancer treatment
expanded our tools and capabilities for research. Cancer stud-
more challenging [3]. Over the past decades, the continuous
ies are undergoing a shift toward the integration of multiple
data types and mega sizes. However, using diverse and high-
Li Y et al / Machine Learning Applications in Lung Cancer 851

databases poses a major challenge to researchers. Therefore, feature-based CAD systems, the DL-based CAD system can
using machine learning (ML) models to automatically learn automatically retrieve and extract intrinsic features of a suspi-
the internal characteristics of different data types to assist cious nodule [38,39], and can model the 3D shape of a nodule
physicians’ decision-making has become increasingly (Figure 2). For example, Ciompi et al. [40] designed a model
important. based on OverFeat [41,42] by extracting three 2D-view-
ML is a subgroup of artificial intelligence (AI) that focuses feature vectors (axial, coronal, and sagittal) of the nodule from
on making predictions by identifying patterns in data using CT scans. The recently integrated CNN models facilitate a glo-
mathematical algorithms [12]. It has served as an assisting tool bal and comprehensive inspection of nodules for feature char-
in cancer phenotyping and therapy for decades [13–19], and acterization from CT images. Buty et al. [37] designed a
has been widely implemented in advanced approaches for early complementary CNN model, where a spherical harmonic
detection, cancer type classification, signature extraction, model [43] for nodule segmentation was used to obtain the
tumor microenvironment (TME) deconvolution, prognosis shape descriptions (‘‘shape” feature) of the segmented nodule
prediction, and drug response evaluation [20–27]. Herein, we and a deep convolutional neural network (DCNN)-based
present an overview of the main ML algorithms that have been model [41] to extract the texture and intensity features
used to integrate complex biomedical data (e.g., imaging or (‘‘appearance” feature) of the nodule. The downstream classi-
sequencing data) for different aspects of lung cancer (Figure 1; fication relied on the combination of ‘‘shape” and ‘‘appear-
Tables S1 and S2), and outline major challenges and opportu- ance” features. Similarly, Venkadesh et al. [44] used an
nities for future applications of ML in lung cancer clinical ensemble model from two different models, 2D-ResNet50-
research and practice. We hope that this review promotes a based [45] and 3D-Inception-V1 [46], to respectively extract
better understanding of the roles and potentialities of ML in two features of a pulmonary nodule, and then concatenated
this field. the two features as the input features for classification. A supe-
riority of the ensemble CNN model is that it can accurately
Apply ML for early detection and auxiliary diagnosis identify malignant nodules from different sizes of nodules
using the raw CT images. Benefiting from the features
of lung cancer
extracted from state-of-the-art CNN models, clinical judgment
inference can be implemented through frequent ML tech-
ML on early detection and diagnosis using medical imaging
niques, including LR, random forest (RF), support vector
machine (SVM), and neural networks (NNs). Notably, some
studies also employed CNN models for final clinical judgment
Early diagnosis is an important procedure for reducing deaths inference. Ardila et al. [47] proposed an end-to-end approach
related to lung cancer. Chest screening using low-dose com- to systematically model both localization and lung cancer risk
puted tomography (CT) is the primary approach for the categorization tasks using the input CT data alone. Their
surveillance of people with increased lung cancer risk. To pro- approach was based on a combination of three CNN models:
mote diagnostic efficiency, the computer-aided diagnosis a Mask-RCNN [48] model for lung tissue segmentation, a
(CAD) system was developed to assist physicians in the inter- modified RetinaNet [49] model for cancer region of interest
pretation of medical imaging data [28,29], which has been (ROI) detection, and a full-volume model based on 3D-
demonstrated as a useful second opinion for physicians [30]. inflated Inception-V1 [50,51] for malignancy risk prediction.
The traditional feature-based CAD task can be broken into In addition to CT images, CNN-based models are also widely
three steps: nodule segmentation, feature extraction and selec- used in histological imaging to help with lung cancer diagnosis.
tion, and clinical judgment inference (classification) (Figure 2). Compared with CT imaging, histological imaging can provide
Some approaches apply the measured texture features of spec- more biological information about cancer at the cellular level.
ified nodules in CT images combined with the patient’s clinical To this end, AbdulJabbar et al. [52] used the Micro-Net [53]
variables as input features to train an ML classifier, including model to identify tissue boundaries followed by an SC-CNN
logistic regression (LR) [31–33] or linear discriminant analysis [54] model to segment individual cells from hematoxylin and
(LDA) [34], for malignancy risk estimation. Typically, these eosin (H&E)-stained and immunohistochemistry (IHC)
measurements include nodule size, nodule type, nodule loca- images. The segmented cells were then applied for cell type
tion, nodule count, nodule boundary, and emphysema infor- classification to evaluate the proportions of each cell type in
mation in CT images, and the clinical variables include the the images. This model helps to identify the differential evolu-
patient’s age, gender, specimen collection timing, family his- tion and immune evasion mechanisms between lung adenocar-
tory of lung cancer, smoking exposure, and more. However, cinoma (LUAD) and lung squamous cell carcinoma (LUSC)
these features are mostly subjective and arbitrarily defined, with high resolution. Another study [55] utilized the
and usually fail to achieve a complete and quantitative descrip- Inception-V3 network [51] to classify whether the tissue was
tion of malignant nodule appearances. LUAD, LUSC, or normal from H&E-stained histopathology
With the development of deep learning (DL) algorithms, whole-slide images. A highlight of this study is that the model
especially convolutional neural networks (CNNs), more stud- can also predict whether a given tissue has somatic mutations
ies have been conducted to apply DL-based models in the in several lung cancer driver genes, including STK11, EGFR,
CAD system to improve its accuracy and reduce its false pos- FAT1, SETBP1, KRAS, and TP53. Note that considering
itive rate and execution time during lung tumor detection the high complexity and large resources of the datasets, some
(Table 1) [35,36]. Similar to feature-based CAD system, the studies utilized transfer learning to improve their efficiency
workflow of these models usually consists of three steps: nod- and robustness when training new models [38,55].
ule detection and segmentation, nodule feature extraction, and Though these ML algorithms are already widely used in
clinical judgment inference [37]. Compared with traditional CAD, the challenge is that only a limited number of the images
852 Genomics Proteomics Bioinformatics 20 (2022) 850–866
Li Y et al / Machine Learning Applications in Lung Cancer 853

Figure 2 Feature-based CAD and DL-based CAD systems

Differences in the development process of feature-based CAD systems and CNN-based CAD systems. Compared with feature-based CAD
systems, the DL-based CAD systems can automatically retrieve and extract intrinsic features of a suspicious nodule. CNN, convolutional
neural network; LR, logistic regression; SVM, support vector machine; RF, random forest.

are labeled. Training a complex CNN model using a limited by a high false discovery rate [61,62]. Therefore, there is a crit-
number of training sets may result in overfitting. Recently, ical need for new techniques in early detection of lung cancers.
generative adversarial network (GAN)-based models have Recent sequencing technologies enable diverse methods for
been used to improve the performance of discriminative classi- early detection of lung cancer [63]. In the meantime, accurately
fiers by generating pseudo images [56]. Chuquicusma et al. [57] classifying lung cancer subtypes is crucial in guiding optimal
first employed a deep convolutional GAN (DCGAN) [58] therapeutic decision-making. LUAD ( 45%) and LUSC
model to generate synthetic lung nodule CT scans. With their ( 25%) are the two most common subtypes of lung cancer
work, more recent studies have integrated the GAN models but are often treated similarly except for targeted therapy
with other CNN models to address the overfitting problem [64]. However, studies have indicated that LUAD and LUSC
in lung cancer classification. Lin et al. [59] used a two-step have drastically different biological signatures, and they have
model — a DCGAN to generate synthetic lung cancer images suggested that LUAD and LUSC should be classified and trea-
and an AlexNet [41] for lung cancer classification using both ted as different cancers [65,66]. From a computational perspec-
original and synthetic datasets. Similar work was also done tive, both early detection and subtype identification are part of
by Ren and colleagues [60]. They also used DCGAN [58] for the classification task. Previous ML studies have shown the
data augmentation. To improve performance, they then efficiency and advancement of early detection and cancer type
designed a regularization-enhanced transfer learning model classification in large pan-cancer sequencing datasets [67–75],
called VGG-DF for data discrimination to prevent overfitting which may provide evidence for lung cancer diagnosis. It is
problems with pre-trained model auto-selection. known that cancer cells are characterized by many genetic
variations, and the accumulation of these genetic variations
ML on early detection and diagnosis using -omics sequencing can be signatures that document the mutational patterns of
datasets different cancer types [3,5,76,77]. For this reason, recent stud-
ies have concentrated on extracting better genomic signatures
Although periodic medical imaging tests are recommended for as input features to boost the accuracy of their ML models.
high-risk populations, implementation has been complicated For early detection, blood-based liquid biopsy, including

Figure 1 Applications of ML model in lung cancer
We presented an overview of ML methodologies for different aspects of lung cancer therapies, including CAD from imaging datasets, lung
cancer early detection based on sequencing technologies, data integration and biomarker extraction from multi-omics datasets, treatment
response and prognosis prediction, and immunotherapy studies. ML, machine learning; IC50, half-maximal inhibitory concentration;
HLA, human leukocyte antigen; CT, computed tomography; MALDI, matrix-assisted laser desorption/ionization; DL, deep learning;
cfDNA, cell-free DNA; CAD, computer-aided diagnosis; CNV, copy number variation; RECIST, Response Evaluation Criteria in Solid
Tumors; TIL, tumor-infiltrating lymphocyte.
Table 1 Publications relevant to ML on early detection and diagnosis using imaging data
Publication Feature extraction Classification model Sample size Imaging data type Performance Validation method Feature selection/input Highlight/advantage Shortcoming

McWilliams et al. [31] NA LR 2961 CT images AUC (0.907–0.960) Hold-out Clinical risk factors + nodule Using the extracted feature as input, the The selection of nodule
characteristics on CT images classifier can achieve high AUC in small characteristics affects the
nodules (< 10 mm) predictive performance of the
Riel et al. [32] NA LR 300 CT images AUC (0.706–0.932) Hold-out Clinical factors + nodule The classifier can perform equivalently The performance heavily relies
characteristics on CT images as human observers for malignant and on nodule size as the
benign classification discriminator, and is not robust
in small nodules
Kriegsmann et al. [34] NA LDA 326 MALDI Accuracy (0.991) Hold-out Mass spectra from ROIs of MALDI The model maintains high accuracy on The performance relies on the
image FFPE biopsies quality of the MALDI

Genomics Proteomics Bioinformatics 20 (2022) 850–866

Buty et al. [37] Spherical harmonics [44]; RF 1018 CT images Accuracy (0.793–0.824) 10-fold cross-validation CT imaging patches + radiologists’ The model reaches higher predictive No benchmarking comparisons
DCNN [41] binary nodule segmentations accuracy by integrating shape and were used in the study
appearance nodule imaging features
Hussein et al. [38] 3D CNN-based multi-task model 3D CNN-based multi-task 1018 CT images Accuracy (0.9126) 10-fold cross-validation 3D CT volume feature The model achieves higher accuracy The ground truth scores defined
model than other benchmarked models by radiologists for the
benchmark might be arbitrary
Khosravan et al. [39] 3D CNN-based multi-task model 3D CNN-based multi-task 6960 CT images Segmentation DSC (0.91); 10-fold cross-validation 3D CT volume feature The model integration of clustering and Segmentation might fail if the
model classification accuracy (0.97) sparsification algorithms helps to ROIs are outside the lung
accurately extract potential attentional regions
Ciompi et al. [40] OverFeat [42] SVM; RF 1729 CT images AUC (0.868) 10-fold cross-validation 3D CT volume feature, nodule position This is the first study attempting to The model requires specifying
coordinate, and maximum diameter classify whether the diagnosed nodule is the position and diameter of the
benign or malignant nodule as input, but many
nodules could not be located on
the CT images
Venkadesh et al. [44] 2D-ResNet50-based [45]; An ensemble model based 16,429 CT images AUC (0.86–0.96) 10-fold cross-validation 3D CT volume feature and nodule The model achieves higher AUC than The model requires specifying
3D-Inception-V1 [46] on two CNN models coordinates other benchmarked models the position of the nodule, but
many nodules are unable to be
located on the CT images
Ardila et al. [47] Mask-RCNN [48]; Mask-RCNN [48]; 14,851 CT images AUC (0.944) Hold-out Patient’s current and prior (if available) The model achieves higher AUC than The training cohort is from only
RetinaNet [49]; RetinaNet [49]; 3D CT volume features radiologists when samples do not have one dataset, although the sample
3D-inflated Inception-V1 [50,51] 3D-inflated prior CT images size is large
Inception-V1 [50,51]
AbdulJabbar et al. [52] Micro-Net [53]; SC-CNN [54] An ensemble model based 100 Histological images Accuracy (0.913) Hold-out Image features of H&E-stained tumor The model can annotate cell types at the The annotation accuracy is
on SC-CNN [54] section histological slides single-cell level using histological affected by the used reference
images only dataset
Coudray et al. [55] Multi-task CNN model based on Multi-task CNN model 1634 Histological images AUC (0.733–0.856) Hold-out Transformed 512  512-pixel tiles from The model can predict whether a given The accuracy of the gene
Inception-V3 [51] based on Inception-V3 nonoverlapping ‘patches’ of the whole- tissue has somatic mutations in genes mutation prediction is not very
network [51] slide images STK11, EGFR, FAT1, SETBP1, high
KRAS, and TP53
Lin et al. [59] DCGAN [58] + AlexNet [41] DCGAN [58] + 22,489 CT images Accuracy (0.9986) Hold-out Initial + synthetic CT images The model uses GAN to generate No benchmarking comparisons
AlexNet [41] synthetic lung cancer images to reduce were used
Ren et al. [60] DCGAN [58] + VGG-DF DCGAN [58] + 15,000 Histopathological images Accuracy (0.9984); Hold-out Initial + synthetic histopathological The model uses GAN to generate The dimension of images by
VGG-DF F1-score (99.84%) images synthetic lung cancer images and a generator (64  64) is not
regularization-enhanced model to sufficient for biomedical domain
reduce overfitting

Note: ML, machine learning; NA, not applicable; LR, logistic regression; AUC, area under the curve; CT, computed tomography; LDA, linear discriminant analysis; MALDI, matrix-assisted laser
desorption/ionization; ROI, region of interest; FFPE, formalin-fixed paraffin-embedded; CNN, convolutional neural network; DSC, dice similarity coefficient; SVM, support vector machine; RF,
random forest; DCNN, deep convolutional neural network; SC-CNN, spatially constrained convolutional neural network; DCGAN, deep convolutional generative adversarial network; RCNN,
Region-CNN; H&E, hematoxylin and eosin; 2D, two dimensional; 3D, three dimensional. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between
possible splits in training, validation, and test data. However, cross-validation is more time consuming than using the simple holdout method.
Li Y et al / Machine Learning Applications in Lung Cancer 855

Figure 3 Omics analysis in lung cancer studies

Different sequencing techniques allow for the simultaneous measurement of multiple molecular features of a biological sample. To
improve efficiency and reduce overfitting, statistical and ML tools perform differential analysis or feature selection. Further ML models
concatenate the obtained omics features with clinical features as input for lung cancer diagnostic/prognostic prediction. DEG,
differentially expressed gene; RFE, recursive feature elimination; UAF, univariate association filtering.

cell-free DNA (cfDNA) fragments, circulating tumor DNA studies used different computational approaches to select mul-
(ctDNA), microRNA (miRNA), methylation, exosomes, and tiple cancer-associated genes to enhance their ML models
circulating tumor cells (CTCs), to explore potential circulating (Figure 3). Some studies used ML-based algorithms for feature
tumor signatures is considered a reliable method [63] (Figure 3). selection. For example, Liang et al. [80] and Whitney et al. [86]
Integrating these liquid biopsy signatures, many discriminative employed the least absolute shrinkage and selection operator
models (SVM, RF, and LR) have been used to detect tumors (LASSO) method to select the optimal markers for model
with high discovery rates [78–81]. For lung cancer subtype training; Aliferis et al. [89] utilized recursive feature elimina-
classification, somatic mutations, including single-nucleotide tion (RFE) [95] and univariate association filtering (UAF)
variants (SNVs), insertions, and deletions, usually have specific models to select highly cancer-associated genes. In addition,
cancer type profiles [82]. Thus, studies have leveraged somatic using unsupervised models for sample population subtype
mutations as input features to train classifiers for LUAD– clustering, and then identifying each cluster’s marker genes is
LUSC classification [83]. Many of these mutations, especially also seen in many studies [96,97]. Apart from ML-based mod-
driver mutations, can change expression levels, which impact els, some studies used statistical methods for feature selection.
gene function and interrupt cellular signaling processes [82]. Raman et al. [81] designed a copy number profile abnormality
As a result, different cancer types show different expression (CPA) score to reinforce the CNV feature which is more
levels of certain proteins [84,85]. Imposed by these unique robust and less subject to variable sample quality than directly
expression profiles of cancer type, ML models can leverage using CNVs as the input feature. Daemen et al. [92] integrated
RNA sequencing as input data to categorize the malignancy several statistical tests (ordinary fold changes, ordinary
(benign or malignant) and subtypes (LUAD or LUSC) of t-statistics, SAM-statistics, and moderated t-statistics) to select
patients [86–89]. Similarly, copy number variation (CNV) is a robust differential expression gene set. Aside from these
reported to be highly correlated with differential gene expres- single-measured signatures, some studies [81,86,88] combined
sion [90], and can be ubiquitously detected in cancer cells. As the -omics signatures with clinical signatures to achieve better
such, CNVs can also be used to train ML models for cancer results. Using these tumor-type specific -omics signatures,
type classification in lung cancer studies [81,91,92]. Note that many algorithms, K-nearest neighbors (KNN), naive Bayes
Daemen et al. [92] proposed a recurrent hidden Markov model (NB), SVM, decision tree (DT), LR, RF, LDA, gradient
(HMM) for the identification of extended chromosomal boosting, and NN, have demonstrated their ability to accu-
regions of altered copy numbers, which offers high accuracy rately detect and classify different lung cancer patterns
for classification. More recently, Jurmeister et al. [93] used (Table 2). Note that to improve the accuracy of ML models,
DNA methylation profiles as input features to determine if Kobayashi et al. [83] added an element-wise input scaling for
the detected malignant nodule is primary lung cancer or the the NN model, which allows the model to maintain its
metastasis of another cancer. Directly using all generated genes accuracy with a small number of learnable parameters for
as an input feature may result in overfitting [94]. Thus, many optimization.
Table 2 Publications relevant to ML on early detection and diagnosis using sequencing data
Publication ML method Sample size Sequencing data type Performance Validation method Feature selection Highlight/advantage Shortcoming

Mathios et al. [78] LR model with a LASSO penalty 799 cfDNA fragment AUC (0.98) 10-fold cross-validation cfDNA fragment features, This study provides a framework for DNA variations in late-stage disease may
clinical risk factors, and CT imaging features combining cfDNA fragmentation profiles with affect cfDNA detection
other markers for lung cancer detection

Genomics Proteomics Bioinformatics 20 (2022) 850–866

Lung-CLiP [79] 5-nearest neighbor; 3-nearest neighbor; 160 cfDNA AUC (0.69–0.98) Leave-one-out cross-validation SNV + CNV features This study establishes an ML framework for Sampling bias exists (most are smokers) in the
NB; LR; DT the early detection of lung cancers using training dataset
Liang et al. [80] LR 296 ctDNA AUC (0.816) 10-fold cross-validation Nine DNA methylation markers This study establishes an ML framework for The selected features are comprised of only
the early detection of lung cancers using DNA nine methylation biomarkers, which poses a
methylation markers limitation on assay performance
Raman et al. [81] RF; SVM; LR with ridge, elastic net; 843 cfDNA mAUC (0.896–0.936) Leave-one-out cross-validation Copy number profiling of cfDNA The model provides a framework for using Feature selection methods can be used to
LASSO regularization copy number profiling of cfDNA as a reduce overfitting and may have the potential
biomarker in lung cancer detection to achieve higher AUC
Kobayashi et al. [83] Diet Networks with EIS 954 Somatic mutation Accuracy (0.8) 5-fold cross-validation SNVs, insertions, and deletions across 1796 The EIS helps to stabilize the training process The interpretable hidden interpretations
genes of Diet Networks obtained from EIS may vary between different
Whitney et al. [86] LR 299 RNA-seq of BECs AUC (0.81) 10-fold cross-validation Lung cancer-associated and clinical covariate The model keeps sensitivity for small and The selected genes vary greatly under different
RNA markers peripheral suspected lesions feature selection processes and parameters
Podolsky et al. [87] KNN; NB normal distribution of attributes; 529 RNA-seq AUC (0.91) Hold-out RNA-seq This study systematically compares different Feature selection methods can be used to
NB distribution through histograms; models of lung cancer subtype classification reduce overfitting
SVM; C4.5 DT across different datasets
Choi et al. [88] An ensemble model based on elastic net LR; 2285 RNA-seq of bronchial AUC (0.74) 5-fold cross-validation RNA-seq of 1232 genes with clinical The model integrates RNA-seq features and Sample sizes in certain subgroups are small
SVM; hierarchical LR brushing samples covariates clinical information to improve the accuracy and may cause unbalanced training
of risk prediction
Aliferis et al. [89] Linear SVM; polynomial-kernel SVM; KNN; NN 203 RNA-seq AUC (0.8783–0.9980) 5-fold cross-validation RNA-seq of selected genes using RFE and The study uses different gene selection The selected genes vary greatly across different
UAF algorithms to improve the classification training cohorts
Aliferis et al. [91] DT; KNN; linear SVM; polynomial-kernel SVM; 37 CNV measured by CGH Accuracy (0.892) Leave-one-out cross-validation Copy number of 80 selected genes based on The study systematically compares different The sample size is small
RBF-kernel SVM; NN linear SVM models of lung cancer subtype classification
Daemen et al. [92] HMM; weighted LS-SVM 89 CNV measured by CGH Accuracy (0.880–0.955) 10-fold cross-validation CNV measured by CGH The use of recurrent HMMs for CNV Benchmarked comparisons are needed to
detection provides high accuracy for cancer demonstrate the superiority of using the
classification HMM model
Jurmeister et al. [93] NN; SVM; RF 972 DNA methylation Accuracy (0.878–0.964) 5-fold cross-validation Top 2000 variable CpG sites The study provides a framework for using The model cannot accurately predict samples
DNA methylation data to predict tumor with low tumor cellularity through
metastases methylation data

Note: LASSO, least absolute shrinkage and selection operator; cfDNA, cell-free DNA; NB, naive Bayes; DT, decision tree; SNV, single-nucleotide variant; CNV, copy number variation; ctDNA,
circulating tumor DNA; mAUC, mean area under the curve; EIS, element-wise input scaling; BEC, bronchial epithelial cell; KNN, K-nearest neighbors; NN, neural network; RFE, recursive feature
elimination; UAF, univariate association filtering; CGH, comparative genomic hybridization; HMM, hidden Markov model; LS-SVM, least squares support vector machines; RNA-seq, RNA
sequencing. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation, and test data. However, cross-validation is
more time consuming than using the simple holdout method.
Li Y et al / Machine Learning Applications in Lung Cancer 857

Apply ML to lung cancer treatment response and Another study from Geeleher et al. [109] used half-maximal
inhibitory concentration (IC50) to evaluate drug response. In
survival prediction
their model, the authors applied a ridge regression model
[110] to estimate IC50 values for different cell lines in terms
Prognosis and therapy response prediction
of their whole-genome expression level. More recently, Quiros
et al. [111] established a phenotype representation learning
Sophisticated ML models have acted as supplements for can- (PRL) through self-supervised learning and community detec-
cer intervention response evaluation and prediction [98,99], tion for spatial clustering cell type annotation on histopatho-
and have demonstrated advances in optimizing therapy deci- logical images. Their clustering results can be further used
sions that improve chances of successful recovery (Figure 4; for tracking histological tumor growth patterns and identify-
Table 3) [100,101]. There are several metrics that are available ing tumor recurrence. Indeed, their model has also demon-
for evaluating cancer therapy response, including the Response strated good performance in the LUAD and LUSC
Evaluation Criteria in Solid Tumors (RECIST) [102]. The def- classifications.
inition of RECIST relies on imaging data, mainly CT and
magnetic resonance imaging (MRI), to determine how tumors
grow or shrink in patients [103]. To track the tumor volume Survival prediction
changes from CT images, Jiang et al. [104] designed an inte-
grated CNN model. Their CNN model used two deep net- Prognosis and survival prediction as a part of clinical oncology
works based on a full-resolution residual network [105] is a tough but essential task for physicians, as knowing the
model by adding multiple residual streams of varying resolu- survival period can inform treatment decisions and benefit
tions, so that they could simultaneously combine features at patients in managing costs [112–114]. For most of the medical
different resolutions for segmenting lung tumors. Using the history, predictions relied primarily on the physician’s
RECIST criterion, Qureshi [106] set up a RF model to predict knowledge and experience based on prior patient histories
the RECIST level under EGFR tyrosine kinase inhibitor (TKI) and medical records. However, studies have indicated that
therapy given the patient’s mutation profile in gene EGFR. To physicians tend to execute poorly in predicting the prognosis
improve the prediction performance, the model integrated clin- and survival expectancy, often over-predicting survival time
ical information, geometrical features, and energy features [115–117]. Statistical algorithms, such as the Cox
obtained from a patient’s EGFR mutant drug complex as input proportional-hazards model [118], have been implemented to
to train the classifiers. In a recent study, the authors defined a assist physicians’ prediction in many studies [119–122], but
different metric, tumor proportional scoring (TPS) calculated they are not particularly accurate [12]. As a comparison, ML
as the percentage of tumor cells in digital pathology images, has shown its potential to predict a patient’s prognosis and
to evaluate the lung cancer treatment response [107]. They survival in genomic, transcriptomic, proteomic, radiomic,
applied the Otsu threshold [108] with an auxiliary classifier and other datasets (Figure 4; Table 3). Chen et al. [123] used
generative adversarial network (AC-GAN) model to identify 3-year survival as a threshold to split the patients into high-
positive tumor cell regions (TC+) and negative tumor cell risk (survival time < 36 months) and low-risk (survival
regions (TC ). And they ultimately used the ratio between time > 36 months) groups, and then constructed a NN model
the pixel count of the TC+ regions and the pixel count of all to binary predict the risk of a patient using his gene expression
detected tumor cell regions to evaluate the TPS number. data and clinical variables. In their model, they tested four

Figure 4 Diagram of ML applications in treatment response and survival prediction

Table 3 Publications relevant to ML on treatment response and survival prediction
Publication Feature extraction method Prediction model Sample size Data type Performance Validation method Feature selection/input Highlight/advantage Shortcoming

Jiang et al. [104] MRRN-based model MRRN-based model 1210 CT Images DSC (0.68–0.75) 5-fold cross-validation 3D image features The model can accurately track the The model does not predict accurately
tumor volume changes from CT images enough when the tumor size is small
across multiple image resolutions
Qureshi [106] NA RF; SVM; KNN; LDA; 201 Molecular structure and Accuracy (0.975) 10-fold cross-validation Among the possible 594 EGFR
4 clinical features + 4 protein drug interaction The model integrates multiple features
CART somatic mutations of features + 5 geometrical features mutations available in the COSMIC
for data training, and achieves better
EGFR performance than other benchmarked database, the model only considers the
models most common 33 EGFR mutations for
model training
Kapil et al. [107] AC-GAN AC-GAN 270 Digital pathology images Lcc (0.94); Pcc (0.95); Hold-out PD-L1-stained tumor section histological The model achieves better performance In the experiments, the use of PD-L1

Genomics Proteomics Bioinformatics 20 (2022) 850–866

MAE (8.03) slides than other benchmarked, fully staining for TPS evaluation may not be
supervised models accurate enough
Geeleher et al. [109] NA Ridge regression model 62 RNA-seq Accuracy (0.89) Leave-one-out cross- Removed low variable genes The model can accurately predict the The training sample size is small
validation drug response using RNA-seq profiles
Chen et al. [123] Chi-square test + NN NN 440 RNA-seq Accuracy (0.83) Hold-out RNA-seq of 5 genes The model uses multiple laboratory The model doesn’t consider
datasets for training to improve its demographic and clinical features,
robustness which may affect the prediction
LUADpp [125] Top genes with most SVM 371 Somatic mutations Accuracy (0.81) 5-fold cross-validation Somatic mutation features in 85 genes The model can predict with high Mutation frequency may be impacted
significant mutation accuracy with only seven gene mutation by the sampling bias across datasets;
frequency difference features LD may also affect the feature selection
Cho et al. [126] Information gain; Chi- NB; KNN; SVM; DT 471 Somatic mutations Accuracy (0.68–0.88) 5-fold cross-validation Somatic mutation features composed of 19 To improve performance, the model The training cohort consists of only one
squared test; minimum genes uses four different methods for feature dataset
redundancy maximum selection
relevance; correlation
Yu et al. [128] Information gain ratio; RF 538 Multi-omics (histology, AUC (> 0.8) leave-one-out cross- 15 gene set features The study uses an integrative omics- Cox models may be overfitted in
hierarchical clustering pathology reports, RNA, validation pathology model to improve the multiple-dimension data
proteomics) accuracy in predicting patients’
Asada et al. [130] Autoencoder + SVM 364 Multi-omics (miRNA, Accuracy (0.81) Hold-out 20 miRNAs + 25 mRNAs The study uses ML algorithms to The model does not consider the impact
Cox-PH + K-means + mRNA) of clinical and demographic variances
systematically model feature extraction
ANOVA from multi-omics datasets in data training
Takahashi et al. [131] Autoencoder + LR 483 Multi-omics (mRNA, AUC (0.43–0.99 under Hold-out 12 mRNAs, 3 miRNAs, 3 methylations, The study uses ML algorithms to The datasets collected in this study
Cox-PH + K-means + somatic mutation, CNV, different omics data) 5 CNVs, 3 somatic mutations, and 3 RPPA systematically model feature extraction
contain uncommon samples between
XGBoost/LightGBM mythelation, RPPA) from multi-omics datasets different omics datasets, which may
cause bias in model evaluation
Wiesweg et al. [136] Lasso regression SVM 122 RNA-seq Significant hazard ratio Hold-out 7 genes from feature selection model + The ML-based feature extraction model The metrics used in this study does not
differences 25 cell type-specific genes performs better than using any single perceptual intuition. Using accuracy or
immune marker for immunotherapy AUC may be better
response prediction
Trebeschi et al. [137] LR; RF LR; RF 262 CT imaging AUC (0.76–0.83) Hold-out 10 radiographic features The model can extract potential The predictive performance between
predictive CT-derived radiomic different cancer types is not robust
biomarkers to improve immunotherapy
response prediction
Saltz et al. [142] CAE [143] VGG16 [144] + 4612 Histological images AUC (0.9544) Hold-out Image features of H&E-stained tumor section The model outperforms pathologists The predictive performance between
DeconvNet [145] (13 cancer types) histological slides and other benchmarked models different cancer types is not robust

Note: MRRN, resolution residually connected network; CART, classification and regression trees; AC-GAN, auxiliary classifier generative adversarial networks; Lcc, Lin’s concordance coefficient; Pcc,
Pearson correlation coefficient; MAE, mean absolute error; TPS, tumor proportional scoring; LD, linkage disequilibrium; Cox-PH, Cox proportional-hazards; ANOVA, analysis of variance; miRNA,
microRNA; RPPA, reverse phase protein array; CAE, convolutional autoencoder; mRNA, messenger RNA; PD-L1, programmed cell death 1 ligand 1; COSMIC, the Catalogue Of Somatic Mutations
In Cancer; EGFR, epidermal growth factor receptor. Compared with hold-out, cross-validation is usually more robust, and accounts for more variance between possible splits in training, validation,
and test data. However, cross-validation is more time consuming than using the simple holdout method.
Li Y et al / Machine Learning Applications in Lung Cancer 859

microarray gene expression datasets and achieved an overall as other imaging features of tumor lesions from contrast-
accuracy of 83.0% with only five identified genes correlated enhanced computed tomography (CE-CT) scans to train a
with survival time. Liu et al. [124] also utilized gene expression classifier, including LR and RF, for RECIST classification.
data for a 3-year survival classification. Unlike Chen et al.
[123], the authors integrated three types of sequencing data —
Tumor-infiltrating lymphocyte evaluation
RNA sequencing, DNA methylation, and DNA mutation —
to select a total of 22 genes to improve their model’s stability.
Meanwhile, LUADpp [125] and Cho et al. [126] used the The proportion of tumor-infiltrating lymphocytes (TILs) is
somatic mutations as input features to model a 3-year survival another important metric for immunotherapy response evalu-
risk classification. To select the genes associated with the high- ation. To this end, using transcriptomics data, DeepTIL [139]
est significant mortality, Cho et al. [126] used chi-squared tests, optimized the cell deconvolution model CIBERSORT [140] to
and LUADpp [125] used a published genome-wide rate com- automatically compute the abundance of the leucocyte subsets
parison test [127] that was able to balance statistical power (B cells, CD4+ T cells, CD8+ T cells, cd T cells, Mo-Ma-DC
and precision to compare gene mutation rates. Due to the cells, and granulocytes) within a tumor sample. A different
complexity of survival prediction, multi-omics tumor data approach [141] utilized a total of 84 radiomic features from
have been integrated for analysis in many studies. Compared the CE-CT scans, along with RNA sequencing of 20,530 genes
with single-omics data, the multi-omics data are more chal- as biomarkers to train a linear elastic-net regression model to
lenging to accurately extract the most significant genes for pre- predict the abundance of CD8+ T cells. Another study [142]
diction. To address the issue, several studies [128–131] created a DL model to identify TILs in digitized H&E-
designed a similar workflow. They first constructed a matrix stained images (Table 3). The methodology consisted of two
representing the similarity between patients based on their unique CNN modules to evaluate TILs at different scales: a
multi-omics data. Using the obtained matrix, they then lymphocyte infiltration classification CNN (lymphocyte
employed an unsupervised clustering model (usually autoen- CNN) and a necrosis segmentation CNN (necrosis CNN).
coder with K-means clustering) to categorize the patients into The ‘‘lymphocyte CNN” aimed to categorize the input image
two clusters. The two clusters were labeled ‘‘high-risk” and into with- and without-lymphocyte infiltration regions. It con-
‘‘low-risk” in terms of the different survival outcomes between sists of two steps: a convolutional autoencoder (CAE) [143] for
the two clusters in the Kaplan–Meier analysis. Following the feature extraction, followed by a VGG 16-layer network [144]
survival outcome differences, the genes associated with mortal- for TIL region classification. The ‘‘necrosis CNN” aimed to
ity were extracted using a statistical model [128,129] or an ML detect TILs within a necrosis region. They used the DeconvNet
model [130,131] for downstream analyses. [145] model for TIL segmentation in ‘‘necrosis CNN” as the
model has been shown to achieve high accuracy with several
benchmark imaging datasets.
Apply ML to lung cancer immunotherapy

Immunotherapy response prediction Neoantigen prediction

Immunotherapy has become increasingly important in recent In addition to immunotherapy response prediction, ML algo-
years. It enables a patient’s own immune system to fight can- rithms have shed light on neoantigen prediction for
cer, in most cases, by stimulating T cells. Up to date, distinct immunotherapy. Neoantigens are tumor-specific mutated pep-
novel immunotherapy treatments are being tested for lung can- tides generated by somatic mutations in tumor cells, which can
cer, and a variety of them have become standard parts of induce antitumor immune responses [146–148]. Recent work
immunotherapy. Immune checkpoint inhibitors (ICIs), espe- has demonstrated that immunogenic neoantigens are benefit
cially programmed cell death protein 1 (PD-1)/programmed to the development and optimization of neoantigen-targeted
cell death protein ligand 1 (PD-L1) blockade therapy [132], immune therapies [149–152]. In accordance with neoantigen
have been demonstrated to be valuable in the treatment of studies in clinical trials, state-of-the-art ML approaches have
patients with non-small cell lung cancer (NSCLC) [133,134]. been implemented to identify neoantigens based on human
However, immunotherapy is not yet as widely used as surgery, leukocyte antigen (HLA) class I and II processing and presen-
chemotherapy, or radiation therapies. One interpretation is tation [153–157]. Using the identified somatic mutations, ML
that it does not work for all patients due to the uniqueness models can estimate the binding affinity of the encoded
of a patient’s tumor immune microenvironment (TIME). mutated peptides to the patient’s HLA alleles (peptide–HLA
Therefore, estimating whether a patient will respond to binding affinity). The neoantigens can be further predicted
immunotherapy is important for cancer treatment. Recently, based on the estimated peptide–HLA binding affinity.
AI-based technologies have been developed to predict NetMHC [158,159] utilized a receptor–ligand dataset consist-
immunotherapy responses based on immune sequencing signa- ing of 528 peptide–HLA binding interactions measured by
tures and medical imaging signatures (Figure 4; Table 3) [135]. Buus et al. [160] to train a combination of several NNs for
To predict the response to PD-1/PD-L1 blockade therapy, neo-peptide affinity prediction. To make the prediction more
Wiesweg et al. [136] utilized gene expression profiles of 7 signif- accurate, NetMHCpan [161,162] used a larger dataset consist-
icant genes extracted from ML models plus 25 cell type-specific ing of 37,384 unique peptide–HLA interactions covering 24
genes as input features to train an SVM classifier for RECIST HLA-A alleles and 18 HLA-B alleles (26,503 and 10,881 for
classification. Aside from sequencing data, features from CT the A and B alleles, respectively) to train their NN model. Both
scans can also be used to assess the RECIST level of a patient. tools have been implemented to study the neoantigen land-
Two recent studies [137,138] used radiomic biomarkers as well scape in lung cancers [146,163–165].
860 Genomics Proteomics Bioinformatics 20 (2022) 850–866

Challenges and future perspectives identification [179] or cell population subtype annotation
[180–183]. In addition, to process the complex structure of
multi-omics data, graph neural network (GNN) models are
Despite the widespread use of ML studies in lung cancer clin-
increasingly popular in dataset integration [184], biomedical
ical practice and research, there are still challenges to be
classification [185], prognosis prediction [186], and so on.
addressed. Here, we post some examples of recent ML algo- Though these studies have not been directly applied to lung
rithms, especially the increasingly popular and important DL
cancer clinical analysis, they are a good inspiration for using
algorithms of the past decade, to enlighten them on lung can-
DL tools to address complex lung cancer omics datasets.
cer therapy analyses, as well as the challenges for future lung
cancer studies.
Multi-view data and multi-database integration
Imaging data analysis
It is common to access large amounts of imaging data, multi-
omics data, and clinical records from a single patient nowa-
Learning how to effectively extract nuance from imaging data
days. Integrating these data provides a comprehensive insight
is critical for clinical use. In the earlier ML-based CAD system,
into the molecular functions of lung cancer studies. However,
feature extractions were typically based on the image intensity,
these data types are typically obtained from different plat-
shape, and texture of a suspicious region along with other clin- forms, so platform noise inevitably exists between these data
ical variables [166]. However, these approaches are arbitrarily
types. For example, imaging data analysis, especially radio-
defined and may not retrieve the intrinsic features of a suspi-
mics, usually comes with the challenges of complicated data
cious nodule. To this end, a DL-based CAD system was devel- normalization, data fusion, and data integration. To overcome
oped leveraging CNN models to extract features directly from
this limitation, multimodality medical segmentation networks
raw imaging data with multilevel representations and hierar-
have been developed to jointly process multimodality medical
chical abstraction [167–169]. Contrary to previous methods, images [187]. Similarly, for sequencing data types, batch noise
features from a CNN model are not designed by humans,
also exists between different databases (i.e., batch effect).
and reflect the intrinsic features of the nodule in an objective
Removing batch effects and integrating datasets from multiple
and comprehensive manner. Recently, the Vision Transformer
platforms together in a framework that allows us to further
(ViT) has emerged as the current state-of-the-art in computer analyze the mechanisms of cancer drug resistance and recur-
vision [170,171]. In comparison to CNN, ViT outperformed
rence is important for cancer therapies. Though biomedical
almost 4 in terms of computational efficiency and accuracy,
studies have experimented and/or benchmarked integrative
and was more robust when training on smaller datasets [172]. tools [188–191], they are not comprehensive and discriminating
Although, to our knowledge, ViT models have not been imple-
enough to address the choice of tools in the context of biolog-
mented in any lung cancer imaging studies, they have shown
ical questions of interest.
their potential as a competitive alternative to CNN in imaging
data analysis.
Model generalizability and robustness

Omics dataset analysis In terms of this review, we find that the performance of an ML
algorithm usually varies across different datasets. One inter-
DL is a subfield of ML, which uses programmable NNs to pretation might be the existence of a database batch effect that
make accurate decisions. It particularly shines when it comes we discussed earlier. However, the absence of generalizability
to complex problems such as image classification. In this study, and robustness might be other factors that hurdle these ML
we reviewed the utility of DL models in imaging datasets. models in clinical studies. In addition, to reduce overfitting,
Compared with imaging datasets, DL algorithms were less fre- most studies used either statistical models or ML models to
quent in lung cancer clinical studies using omics data. How- select marker genes before classification. However, these mar-
ever, DL models have been extensively applied in other fields ker genes are usually quite different between studies, indicating
of omics analysis. For example, the genomics data are contin- that the identified marker genes lack generalizability and bio-
uous sequences, thus recurrent neural network (RNN) models logical interpretability. To improve the generalizability and
[173] and CNN models [174] are good tools for the population robustness of a model, it is important to develop a better
genetics analysis. Moreover, considering the input dimension understanding of robustness issues in different ML architec-
of the omics data is usually very high, to improve efficiency tures and bridge the gap in robustness techniques among dif-
and reduce overfitting, many studies have used autoencoders ferent domains. For example, recent studies have applied
or deep generative models for feature extraction and dimen- transfer learning to use a pre-trained model when training their
sionality reduction [175]. In the meantime, self-supervised rep- own datasets in lung cancer imaging data analysis [38,55,192],
resentation learning models can overcome the curse of and have improved the efficiency and robustness of their
dimensionality and integrate multi-omics data to combine CNN-based models. For sequencing datasets, transfer learning
information about different aspects of the same tissue samples has also been used in deep NNs to provide a generalizability
[176]. Accompanied by the development of single-cell-based approach [193], which could be a good example of building a
[177] and spatial-based [178] technologies that have been general and robust model for lung cancer sequencing data
applied in molecular studies, numerous DL models are becom- analysis. In addition, DL is a complex black-box model.
ing more popular for computationally intensive analysis. To Understanding the mechanisms of a DL system in clinical
deal with the complexity of large genomics data, unsupervised studies could help to build a standardized and unified DL
deep clustering tools have been built for population structure framework to improve its performance and robustness. The
Li Y et al / Machine Learning Applications in Lung Cancer 861

explainable AI (XAI) models have provided a tool for model- system that considers both imaging data and omics data treat-
specific and model-agnostic analysis [194,195]. These methods ment, and the integration of multiple data types. Finally, we
can provide the explanations of a model at local and global expect that these challenges could motivate further studies to
levels, which further helps the researchers to fine-tune hyper- focus on lung cancer therapies.
parameters from different models with high efficacy [196,197].
CRediT author statement
Metrics for performance evaluation
Yawei Li: Conceptualization, Data curation, Writing - original
Studies usually focus on the development of algorithms for draft, Visualization. Xin Wu: Data curation, Writing - original
clinical studies. However, metrics selection for performance draft, Visualization. Ping Yang: Writing - review & editing.
assessment of these algorithms is usually neglected, though it Guoqian Jiang: Writing - review & editing. Yuan Luo: Concep-
usually plays an important component in ML systems [198]. tualization, Funding acquisition, Writing - review & editing.
Based on this review (Tables 1–3), accuracy and under the All authors have read and approved the final manuscript.
curve (AUC) are the two most conventional metrics, whereas
these metrics do not always reflect the clinical needs and Competing interests
should be translated into clinically explainable metrics. Com-
pared with accuracy, sensitivity or specificity might be more
associated with clinical needs under certain circumstances, The authors have declared no competing interests.
for example, patients at high risk of emergency department vis-
its [199]. Acknowledgments
Clinical decision-making This study is supported in part by the National Institutes of
Health, USA (Grant Nos. U01TR003528 and R01LM013337).
A recent study estimated that the overall costs for lung cancer
therapy would exceed $50,000 [200] for most patients, and that Supplementary material
the cost would be high for most families. Thus, accurate prog-
nosis prediction and decision-making will pave the way for Supplementary data to this article can be found online at
personalized treatment. Recent DL models have been used to
predict the effectiveness of a therapy/drug and optimize the
combination of different therapies/drugs [201,202]. However,
most existing DL models for clinical decision-making have dif- ORCID
ficulty in keeping up with knowledge evolution and/or
dynamic health care data change [203]. Currently, clinical deci- ORCID 0000-0001-9699-5118 (Yawei Li)
sion support systems, including IBM Watson Health and Goo- ORCID 0000-0003-2386-6344 (Xin Wu)
gle DeepMind Health, have been implemented in lung cancer ORCID 0000-0002-8588-847X (Ping Yang)
treatments in recent years [204,205]. Although the efficiency ORCID 0000-0003-2940-0019 (Guoqian Jiang)
of clinical work has improved with the help of these systems, ORCID 0000-0003-0195-7456 (Yuan Luo)
they are still far from perfect in terms of clinical trials, and cur-
