Deep Learning in Image-Based Breast and Cervical Cancer Detection: A Systematic Review and Meta-Analysis
Deep Learning in Image-Based Breast and Cervical Cancer Detection: A Systematic Review and Meta-Analysis
Deep Learning in Image-Based Breast and Cervical Cancer Detection: A Systematic Review and Meta-Analysis
com/npjdigitalmed
ARTICLE OPEN
Accurate early detection of breast and cervical cancer is vital for treatment success. Here, we conduct a meta-analysis to assess the
diagnostic performance of deep learning (DL) algorithms for early breast and cervical cancer identification. Four subgroups are also
investigated: cancer type (breast or cervical), validation type (internal or external), imaging modalities (mammography, ultrasound,
cytology, or colposcopy), and DL algorithms versus clinicians. Thirty-five studies are deemed eligible for systematic review, 20 of
which are meta-analyzed, with a pooled sensitivity of 88% (95% CI 85–90%), specificity of 84% (79–87%), and AUC of 0.92
(0.90–0.94). Acceptable diagnostic performance with analogous DL algorithms was highlighted across all subgroups. Therefore, DL
algorithms could be useful for detecting breast and cervical cancer using medical imaging, having equivalent performance to
human clinicians. However, this tentative assertion is based on studies with relatively poor designs and reporting, which likely
caused bias and overestimated algorithm performance. Evidence-based, standardized guidelines around study methods and
reporting are required to improve the quality of DL research.
npj Digital Medicine (2022)5:19 ; https://doi.org/10.1038/s41746-022-00559-z
1234567890():,;
1
Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical
College, Beijing 100730, China. 2National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union
Medical College, Beijing 100021, China. 3School of Humanities and Social Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730,
China. 4Faculty of Health and Medicine, Division of Health Research, Lancaster University, Lancaster LA1 4YW, United Kingdom. 5These authors contributed equally: Peng Xue,
Jiaxu Wang. ✉email: [email protected]; [email protected]
npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
Table 1. Study design and basic demographics.
Xiao et al.49 * Had breast lesions clearly visualized by ultrasound; Underwent biopsy and Patients who were pregnant or lactating; patients who had breast biopsy 389 46.86 (13.03; 19–84)
had pathological results; provided informed consent. or were undergoing neoadjuvant chemotherapy or radiotherapy.
Zhang et al.50 * NR Pathological results were neither benign nor malignant; Patients with BI- 2062 NR
RADS 1 or 2 and abnormal mammography results; patients who were
diagnosed with Paget’s disease but had no masses in the breasts.
Zhou et al.51 * Images were scanned under the same MR protocol; The lesion had Normal or typical background parenchyma enhancement in bilateral 1537 47.5 (11.8; NR)
complete pathology results; Imaging reports had definite BI-RADS breasts was eliminated.
category diagnosed; Lesions were a) solitary in one breast or b) in both
breasts with the same BI-RADS and pathological results.
Agnes et al.52 NR NR NR NR
Tanaka et al.53 * women with breast masses who were referred for further examination Typical cysts; mass lesions ≥ 4.5 cm diameter NR NR
after their initial screening examination of breast cancer and then
underwent ultrasonography and pathological examination.
Becker et al.35 Patients with postsurgical scars, initially indeterminate, or malignant Patients with normal breast ultrasound, and all patients with lesions 632 53 (15; 15–91)
lesions with histological diagnoses or 2 years follow up. classified as clearly benign, except for patients with prior breast-
conserving surgical treatment.
Kyono et al.54 Women recalled after routine breast screening between ages of 47–73 or NR 2000 NR (NR; 47–73)
women with a family history of breast cancer attending annual screening
between ages of 40–49.
Qi et al.55 * NR NR 2047 NR
Mckinney et al.60 NR Cases without follow-up were excluded from the test set. 28953 NR
Shen et al.61 NR NR 1249 NR
Suh et al.62 * 18 years or older and not having a history of previous breast surgery. Subjects without medical records or pathological confirmation for a 1501 48.9 (11.1; NR)
suspicious breast lesion, missing mammograms, or having poor-quality
mammograms.
O’Connell et al.63 Adult females or males recommended for ultrasound-guided breast lesion Unable to read and understand English at the University of Rochester; 299 52.3 (NR; NR)
biopsy or ultrasound follow-up with at least one suspicious lesion; age ≥ patients with diagnosis of breast cancer in the same quadrant; unwilling
18 years. to undergo study procedures and informed consent.
Ruiz et al.64 Women presenting for screening with no symptoms or concerns. Women with implants and/or a history of breast cancer. 240 62 (53–66; 39–89)
Adachi et al.65 * Patients who underwent DCE breast MRI; patients who were diagnosed Patients who were treated with breast surgery, hormonal therapy, 371 NR
with benign or malignant lesions by pathology or a follow-up examination chemotherapy, or radiation therapy; age ≤ 20 years.
at more than one year.
Samala et al.66 NR NR 2242 51.7 (NR; 24–82)
Schaffter et al.67 NR NR 153588 56.1 (NR; NR)
Kim et al.68 * NR NR 172230 50.3 (10; NR)
Wang et al.69 All nodules of patients were newly discovered and untreated; patients had Non-nodular breast disease; ABUS artifact was obvious and the poor 264 54.31 (9.68; 37–75)
undertaken ABUS scan; definite pathological benign and malignant; the images quality; ABUS was not available; patients received chemotherapy,
image quality of ABUS examination was good enough to show the entire radiation therapy or surgical local resection before ABUS scan.
margin of the lesion, no matter distinct or indistinct.
Yu et al.70 Pathological results clearly; at least 2D mode US images available, but A foreign-body in the breast; other metastatic tumors or co-infection with 3623 42.5 (NR; 11–95)
Sasaki et al.71 * Patients undergone bilateral mammography; patients in whom NR 310 50 (NR; 20–93)
ultrasonography had established the presence or absence of a lesion;
patients in whom a lesion, if present, had been diagnosed as being benign
or malignant by cytology or histology; normal patients in whom
ultrasonography had revealed no lesion and who had been followed up
Xue et al.39 * Aged 24-65 years with indications for the need for colposcopy imaging Empty or invalid images, low quality, unsatisfactory images, 19435 NR (NR; 24–65)
and biopsy, and those who were pathologically confirmed. information loss.
Yu et al.80 * NR NR 679 NR
Yuan et al.40 * NR Without complete clinical and pathological information; without biopsies; 22330 NR (NR; 20–66)
pathologically diagnosed as invasive cervical cancer or glandular
intraepithelial lesions; poor-quality colposcopy images.
DCE dynamic contrast enhanced, NR not reported, MRI magnetic resonance imaging, BI-RADS breast imaging reporting and data system, MR magnetic resonance, ABUS automated breast ultrasound, CDFI color
doppler flow imaging, PW pulsed wave, HIV human immunodeficiency virus, HRME high-resolution microendoscopy, DS dual stained.
*20 studies included in the meta-analysis.
First author and year Focus Reference standard Type of internal validation External validation DL versus clinician
49
Xiao et al. * Breast cancer Histopathology NR Yes Yes
Zhang et al.50 * Breast cancer Histopathology, immunohistochemistry Random split-sample validation Yes No
Zhou et al.51 * Breast cancer Histopathology, expert consensus Random split-sample validation No Yes
Agnes et al.52 Breast cancer Histopathology NR No No
Tanaka et al.53 * Breast cancer Histopathology, two-year follow-up Random split-sample validation No No
Becker et al.35 Breast cancer Histopathology, two-year follow-up Random split-sample validation No No
Kyono et al.54 Breast cancer Histopathology, follow-up, expert consensus Ten-fold cross validation No No
Qi et al.55 * Breast cancer Histopathology Random split-sample validation No No
Salim et al.56 * Breast cancer Histopathology, two-year follow-up NR Yes Yes
Zhang et al.57 Breast cancer Histopathology NR No No
Wang et al.58 Breast cancer Histopathology, two-year follow-up Five-fold cross validation No No
Li et al.59 Breast cancer Histopathology Five-fold cross validation No No
Mckinney et al.60 Breast cancer Histopathology, multiple years of follow-up NR Yes No
Shen et al.61 Breast cancer Histopathology Random split-sample validation No No
Suh et al.62 * Breast cancer Histopathology Random split-sample validation No No
O’Connell et al.63 Breast cancer Histopathology, two-year follow-up NR Yes No
Ruiz et al.64 Breast cancer Histopathology, one-year follow-up NR Yes Yes
O’Connell Ultrasound NR No CNN No Prospective study, data from University NR/NR/299 2018–2019 No
et al.63 of Rochester and University Hospital
Palermo, Italy.
Ruiz et al.64 Mammography Yes No CNN No Retrospective study, data from two NR/NR/240 2013–2017 No
institutes in the US and Europe.
Adachi et al.65 * MRI NR No RetinaNet No Retrospective study, data from Tokyo 286/85/NR 2014.03–2018.10 No
Medical and Dental University hospital.
Samala et al.66 Mammography NR No ImageNet DCNN Yes Retrospective study, data from 1335/907/NR 2001–2006 No
University of Michigan Health System
and the Digital Database for Screening
Mammography.
Schaffter et al.67 Mammography NR No Faster-RCNN No Retrospective study, data from Kaiser 100974/43257/ 2016.09–2017.11 No
Permanente Washington and 166578
Karolinska Institute.
Kim et al.68 * Mammography NR Yes ResNet-34 No Retrospective study, data from five 166968/3262/ 2000.01–2018.12 No
institutions in South Korea, USA. 320
Sasaki et al.71 * Mammography NR No Transpara No Retrospective study, data from Sagara NR/NR/620 2018.01–2018.10 No
Hospital Affiliated Breast Center, Japan.
Zhang et al.72 Mammography NR Yes MVNN No Retrospective study, data from Digital 5194/512/NR NR Yes
Database for Screening Mammography.
Bao et al.73 * Cytology NR No DL No Retrospective study. data from a 103793/NR/ 2017.01–2018.12 No
cervical cancer screening program. 69906
Holmström Cytology NR No CNN No Retrospective study, data from a rural 350/390/NR 2018–2019 No
et al.74 * clinic in Kenya.
Cho et al.75 * Colposcopy NR Yes Inception-Resnet-v2, No Retrospective study, data from three 675/116/NR 2015–2018 No
Resnet-152 university affiliated hospitals.
Bao et al.76 * Cytology NR No VGG16 No Retrospective study, data from eight 15083/NR/2145 2017.05–2018.10 No
tertiary hospitals in China.
Hu et al.77 * Cervicography NR Yes Faster R-CNN Yes Retrospective study, data from 744/8917/NR 1993–2001 No
Guanacaste costa Rica cohort.
Hunt et al.78 * Microendoscopy NR Yes CNN No Prospectively study, data from Barretos 870/616/NR NR No
Cancer Hospital.
Wentzensen Cytology NR No CNN4, Inception-v3 No Retrospective study, data from Kaiser 193/409/NR 2009–2014 No
et al.79 * Permanente Northern California and
the University of Oklahoma.
dense convolutional network, MA-CNN multiattention convolutional neural network, VGG visual geometry group network, ResNet deep residual network, FPN feature pyramid networks, MVNN multiview feature
NR not reported, CNN convolutional neural network, DL deep learning, YOLO you only look once, DNN deep neural network, DCNN deep convolutional neural network, MRI magnetic resonance imaging, DenseNet
access data
nor was it reduced to an acceptable level, with all subgroup I2
Open values remained high. Therefore, we could infer whether different
validation types, cancer types, and imaging modalities were likely
2018.01–2018.12 No
2013.07–2016.09 No
2013.08–2019.05 No
to have influenced DL algorithm performances for detecting
breast and cervical cancer.
To further investigate this finding, we performed meta-
regression analysis with these covariates (see Supplementary
Data range
40194/4466/NR
Quality assessment
77788/23479/
3802/951/NR
Number of
images for
exclusion. For the index test domain, only one studies was
considered high or at unclear risk of bias due to having no
predefined threshold, whereas the others were considered at low
Zhejiang University.
risk of bias.
For the reference standard domain, three studies were
Source of data
applied
DISCUSSION
U-Net, YOLO
architecture
Algorithm
Algorithm
No
No
Yes
Colposcopy
Colposcopy
npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
9
Fig. 2 Pooled overall performance of DL algorithms. a Receiver operator characteristic (ROC) curves of all studies included in the meta-
analysis (20 studies with 55 tables), and b ROC curves of studies reporting the highest accuracy (20 studies with 20 tables).
Fig. 3 Pooled performance of DL algorithms using different validation types. a Receiver operator characteristic (ROC) curves of studies with
internal validations (15 studies with 40 tables), b ROC curves of studies with external validations (8 studies with 15 tables).
Fig. 4 Pooled performance of DL algorithms using different cancer types. a Receiver operator characteristic (ROC) curves of studies in
detecting breast cancer (10 studies with 36 tables), and b ROC curves of studies in detecting cervical cancer (10 studies with 19 tables).
cancers were analyzed with more than three studies, which not algorithms have high diagnostic performance. However, the
only inhibits generalizability but highlights the need for further DL authors also found high heterogeneity which was attributed to
algorithm performance research in the field of medical imaging. In combining distinct methods and perhaps through unspecified
identifying pathologies, Aggarwal et al. 24 found that DL terms. They concluded that we need to be cautious when
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19
P. Xue et al.
10
Fig. 5 Pooled performance of DL algorithms using different imaging modalities. a Receiver operator characteristic (ROC) curves of studies
using mammography (4 studies with 15 tables), b ROC curves of studies using ultrasound (4 studies with 17 tables), c ROC curves of studies
using cytology (4 studies with 6 tables), and d presented ROC curves of studies using colposcopy (4 studies with 11 tables).
Fig. 6 Pooled performance of DL algorithms versus human clinicians and human clinicians using the same sample. a Receiver operator
characteristic (ROC) curves of studies using DL algorithms (11 studies with 29 tables), and b ROC curves of studies using human clinicians
(11 studies with 18 tables).
considering the diagnostic accuracy of DL algorithms and that only other review in this field was conducted by Zheng et al. 25
there is a need to develop (and apply) AI guidelines. This was also who found that DL algorithms are beneficial in radiological
apparent in this study and therefore we would reiterate this imaging with equivalent, or in some instances better performance
sentiment. than healthcare professionals. Although again, there were
While the findings from the aforementioned studies are methodological deficiencies which must be considered before
incredibly valuable, at present there is a need to expand upon we adopt these technologies into clinical practice. Also, we must
the emerging knowledge-base for metastatic tumor diagnosis. The strive to identify the best available DL algorithm and then develop
npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
11
Fig. 7 Summary estimate of pooled performance using forest plot. Data presented forest plot of studies included in the meta-analysis
(20 studies).
it to enhance identification and reduce the number of false were from either pre-existing electronic medical records or online
positives and false negatives beyond that which is humanly open-access databases, which were not explicitly intended for
possible. As such, we need to continue to use systematic reviews algorithmic analysis in real clinical settings. Of course, we must
to identify gaps in research and we should not only consider first test these technologies using retrospective datasets to see
technology-specific reviews, but also disease-specific systematic whether they are appropriate and with a view to modifying and
reviews. Of course, DL algorithms are in an almost constant state enhancing accuracy perhaps for specific populations or for specific
of development but the purpose of this study was to critically types of cancer. We also encourage more prospective DL studies
appraise potential issues with study methods and reporting in the future. If possible, we should investigate the potential rules
standards. By doing so, we hoped to make recommendations and of breast or cervical images through more prospective studies,
to drive further research in this field so that the most effective and identify possible image feature correlations and diagnostic
technology is adopted into clinical practice sooner rather logic for risk predictions. Most studies constructed and trained
than later. algorithms using small labeled breast or cervical images, with
This systematic review with meta-analysis suggests that deep labels which were rarely quality-checked by a clinical specialist.
learning algorithms can be used for the detection of breast and This design fault is likely to have created ambiguous ground-truth
cervical cancer using medical imaging. Evidence also suggests that inputs which may have caused unintended adverse model effects.
while the deep learning algorithms are not yet superior, nor are Of course, the knock-on effect is that there is likely to be
they inferior in terms of performance when compared to diagnostic inaccuracies through unidentified biases. This is
clinicians. Acceptable diagnostic performance with analogous certainly an issue which should be considered when designing
deep learning algorithms was observed in both breast and cervical future deep learning-based studies.
cancer despite having dissimilar workflows with different imaging It is important to note that no matter how well-constructed an
modalities. This finding also suggests that these algorithms could algorithm is, its diagnostic performance depends largely upon the
be deployed across both breast or cervical imaging, and volume of raw data and quality26. Most studies included in this
potentially across all types of cancer which utilize imaging systematic review mentioned a data augmentation method which
technologies to identify cases early. However, we must also adopted some form of affine image transformations strategy e.g.
critically consider some of the issues which emerged during our translational, rotation or flipping, in order to compensate for data
systematic analysis of this evidence base. deficiencies. This, one could argue, is symptomatic of the paucity
Overall, there were very few prospective studies and few clinical of annotated datasets for model training, and prospective studies
trials. In fact, most included studies were retrospective studies for model validation. Though fortunately, there has been a
which may be the case because of the relative newness of DL substantial increase in the number of openly available datasets
algorithms in medical imaging. However, the data sources used around cervical or breast cancer. However, given the necessity for
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19
P. Xue et al.
12
this research, one would like to see institutions collaborating more DL models in the healthcare setting will need clinicians to
frequently to establish cloud sharing platforms which would optimize clinical workflow integration. However, we found only
increase the availability (and breadth) of annotated datasets. two of studies which mentioned DL versus clinicians and versus
Moreover, training DL algorithms requires reliable, high-quality DL combined with clinicians. This hindered our meta-analysis of
image inputs, which may not be readily available, as some pre- DL algorithms but highlighted the need for strict and reliable
analytical factors such as incorrect specimen preparation and assessment of DL performance in real clinical settings. Indeed, the
processing, unstandardized image digitalization acquisition, scientific discourse should change from DL versus clinicians
improper device calibration and maintenance could lower image dichotomy to a more realistic DL-clinician combination, which
quality. Complete standardization of all procedures and reagents would improve workflows.
in clinical practice is required to optimally prepare pre-analytical 35 studies met the eligibility criteria for the systematic review,
image inputs in order to develop more robust and accurate DL yet only 20 studies could be used to develop contingency tables.
algorithms. Having these would drive developments in this field Some DL algorithm studies from computer science journals only
and would benefit clinical practice, perhaps serving as a cost- reported precision, dice coefficient, F1 score, recall, and competi-
effective replacement diagnostic tool or an initial method of risk tion performance metric. Whereas indicators such as AUC,
categorization. Although, this is beyond the scope of this study accuracy, sensitivity, and specificity are more familiar to healthcare
and would require further research to consider this in detail. professionals25. Bridging the gap between computer sciences
Of the 35 included studies, only 11 studies performed external research would seem prudent if we are to manage interdepart-
validation, which means that an assessment of DL model mental research and the transition to a more digitized healthcare
performance was conducted with either an out-of-sample dataset system. Moreover, we found the term “validation” is used causally
or with an open-access dataset. Indeed, most of the studies in DL model studies. Some authors used it for assessing the
included here split a single sample by either randomly and non- diagnostic performance of the final algorithm, others defined it as
randomly assigning individuals’ data from one center into one a dataset for model tuning during the development process. This
development dataset or the other internal validations dataset. We confuses readers and makes it difficult to judge the function of
found that studies with internal validation were higher than datasets. We combined experts’ opinions33, and proposed to
externally validated studies for early detection of cervical and distinguish datasets used in the development and validation of DL
breast cancer. However, this was to be expected because using an algorithms. In keeping with the language used for nomogram
internal dataset to validate models is more likely homogenous and development, a dataset for training the model should be named
may lead to an overestimated diagnostic performance. This ‘training set’, while datasets used for tuning should be referred to
finding highlights the need for out-of-sample external validation as the ‘tuning set’. Likewise, during the validation phase, the hold-
in all predictive models. A possible method for improving external back subset split from the entire dataset should be referred to a
validation would be to establish an alliance of institutions wherein ‘internal’ validation, which is the same condition/image types as
trained deep learning algorithms are shared and performances the training set. While a completely independent dataset for our-
tested, externally. This might provide insight into subgroups and of-sample validation should be referred to as ‘external’
variations between various ethnic groups although we would also validation34.
need to maintain patient anonymity and security, as several Most of the issues discussed here could be avoided through
researchers have previously noted27,28. more robust designs and high-quality reporting, although several
Most of the studies that were retrospective using narrowly hurdles must be overcome before DL algorithms are used in
defined binary or multi-class tests focusing on the diagnostic practice for breast and cervical cancer identification. The black box
performance in the field of DL algorithms rather than clinical nature of DL models without clear interpretability of the basis for
practice. This is a direct consequence of poor reporting, and the the clinical situations is a well-recognized challenge. For example,
lack of real-world prospective clinical practice, which has resulted a clinician considering whether breast nodules represent breast
in inadequate data availability and therefore may limit our ability cancer based on mammographic images for a series of judgement
to gauge the applicability of these DL algorithms to clinical criteria. Therefore, a clinician developing a clear rationale for a
settings. Accordingly, there is uncertainty around the estimates of proposed diagnosis maybe the desired state. Whereas, having a
diagnostic performance provided in our meta-analysis and DL model which merely states the diagnosis may be viewed with
adherence levels should be interpreted with caution. more skepticism. Scientists have actively investigated possible
Recently, several AI-related method guides have been pub- methods for inspecting and explaining algorithmic decisions. An
lished, with many still under development29,30. We found most of important example is the use of salience or heat maps to provide
the included studies we analyzed were probably conceived or the location of salient lesion features within the image rather than
performed before these guidelines were available. Therefore, it is defining the lesion characteristics themselves35,36. This raises
reasonable to assume that design features, reporting adequacy questions around human-technology interactions, and particularly
and transparency of studies used to evaluate the diagnostic around transparency and patient-practitioner communications
performance of DL algorithms will be improved in the future. Even which ought to be studied in conjunction with DL modeling in
though, our findings suggest that DL is not inferior in terms of medical imaging.
performance compared to clinicians for the early detection of Another common problem limiting DL algorithms is model
breast or cervical cancer, this is based on relatively few studies. generalizability. There may be potential factors in the training data
Therefore, the uncertainty which exists is, at least in part, due to that would affect the performance of DL models in different data
the in silico context in which clinicians are being evaluated. distribution settings28. For example, a model only trained in US
We should also acknowledge that most of the current DL may not perform well in Asia because a model trained using data
studies are publications of positive results. We must be aware that from predominantly caucasian patients may not perform well
this may be a form of researcher-based reporting bias (rather than across other ethnicities. One solution to improve generalizability
publication-based bias), which is likely to skew the dataset and and reduce bias is to conduct large, multicenter studies which can
adds complexity to comparison between DL algorithms and enable the analysis of nationalities, ethnicities, hospital specifics,
clinicians31,32. Differences in reference standard definitions, grader and population distribution characteristics37. Societal biases can
capabilities (i.e. the degrees of expertise), imaging modalities and also affect the performance of DL models and of course, bias exists
detection thresholds for classification of early breast or cervical in DL algorithms because a training dataset may not include
cancer also make direct comparisons between studies and appropriate proportions of minority groups. For example, a DL
algorithms very difficult. Furthermore, non-trivial applications of algorithm for melanoma diagnosis in dermatological study may
npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
13
lack diversity in terms of skin color and genomic data, but this may to be robust, and scientifically validated for clinical and personal
also cause an under-representation of minority groups38. To utility.
eliminate embedded prejudice, efforts should be made to carry We tentatively suggest that DL algorithms could be useful for
out DL algorithm research which provides a more realistic detecting breast and cervical cancer using medical imaging, with
representation of global populations. equivalent performance to human clinicians, in terms of sensitivity
As we have seen, the included studies were mostly retro- and specificity. However, this finding is based on poor study
spective with extensive variation in methods and reporting. More designs and reporting which could lead to bias and over-
high-quality studies such as prospective studies and clinical trials estimating algorithmic performances. Standardized guidelines
are needed to enhance the current evidence base. We also around study methods and reporting are needed to improve
focused on DL algorithms for breast and cervical cancer detection the quality of DL model research. This may help to facilitate the
using medical imaging. Therefore, we made no attempt to transition into clinical practice although further research is
generalize our findings to other types of AI, such as conventional required.
machine learning models. While there were a reasonable number
of studies for this meta-analysis, the number of studies of each
imaging modality was limited like cytology or colposcopy, METHODS
Therefore, the results of the subgroup analyses around imaging Protocol registration and study design
modality needs to be interpreted with caution. We also selected The study protocol was registered with the PROSPERO International
only studies in which histopathology was used as the reference register of systematic reviews, number CRD42021252379. The study was
standard. Consequently, some DL studies that may have shown conducted according to the preferred reporting items for systematic
promise but did not have confirmatory histopathologic results, reviews and meta-analyses (PRISMA) guidelines47. No ethical approval or
were excluded. Even though the publication bias was not informed consent was required for the current systematic review and
identified through funnel plot analysis in Supplementary Fig. 3 meta-analysis.
based on data extracted from 20 studies, the lack of prospective
studies and the potential absence of studies with negative results Search strategy and eligibility criteria
can cause bias. As such, we would encourage deep learning In this study, we searched Medline, Embase, IEEE and the Cochrane library
researchers in medical imaging to report studies which do not until April 2021. No restrictions were applied around regions, languages, or
reject the null hypothesis because this will ensure evidence publication types; however, letters, scientific reports, conference abstracts,
clusters around true effect estimates. and narrative reviews were excluded. The full search strategy for each
It remains necessary to promote deep learning in medical database was developed in collaboration with a group of experienced
imaging studies for breast or cervical cancer detection. However, clinicians and medical researchers. Please see Supplementary Note 1 for
further details.
we suggest improving breast and cervical data quality and
Eligibility assessment was conducted by two independent investigators,
establishing unified standards. Developing DL algorithms needs to who screened titles and abstracts, and selected all relevant citations for
feed on reliable and high-quality images tagged with appropriate full-text review. Disagreements were resolved through discussion with
histopathological labels. Likewise, it is important to establish another collaborator. We included studies that reported the diagnostic
unified standards to improve the quality of the digital image- performance of a DL model/s for the early detection of breast or cervical
production, the collection process, imaging reports, and final cancer using medical imaging. Studies reporting any diagnostic outcome,
histopathological diagnosis. Combining DL algorithm results with such as accuracy, sensitivity, and specificity etc., could be included. There
other biomarkers may prove useful to improve risk discrimination was no restriction on participant characteristics, type of imaging modality
for breast or cervical cancer detection. An example would be a DL or the intended context for using DL models.
model for cervical imaging that combines with additional clinical Only histopathology was accepted as the study reference standard. As
such, imperfect ground truths, such as expert opinion or consensus, and
information i.e. cytology and HPV typing, which could improve
other clinical testing were rejected. Likewise, medical waveform data or
overall diagnostic performance39,40. Secondly, we need to improve investigations into the performance of image segmentation were excluded
the error correction ability and DL algorithm compatibility. because these could not be synthesized with histopathological data.
Prophase developing DL algorithms are more generalizable and Animals’ studies or non-human samples were also excluded and duplicates
less susceptible to bias but may require larger and multicenter were removed. The primary outcomes were various diagnostic perfor-
datasets which incorporate diverse nationalities and ethnicities, as mance metrics. Secondary analysis included and assessment of study
well as those with different socioeconomic status etc., if we are to methodologies and reporting standards.
implement algorithms into real-world settings.
This also highlights the need for international reporting Data extraction
guidelines for DL algorithms in medical imaging. Existing Two investigators independently extracted study characteristics and
reporting guidelines such as STARD41 for diagnostic accuracy diagnostic performance data using predetermined data extraction sheet.
studies, and TRIPOD42 for conventional prediction models are not Again, uncertainties were resolved by a third investigator. Binary
available to DL model study. The recent publication of CONSORT- diagnostic accuracy data were extracted directly into contingency tables
AI43 and SPIRIT-AI44 guidelines are welcomed but we await which included true-positives, false-positives, true-negatives, and false-
disease-specific DL guidelines. Furthermore, we would encourage negatives. These were then used to calculate pooled sensitivity, pooled
organizations to develop diverse teams, combining computer specificity, and other metrics. If a study provided multiple contingency
scientists and clinicians to solve clinical problems using DL tables for the same or for different DL algorithms, we assumed that they
were independent of each other.
algorithms. Even though DL algorithms appear like black boxes
with unexplainable decision-making outputs, these technologies
need to be discussed for development and require additional Quality assessment
clinical information45,46. Finally, medical computer vision algo- The risk of bias and applicability concerns of the included studies were
rithms do not exist in a vacuum, we must integrate DL algorithms assessed by the three investigators using the quality assessment of
into routine clinical workflows and across entire healthcare diagnostic accuracy studies 2 (QUADAS-2) tool48.
systems to assist doctors and augment decision-making. There-
fore, it is crucial that clinicians understand the information each Statistical analysis
algorithm provides and how this can be integrated into clinical Hierarchical summary receiver operating characteristic (SROC) curves were
decisions which enhance efficiency without absorbing resources. used to assess the diagnostic performance of DL algorithms. 95%
For any algorithm to be incorporated into existing workflows it has confidence intervals (CI) and prediction regions were generated around
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19
P. Xue et al.
14
averaged sensitivity, specificity, and AUCs estimates in SROC figures. 15. Mandal, R. & Basu, P. Cancer screening and early diagnosis in low and middle
Further meta-analysis was performed to report the best accuracy in studies income countries: Current situation and future perspectives. Bundesgesundheits-
with multiple DL algorithms from contingency tables. Heterogeneity was blatt Gesundheitsforschung Gesundheitsschutz 61, 1505–1512 (2018).
assessed using the I2 statistic. We also conducted the subgroup meta- 16. Torode, J. et al. National action towards a world free of cervical cancer for all
analyses and regression analyses to explore potential sources of women. Prev. Med 144, 106313 (2021).
heterogeneity. The random effects model was implemented because of 17. Coiera, E. The fate of medicine in the time of AI. Lancet 392, 2331–2332 (2018).
the assumed differences between studies. Publication bias was assessed 18. Kleppe, A. et al. Designing deep learning studies in cancer diagnostics. Nat. Rev.
visually using funnel plots. Cancer 21, 199–211 (2021).
Four separate meta-analyses were conducted: (1) according to validation 19. Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of
type, DL algorithms were categorized as either internal or external. Internal design, reporting standards, and claims of deep learning studies. BMJ 368, m689
validation meant that studies were validated using an in-sample-dataset, (2020).
while external validation studies were validated using an out-of-sample 20. Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based
dataset; (2) according to cancer type i.e., breast or cervical cancer; (3) FDA-approved medical devices and algorithms: an online database. NPJ Digit Med
according to imaging modalities, such as mammography, ultrasound, 3, 118 (2020).
cytology, and colposcopy, etc; (4) according to the pooled performance for 21. Liu, X., Rivera, S. C., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting
DL algorithms versus human clinicians using the same dataset. guidelines for clinical trial reports for interventions involving artificial intelligence:
Meta-analysis was only performed where there were more than or equal the CONSORT-AI Extension. BMJ 370, m3164 (2020).
to three original studies. STATA (version 15.1), and SAS (version 9.4) were 22. Bengtsson, E. & Malm, P. Screening for cervical cancer using automated analysis
for data analyses. The threshold for statistical significance was set at p < of PAP-smears. Comput Math. Methods Med 2014, 842037 (2014).
0.05, and all tests were two-sides. 23. Liu, X. et al. A comparison of deep learning performance against health-care
professionals in detecting diseases from medical imaging: a systematic review
and meta-analysis. Lancet Digit Health 1, e271–e297 (2019).
Reporting Summary 24. Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a
Further information on research design is available in the Nature Research systematic review and meta-analysis. NPJ Digit Med 4, 65 (2021).
Reporting Summary linked to this article. 25. Zheng, Q. et al. Artificial intelligence performance in detecting tumor metastasis
from medical radiology imaging: A systematic review and meta-analysis. EClini-
calMedicine 31, 100669 (2021).
DATA AVAILABILITY 26. Moon, J. H. et al. How much deep learning is enough for automatic identification
The search strategy and aggregated data contributing to the meta-analysis is to be reliable? Angle Orthod. 90, 823–830 (2020).
available in the appendix. 27. Beam, A. L., Manrai, A. K. & Ghassemi, M. Challenges to the Reproducibility of
Machine Learning Models in Health Care. Jama 323, 305–306 (2020).
28. Trister, A. D. The Tipping Point for Deep Learning in Oncology. JAMA Oncol. 5,
Received: 24 June 2021; Accepted: 22 December 2021; 1429–1430 (2019).
29. Kim, D. W., Jang, H. Y., Kim, K. W., Shin, Y. & Park, S. H. Design Characteristics of
Studies Reporting the Performance of Artificial Intelligence Algorithms for
Diagnostic Analysis of Medical Images: Results from Recently Published Papers.
Korean J. Radio. 20, 405–410 (2019).
REFERENCES 30. England, J. R. & Cheng, P. M. Artificial Intelligence for Medical Image Analysis: A
1. Arbyn, M. et al. Estimates of incidence and mortality of cervical cancer in 2018: a Guide for Authors and Reviewers. AJR Am. J. Roentgenol. 212, 513–519 (2019).
worldwide analysis. Lancet Glob. Health 8, e191–e203 (2020). 31. Cook, T. S. Human versus machine in medicine: can scientific literature answer
2. Li, N. et al. Global burden of breast cancer and attributable risk factors in 195 the question? Lancet Digit Health 1, e246–e247 (2019).
countries and territories, from 1990 to 2017: results from the Global Burden of 32. Simon, A. B., Vitzthum, L. K. & Mell, L. K. Challenge of Directly Comparing Imaging-
Disease Study 2017. J. Hematol. Oncol. 12, 140 (2019). Based Diagnoses Made by Machine Learning Algorithms With Those Made by
3. Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence Human Clinicians. J. Clin. Oncol. 38, 1868–1869 (2020).
and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 71, 33. Altman, D. G. & Royston, P. What do we mean by validating a prognostic model?
209–249 (2021). Stat. Med 19, 453–473 (2000).
4. Ginsburg, O. et al. Changing global policy to deliver safe, equitable, and afford- 34. Kim, D. W. et al. Inconsistency in the use of the term “validation” in studies
able care for women’s cancers. Lancet 389, 871–880 (2017). reporting the performance of deep learning algorithms in providing diagnosis
5. Allemani, C. et al. Global surveillance of trends in cancer survival 2000-14 from medical imaging. PLoS One 15, e0238908 (2020).
(CONCORD-3): analysis of individual records for 37 513 025 patients diagnosed 35. Becker, A. S. et al. Classification of breast cancer in ultrasound imaging using a
with one of 18 cancers from 322 population-based registries in 71 countries. generic deep learning analysis software: a pilot study. Br. J. Radio. 91, 20170576
Lancet 391, 1023–1075 (2018). (2018).
6. Shah, S. C., Kayamba, V., Peek, R. M. Jr. & Heimburger, D. Cancer Control in Low- 36. Becker, A. S. et al. Deep Learning in Mammography: Diagnostic Accuracy of a
and Middle-Income Countries: Is It Time to Consider Screening? J. Glob. Oncol. 5, Multipurpose Image Analysis Software in the Detection of Breast Cancer. Invest
1–8 (2019). Radio. 52, 434–440 (2017).
7. Wentzensen, N., Chirenje, Z. M. & Prendiville, W. Treatment approaches for 37. Wang, F., Casalino, L. P. & Khullar, D. Deep Learning in Medicine-Promise, Pro-
women with positive cervical screening results in low-and middle-income gress, and Challenges. JAMA Intern Med 179, 293–294 (2019).
countries. Prev. Med 144, 106439 (2021). 38. Topol, E. J. High-performance medicine: the convergence of human and artificial
8. Britt, K. L., Cuzick, J. & Phillips, K. A. Key steps for effective breast cancer pre- intelligence. Nat. Med 25, 44–56 (2019).
vention. Nat. Rev. Cancer 20, 417–436 (2020). 39. Xue, P. et al. Development and validation of an artificial intelligence system for
9. Brisson, M. et al. Impact of HPV vaccination and cervical screening on cervical grading colposcopic impressions and guiding biopsies. BMC Med 18, 406 (2020).
cancer elimination: a comparative modelling analysis in 78 low-income and 40. Yuan, C. et al. The application of deep learning based diagnostic system to
lower-middle-income countries. Lancet 395, 575–590 (2020). cervical squamous intraepithelial lesions recognition in colposcopy images. Sci.
10. Yang, L. et al. Performance of ultrasonography screening for breast cancer: a Rep. 10, 11639 (2020).
systematic review and meta-analysis. BMC Cancer 20, 499 (2020). 41. Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting
11. Conti, A., Duggento, A., Indovina, I., Guerrisi, M. & Toschi, N. Radiomics in breast diagnostic accuracy studies. BMJ 351, h5527 (2015).
cancer classification and prediction. Semin Cancer Biol. 72, 238–250 (2021). 42. Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a
12. Xue, P., Ng, M. T. A. & Qiao, Y. The challenges of colposcopy for cervical cancer multivariable prediction model for individual prognosis or diagnosis (TRIPOD):
screening in LMICs and solutions by artificial intelligence. BMC Med 18, 169 the TRIPOD statement. BMJ 350, g7594 (2015).
(2020). 43. Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting
13. William, W., Ware, A., Basaza-Ejiri, A. H. & Obungoloch, J. A review of image analysis guidelines for clinical trial reports for interventions involving artificial intelligence:
and machine learning techniques for automated cervical cancer screening from the CONSORT-AI extension. Nat. Med 26, 1364–1374 (2020).
pap-smear images. Comput Methods Prog. Biomed. 164, 15–22 (2018). 44. Cruz Rivera, S., Liu, X., Chan, A. W., Denniston, A. K. & Calvert, M. J. Guidelines for
14. Muse, E. D. & Topol, E. J. Guiding ultrasound image capture with artificial intel- clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-
ligence. Lancet 396, 749 (2020). AI extension. Nat. Med 26, 1351–1363 (2020).
npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
15
45. Huang, S. C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical 72. Zhang, C., Zhao, J., Niu, J. & Li, D. New convolutional neural network model for
imaging and electronic health records using deep learning: a systematic review screening and diagnosis of mammograms. PLoS One 15, e0237674 (2020).
and implementation guidelines. NPJ Digit Med 3, 136 (2020). 73. Bao, H. et al. The artificial intelligence-assisted cytology diagnostic system in
46. Guo, H. et al. Heat map visualization for electrocardiogram data analysis. BMC large-scale cervical cancer screening: A population-based cohort study of 0.7
Cardiovasc Disord. 20, 277 (2020). million women. Cancer Med 9, 6896–6906 (2020).
47. Moher, D., Liberati, A., Tetzlaff, J. & Altman, D. G. Preferred reporting items for sys- 74. Holmström, O. et al. Point-of-Care Digital Cytology With Artificial Intelligence for
tematic reviews and meta-analyses: the PRISMA statement. BMJ 339, b2535 (2009). Cervical Cancer Screening in a Resource-Limited Setting. JAMA Netw. Open 4,
48. Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of e211740 (2021).
diagnostic accuracy studies. Ann. Intern Med 155, 529–536 (2011). 75. Cho, B. J. et al. Classification of cervical neoplasms on colposcopic photography
49. Xiao, M. et al. Diagnostic Value of Breast Lesions Between Deep Learning-Based using deep learning. Sci. Rep. 10, 13652 (2020).
Computer-Aided Diagnosis System and Experienced Radiologists: Comparison 76. Bao, H. et al. Artificial intelligence-assisted cytology for detection of cervical
the Performance Between Symptomatic and Asymptomatic Patients. Front Oncol. intraepithelial neoplasia or invasive cancer: A multicenter, clinical-based, obser-
10, 1070 (2020). vational study. Gynecol. Oncol. 159, 171–178 (2020).
50. Zhang, X. et al. Evaluating the Accuracy of Breast Cancer and Molecular Subtype 77. Hu, L. et al. An Observational Study of Deep Learning and Automated Eva-
Diagnosis by Ultrasound Image Deep Learning Model. Front Oncol. 11, 623506 (2021). luation of Cervical Images for Cancer Screening. J. Natl Cancer Inst. 111,
51. Zhou, J. et al. Weakly supervised 3D deep learning for breast cancer classification 923–932 (2019).
and localization of the lesions in MR images. J. Magn. Reson Imaging 50, 78. Hunt, B. et al. Cervical lesion assessment using real-time microendoscopy image
1144–1151 (2019). analysis in Brazil: The CLARA study. Int J. Cancer 149, 431–441 (2021).
52. Agnes, S. A., Anitha, J., Pandian, S. I. A. & Peter, J. D. Classification of Mammogram 79. Wentzensen, N. et al. Accuracy and Efficiency of Deep-Learning-Based Automa-
Images Using Multiscale all Convolutional Neural Network (MA-CNN). J. Med Syst. tion of Dual Stain Cytology in Cervical Cancer Screening. J. Natl Cancer Inst. 113,
44, 30 (2019). 72–79 (2021).
53. Tanaka, H., Chiu, S. W., Watanabe, T., Kaoku, S. & Yamaguchi, T. Computer-aided 80. Yu, Y., Ma, J., Zhao, W., Li, Z. & Ding, S. MSCI: A multistate dataset for colposcopy
diagnosis system for breast ultrasound images using deep learning. Phys. Med image classification of cervical cancer screening. Int J. Med Inf. 146, 104352 (2021).
Biol. 64, 235013 (2019).
54. Kyono, T., Gilbert, F. J. & van der Schaar, M. Improving Workflow Efficiency for
Mammography Using Machine Learning. J. Am. Coll. Radio. 17, 56–63 (2020). ACKNOWLEDGEMENTS
55. Qi, X. et al. Automated diagnosis of breast ultrasonography images using deep This study was supported by CAMS Innovation Fund for Medical Sciences (Grant #:
neural networks. Med Image Anal. 52, 185–198 (2019). CAMS 2021-I2M-1-004).
56. Salim, M. et al. External Evaluation of 3 Commercial Artificial Intelligence Algo-
rithms for Independent Assessment of Screening Mammograms. JAMA Oncol. 6,
1581–1588 (2020).
AUTHOR CONTRIBUTIONS
57. Zhang, Q. et al. Dual-mode artificially-intelligent diagnosis of breast tumours in
shear-wave elastography and B-mode ultrasound using deep polynomial net- P.X., Y.J., and Y.Q. conceptualised the study, P.X., J.W., D.Q., and H.Y. designed the
works. Med Eng. Phys. 64, 1–6 (2019). study, extracted data, conducted the analysis and wrote the manuscript. P.X. and S.S.
58. Wang, Y. et al. Breast Cancer Classification in Automated Breast Ultrasound Using revised the manuscript. All authors approved the final version of the manuscript and
Multiview Convolutional Neural Network with Transfer Learning. Ultrasound Med take accountability for all aspects of the work. P.X. and J.W. contributed equally to
Biol. 46, 1119–1132 (2020). this article.
59. Li, Y., Wu, W., Chen, H., Cheng, L. & Wang, S. 3D tumor detection in automated
breast ultrasound using deep convolutional neural network. Med Phys. 47,
5669–5680 (2020). COMPETING INTERESTS
60. McKinney, S. M. et al. international evaluation of an AI system for breast cancer The authors declare no competing interests.
screening. Nature 577, 89–94 (2020).
61. Shen, L. et al. Deep Learning to Improve Breast Cancer Detection on Screening
Mammography. Sci. Rep. 9, 12495 (2019). ADDITIONAL INFORMATION
62. Suh, Y. J., Jung, J. & Cho, B. J. Automated Breast Cancer Detection in Digital Supplementary information The online version contains supplementary material
Mammograms of Various Densities via Deep Learning. J. Pers. Med 10, 211 (2020). available at https://doi.org/10.1038/s41746-022-00559-z.
63. O'Connell, A. M. et al. Diagnostic Performance of An Artificial Intelligence System
in Breast Ultrasound. J. Ultrasound Med. 41, 97–105 (2021). Correspondence and requests for materials should be addressed to Yu Jiang or
64. Rodriguez-Ruiz, A. et al. Stand-Alone Artificial Intelligence for Breast Cancer Youlin Qiao.
Detection in Mammography: Comparison With 101 Radiologists. J. Natl Cancer
Inst. 111, 916–922 (2019). Reprints and permission information is available at http://www.nature.com/
65. Adachi, M. et al. Detection and Diagnosis of Breast Cancer Using Artificial Intel- reprints
ligence Based assessment of Maximum Intensity Projection Dynamic Contrast-
Enhanced Magnetic Resonance Images. Diagnostics (Basel) 10, 330 (2020). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
66. Samala, R. K. et al. Multi-task transfer learning deep convolutional neural network: in published maps and institutional affiliations.
application to computer-aided diagnosis of breast cancer on mammograms.
Phys. Med Biol. 62, 8894–8908 (2017).
67. Schaffter, T. et al. Evaluation of Combined Artificial Intelligence and Radiologist
Assessment to Interpret Screening Mammograms. JAMA Netw. Open 3, e200265 Open Access This article is licensed under a Creative Commons
(2020). Attribution 4.0 International License, which permits use, sharing,
68. Kim, H. E. et al. Changes in cancer detection and false-positive recall in mam- adaptation, distribution and reproduction in any medium or format, as long as you give
mography using artificial intelligence: a retrospective, multireader study. Lancet appropriate credit to the original author(s) and the source, provide a link to the Creative
Digit Health 2, e138–e148 (2020). Commons license, and indicate if changes were made. The images or other third party
69. Wang, F. et al. Study on automatic detection and classification of breast nodule using material in this article are included in the article’s Creative Commons license, unless
deep convolutional neural network system. J. Thorac. Dis. 12, 4690–4701 (2020). indicated otherwise in a credit line to the material. If material is not included in the
70. Yu, T. F. et al. Deep learning applied to two-dimensional color Doppler flow article’s Creative Commons license and your intended use is not permitted by statutory
imaging ultrasound images significantly improves diagnostic performance in the regulation or exceeds the permitted use, you will need to obtain permission directly
classification of breast masses: a multicenter study. Chin. Med J. (Engl.) 134, from the copyright holder. To view a copy of this license, visit http://creativecommons.
415–424 (2021). org/licenses/by/4.0/.
71. Sasaki, M. et al. Artificial intelligence for breast cancer detection in mammo-
graphy: experience of use of the ScreenPoint Medical Transpara system in 310
Japanese women. Breast Cancer 27, 642–651 (2020). © The Author(s) 2022
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19