Deep Learning in Image-Based Breast and Cervical Cancer Detection: A Systematic Review and Meta-Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

www.nature.

com/npjdigitalmed

ARTICLE OPEN

Deep learning in image-based breast and cervical cancer


detection: a systematic review and meta-analysis
1,5 1,5 1✉ 1✉
Peng Xue , Jiaxu Wang , Dongxu Qin1, Huijiao Yan2, Yimin Qu1, Samuel Seery 3,4
, Yu Jiang and Youlin Qiao

Accurate early detection of breast and cervical cancer is vital for treatment success. Here, we conduct a meta-analysis to assess the
diagnostic performance of deep learning (DL) algorithms for early breast and cervical cancer identification. Four subgroups are also
investigated: cancer type (breast or cervical), validation type (internal or external), imaging modalities (mammography, ultrasound,
cytology, or colposcopy), and DL algorithms versus clinicians. Thirty-five studies are deemed eligible for systematic review, 20 of
which are meta-analyzed, with a pooled sensitivity of 88% (95% CI 85–90%), specificity of 84% (79–87%), and AUC of 0.92
(0.90–0.94). Acceptable diagnostic performance with analogous DL algorithms was highlighted across all subgroups. Therefore, DL
algorithms could be useful for detecting breast and cervical cancer using medical imaging, having equivalent performance to
human clinicians. However, this tentative assertion is based on studies with relatively poor designs and reporting, which likely
caused bias and overestimated algorithm performance. Evidence-based, standardized guidelines around study methods and
reporting are required to improve the quality of DL research.
npj Digital Medicine (2022)5:19 ; https://doi.org/10.1038/s41746-022-00559-z
1234567890():,;

INTRODUCTION expertise is not always readily available for marginalized popula-


Female breast and cervical cancer remain as major contributors to tions, or for those living in remote areas. Resource-based testing
the burden of cancer1,2. The World Health Organization (WHO) and deployment of effective interventions together could reduce
reported that approximately 2.86 million new cases (14.8% of all cancer morbidity and mortality in LMICs16. In line with this, an
cancer cases) and 1.03 million deaths (10.3% of all cancer deaths) ideal detection technology for LMICs should at least have low
were recorded worldwide in 20203. This disproportionately affects training needs.
women, especially in low- and middle-income countries (LMICs), Deep learning (DL), as a subset of artificial intelligence (AI),
which can be largely attributed to more advanced stage could be applied to medical imaging and has shown promise in
diagnoses, limited access to early diagnostics, and suboptimal automatic detection17,18. While media headlines tend to over-
treatment4,5. Population-based cancer screening in high-income emphasize the polarization of DL model findings19, few have
countries might not be as effective in LMICs, due to limited demonstrated inferiority or superiority. However, the Food and
resources for treatment and palliative care6,7. Integrative screening Drug Administration (FDA) has approved a select number of DL-
for cancer is a complex procedure that needs to take biological based diagnosis tools for clinical practice, even though further
and social determinants, as well as ethical constraints into critical appraisal and independent quality assessments are
consideration, and as is already known, early detection of breast pending20,21. To date, there are few medical imaging specialty-
and cervical cancers are associated with improved prognosis and specific systematic reviews such as this, which assess the
survival8,9. Therefore, it is vital to select the most accurate and diagnostic performance of DL algorithms, particularly in breast
reliable technologies that are capable of identifying early and cervical cancer.
symptoms.
Medical imaging plays an essential role in tumor detection,
especially within progressively digitized cancer care services. For RESULTS
example, mammography and ultrasound, as well as cytology and Study selection and characteristics
colposcopy are commonly used in clinical practice10–14. However, Our search initially identified 2252 records, of which 2028 were
fragmented health systems in LMICs may lack infrastructure and screened after removing 224 duplicates. 1957 were also excluded
perhaps the manpower required to ensure high-quality screening, as they did not fulfil our predetermined inclusion criteria. We
diagnosis, and treatment. This hinders the universality of assessed 71 full-text articles and a further 36 articles were
traditional detection technologies mentioned above, which excluded. 25 of these articles focused on breast cancer, and 10
require sophisticated training15. Furthermore, there may be were on cervical cancer (see Fig. 1). Study characteristics are
substantial inter- and intraoperator variability which affects both summarized in Tables 1–3.
machine and human performances. Therefore, the interpretation Thirty-three studies utilized retrospective data. Only two studies
of medical imaging is vulnerable to human error. Of course, used prospective data. Two studies also used data from open
experienced doctors tend to be more accurate although their access sources. No studies reported a prespecified sample size

1
Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical
College, Beijing 100730, China. 2National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union
Medical College, Beijing 100021, China. 3School of Humanities and Social Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730,
China. 4Faculty of Health and Medicine, Division of Health Research, Lancaster University, Lancaster LA1 4YW, United Kingdom. 5These authors contributed equally: Peng Xue,
Jiaxu Wang. ✉email: [email protected]; [email protected]

Published in partnership with Seoul National University Bundang Hospital


P. Xue et al.
2
sample dataset and had a pooled sensitivity of 89%
(87–91%), and pooled specificity of 83% (78–86%), with an
AUC of 0.93 (0.91–0.95), see Fig. 3a for details. Only 8 studies
with 15 contingency tables performed an external valida-
tion, for which the pooled sensitivity and specificity were
83% (77–88%), and 85% (73–92%), respectively, with an AUC
of 0.90 (0.87–0.92), see Fig. 3b.
II. Cancer types—10 studies with 36 contingency tables
targeting breast cancer, had a pooled sensitivity of 90%
(87–92%) and specificity of 85% (80–89%), with an AUC of
0.94 (0.91–0.96), see Fig. 4a. 10 studies with 19 contingency
tables considered cervical cancer with a pooled sensitivity
and specificity of 83% (78–88%), and 80 (70–88%),
respectively, with an AUC of 0.89 (0.86–0.91), see Fig. 4b
for details.
III. Imaging modalities—4 mammography studies with 15
contingency tables had a pooled sensitivity of 87%
(82–91%), a pooled specificity of 88% (79–93%), and with
an AUC of 0.93 (0.91–0.95), see Fig. 5a. There were four
ultrasound studies with 17 contingency tables with a pooled
sensitivity of 91% (89–93%), pooled specificity of 85%
(80–89%), and an AUC of 0.95 (0.93–0.96), see Fig. 5b. There
were four cytology studies with six contingency tables
which had a pooled sensitivity of 87% (82–90%), pooled
specificity of 86% (68–95%), and an AUC of 0.91(0.88–0.93),
1234567890():,;

Fig. 5c. There were four colposcopy studies with 11


contingency tables which had a pooled sensitivity of 78%
(69–84%), pooled specificity of 78% (63–87%), and an AUC
of 0.84 (0.81–0.87), see Fig. 5d.
IV. DL algorithms versus human clinicians—of the 20 included
studies, 11 studies compared diagnostic performance
between DL algorithms and human clinicians using the
Fig. 1 PRISMA flowchart of study selection. Displayed is the PRISMA same dataset, with 29 contingency tables for DL algorithms,
(preferred reporting items for systematic reviews and meta-analyses)
and 18 contingency tables for human clinicians. The pooled
flow of search methodology and literature selection process.
sensitivity was 87% (84–90%) for DL algorithms, which
human clinicians had 88% (81–93%). The pooled specificity
calculation. Eight studies excluded low quality images, while was 83% (76–88%) for DL algorithms, and 82% (72–88%) for
27 studies did not report anything around image quality. human clinicians. The AUC was 0.92 (0.89–0.94) for DL
11 studies performed external validation using out-of-sample algorithms, and 0.92 (0.89–0.94) for human clinicians
dataset, while the others performed internal validation using in- (Fig. 6a, b).
sample-dataset. 12 studies compared DL algorithms against
human clinicians using the same dataset. Additionally, medical
Heterogeneity analysis
imaging modalities were categorized into cytology (n = 4),
colposcopy (n = 4), cervicography (n = 1), microendoscopy (n = All included studies found that DL algorithms are useful for the
1), mammography (n = 12), ultrasound (n = 11), and MRI (n = 2). detection of breast and cervical cancer using medical imaging
when compared with histopathological analysis, as the gold
standard; however, extreme heterogeneity was observed. Sensi-
Pooled performance of DL algorithms tivity (SE) had an I2 = 97.65%, while specificity (SP) had I2 = 99.90
Among the 35 studies in this sample, 20 provided sufficient data (p < 0.0001), see Fig. 7.
to create contingency tables for calculating diagnostic perfor- A funnel plot was produced to assess publication bias. The p
mance and were therefore included for synthesis at the meta- value of 0.41 suggests there is no publication bias although
analysis stage. Hierarchical SROC curves for these studies (i.e. 55 studies were widely dispersed around the regression line. See
contingency tables) are provided in Fig. 2a. When averaging Supplementary Fig. 3 for further details. In order to identify the
across studies, the pooled sensitivity and specificity were 88% source/sources of such extreme heterogeneity we conducted
(95% CI 85–90), and 84% (95% CI 79–87), respectively, with an AUC subgroup analysis, and found:
of 0.92 (95% CI 0.90–0.94) for all DL algorithms.
Most studies used more than one DL algorithm to report I. Validation types—Internal validation (SE, I2 = 97.60%, SP, I2
diagnostic performance, therefore we reported the highest = 99.19, p < 0.0001), and external validation (SE, I2 = 96.15%,
accuracy of different DL algorithms for included studies in 20 SP, I2 = 99.96, p < 0.0001). See Supplementary Fig. 4.
contingency tables. The pooled sensitivity and specificity were II. Cancer types of DL algorithms included breast cancer (SE, I2
89% (86–92%), and 85% (79–90%), respectively, with an AUC of = 95.84%, SP, I2 = 99.86 p < 0.0001) and cervical cancer (SE,
0.93 (0.91–0.95). Please see Fig. 2b for further details. I2 = 98.16%, SP, I2 = 99.89, p < 0.0001). Please see Supple-
mentary Fig. 5 for further details.
III. Imaging modalities including mammography (SE, I2 =
Subgroup meta-analyses
97.01%, SP, I2 = 99.93, p < 0.0001), and ultrasound (SE, I2 =
Four separate meta-analyses were conducted: 86.49%, SP, I2 = 96.06, p < 0.0001), cytology (SE, I2 = 89.97%,
I. Validation types—15 studies with 40 contingency tables SP, I2 = 99.90, p < 0.0001), and colposcopy (SE, I2 = 98.12%,
included in the meta-analysis were validated with an in- SP, I2 = 99.59, p < 0.0001), see Supplementary Fig. 6.

npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
Table 1. Study design and basic demographics.

First author and year Participants


Inclusion criteria Exclusion criteria N Mean or median age
(SD; range)

Xiao et al.49 * Had breast lesions clearly visualized by ultrasound; Underwent biopsy and Patients who were pregnant or lactating; patients who had breast biopsy 389 46.86 (13.03; 19–84)
had pathological results; provided informed consent. or were undergoing neoadjuvant chemotherapy or radiotherapy.
Zhang et al.50 * NR Pathological results were neither benign nor malignant; Patients with BI- 2062 NR
RADS 1 or 2 and abnormal mammography results; patients who were
diagnosed with Paget’s disease but had no masses in the breasts.
Zhou et al.51 * Images were scanned under the same MR protocol; The lesion had Normal or typical background parenchyma enhancement in bilateral 1537 47.5 (11.8; NR)
complete pathology results; Imaging reports had definite BI-RADS breasts was eliminated.
category diagnosed; Lesions were a) solitary in one breast or b) in both
breasts with the same BI-RADS and pathological results.
Agnes et al.52 NR NR NR NR
Tanaka et al.53 * women with breast masses who were referred for further examination Typical cysts; mass lesions ≥ 4.5 cm diameter NR NR
after their initial screening examination of breast cancer and then
underwent ultrasonography and pathological examination.
Becker et al.35 Patients with postsurgical scars, initially indeterminate, or malignant Patients with normal breast ultrasound, and all patients with lesions 632 53 (15; 15–91)
lesions with histological diagnoses or 2 years follow up. classified as clearly benign, except for patients with prior breast-
conserving surgical treatment.
Kyono et al.54 Women recalled after routine breast screening between ages of 47–73 or NR 2000 NR (NR; 47–73)
women with a family history of breast cancer attending annual screening
between ages of 40–49.
Qi et al.55 * NR NR 2047 NR

Published in partnership with Seoul National University Bundang Hospital


Salim et al.56 * Women aged 40–74 years who were diagnosed as having breast cancer, With a cancer diagnosis that had ≥ 12 months between the examination 8805 54.5 (16.1; 40–74)
who had a complete screening examination prior to diagnosis, had no and diagnosis date.
prior breast cancer, did not have implants.
Zhang et al.57 NR NR 121 NR
Wang et al.58 NR NR 263 51.4 (9.8; 28–76)
Li et al.59 NR NR 124 NR
P. Xue et al.

Mckinney et al.60 NR Cases without follow-up were excluded from the test set. 28953 NR
Shen et al.61 NR NR 1249 NR
Suh et al.62 * 18 years or older and not having a history of previous breast surgery. Subjects without medical records or pathological confirmation for a 1501 48.9 (11.1; NR)
suspicious breast lesion, missing mammograms, or having poor-quality
mammograms.
O’Connell et al.63 Adult females or males recommended for ultrasound-guided breast lesion Unable to read and understand English at the University of Rochester; 299 52.3 (NR; NR)
biopsy or ultrasound follow-up with at least one suspicious lesion; age ≥ patients with diagnosis of breast cancer in the same quadrant; unwilling
18 years. to undergo study procedures and informed consent.
Ruiz et al.64 Women presenting for screening with no symptoms or concerns. Women with implants and/or a history of breast cancer. 240 62 (53–66; 39–89)
Adachi et al.65 * Patients who underwent DCE breast MRI; patients who were diagnosed Patients who were treated with breast surgery, hormonal therapy, 371 NR
with benign or malignant lesions by pathology or a follow-up examination chemotherapy, or radiation therapy; age ≤ 20 years.
at more than one year.
Samala et al.66 NR NR 2242 51.7 (NR; 24–82)
Schaffter et al.67 NR NR 153588 56.1 (NR; NR)
Kim et al.68 * NR NR 172230 50.3 (10; NR)
Wang et al.69 All nodules of patients were newly discovered and untreated; patients had Non-nodular breast disease; ABUS artifact was obvious and the poor 264 54.31 (9.68; 37–75)
undertaken ABUS scan; definite pathological benign and malignant; the images quality; ABUS was not available; patients received chemotherapy,
image quality of ABUS examination was good enough to show the entire radiation therapy or surgical local resection before ABUS scan.
margin of the lesion, no matter distinct or indistinct.
Yu et al.70 Pathological results clearly; at least 2D mode US images available, but A foreign-body in the breast; other metastatic tumors or co-infection with 3623 42.5 (NR; 11–95)

npj Digital Medicine (2022) 19


preferably CDFI and PW mode images. Without blurred images or color HIV; measurement markers, arrows, or puncture needles within the image;
overflow.
3
Table 1 continued 4

First author and year Participants


Inclusion criteria Exclusion criteria N Mean or median age
(SD; range)

Sasaki et al.71 * Patients undergone bilateral mammography; patients in whom NR 310 50 (NR; 20–93)
ultrasonography had established the presence or absence of a lesion;
patients in whom a lesion, if present, had been diagnosed as being benign
or malignant by cytology or histology; normal patients in whom
ultrasonography had revealed no lesion and who had been followed up

npj Digital Medicine (2022) 19


for at least 1 year.
Zhang et al.72 NR NR 2620 NR
Bao et al.73 * Aged 20–65 years participated in the program. NR 703103 NR (NR; 20–65)
Holmström et al.74 * Nonpregnant aged between 18-64 years, confirmed HIV positivity, and NR 740 41.8 (10.3; 18–64)
signed informed consent.
Cho et al.75 * Age ≥18 years, not pregnant, had no history of cervical surgery, and had NR 791 NR (NR; 18–94)
Pap test results. All lesions were pathologically confirmed by conization
biopsy, and normal were defined as those with normal Pap test results.
Bao et al.76 * Aged 25–64 years; samples were processed with liquid-based method. NR 2145 38.4 (6.7; 25–46)
done with HPV testing, and diagnosed by colposcopy-directed biopsy.
Hu et al.77 * NR No image, multiple colpo sessions, inadequate histology. 9406 35 (NR;18–94)
Hunt et al.78 * Abnormal cervical screening test, age ≥18 years, intact uterine cervix, not unable to provide informed consent; prior treatment history; pregnant; 1486 40 (12.1; NR)
pregnant, no known allergy to the fluorescent dye used for HRME imaging, other clinical considerations.
does not belong to an indigenous Brazilian population;
Wentzensen et al.79 * Women aged ≥18 years referred to colposcopy. NR 4253 NR
P. Xue et al.

Xue et al.39 * Aged 24-65 years with indications for the need for colposcopy imaging Empty or invalid images, low quality, unsatisfactory images, 19435 NR (NR; 24–65)
and biopsy, and those who were pathologically confirmed. information loss.
Yu et al.80 * NR NR 679 NR
Yuan et al.40 * NR Without complete clinical and pathological information; without biopsies; 22330 NR (NR; 20–66)
pathologically diagnosed as invasive cervical cancer or glandular
intraepithelial lesions; poor-quality colposcopy images.
DCE dynamic contrast enhanced, NR not reported, MRI magnetic resonance imaging, BI-RADS breast imaging reporting and data system, MR magnetic resonance, ABUS automated breast ultrasound, CDFI color
doppler flow imaging, PW pulsed wave, HIV human immunodeficiency virus, HRME high-resolution microendoscopy, DS dual stained.
*20 studies included in the meta-analysis.

Published in partnership with Seoul National University Bundang Hospital


Table 2. Methods of model training and validation.

First author and year Focus Reference standard Type of internal validation External validation DL versus clinician
49
Xiao et al. * Breast cancer Histopathology NR Yes Yes
Zhang et al.50 * Breast cancer Histopathology, immunohistochemistry Random split-sample validation Yes No
Zhou et al.51 * Breast cancer Histopathology, expert consensus Random split-sample validation No Yes
Agnes et al.52 Breast cancer Histopathology NR No No
Tanaka et al.53 * Breast cancer Histopathology, two-year follow-up Random split-sample validation No No
Becker et al.35 Breast cancer Histopathology, two-year follow-up Random split-sample validation No No
Kyono et al.54 Breast cancer Histopathology, follow-up, expert consensus Ten-fold cross validation No No
Qi et al.55 * Breast cancer Histopathology Random split-sample validation No No
Salim et al.56 * Breast cancer Histopathology, two-year follow-up NR Yes Yes
Zhang et al.57 Breast cancer Histopathology NR No No
Wang et al.58 Breast cancer Histopathology, two-year follow-up Five-fold cross validation No No
Li et al.59 Breast cancer Histopathology Five-fold cross validation No No
Mckinney et al.60 Breast cancer Histopathology, multiple years of follow-up NR Yes No
Shen et al.61 Breast cancer Histopathology Random split-sample validation No No
Suh et al.62 * Breast cancer Histopathology Random split-sample validation No No
O’Connell et al.63 Breast cancer Histopathology, two-year follow-up NR Yes No
Ruiz et al.64 Breast cancer Histopathology, one-year follow-up NR Yes Yes

Published in partnership with Seoul National University Bundang Hospital


Adachi et al.65 * Breast cancer Histopathology, at least one-year follow-up Random split-sample validation No Yes
Samala et al.66 Breast cancer Histopathology N-fold cross validation No No
Schaffter et al.67 Breast cancer Histopathology, follow-up Random split-sample validation Yes No
Kim et al.68 * Breast cancer Histopathology, at least one-year follow-up Random split-sample validation Yes Yes
Wang et al.69 Breast cancer Histopathology Random split-sample validation No No
P. Xue et al.

Yu et al.70 Breast cancer Histopathology Random split-sample validation No No


Sasaki et al.71 * Breast cancer Histopathology, cytology, at least one-year follow-up NR Yes Yes
Zhang et al.72 Breast cancer Histopathology NR No No
Heling Bao et al.73 * Cervical cancer Histopathology NR Yes Yes
Holmström et al.74 * Cervical cancer Histopathology NR No Yes
Cho et al.75 * Cervical cancer Histopathology Random split-sample validation No No
Bao et al.76 * Cervical cancer Histopathology NR Yes No
Hu et al.77 * Cervical cancer Histopathology Random split-sample validation No No
Hunt et al.78 * Cervical cancer Histopathology Random split-sample validation No Yes
Wentzensen et al.79 * Cervical cancer Histopathology Random split-sample validation No Yes
Xue et al.39 * Cervical cancer Histopathology Random split-sample validation No Yes
Yu et al.80 * Cervical cancer Histopathology Random split-sample validation No No
Yuan et al.40 * Cervical cancer Histopathology Random split-sample validation No No
NR not reported, DL deep learning. *20 studies included in the meta-analysis.
*20 studies included in the meta-analysis.

npj Digital Medicine (2022) 19


5
6
Table 3. Indicators, algorithms and data sources.

First author Indicator definition Algorithm Data source


and year
Device Exclusion of Heatmap Algorithm Transfer Source of data Number of Data range Open
poor-quality provided architecture learning images for access data
imaging applied training/
internal/
external

npj Digital Medicine (2022) 19


Xiao et al.49 * Ultrasound NR No CNN No Prospective study, data from Peking NR/NR/451 2018.01–2018.12 No
Union Medical College Hospital.
Zhang et al.50 * Ultrasound Yes No CNN Yes Retrospective study, training data from 2822/447/210 NR No
Harbin Medical University Cancer
Hospital; external data from the First
Affiliated Hospital of Harbin Medical
University.
Zhou et al.51 * MRI Yes Yes DenseNet No Retrospective study, data from Chinese 1230/307/NR 2013.03–2016.12 No
University of Hong Kong.
Agnes et al.52 Mammography NR No MA-CNN No Retrospective study, data from mini- 322/NR/NR NR Yes
Mammographic Image Analysis Society
database.
Tanaka et al.53 * Ultrasound NR Yes VGG19, ResNet152 Yes Retrospective study, data from Japan 1382/154/NR 2011.11–2015.12 No
Association of Breast Thyroid Sonology.
Becker et al.35 Ultrasound NR Yes CNN No Retrospective study, data from 445/192/NR 2014.01–2014.12 No
P. Xue et al.

university hospital of Zurich,


Switzerland.
Kyono et al.54 Mammography NR No CNN No Retrospective study, data from UK 1800/200/NR NR No
National Health Service Breast
Screening Program Centers.
Qi et al.55 * Ultrasound NR Yes GoogLeNet Yes Retrospective study, data from West 6786/1359/NR 2014.10–2017.08 No
China Hospital, Sichuan University.
Salim et al.56 * Mammography NR No ResNet-34, No Retrospective study, data from NR/NR/113663 2008–2015 No
MobileNet secondary analysis of a population-
based mammography screening cohort
in Swedish Cohort of Screen-
Age Women.
Zhang et al.57 Ultrasound NR No Deep polynomial No Retrospective study, data source is NR/NR/NR NR No
networks not clear.
Wang et al.58 Ultrasound NR No Inception-v3 CNN No Retrospective study, data from Jeonbuk 252/64/NR 2012.03–2018.03 No
National University Hospital.
Li et al.59 Ultrasound NR No YOLO-v3 No Retrospective study, data from Peking 3124/10812/NR 2018.10–2019.03 No
University People’s Hospital.
Mckinney Mammography NR No CNN No Retrospective study, data 1 from two 25856/NR/3097 2001–2018 No
et al.60 screening centers in England, data 2
from one medical center in USA.
Shen et al.61 Mammography NR Yes VGG, Resnet Yes Retrospective study, data from CBIS- 2102/376/NR NR Yes
DDSM website.
Suh et al.62 * Mammography Yes Yes DenseNet-169, No Retrospective study, data from Hallym 2701/301/NR 2007.02–2015.05 No
EfficientNet-B5 University Sacred Heart Hospital.

Published in partnership with Seoul National University Bundang Hospital


Table 3 continued
First author Indicator definition Algorithm Data source
and year
Device Exclusion of Heatmap Algorithm Transfer Source of data Number of Data range Open
poor-quality provided architecture learning images for access data
imaging applied training/
internal/
external

O’Connell Ultrasound NR No CNN No Prospective study, data from University NR/NR/299 2018–2019 No
et al.63 of Rochester and University Hospital
Palermo, Italy.
Ruiz et al.64 Mammography Yes No CNN No Retrospective study, data from two NR/NR/240 2013–2017 No
institutes in the US and Europe.
Adachi et al.65 * MRI NR No RetinaNet No Retrospective study, data from Tokyo 286/85/NR 2014.03–2018.10 No
Medical and Dental University hospital.
Samala et al.66 Mammography NR No ImageNet DCNN Yes Retrospective study, data from 1335/907/NR 2001–2006 No
University of Michigan Health System
and the Digital Database for Screening
Mammography.
Schaffter et al.67 Mammography NR No Faster-RCNN No Retrospective study, data from Kaiser 100974/43257/ 2016.09–2017.11 No
Permanente Washington and 166578
Karolinska Institute.
Kim et al.68 * Mammography NR Yes ResNet-34 No Retrospective study, data from five 166968/3262/ 2000.01–2018.12 No
institutions in South Korea, USA. 320

Published in partnership with Seoul National University Bundang Hospital


Wang et al.69 Ultrasound Yes No 3D U-Net No Retrospective study, data from the First 254/73/NR 2018.06–2019.05 No
Affiliated Hospital of Xi’an Jiao tong
University.
Yu et al.70 Ultrasound Yes No ResNet50, FPN No Retrospective study, data from 13 7835/7813/NR 2016.01–2019.12 No
Chinese hospitals.
P. Xue et al.

Sasaki et al.71 * Mammography NR No Transpara No Retrospective study, data from Sagara NR/NR/620 2018.01–2018.10 No
Hospital Affiliated Breast Center, Japan.
Zhang et al.72 Mammography NR Yes MVNN No Retrospective study, data from Digital 5194/512/NR NR Yes
Database for Screening Mammography.
Bao et al.73 * Cytology NR No DL No Retrospective study. data from a 103793/NR/ 2017.01–2018.12 No
cervical cancer screening program. 69906
Holmström Cytology NR No CNN No Retrospective study, data from a rural 350/390/NR 2018–2019 No
et al.74 * clinic in Kenya.
Cho et al.75 * Colposcopy NR Yes Inception-Resnet-v2, No Retrospective study, data from three 675/116/NR 2015–2018 No
Resnet-152 university affiliated hospitals.
Bao et al.76 * Cytology NR No VGG16 No Retrospective study, data from eight 15083/NR/2145 2017.05–2018.10 No
tertiary hospitals in China.
Hu et al.77 * Cervicography NR Yes Faster R-CNN Yes Retrospective study, data from 744/8917/NR 1993–2001 No
Guanacaste costa Rica cohort.
Hunt et al.78 * Microendoscopy NR Yes CNN No Prospectively study, data from Barretos 870/616/NR NR No
Cancer Hospital.
Wentzensen Cytology NR No CNN4, Inception-v3 No Retrospective study, data from Kaiser 193/409/NR 2009–2014 No
et al.79 * Permanente Northern California and
the University of Oklahoma.

npj Digital Medicine (2022) 19


7
P. Xue et al.
8
However, heterogeneity was not aligned to a specific subgroup,

dense convolutional network, MA-CNN multiattention convolutional neural network, VGG visual geometry group network, ResNet deep residual network, FPN feature pyramid networks, MVNN multiview feature
NR not reported, CNN convolutional neural network, DL deep learning, YOLO you only look once, DNN deep neural network, DCNN deep convolutional neural network, MRI magnetic resonance imaging, DenseNet
access data
nor was it reduced to an acceptable level, with all subgroup I2
Open values remained high. Therefore, we could infer whether different
validation types, cancer types, and imaging modalities were likely

2018.01–2018.12 No

2013.07–2016.09 No

2013.08–2019.05 No
to have influenced DL algorithm performances for detecting
breast and cervical cancer.
To further investigate this finding, we performed meta-
regression analysis with these covariates (see Supplementary
Data range

Table 1). The results highlighted a statistically significant


difference, which is line with sub-group and meta-analytical
sensitivity analyses.

40194/4466/NR
Quality assessment
77788/23479/

3802/951/NR
Number of
images for

The quality of the included studies was assessed using QUADAS-2


training/
internal/

and a summary of findings has been provided with an appropriate


external

diagram in the supplementary materials as Supplementary Fig. 1.


NR

A detailed assessment for each item based on the domain of risk


of bias and concern of applicability has also been provided as
Women’s Hospital, School of Medicine,
Affiliated Hospital of the University of

Supplementary Fig. 2. For the patient selection domain of risk of


Retrospective study, data from First
multicenter hospitals across China.

Science and Technology of China.


Retrospective study, data from six

bias, 13 studies were considered high or unclear risk of bias due to


unreported inclusion criteria or exclusion criteria, and improper
Retrospective study, data from

exclusion. For the index test domain, only one studies was
considered high or at unclear risk of bias due to having no
predefined threshold, whereas the others were considered at low
Zhejiang University.

risk of bias.
For the reference standard domain, three studies were
Source of data

considered at high or unclear risk of bias due to reference


Data source

standard inconsistencies. There was no mention of whether the


threshold was determined in advance and whether blinding was
implemented. For the flow and timing domain, five studies were
considered high or with an unclear risk of bias because the
authors had not mentioned whether there was an appropriate
time gap or whether it was based on the same gold standard.
learning
Transfer

applied

In the applicability concern domain, 12 studies were considered


to have high or unclear applicability in patient selection. One
Yes

ResNet, U-Net, MASK Yes


No

study also had unclear applicability in the reference standard


domain, with no applicability concern in the index test domain.
C-GCNN, GRU

DISCUSSION
U-Net, YOLO
architecture
Algorithm
Algorithm

Artificial Intelligence in medical imaging is without question


improving however, we must subject emerging knowledge to the
R-CNN

same rigorous testing we would for any other diagnostic


procedure. Deep learning could reduce the over-reliance of
experienced clinicians and could, with relative ease, be extended
to rural communities and LMICs. While this relatively inexpensive
Heatmap
provided

approach may help to bridge inequality gaps across healthcare


systems generally, evidence is increasingly highlighting the value
No

No

No

of deep learning in cancer diagnostics and care. Within the field of


female cancer diagnosis, one of the representative technologies is
poor-quality
Exclusion of

computer-assisted cytology image diagnosis such as the FDA-


fusion neural network, GRU gate recurrent Unit.

approved PAPNET and AutoPap systems, which dates back to at


imaging

least the 1970s22. While rapid progress in AI technology is made,


*20 studies included in the meta-analysis.
Yes

Yes

they are also becoming an increasingly important element


NR
Indicator definition

involved in automated image-based cytology analysis systems.


These technologies have the potential to reduce the amount of
time spent and improve cytologics during the reading process.
Colposcopy

Colposcopy

Colposcopy

Here, we attempted to ascertain which is the most accurate and


reliable detection technology presently available in the field of
Device

breast and cervical cancer diagnostics.


Table 3 continued

A systematic search for pertinent articles identified three


systematic reviews with meta-analyses which investigated DL
Yuan et al.40 *
Xue et al.39 *

algorithms in medical imaging. However, these were in diverse


Yu et al.80 *
First author

domains which make it difficult to compare directly with the


and year

present review. For example, Liu et al. 23 found that DL algorithm


performance in medical imaging might be equivalent to
healthcare professors. However, only breast and dermatological

npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
9

Fig. 2 Pooled overall performance of DL algorithms. a Receiver operator characteristic (ROC) curves of all studies included in the meta-
analysis (20 studies with 55 tables), and b ROC curves of studies reporting the highest accuracy (20 studies with 20 tables).

Fig. 3 Pooled performance of DL algorithms using different validation types. a Receiver operator characteristic (ROC) curves of studies with
internal validations (15 studies with 40 tables), b ROC curves of studies with external validations (8 studies with 15 tables).

Fig. 4 Pooled performance of DL algorithms using different cancer types. a Receiver operator characteristic (ROC) curves of studies in
detecting breast cancer (10 studies with 36 tables), and b ROC curves of studies in detecting cervical cancer (10 studies with 19 tables).

cancers were analyzed with more than three studies, which not algorithms have high diagnostic performance. However, the
only inhibits generalizability but highlights the need for further DL authors also found high heterogeneity which was attributed to
algorithm performance research in the field of medical imaging. In combining distinct methods and perhaps through unspecified
identifying pathologies, Aggarwal et al. 24 found that DL terms. They concluded that we need to be cautious when

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19
P. Xue et al.
10

Fig. 5 Pooled performance of DL algorithms using different imaging modalities. a Receiver operator characteristic (ROC) curves of studies
using mammography (4 studies with 15 tables), b ROC curves of studies using ultrasound (4 studies with 17 tables), c ROC curves of studies
using cytology (4 studies with 6 tables), and d presented ROC curves of studies using colposcopy (4 studies with 11 tables).

Fig. 6 Pooled performance of DL algorithms versus human clinicians and human clinicians using the same sample. a Receiver operator
characteristic (ROC) curves of studies using DL algorithms (11 studies with 29 tables), and b ROC curves of studies using human clinicians
(11 studies with 18 tables).

considering the diagnostic accuracy of DL algorithms and that only other review in this field was conducted by Zheng et al. 25
there is a need to develop (and apply) AI guidelines. This was also who found that DL algorithms are beneficial in radiological
apparent in this study and therefore we would reiterate this imaging with equivalent, or in some instances better performance
sentiment. than healthcare professionals. Although again, there were
While the findings from the aforementioned studies are methodological deficiencies which must be considered before
incredibly valuable, at present there is a need to expand upon we adopt these technologies into clinical practice. Also, we must
the emerging knowledge-base for metastatic tumor diagnosis. The strive to identify the best available DL algorithm and then develop

npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
11

Fig. 7 Summary estimate of pooled performance using forest plot. Data presented forest plot of studies included in the meta-analysis
(20 studies).

it to enhance identification and reduce the number of false were from either pre-existing electronic medical records or online
positives and false negatives beyond that which is humanly open-access databases, which were not explicitly intended for
possible. As such, we need to continue to use systematic reviews algorithmic analysis in real clinical settings. Of course, we must
to identify gaps in research and we should not only consider first test these technologies using retrospective datasets to see
technology-specific reviews, but also disease-specific systematic whether they are appropriate and with a view to modifying and
reviews. Of course, DL algorithms are in an almost constant state enhancing accuracy perhaps for specific populations or for specific
of development but the purpose of this study was to critically types of cancer. We also encourage more prospective DL studies
appraise potential issues with study methods and reporting in the future. If possible, we should investigate the potential rules
standards. By doing so, we hoped to make recommendations and of breast or cervical images through more prospective studies,
to drive further research in this field so that the most effective and identify possible image feature correlations and diagnostic
technology is adopted into clinical practice sooner rather logic for risk predictions. Most studies constructed and trained
than later. algorithms using small labeled breast or cervical images, with
This systematic review with meta-analysis suggests that deep labels which were rarely quality-checked by a clinical specialist.
learning algorithms can be used for the detection of breast and This design fault is likely to have created ambiguous ground-truth
cervical cancer using medical imaging. Evidence also suggests that inputs which may have caused unintended adverse model effects.
while the deep learning algorithms are not yet superior, nor are Of course, the knock-on effect is that there is likely to be
they inferior in terms of performance when compared to diagnostic inaccuracies through unidentified biases. This is
clinicians. Acceptable diagnostic performance with analogous certainly an issue which should be considered when designing
deep learning algorithms was observed in both breast and cervical future deep learning-based studies.
cancer despite having dissimilar workflows with different imaging It is important to note that no matter how well-constructed an
modalities. This finding also suggests that these algorithms could algorithm is, its diagnostic performance depends largely upon the
be deployed across both breast or cervical imaging, and volume of raw data and quality26. Most studies included in this
potentially across all types of cancer which utilize imaging systematic review mentioned a data augmentation method which
technologies to identify cases early. However, we must also adopted some form of affine image transformations strategy e.g.
critically consider some of the issues which emerged during our translational, rotation or flipping, in order to compensate for data
systematic analysis of this evidence base. deficiencies. This, one could argue, is symptomatic of the paucity
Overall, there were very few prospective studies and few clinical of annotated datasets for model training, and prospective studies
trials. In fact, most included studies were retrospective studies for model validation. Though fortunately, there has been a
which may be the case because of the relative newness of DL substantial increase in the number of openly available datasets
algorithms in medical imaging. However, the data sources used around cervical or breast cancer. However, given the necessity for

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19
P. Xue et al.
12
this research, one would like to see institutions collaborating more DL models in the healthcare setting will need clinicians to
frequently to establish cloud sharing platforms which would optimize clinical workflow integration. However, we found only
increase the availability (and breadth) of annotated datasets. two of studies which mentioned DL versus clinicians and versus
Moreover, training DL algorithms requires reliable, high-quality DL combined with clinicians. This hindered our meta-analysis of
image inputs, which may not be readily available, as some pre- DL algorithms but highlighted the need for strict and reliable
analytical factors such as incorrect specimen preparation and assessment of DL performance in real clinical settings. Indeed, the
processing, unstandardized image digitalization acquisition, scientific discourse should change from DL versus clinicians
improper device calibration and maintenance could lower image dichotomy to a more realistic DL-clinician combination, which
quality. Complete standardization of all procedures and reagents would improve workflows.
in clinical practice is required to optimally prepare pre-analytical 35 studies met the eligibility criteria for the systematic review,
image inputs in order to develop more robust and accurate DL yet only 20 studies could be used to develop contingency tables.
algorithms. Having these would drive developments in this field Some DL algorithm studies from computer science journals only
and would benefit clinical practice, perhaps serving as a cost- reported precision, dice coefficient, F1 score, recall, and competi-
effective replacement diagnostic tool or an initial method of risk tion performance metric. Whereas indicators such as AUC,
categorization. Although, this is beyond the scope of this study accuracy, sensitivity, and specificity are more familiar to healthcare
and would require further research to consider this in detail. professionals25. Bridging the gap between computer sciences
Of the 35 included studies, only 11 studies performed external research would seem prudent if we are to manage interdepart-
validation, which means that an assessment of DL model mental research and the transition to a more digitized healthcare
performance was conducted with either an out-of-sample dataset system. Moreover, we found the term “validation” is used causally
or with an open-access dataset. Indeed, most of the studies in DL model studies. Some authors used it for assessing the
included here split a single sample by either randomly and non- diagnostic performance of the final algorithm, others defined it as
randomly assigning individuals’ data from one center into one a dataset for model tuning during the development process. This
development dataset or the other internal validations dataset. We confuses readers and makes it difficult to judge the function of
found that studies with internal validation were higher than datasets. We combined experts’ opinions33, and proposed to
externally validated studies for early detection of cervical and distinguish datasets used in the development and validation of DL
breast cancer. However, this was to be expected because using an algorithms. In keeping with the language used for nomogram
internal dataset to validate models is more likely homogenous and development, a dataset for training the model should be named
may lead to an overestimated diagnostic performance. This ‘training set’, while datasets used for tuning should be referred to
finding highlights the need for out-of-sample external validation as the ‘tuning set’. Likewise, during the validation phase, the hold-
in all predictive models. A possible method for improving external back subset split from the entire dataset should be referred to a
validation would be to establish an alliance of institutions wherein ‘internal’ validation, which is the same condition/image types as
trained deep learning algorithms are shared and performances the training set. While a completely independent dataset for our-
tested, externally. This might provide insight into subgroups and of-sample validation should be referred to as ‘external’
variations between various ethnic groups although we would also validation34.
need to maintain patient anonymity and security, as several Most of the issues discussed here could be avoided through
researchers have previously noted27,28. more robust designs and high-quality reporting, although several
Most of the studies that were retrospective using narrowly hurdles must be overcome before DL algorithms are used in
defined binary or multi-class tests focusing on the diagnostic practice for breast and cervical cancer identification. The black box
performance in the field of DL algorithms rather than clinical nature of DL models without clear interpretability of the basis for
practice. This is a direct consequence of poor reporting, and the the clinical situations is a well-recognized challenge. For example,
lack of real-world prospective clinical practice, which has resulted a clinician considering whether breast nodules represent breast
in inadequate data availability and therefore may limit our ability cancer based on mammographic images for a series of judgement
to gauge the applicability of these DL algorithms to clinical criteria. Therefore, a clinician developing a clear rationale for a
settings. Accordingly, there is uncertainty around the estimates of proposed diagnosis maybe the desired state. Whereas, having a
diagnostic performance provided in our meta-analysis and DL model which merely states the diagnosis may be viewed with
adherence levels should be interpreted with caution. more skepticism. Scientists have actively investigated possible
Recently, several AI-related method guides have been pub- methods for inspecting and explaining algorithmic decisions. An
lished, with many still under development29,30. We found most of important example is the use of salience or heat maps to provide
the included studies we analyzed were probably conceived or the location of salient lesion features within the image rather than
performed before these guidelines were available. Therefore, it is defining the lesion characteristics themselves35,36. This raises
reasonable to assume that design features, reporting adequacy questions around human-technology interactions, and particularly
and transparency of studies used to evaluate the diagnostic around transparency and patient-practitioner communications
performance of DL algorithms will be improved in the future. Even which ought to be studied in conjunction with DL modeling in
though, our findings suggest that DL is not inferior in terms of medical imaging.
performance compared to clinicians for the early detection of Another common problem limiting DL algorithms is model
breast or cervical cancer, this is based on relatively few studies. generalizability. There may be potential factors in the training data
Therefore, the uncertainty which exists is, at least in part, due to that would affect the performance of DL models in different data
the in silico context in which clinicians are being evaluated. distribution settings28. For example, a model only trained in US
We should also acknowledge that most of the current DL may not perform well in Asia because a model trained using data
studies are publications of positive results. We must be aware that from predominantly caucasian patients may not perform well
this may be a form of researcher-based reporting bias (rather than across other ethnicities. One solution to improve generalizability
publication-based bias), which is likely to skew the dataset and and reduce bias is to conduct large, multicenter studies which can
adds complexity to comparison between DL algorithms and enable the analysis of nationalities, ethnicities, hospital specifics,
clinicians31,32. Differences in reference standard definitions, grader and population distribution characteristics37. Societal biases can
capabilities (i.e. the degrees of expertise), imaging modalities and also affect the performance of DL models and of course, bias exists
detection thresholds for classification of early breast or cervical in DL algorithms because a training dataset may not include
cancer also make direct comparisons between studies and appropriate proportions of minority groups. For example, a DL
algorithms very difficult. Furthermore, non-trivial applications of algorithm for melanoma diagnosis in dermatological study may

npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
13
lack diversity in terms of skin color and genomic data, but this may to be robust, and scientifically validated for clinical and personal
also cause an under-representation of minority groups38. To utility.
eliminate embedded prejudice, efforts should be made to carry We tentatively suggest that DL algorithms could be useful for
out DL algorithm research which provides a more realistic detecting breast and cervical cancer using medical imaging, with
representation of global populations. equivalent performance to human clinicians, in terms of sensitivity
As we have seen, the included studies were mostly retro- and specificity. However, this finding is based on poor study
spective with extensive variation in methods and reporting. More designs and reporting which could lead to bias and over-
high-quality studies such as prospective studies and clinical trials estimating algorithmic performances. Standardized guidelines
are needed to enhance the current evidence base. We also around study methods and reporting are needed to improve
focused on DL algorithms for breast and cervical cancer detection the quality of DL model research. This may help to facilitate the
using medical imaging. Therefore, we made no attempt to transition into clinical practice although further research is
generalize our findings to other types of AI, such as conventional required.
machine learning models. While there were a reasonable number
of studies for this meta-analysis, the number of studies of each
imaging modality was limited like cytology or colposcopy, METHODS
Therefore, the results of the subgroup analyses around imaging Protocol registration and study design
modality needs to be interpreted with caution. We also selected The study protocol was registered with the PROSPERO International
only studies in which histopathology was used as the reference register of systematic reviews, number CRD42021252379. The study was
standard. Consequently, some DL studies that may have shown conducted according to the preferred reporting items for systematic
promise but did not have confirmatory histopathologic results, reviews and meta-analyses (PRISMA) guidelines47. No ethical approval or
were excluded. Even though the publication bias was not informed consent was required for the current systematic review and
identified through funnel plot analysis in Supplementary Fig. 3 meta-analysis.
based on data extracted from 20 studies, the lack of prospective
studies and the potential absence of studies with negative results Search strategy and eligibility criteria
can cause bias. As such, we would encourage deep learning In this study, we searched Medline, Embase, IEEE and the Cochrane library
researchers in medical imaging to report studies which do not until April 2021. No restrictions were applied around regions, languages, or
reject the null hypothesis because this will ensure evidence publication types; however, letters, scientific reports, conference abstracts,
clusters around true effect estimates. and narrative reviews were excluded. The full search strategy for each
It remains necessary to promote deep learning in medical database was developed in collaboration with a group of experienced
imaging studies for breast or cervical cancer detection. However, clinicians and medical researchers. Please see Supplementary Note 1 for
further details.
we suggest improving breast and cervical data quality and
Eligibility assessment was conducted by two independent investigators,
establishing unified standards. Developing DL algorithms needs to who screened titles and abstracts, and selected all relevant citations for
feed on reliable and high-quality images tagged with appropriate full-text review. Disagreements were resolved through discussion with
histopathological labels. Likewise, it is important to establish another collaborator. We included studies that reported the diagnostic
unified standards to improve the quality of the digital image- performance of a DL model/s for the early detection of breast or cervical
production, the collection process, imaging reports, and final cancer using medical imaging. Studies reporting any diagnostic outcome,
histopathological diagnosis. Combining DL algorithm results with such as accuracy, sensitivity, and specificity etc., could be included. There
other biomarkers may prove useful to improve risk discrimination was no restriction on participant characteristics, type of imaging modality
for breast or cervical cancer detection. An example would be a DL or the intended context for using DL models.
model for cervical imaging that combines with additional clinical Only histopathology was accepted as the study reference standard. As
such, imperfect ground truths, such as expert opinion or consensus, and
information i.e. cytology and HPV typing, which could improve
other clinical testing were rejected. Likewise, medical waveform data or
overall diagnostic performance39,40. Secondly, we need to improve investigations into the performance of image segmentation were excluded
the error correction ability and DL algorithm compatibility. because these could not be synthesized with histopathological data.
Prophase developing DL algorithms are more generalizable and Animals’ studies or non-human samples were also excluded and duplicates
less susceptible to bias but may require larger and multicenter were removed. The primary outcomes were various diagnostic perfor-
datasets which incorporate diverse nationalities and ethnicities, as mance metrics. Secondary analysis included and assessment of study
well as those with different socioeconomic status etc., if we are to methodologies and reporting standards.
implement algorithms into real-world settings.
This also highlights the need for international reporting Data extraction
guidelines for DL algorithms in medical imaging. Existing Two investigators independently extracted study characteristics and
reporting guidelines such as STARD41 for diagnostic accuracy diagnostic performance data using predetermined data extraction sheet.
studies, and TRIPOD42 for conventional prediction models are not Again, uncertainties were resolved by a third investigator. Binary
available to DL model study. The recent publication of CONSORT- diagnostic accuracy data were extracted directly into contingency tables
AI43 and SPIRIT-AI44 guidelines are welcomed but we await which included true-positives, false-positives, true-negatives, and false-
disease-specific DL guidelines. Furthermore, we would encourage negatives. These were then used to calculate pooled sensitivity, pooled
organizations to develop diverse teams, combining computer specificity, and other metrics. If a study provided multiple contingency
scientists and clinicians to solve clinical problems using DL tables for the same or for different DL algorithms, we assumed that they
were independent of each other.
algorithms. Even though DL algorithms appear like black boxes
with unexplainable decision-making outputs, these technologies
need to be discussed for development and require additional Quality assessment
clinical information45,46. Finally, medical computer vision algo- The risk of bias and applicability concerns of the included studies were
rithms do not exist in a vacuum, we must integrate DL algorithms assessed by the three investigators using the quality assessment of
into routine clinical workflows and across entire healthcare diagnostic accuracy studies 2 (QUADAS-2) tool48.
systems to assist doctors and augment decision-making. There-
fore, it is crucial that clinicians understand the information each Statistical analysis
algorithm provides and how this can be integrated into clinical Hierarchical summary receiver operating characteristic (SROC) curves were
decisions which enhance efficiency without absorbing resources. used to assess the diagnostic performance of DL algorithms. 95%
For any algorithm to be incorporated into existing workflows it has confidence intervals (CI) and prediction regions were generated around

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19
P. Xue et al.
14
averaged sensitivity, specificity, and AUCs estimates in SROC figures. 15. Mandal, R. & Basu, P. Cancer screening and early diagnosis in low and middle
Further meta-analysis was performed to report the best accuracy in studies income countries: Current situation and future perspectives. Bundesgesundheits-
with multiple DL algorithms from contingency tables. Heterogeneity was blatt Gesundheitsforschung Gesundheitsschutz 61, 1505–1512 (2018).
assessed using the I2 statistic. We also conducted the subgroup meta- 16. Torode, J. et al. National action towards a world free of cervical cancer for all
analyses and regression analyses to explore potential sources of women. Prev. Med 144, 106313 (2021).
heterogeneity. The random effects model was implemented because of 17. Coiera, E. The fate of medicine in the time of AI. Lancet 392, 2331–2332 (2018).
the assumed differences between studies. Publication bias was assessed 18. Kleppe, A. et al. Designing deep learning studies in cancer diagnostics. Nat. Rev.
visually using funnel plots. Cancer 21, 199–211 (2021).
Four separate meta-analyses were conducted: (1) according to validation 19. Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of
type, DL algorithms were categorized as either internal or external. Internal design, reporting standards, and claims of deep learning studies. BMJ 368, m689
validation meant that studies were validated using an in-sample-dataset, (2020).
while external validation studies were validated using an out-of-sample 20. Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based
dataset; (2) according to cancer type i.e., breast or cervical cancer; (3) FDA-approved medical devices and algorithms: an online database. NPJ Digit Med
according to imaging modalities, such as mammography, ultrasound, 3, 118 (2020).
cytology, and colposcopy, etc; (4) according to the pooled performance for 21. Liu, X., Rivera, S. C., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting
DL algorithms versus human clinicians using the same dataset. guidelines for clinical trial reports for interventions involving artificial intelligence:
Meta-analysis was only performed where there were more than or equal the CONSORT-AI Extension. BMJ 370, m3164 (2020).
to three original studies. STATA (version 15.1), and SAS (version 9.4) were 22. Bengtsson, E. & Malm, P. Screening for cervical cancer using automated analysis
for data analyses. The threshold for statistical significance was set at p < of PAP-smears. Comput Math. Methods Med 2014, 842037 (2014).
0.05, and all tests were two-sides. 23. Liu, X. et al. A comparison of deep learning performance against health-care
professionals in detecting diseases from medical imaging: a systematic review
and meta-analysis. Lancet Digit Health 1, e271–e297 (2019).
Reporting Summary 24. Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a
Further information on research design is available in the Nature Research systematic review and meta-analysis. NPJ Digit Med 4, 65 (2021).
Reporting Summary linked to this article. 25. Zheng, Q. et al. Artificial intelligence performance in detecting tumor metastasis
from medical radiology imaging: A systematic review and meta-analysis. EClini-
calMedicine 31, 100669 (2021).
DATA AVAILABILITY 26. Moon, J. H. et al. How much deep learning is enough for automatic identification
The search strategy and aggregated data contributing to the meta-analysis is to be reliable? Angle Orthod. 90, 823–830 (2020).
available in the appendix. 27. Beam, A. L., Manrai, A. K. & Ghassemi, M. Challenges to the Reproducibility of
Machine Learning Models in Health Care. Jama 323, 305–306 (2020).
28. Trister, A. D. The Tipping Point for Deep Learning in Oncology. JAMA Oncol. 5,
Received: 24 June 2021; Accepted: 22 December 2021; 1429–1430 (2019).
29. Kim, D. W., Jang, H. Y., Kim, K. W., Shin, Y. & Park, S. H. Design Characteristics of
Studies Reporting the Performance of Artificial Intelligence Algorithms for
Diagnostic Analysis of Medical Images: Results from Recently Published Papers.
Korean J. Radio. 20, 405–410 (2019).
REFERENCES 30. England, J. R. & Cheng, P. M. Artificial Intelligence for Medical Image Analysis: A
1. Arbyn, M. et al. Estimates of incidence and mortality of cervical cancer in 2018: a Guide for Authors and Reviewers. AJR Am. J. Roentgenol. 212, 513–519 (2019).
worldwide analysis. Lancet Glob. Health 8, e191–e203 (2020). 31. Cook, T. S. Human versus machine in medicine: can scientific literature answer
2. Li, N. et al. Global burden of breast cancer and attributable risk factors in 195 the question? Lancet Digit Health 1, e246–e247 (2019).
countries and territories, from 1990 to 2017: results from the Global Burden of 32. Simon, A. B., Vitzthum, L. K. & Mell, L. K. Challenge of Directly Comparing Imaging-
Disease Study 2017. J. Hematol. Oncol. 12, 140 (2019). Based Diagnoses Made by Machine Learning Algorithms With Those Made by
3. Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence Human Clinicians. J. Clin. Oncol. 38, 1868–1869 (2020).
and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 71, 33. Altman, D. G. & Royston, P. What do we mean by validating a prognostic model?
209–249 (2021). Stat. Med 19, 453–473 (2000).
4. Ginsburg, O. et al. Changing global policy to deliver safe, equitable, and afford- 34. Kim, D. W. et al. Inconsistency in the use of the term “validation” in studies
able care for women’s cancers. Lancet 389, 871–880 (2017). reporting the performance of deep learning algorithms in providing diagnosis
5. Allemani, C. et al. Global surveillance of trends in cancer survival 2000-14 from medical imaging. PLoS One 15, e0238908 (2020).
(CONCORD-3): analysis of individual records for 37 513 025 patients diagnosed 35. Becker, A. S. et al. Classification of breast cancer in ultrasound imaging using a
with one of 18 cancers from 322 population-based registries in 71 countries. generic deep learning analysis software: a pilot study. Br. J. Radio. 91, 20170576
Lancet 391, 1023–1075 (2018). (2018).
6. Shah, S. C., Kayamba, V., Peek, R. M. Jr. & Heimburger, D. Cancer Control in Low- 36. Becker, A. S. et al. Deep Learning in Mammography: Diagnostic Accuracy of a
and Middle-Income Countries: Is It Time to Consider Screening? J. Glob. Oncol. 5, Multipurpose Image Analysis Software in the Detection of Breast Cancer. Invest
1–8 (2019). Radio. 52, 434–440 (2017).
7. Wentzensen, N., Chirenje, Z. M. & Prendiville, W. Treatment approaches for 37. Wang, F., Casalino, L. P. & Khullar, D. Deep Learning in Medicine-Promise, Pro-
women with positive cervical screening results in low-and middle-income gress, and Challenges. JAMA Intern Med 179, 293–294 (2019).
countries. Prev. Med 144, 106439 (2021). 38. Topol, E. J. High-performance medicine: the convergence of human and artificial
8. Britt, K. L., Cuzick, J. & Phillips, K. A. Key steps for effective breast cancer pre- intelligence. Nat. Med 25, 44–56 (2019).
vention. Nat. Rev. Cancer 20, 417–436 (2020). 39. Xue, P. et al. Development and validation of an artificial intelligence system for
9. Brisson, M. et al. Impact of HPV vaccination and cervical screening on cervical grading colposcopic impressions and guiding biopsies. BMC Med 18, 406 (2020).
cancer elimination: a comparative modelling analysis in 78 low-income and 40. Yuan, C. et al. The application of deep learning based diagnostic system to
lower-middle-income countries. Lancet 395, 575–590 (2020). cervical squamous intraepithelial lesions recognition in colposcopy images. Sci.
10. Yang, L. et al. Performance of ultrasonography screening for breast cancer: a Rep. 10, 11639 (2020).
systematic review and meta-analysis. BMC Cancer 20, 499 (2020). 41. Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting
11. Conti, A., Duggento, A., Indovina, I., Guerrisi, M. & Toschi, N. Radiomics in breast diagnostic accuracy studies. BMJ 351, h5527 (2015).
cancer classification and prediction. Semin Cancer Biol. 72, 238–250 (2021). 42. Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a
12. Xue, P., Ng, M. T. A. & Qiao, Y. The challenges of colposcopy for cervical cancer multivariable prediction model for individual prognosis or diagnosis (TRIPOD):
screening in LMICs and solutions by artificial intelligence. BMC Med 18, 169 the TRIPOD statement. BMJ 350, g7594 (2015).
(2020). 43. Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting
13. William, W., Ware, A., Basaza-Ejiri, A. H. & Obungoloch, J. A review of image analysis guidelines for clinical trial reports for interventions involving artificial intelligence:
and machine learning techniques for automated cervical cancer screening from the CONSORT-AI extension. Nat. Med 26, 1364–1374 (2020).
pap-smear images. Comput Methods Prog. Biomed. 164, 15–22 (2018). 44. Cruz Rivera, S., Liu, X., Chan, A. W., Denniston, A. K. & Calvert, M. J. Guidelines for
14. Muse, E. D. & Topol, E. J. Guiding ultrasound image capture with artificial intel- clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-
ligence. Lancet 396, 749 (2020). AI extension. Nat. Med 26, 1351–1363 (2020).

npj Digital Medicine (2022) 19 Published in partnership with Seoul National University Bundang Hospital
P. Xue et al.
15
45. Huang, S. C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical 72. Zhang, C., Zhao, J., Niu, J. & Li, D. New convolutional neural network model for
imaging and electronic health records using deep learning: a systematic review screening and diagnosis of mammograms. PLoS One 15, e0237674 (2020).
and implementation guidelines. NPJ Digit Med 3, 136 (2020). 73. Bao, H. et al. The artificial intelligence-assisted cytology diagnostic system in
46. Guo, H. et al. Heat map visualization for electrocardiogram data analysis. BMC large-scale cervical cancer screening: A population-based cohort study of 0.7
Cardiovasc Disord. 20, 277 (2020). million women. Cancer Med 9, 6896–6906 (2020).
47. Moher, D., Liberati, A., Tetzlaff, J. & Altman, D. G. Preferred reporting items for sys- 74. Holmström, O. et al. Point-of-Care Digital Cytology With Artificial Intelligence for
tematic reviews and meta-analyses: the PRISMA statement. BMJ 339, b2535 (2009). Cervical Cancer Screening in a Resource-Limited Setting. JAMA Netw. Open 4,
48. Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of e211740 (2021).
diagnostic accuracy studies. Ann. Intern Med 155, 529–536 (2011). 75. Cho, B. J. et al. Classification of cervical neoplasms on colposcopic photography
49. Xiao, M. et al. Diagnostic Value of Breast Lesions Between Deep Learning-Based using deep learning. Sci. Rep. 10, 13652 (2020).
Computer-Aided Diagnosis System and Experienced Radiologists: Comparison 76. Bao, H. et al. Artificial intelligence-assisted cytology for detection of cervical
the Performance Between Symptomatic and Asymptomatic Patients. Front Oncol. intraepithelial neoplasia or invasive cancer: A multicenter, clinical-based, obser-
10, 1070 (2020). vational study. Gynecol. Oncol. 159, 171–178 (2020).
50. Zhang, X. et al. Evaluating the Accuracy of Breast Cancer and Molecular Subtype 77. Hu, L. et al. An Observational Study of Deep Learning and Automated Eva-
Diagnosis by Ultrasound Image Deep Learning Model. Front Oncol. 11, 623506 (2021). luation of Cervical Images for Cancer Screening. J. Natl Cancer Inst. 111,
51. Zhou, J. et al. Weakly supervised 3D deep learning for breast cancer classification 923–932 (2019).
and localization of the lesions in MR images. J. Magn. Reson Imaging 50, 78. Hunt, B. et al. Cervical lesion assessment using real-time microendoscopy image
1144–1151 (2019). analysis in Brazil: The CLARA study. Int J. Cancer 149, 431–441 (2021).
52. Agnes, S. A., Anitha, J., Pandian, S. I. A. & Peter, J. D. Classification of Mammogram 79. Wentzensen, N. et al. Accuracy and Efficiency of Deep-Learning-Based Automa-
Images Using Multiscale all Convolutional Neural Network (MA-CNN). J. Med Syst. tion of Dual Stain Cytology in Cervical Cancer Screening. J. Natl Cancer Inst. 113,
44, 30 (2019). 72–79 (2021).
53. Tanaka, H., Chiu, S. W., Watanabe, T., Kaoku, S. & Yamaguchi, T. Computer-aided 80. Yu, Y., Ma, J., Zhao, W., Li, Z. & Ding, S. MSCI: A multistate dataset for colposcopy
diagnosis system for breast ultrasound images using deep learning. Phys. Med image classification of cervical cancer screening. Int J. Med Inf. 146, 104352 (2021).
Biol. 64, 235013 (2019).
54. Kyono, T., Gilbert, F. J. & van der Schaar, M. Improving Workflow Efficiency for
Mammography Using Machine Learning. J. Am. Coll. Radio. 17, 56–63 (2020). ACKNOWLEDGEMENTS
55. Qi, X. et al. Automated diagnosis of breast ultrasonography images using deep This study was supported by CAMS Innovation Fund for Medical Sciences (Grant #:
neural networks. Med Image Anal. 52, 185–198 (2019). CAMS 2021-I2M-1-004).
56. Salim, M. et al. External Evaluation of 3 Commercial Artificial Intelligence Algo-
rithms for Independent Assessment of Screening Mammograms. JAMA Oncol. 6,
1581–1588 (2020).
AUTHOR CONTRIBUTIONS
57. Zhang, Q. et al. Dual-mode artificially-intelligent diagnosis of breast tumours in
shear-wave elastography and B-mode ultrasound using deep polynomial net- P.X., Y.J., and Y.Q. conceptualised the study, P.X., J.W., D.Q., and H.Y. designed the
works. Med Eng. Phys. 64, 1–6 (2019). study, extracted data, conducted the analysis and wrote the manuscript. P.X. and S.S.
58. Wang, Y. et al. Breast Cancer Classification in Automated Breast Ultrasound Using revised the manuscript. All authors approved the final version of the manuscript and
Multiview Convolutional Neural Network with Transfer Learning. Ultrasound Med take accountability for all aspects of the work. P.X. and J.W. contributed equally to
Biol. 46, 1119–1132 (2020). this article.
59. Li, Y., Wu, W., Chen, H., Cheng, L. & Wang, S. 3D tumor detection in automated
breast ultrasound using deep convolutional neural network. Med Phys. 47,
5669–5680 (2020). COMPETING INTERESTS
60. McKinney, S. M. et al. international evaluation of an AI system for breast cancer The authors declare no competing interests.
screening. Nature 577, 89–94 (2020).
61. Shen, L. et al. Deep Learning to Improve Breast Cancer Detection on Screening
Mammography. Sci. Rep. 9, 12495 (2019). ADDITIONAL INFORMATION
62. Suh, Y. J., Jung, J. & Cho, B. J. Automated Breast Cancer Detection in Digital Supplementary information The online version contains supplementary material
Mammograms of Various Densities via Deep Learning. J. Pers. Med 10, 211 (2020). available at https://doi.org/10.1038/s41746-022-00559-z.
63. O'Connell, A. M. et al. Diagnostic Performance of An Artificial Intelligence System
in Breast Ultrasound. J. Ultrasound Med. 41, 97–105 (2021). Correspondence and requests for materials should be addressed to Yu Jiang or
64. Rodriguez-Ruiz, A. et al. Stand-Alone Artificial Intelligence for Breast Cancer Youlin Qiao.
Detection in Mammography: Comparison With 101 Radiologists. J. Natl Cancer
Inst. 111, 916–922 (2019). Reprints and permission information is available at http://www.nature.com/
65. Adachi, M. et al. Detection and Diagnosis of Breast Cancer Using Artificial Intel- reprints
ligence Based assessment of Maximum Intensity Projection Dynamic Contrast-
Enhanced Magnetic Resonance Images. Diagnostics (Basel) 10, 330 (2020). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
66. Samala, R. K. et al. Multi-task transfer learning deep convolutional neural network: in published maps and institutional affiliations.
application to computer-aided diagnosis of breast cancer on mammograms.
Phys. Med Biol. 62, 8894–8908 (2017).
67. Schaffter, T. et al. Evaluation of Combined Artificial Intelligence and Radiologist
Assessment to Interpret Screening Mammograms. JAMA Netw. Open 3, e200265 Open Access This article is licensed under a Creative Commons
(2020). Attribution 4.0 International License, which permits use, sharing,
68. Kim, H. E. et al. Changes in cancer detection and false-positive recall in mam- adaptation, distribution and reproduction in any medium or format, as long as you give
mography using artificial intelligence: a retrospective, multireader study. Lancet appropriate credit to the original author(s) and the source, provide a link to the Creative
Digit Health 2, e138–e148 (2020). Commons license, and indicate if changes were made. The images or other third party
69. Wang, F. et al. Study on automatic detection and classification of breast nodule using material in this article are included in the article’s Creative Commons license, unless
deep convolutional neural network system. J. Thorac. Dis. 12, 4690–4701 (2020). indicated otherwise in a credit line to the material. If material is not included in the
70. Yu, T. F. et al. Deep learning applied to two-dimensional color Doppler flow article’s Creative Commons license and your intended use is not permitted by statutory
imaging ultrasound images significantly improves diagnostic performance in the regulation or exceeds the permitted use, you will need to obtain permission directly
classification of breast masses: a multicenter study. Chin. Med J. (Engl.) 134, from the copyright holder. To view a copy of this license, visit http://creativecommons.
415–424 (2021). org/licenses/by/4.0/.
71. Sasaki, M. et al. Artificial intelligence for breast cancer detection in mammo-
graphy: experience of use of the ScreenPoint Medical Transpara system in 310
Japanese women. Breast Cancer 27, 642–651 (2020). © The Author(s) 2022

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2022) 19

You might also like