Aggarwal Et Al. - 2021
Aggarwal Et Al. - 2021
Aggarwal Et Al. - 2021
com/npjdigitalmed
Deep learning (DL) has the potential to transform medical diagnostics. However, the diagnostic accuracy of DL is uncertain. Our aim
was to evaluate the diagnostic accuracy of DL algorithms to identify pathology in medical imaging. Searches were conducted in
Medline and EMBASE up to January 2020. We identified 11,921 studies, of which 503 were included in the systematic review. Eighty-
two studies in ophthalmology, 82 in breast disease and 115 in respiratory disease were included for meta-analysis. Two hundred
twenty-four studies in other specialities were included for qualitative review. Peer-reviewed studies that reported on the diagnostic
accuracy of DL algorithms to identify pathology using medical imaging were included. Primary outcomes were measures of
diagnostic accuracy, study design and reporting standards in the literature. Estimates were pooled using random-effects meta-
analysis. In ophthalmology, AUC’s ranged between 0.933 and 1 for diagnosing diabetic retinopathy, age-related macular
degeneration and glaucoma on retinal fundus photographs and optical coherence tomography. In respiratory imaging, AUC’s
ranged between 0.864 and 0.937 for diagnosing lung nodules or lung cancer on chest X-ray or CT scan. For breast imaging, AUC’s
1234567890():,;
ranged between 0.868 and 0.909 for diagnosing breast cancer on mammogram, ultrasound, MRI and digital breast tomosynthesis.
Heterogeneity was high between studies and extensive variation in methodology, terminology and outcome measures was noted.
This can lead to an overestimation of the diagnostic accuracy of DL algorithms on medical imaging. There is an immediate need for
the development of artificial intelligence-specific EQUATOR guidelines, particularly STARD, in order to provide guidance around key
issues in this field.
npj Digital Medicine (2021)4:65 ; https://doi.org/10.1038/s41746-021-00438-z
1
Institute of Global Health Innovation, Imperial College London, London, UK. 2Singapore Eye Research Institute, Singapore National Eye Center, Singapore, Singapore. ✉email: h.
ashrafi[email protected]
high heterogeneity across all studies (see Table 2). images and visual fields) or for identifying other diagnoses
Diabetic retinopathy: Twenty-five studies with 48 different (pseudopapilloedema, retinal vein occlusion and retinal detach-
patient cohorts reported diagnostic accuracy data for all, referable ment). These studies were not included in the meta-analysis.
or vision-threatening DR on RFP. Twelve studies and 16 cohorts
Respiratory imaging
11921 records iden fied One hundred and fifteen studies with 244 separate patient
4902 from MEDLINE
6952 from Embase
cohorts report on diagnostic accuracy of DL on respiratory disease
67 from other sources (see Table 3 and Supplementary References 2). Lung nodules were
largely identified on CT scans, whereas chest X-rays (CXR) were
used to diagnose a wide spectrum of conditions from simply
2437 duplicates removed
being ‘abnormal’ to more specific diagnoses, such as pneu-
mothorax, pneumonia and tuberculosis.
9484 records screened Only two studies62,63 used prospectively collected data and 13
(refs. 63–75) studies validated algorithms on external data. No
studies provided a prespecified sample size calculation. Twenty-
8721 records excluded one54,63–67,70,72,76–88 studies compared algorithm performance
against healthcare professionals. Reference standards varied
763 full-text ar cles greatly as did the method of internal validation used. There was
assessed for eligibility high heterogeneity across all studies (see Table 3).
Lung nodules: Fifty-six studies with 74 separate patient cohorts
260 excluded
114 conference papers reported diagnostic accuracy for identifying lung nodules on CT
72 segmenta on scans on a per lesion basis, compared with nine studies and 14
42 predic on
12 not imaging
patient cohorts on CXR. AUC was 0.937 (95% CI 0.924–0.949) for CT
11 no outcomes versus 0.884 (95% CI 0.842–0.925) for CXR. Seven studies reported
9 not pathology on diagnostic accuracy for identifying lung nodules on CT scans
on a per scan basis, these were not included in the meta-analysis.
503 ar cles included in Lung cancer or mass: Six studies with nine patient cohorts
qualita ve synthesis 224 studies in other medical speciali es
1 Cardiology reported diagnostic accuracy for identifying mass lesions or lung
15 Dermatology
19 Endocrine/Thyroid
cancer on CT scans compared with eight studies and ten cohorts
2 ENT on CXR. AUC was 0.887 (95% CI 0.847–0.928) for CT versus 0.864
24 Gastroenterology/Hepatology (95% CI 0.827–0.901) for CXR.
2 Haematology
279 studies included in meta-analysis 11 Maxillofacial Surgery
Abnormal Chest X-ray: Twelve studies reported diagnostic
115 in respiratory medicine 1 Metabolic Medicine accuracy for abnormal CXR with 13 different patient cohorts.
82 in ophthalmology 78 Neurology/Neurosurgery AUC was 0.917 (95% CI 0.869–0.966), sensitivity was 0.873 (95% CI
82 in breast cancer 9 Oncology
28 Orthopaedics
0.762–0.985) and specificity was 0.894 (95% CI 0.860–0.929).
5 Rheumatology Pneumothorax: Ten studies reported diagnostic accuracy for
3 GI surgery pneumothorax on CXR with 14 different patient cohorts. AUC was
25 Urology
1 Vascular Surgery 0.910 (95% CI 0.863–0.957), sensitivity was 0.718 (95% CI
0.433–1.004) and specificity was 0.918 (95% CI 0.870–0.965). Five
Fig. 1 PRISMA flow diagram of included studies. PRISMA patient cohorts from two studies73,89 provided contingency tables
(preferred reporting items for systematic reviews and meta-analyses) with raw diagnostic accuracy. When averaging across the cohorts,
flow diagram of included studies. the pooled sensitivity was 0.70 (95% CI 0.45–0.87) and pooled
npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
Table 1. Summary estimates of pooled speciality and imaging modality specific diagnostic accuracy metrics.
Imaging modality Diagnosis AUC 95% CI I2 Sensitivity 95% CI I2 Specificity 95% CI I2 PPV 95% CI I2 NPV 95% CI I2 Accuracy 95% CI I2 F1 score 95% CI I2
Ophthalmology imaging
RFP DR 0.939 0.920–0.958 99.9 0.976 0.975–0.977 99.9 0.902 0.889–0.916 99.7 0.389 0.166–0.612 99.7 1 1 90.6 0.927 0.899–0.955 96.3
RFP AMD 0.963 0.948–0.979 99.3 0.973 0.971–0.974 99.9 0.924 0.896–0.952 99.6 0.797 0.719–0.875 99.9
RFP Glaucoma 0.933 0.924–0.942 99.6 0.883 0.862–0.904 99.9 0.918 0.898–0.938 99.7 0.881 0.847–0.915 97.7
RFP ROP 0.96 0.913–1.008 99.5 0.907 0.749–1.066 99.8
OCT DR 1 0.999–1.0 98.1 0.954 0.937–0.972 98.9 0.993 0.991–0.994 98.2 0.97 0.959–0.981 97.5
OCT AMD 0.969 0.955–0.983 99.4 0.997 0.996–0.997 99.7 0.932 0.914–0.950 98.9 0.936 0.906–0.965 99.6
OCT Glaucoma 0.964 0.941–0.986 77.7
Respiratory imaging
CT Lung nodules 0.937 0.924–0.949 97 0.86 0.831–0.890 99.7 0.896 0.871–0.921 99.2 0.785 0.711–0.858 99.2 0.889 0.870–0.908 98.4 0.79 0.747–0.834 97.9
CT Lung cancer 0.887 0.847–0.928 95.9 0.837 0.780–0.894 94.6 0.826 0.735–0.918 98.1 0.827 0.784–0.870 81.7
X-ray Nodules 0.884 0.842–0.925 99.6 0.75 0.634–0.866 99 0.944 0.912–0.976 98.4 0.86 0.736–0.984 99.8 0.894 0.842–0.945 81.4
X-ray Mass 0.864 0.827–0.901 99.7 0.801 0.683–0.919 99.7
X-ray Abnormal 0.917 0.869–0.966 99.9 0.873 0.762–0.985 99.9 0.894 0.860–0.929 98.7 0.85 0.567–1.133 100 0.859 0.736–0.983 99 0.76 0.558–0.962 99.7
X-ray Atelectasis 0.824 0.783–0.866 99.7
X-ray Cardiomegaly 0.905 0.871–0.938 99.7
X-ray Consolidation 0.875 0.800–0.949 99.9 0.914 0.816–1.013 99.5 0.751 0.637–0.866 98.6 0.897 0.828–0.966 96.4
X-ray Pulmonary oedema 0.893 0.843–0.944 99.9
X-ray Effusion 0.906 0.862–0.950 99.8
X-ray Emphysema 0.885 0.855–0.916 99.7
X-ray Fibrosis 0.834 0.796–0.872 99.7
X-ray Hiatus hernia 0.894 0.858–0.930 99.8
MRI Breast cancer 0.868 0.850–0.886 27.8 0.786 0.710–0.861 80.5 0.788 0.697–0.880 86.2
DBT Breast cancer 0.908 0.880–0.937 63.2 0.831 0.675–0.988 97.6 0.918 0.905–0.930 0
Abramoff et al. 2016 AlexNet/VGG No 1748 Photographs Messidor-2 NR No Expert consensus No Retinal fundus Referable DR
photography
Abramoff et al. AlexNet/VGG Yes 819 Patients Prospective cohort from 10 NR Yes Expert consensus No Retinal fundus More than mild DR
201814 primary care practice photography
sites in USA
Ahn et al. 2018 (a) Inception-v3; (b) No (a) 464; (b) 464 Images Kim’s Eye Hospital, Korea Random split No Expert consensus No Retinal fundus Early and advanced
customised CNN photography glaucoma
(Spectralis
device)
ElTanboly et al. 2016 Deep fusion No 12 OCT scans Hold- No NR No OCT Early DR
classification out method
network (DFCN)
Gargeya et al. 201726 CNN No (a) 15,000 (b) Photographs (a) EyePACS-1; (b) Messidor-2; Random split Yes Expert consensus No Retinal fundus DR
1748; (c) 463 (c) E-Opthma photography
Gomez-Valverde VGG-19 No 494 Photographs ESPERANZA Random split No Expert consensus Yes Retinal fundus Glaucoma suspect or
et al. 201952 photographs glaucoma
Grassman et al. Ensemble: No (a) 12,019; Images (a) AREDS dataset; (b) KORA Random split Yes Reading No Retinal fundus AMD-AREDS 9 step
201827 random forest (b) 5555 dataset centre grader photography
17
Gulshan et al. 2019 Inception-v3 Yes 3049 Photographs Prospective NA Yes Expert consensus Yes Retinal fundus Referable DR
photographs
28
Gulshan et al. 2016 Inception-v3 No (a) 8788; Photographs (a) EyePACS-1; (b) Messidor-2 Random split Yes Reading Yes Retinal fundus Referable DR
(b) 1745 centre grader photography
29
Hwang et al. 2019 (a) ResNet50; (b) VGG- No (a–c) 3872; Images (a–c) Department of Random split Yes Expert consensus Yes OCT AMD-AREDS 4 step
16; (c) Inception-v3; (d–f ) 750 Ophthalmology of Taipei
(d) ResNet50; (e) VGG- Veterans General Hospital; (d–f )
16; (f ) Inception-v3 External validation
Jammal et al. 201953 ResNet34 No 490 Images Randomly drawn No Reading Yes Retinal fundus Glaucomatous optic
from test sample centre grader photographs neuropathy
Kanagasingham et al. DCNN Yes 398 Patients Primary Care Practice, Midland, NA Yes Reading No Retinal fundus Referable DR
201821 Western Australia centre grader photography
Karri et al. 2017 GoogLeNet No 21 Scans Duke University Random split No NR No OCT (a) DME; (b) dry AMD
Keel et al. 201818 Inception-v3 Yes 93 Images St Vincent’s Hospital Melbourne NA Yes Reading No Retinal fundus Referable DR
and University Hospital centre grader photography
Geelong, Barwon Health
Krause et al. 201831 CNN No 1958 Images EyePACS-2 Hold- Yes Expert consensus No Retinal fundus Referable DR
out method photographs
Lee et al. 2017 VGG-16 No 2151 Scans Random split No Routine No OCT AMD
clinical notes
Lee et al. 2019 CNN No 200 Photographs Seoul National University Hold- No Other imaging No Retinal fundus Glaucoma
Hospital out method technique photographs
108
Li et al. 2018 Inception-v3 No 8000 Scans Guangdong (China) Random split No Expert graders No Retinal fundus Glaucomatous optic
photography neuropathy
55
Li et al. 2019 VGG-16 No 1000 Images Shiley Eye Institute of the Random split No Expert consensus No OCT Choroidal
University of California San neovascularisation vs DME
Diego, the California Retinal vs drusen Vs normal
Research Foundation, Medical
Centre Ophthalmology
Associates, the Shanghai First
People’s Hospital, and Beijing
Tongren Eye Centre
Li et al. 2019 OCT-NET No 859 Scans Wenzhou Medical University Random split No Expert graders No OCT Early DR
Li et al. 201933 Inception-v3 No 800 Images Messidor-2 Random split Yes Reading No Retinal fundus Referable DR
centre grader photographs
Li et al. 2019 ResNet50 No 1635 Images Shanghai Zhongshan Hospital Random split No Reading Yes OCT DME
and the Shanghai First People’s centre grader
Hospital
Lin et al. 2019109 CC-Cruiser Yes— 350 Images Multicentre RCT NA NA Expert consensus Yes Slit-lamp Childhood cataracts
Li Z et al. 201833 CNN No 35,201 Photographs NIEHS, SiMES, AusDiab Random split Yes Reading No Retinal fundus Referable DR
centre grader photographs
35
Liu et al. 2018 ResNet50 No (a) 754; (b) 30 Photographs (a) NR; (b) HRF Random split Yes Reading Yes Retinal fundus Glaucomatous optic discs
centre grader photographs
34
Liu et al. 2019 CNN No (a) 28,569; (b) Photographs (a) Local Validation (Chinese Random split Yes Consensus No Retinal fundus Glaucomatous optic
20,466; (c) Glaucoma Study Alliance); (b) involving experts photographs neuropathy
12,718; (d) Beijing Tongren Hospital; (c) and non-experts
9305; (e) Peking University Third
Nagasato et al. 2019 VGG-16 No 466 Images NR K-fold cross No NR No Retinal fundus Retinal vein occlusion
validation photography
(optos)
Nagasato et al. DNN No 322 Scans Tsukazaki Hospital and K-fold cross No Expert graders Yes OCT Retinal vein occlusion
201958 Tokushima University Hospital validation
Nagasawa et al. 2019 VGG-16 No 378 Images Tsukazaki Hospital and K-fold cross No Expert graders No Retinal fundus Proliferative diabetic
Tokushima University Hospital validation photography retinopathy
(optos)
Ohsugi et al. 2017 DCNN No 166 Images Tsukazaki Hospital Random split No Expert consensus No Retinal fundus Rhegmatogenous retinal
photography detachment
(optos)
Peng et al. 201959 Inception-v3 No 900 Images AREDS Random split No Reading Yes Retinal fundus Age-related macular
centre grader photography degeneration-AREDS 4 step
Perdomo et al. 2019 OCT-NET No 2816 Images SERI-CUHK data set Random split No Expert graders No OCT DME
Phan et al. 2019 DenseNet201 No 828 Images Yamanashi Koseiren Hospital No Expert consensus No Retinal fundus Glaucoma
+ further imaging photography
Phene et al. 201937 Inception-v3 No (a) 1205; (b) Images (a) EyePACS, Inoveon, the Random split Yes Reading Yes Retinal fundus Glaucomatous optic
9642; (c) 346 United Kingdom Biobank, the centre grader photographs neuropathy
Age-Related Eye Disease Study,
and Sankara Nethralaya; (b)
Atlanta Veterans Affairs (VA)
Eye Clinic; (c) Dr. Shroff’s
Charity Eye Hospital, New
Delhi, India
Prahs et al. 2017 GoogLeNet No 5358 Images Heidelberg Eye Explorer, Random split No Expert graders No OCT Injection vs No injection
Heidelberg Engineering for AMD
Raju et al. 2017 CNN No 53,126 Images EyePACS-1 Random split No NR No Retinal fundus Referable DR
photography
Ramachandran et al. Visiona intelligent No (a) 485; Photographs (a) ODEMS; (b) Messidor NA Yes Expert graders No Retinal fundus Referable DR
201838 diabetic retinopathy (b) 1200 photographs
screening platform
Raumviboonsuk et al. Inception-v4 No (a–c) 25,348; Images National screening program for NA Yes Expert consensus Yes Retinal fundus (a) Moderate non-
201939 (d) 24,332 DR in Thailand photography proliferative DR or worse;
(b) severe non-proliferative
DR or worse; (c) proliferative
DR; (d) referable DME
Redd et al. 2018 Inception-v1 and U- No 4861 Images Multicentre i-ROP study NR No Expert graders + No Retinal fundus Plus disease in ROP
Net further imaging photography
Rogers et al. 201945 Pegasus (ResNet50) No 94 Photographs EODAT NA Yes Reading Yes Retinal fundus Glaucomatous optic
centre grader photographs neuropathy
Sandhu et al. 201819 Deep fusion SNCAE Yes 160 Scans University of Waikato NA No Clinical diagnosis No Retinal fundus Non-proliferative DR
photographs
Sayres et al. 201940 Inception-v4 No 2000 Images EyePACS-2 NA Yes Expert consensus Yes Retinal fundus Referable DR
photographs
Shibata et al. 201860 (a) ResNet; (b) VGG-16 No 110 Images Matsue Red Cross Hospital Random split No Expert consensus Yes Retinal fundus Glaucoma
photography
Stevenson et al. 2019 Inception-v3 No (a) 2333; (b) Photographs Publicly available databases Random split No Existing diagnosis No Retinal fundus (a) Glaucoma; (b) DR;
2283; (c) 2105 from source data photographs (c) AMD
Ting et al. 201741 VGGNet No (a) 71,896; (b) Images (a) Singapore National Diabetic Random split Yes Expert consensus No Retinal fundus Referable DR
15,798; (c) Retinopathy Screening photography
3052; (d) 4512; Program 2014–2015; (b)
(e) 1936; (f) Guangdong (China); (c)
1052; (g) 1968; Singapore Malay Eye Study; (d)
(h) 2302; (i) Singapore Indian Eye Study; (e)
1172; (j) 1254; Singapore Chinese Eye Study;
(k) 7706; (l) (f) Beijing Eye Study; (g) African
35,948; American Eye Disease Study; (h)
(m) 35,948 Royal Victoria Eye and Ear
Hospital; (i) Mexican; (j) Chinese
University of Hong Kong, (k, l)
Singapore National Diabetic
Retinopathy Screening
Program 2014–2015
Ting et al. 201942 VGGNet No 85,902 Images Combined eight datasets NA Yes Consensus No Retinal fundus (a) Any DR; (b) referable DR;
Yang et al. 2019 VGGNet No 500 Photographs Intelligent Ophthalmology Hold- No Expert consensus No Retinal fundus Referable DR
Database of Zhejiang Society out method photographs
for Mathematical Medicine
in China
Yoo et al. 2019 VGG-19 No 900 Scans Project Macula Random split No NR No (a) OCT; (b) retinal AMD
fundus
photographs
Zhang et al. 201961 VGG-16 No 1742 Images Telemed-R screening Random split No Expert consensus Yes Retinal fundus ROP
photographs
20
Zheng et al. 2019 Inception-v3 Yes 102 Scans Joint Shantou International Eye Hold- No NR No OCT Glaucomatous optic
Centre of Shantou University out method neuropathy
and the Chinese University of
Hong Kong (JSIEC)
Abiyev et al. 2018 CNN No 380 Images Chest X-ray14 Random split No Routine clinical reports No X-ray Abnormal X-ray
Al-Shabi et al. 2019 Local-Global No 848 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Alakwaa et al. 2017 U-Net No 419 Scans Kaggle Data Science Bowl Random split No Expert reader, existing No CT Lung cancer
labels in dataset
Ali et al. 2018 3D CNN No 668 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Annarumma et al. CNN No 15,887 Images Kings College London Hold-out method No Routine clinical reports No X-ray (a) Critical radiographs;
in Kampala, Uganda
Behzadi-Khormouji (a) ChestNet; (b) VGG-16; No 582 X-rays Guangzhou Women and NR No Expert readers No X-ray Consolidation
et al. 2020 (c) DenseNet121 Children’s Medical Centre
Beig et al. 2019 CNN No 145 Scans Erlangen Germany, Random split No Histopathology No CT Lung cancer
Waukesha Wis, Cleveland
Ohio, Tochigi-ken Japan
Causey et al. 2018 CNN No (a) 424; (b) 213 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Cha et al. 201976 ResNet50 No (a) 1483; (b) 500 X-rays Samsung Medical Random split No Other imaging, expert Yes X-ray (a) Lung cancer; (b) T1
Centre, Seoul readers lung cancer
Chae et al. 201977 Ct-LUNGNET No 60 Nodules Chonbuk National Random split No Expert readers, Yes CT Nodules
University Hospital histopathology,
follow up
Chakravarthy Probabilistic neural No 119 Scans LIDC/IDRI NR No NR No CT Lung cancer
et al. 2019 network
Chen et al. 2019 3D CNN No 3674 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Cheng et al. 2016 Stacked denoising No 1400 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
autoencoder
Cicero et al. 2017 GoogLeNet No 2443 Images Department of Medical Random split No Expert readers, routine No X-ray (a) Effusion; (b) oedema;
Imaging, St Michael’s clinical reports (c) consolidation; (d)
Hospital, Toronto cardiomegaly; (e)
pneumothorax
Ciompi et al. 201778 ConvNet No 639 Nodules Danish Lung Cancer Random split No Non-expert readers Yes CT (a) Nodules—solid; (b)
Screening Trial (DLCST) nodules—calcified; (c)
nodules—part-solid; (d)
nodules—non-solid; (e)
nodules—perifissural; (f)
nodules—spiculated
Correa et al. 2018 CNN No 60 Images Lima, Peru NR No Expert readers No Ultrasound Paediatric pneumonia
da Silva et al. 2017 Evolutionary CNN No 200 Nodules LIDC-IDRI Hold-out method No Expert readers No CT Nodules
da Silva et al. 2018 Particle swarm No 2000 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
optimisation algorithm
within CNN
Dai et al. 2018 3D DenseNet-40 No 211 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Dou et al. 2017 3D CNN No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
Dunnmon et al. ResNet18 No 533 Images Stanford University Hold-out method No Expert consensus Yes X-ray Abnormal X-ray
201979
Gao et al. 2018 CNN No 20 Scans University Hospitals Random split No NR No CT Interstitial lung disease
of Geneva
Gong et al. 2019 3D SE-ResNet No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
Gonzalez et al. 2018 CNN No 1000 Scans ECLIPSE study Random split No NR No CT COPD
Gruetzemacher DNN No 1186 Nodules LUNA16 Ninefold cross No NR No CT Nodules
et al. 2018 validation
Gu et al. 2018 3D CNN No 1186 Nodules LUNA16 Tenfold cross No Expert readers No CT Nodules
validation
Hamidian et al. 2017 3D CNN No 104 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Han et al. 2018 Multi-CNNs No 812 Regions of LIDC-IDRI Random split No NR No CT Ground glass opacity
interest
Heo et al. 2019 VGG-19 No 37,677 X-rays Yonsei University Hospital, Hold-out method No Expert readers No X-ray Tuberculosis
South Korea
Hua et al. 2015 (a) CNN; (b) deep belief No 2545 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
network
Huang et al. 2019 R-CNN No 176 Scans LIDC-IDRI Random split No Expert readers No CT Nodules
Huang et al. 2019 Amalgamated-CNN No 1795 Nodules LIDC/IDRI and Ali Tianchi Random split No Expert readers No CT Nodules
medical
Hussein et al. 2019 VGG No 1144 Nodules LIDC/IDRI Random split No Expert readers No CT Lung cancer
Hwang et al. 201867 DCNN No (a) 450; (b) 183; X-rays (a) Internal validation; (b) Random split Yes Expert readers Yes X-ray Tuberculosis
(c) 140; (d) 173; Seoul National University
(e) 170; (f) 132; Hospital; (c) Boromae
(g) 646 Hospital; (d) Kyunghee
Jung et al. 2018 3D DCNN No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
Kang et al. 2017 3D multi view-CNN No 776 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Kermany et al. 2018 Inception-v3 No 624 X-rays Guangzhou Women and Random split No Expert readers Yes X-ray Pneumonia
Children’s Medical Centre
Kim et al. 2019 MGI-CNN No 1186 Nodules LIDC/IDRI NR No Expert readers No CT Nodules
Lakhani et al. 201790 (a) AlexNet; (b) No 150 X-rays Montgomery County MD, Random split No Routine clinical reports, No X-ray Tuberculosis
GoogLeNet; (c) Ensemble Shenzhen China, Belarus TB expert reader,
(AlexNet + GoogLeNet); public Health Program, histopathology
(d) Radiologist augmented Thomas Jefferson University
Hospital
Li et al. 2016 CNN No 8937 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Li et al. 201981 DL-CAD No 812 Nodules Shenzhen Hospital NR No Expert consensus Yes CT Nodules
Li et al. 201980 CNN No 200 Scans Massachusetts General Random split No Routine clinical reports Yes CT Pneumothorax
Hospital
Liang et al. 202068 CNN No 100 Images Kaohsiung Veterans General NA Yes Other imaging No X-ray Nodules
Hospital, Taiwan
Liang et al. 2019 (a) Custom CNN; (b) VGG- No 624 X-rays Guangzhou Women and Random split No Expert readers No X-ray Pneumonia
16; (c) DenseNet121; (d) Children’s Medical Centre
Inception-v3; (e) Xception
Liu et al. 2017 3D CNN No 326 Nodules National Lung Cancer Fivefold cross No Histopathology, No CT Nodules
Screening Trial and Early validation follow up
Lung Cancer Action
Liu H et al. 2019 Segmentation-based deep No 112,120 X-rays Chest X-ray14 NR No Routine clinical reports No X-ray (a) Atelectasis; (b)
fusion network cardiomegaly; (c)
effusion; (d) infiltration;
(e) mass; (f) nodule; (g)
pneumonia; (h)
pneumothorax; (i)
consolidation; (j) oedema;
(k) emphysema; (l)
fibrosis; (m) fibrosis; (n)
Nibali et al. 2017 ResNet No 166 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Nishio et al. 2018 VGG-16 No 123 Nodules Kyoto University Hospital Random split No NR No CT Nodules
Onishi et al. 2019 AlexNet No 60 Nodules NR NR No Histopathology, No CT Nodules
follow up
Onishi et al. 2019 Wasserstein generative No 60 Nodules Fujita Health University NR No Histopathology, No CT Nodules
adversarial network Hospital follow up
Park et al. 201989 YOLO No 503 X-rays Asan Medical Centre and Hold-out method No Expert reader No X-ray Pneumothorax
Seoul National University
Bundang Hospital
Park et al. 201983 CNN No 200 Images Asan Medical Centre and Hold-out method No Expert consensus Yes X-ray (a) Nodules; (b) opacity;
Seoul National University (c) effusion; (d)
Bundang Hospital pneumothorax; (e)
abnormal chest X-ray
Pasa et al. 2019 Custom CNN No 220 X-rays NIH Tuberculosis Chest X- Random split No NR No X-ray Tuberculosis
ray dataset and Belarus
Tuberculosis Portal dataset
Patel et al. 201984 CheXMax No 50 X-rays Stanford University Hold-out method No Expert reader, other Yes X-ray Pneumonia
imaging, clinical notes
Paul et al. 2018 VGG-s CNN No 237 Nodules National Lung Cancer Hold-out method No Expert readers, No CT Nodules
Screening Trial follow up
Pesce et al. 2019 Convolution networks No 7850 X-rays Guy’s and St. Thomas’ NHS Random split No Routine clinical reports No X-ray Lung lesions
with attention feedback Foundation Trust
(CONAF)
Pezeshk et al. 2019 3D CNN No 128 Nodules LUNA16 Random split No Expert readers No CT Nodules
Qin et al. 201970 (a) Lunit; (b) qXR (Qure.ai); No 1196 X-rays Nepal and Cameroon NA Yes Expert readers Yes X-ray Tuberculosis
(c) CAD4TB
Rajpurkar et al. 201885 CNN No 420 X-rays ChestXray-14 Random split No Routine clinical reports Yes X-ray (a) Atelectasis; (b)
cardiomegaly; (c)
consolidation; (d)
oedema; (e) effusion; (f)
emphysema; (g) fibrosis;
(h) hernia; (i) infiltration;
(j) mass; (k) nodule; (l)
pleural thickening; (m)
pneumonia; (n)
pneumothorax
Ren et al. 2019 Manifold regularized No 98 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
classification deep neural
network
Sahu et al. 2019 Multi-section CNN No 130 Nodules LIDC-IDRI Tenfold cross No Expert readers No CT Nodules
validation
Schwyzer et al. 2018 CNN No 100 Patients NR NR No NR No FDG-PET Lung cancer
Setio et al. 201671 ConvNet No (a) 1186; (b) 50; (a) Nodules; (b) LIDC-IDRI Fivefold cross Yes (a) Expert readers; No CT Nodules
(c) 898 scans; (c) nodules validation (b, c) NR
Shaffie et al. 2018 Deep autoencoder No 727 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Shen et al. 2017 Multiscale CNN No 1375 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Sim et al. 201972 ResNet50 No 800 Images Freiberg University Hospital NA Yes Other imaging, Yes X-ray Nodules
Freiburg, Massachusetts histopathology
General Hospital Boston,
Samsung Medical Centre
Seoul, Severance
Hospital Seoul
Singh et al. 201886 Qure-AI No 724 Chest Chest X-ray8 Random split No Routine clinical reports Yes X-ray (a) Lesions; (b) effusion;
radiographs (c) hilar prominence; (d)
cardiomegaly
Song et al. 2017 (a) CNN; (b) DNN; (c) No 5024 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
stacked autoencoder
Stephen et al. 2019 CNN No 2134 Images Guangzhou Women and Random split No NR No X-ray Pneumonia
Children’s Medical Centre
Sun et al. 2017 (a) CNN; (b) deep belief No 88,948 Samples LIDC-IDRI Tenfold cross No Expert readers No CT Nodules
network; (c) stacked validation
denoising autoencoder
VGG-16
Tran et al. 2019 LdcNet No 1186 Nodules LUNA16 Tenfold cross No Expert readers No CT Nodules
validation
Tu et al. 2017 CNN No 20 Nodules LIDC-IDRI Tenfold cross No Expert readers No CT (a) Nodules—non-solid;
validation (b) nodules—part-solid;
(c) nodules—solid
Uthoff et al. 201974 CNN No 100 Nodules INHALE STUDY NA Yes Histopathology, No CT Nodules
follow up
87
Walsh et al. 2018 Inception-ResNet-v2 No 150 Scans La Fondazione Policlinico Random split No Expert readers Yes CT Interstitial lung disease
Universitario A Gemelli
IRCCS, Rome, Italy, and
University of Parma,
Parma, Italy
Wang et al. 2017 AlexNet No 230 X-rays Japanese Society of Tenfold cross No Other imaging No X-ray Nodules
Radiological Technology validation
(JSRT) database
Wang et al. 201888 3D CNN No 200 Scans Fudan University Shanghai Random split No Expert readers, Yes HRCT Lung cancer
Cancer Centre histopathology
Wang et al. 2018 VGG-16 No 744 X-rays JSRT, OpenI, SZCX and MC Random split No Other imaging No X-ray (a) Abnormal chest X-ray;
(b) normal chest X-ray
Wang et al. 2019 ChestNet No 442 X-rays Zhejiang University School Random split No Expert readers No X-ray Pneumothorax
of Medicine (ZJU-2) and
Chest X-ray14
Wang et al. 2019 ResNet152 No 25,596 X-rays Chest X-ray14 Random split No Routine clinical reports No X-ray (a) Atelectasis; (b)
cardiomegaly; (c)
effusion; (d) infiltration;
(e) mass; (f) nodule; (g)
pneumonia; (h)
pneumothorax; (i)
consolidation; (j) oedema;
(k) emphysema; (l)
fibrosis; (m) pleural
network validation
Zhang C et al. 2019 3D CNN Yes 50 Images Guangdong Lung Cancer Random split Yes Histopathology, Yes CT Nodules
Institute follow up
Zhang et al. 201963 Mask R-CNN No 134 Slices Shenzhen Hospital Random split No Expert readers No CT/PET Lung cancer
Zhang S et al. 2019 Le-Net5 No 762 Nodules LIDC/IDRI Random split No Expert readers No CT Nodules
Zhang T et al. 2017 Deep Belief Network No 1664 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Zhao X et al. 2018 Agile CNN No 743 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Zhao X et al. 2019 (a) AlexNet; (b) No 2028 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
GoogLeNet; (c) ResNet; (d)
VifarNet
Zheng et al. 2019 CNN No 1186 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Zhou et al. 2019 Inception-v3 and No 600 Images Chest X-ray8 Random split No Routine clinical reports No X-ray Cardiomegaly
ResNet50
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
14
Table 4. Characteristics of breast imaging studies.
Study Model Prospective? Test Set Population Test datasets Type of internal External Reference standard AI vs Imaging modality Body system/disease
validation validation clinician?
Becker et al. DNN No 192 Lesions Private Random split No Follow up, histology Yes Ultrasound Breast cancer
201862
Bevilacqua VGG-S No 39 Images NR NR No NR No Digital breast Breast cancer
et al. 2019 tomosynthesis
Byra et al. 201999 VGG-19 No (a) 150; (b) 163; Images (a) Moores Cancer Center, Random split No (a) Follow up, Yes Ultrasound Breast cancer
(c) 100 University of California; (b) histology; (b) expert
UDIAT (c) OASBUD reader; (c) expert
reader, histology,
follow up
Cai et al. 2019 CNN No 99 Images SYSUCC and Foshan, China Random split No Histology No Mammogram Breast cancer
Cao et al. 2019 SSD300 + ZFNet No 183 Lesions Sichuan Provincial People’s Random split No Expert consensus No Ultrasound Breast cancer
Hospital
Cao et al. 2019 NF-Net No 272 Lesions Sichuan Provincial People’s Random split No Histology No Ultrasound Breast cancer
Hospital
Cheng et al. 2016 Stacked denoising No 520 Lesions Taipei Veterans General NR No Histology No Ultrasound Breast Nodules
autoencoder Hospital
Chiao et al. 2019 Mask R-CNN No 61 Images China Medical University Random split No Histology, routine No Ultrasound Breast cancer
Hospital clinical report
Choi et al. 2019100 CNN No 253 Lesions Samsung Medical NR No Follow up, histology Yes Ultrasound Breast cancer
Centre, Seoul
Chougrad Inception-v3 No (a) 5316; (b) Images (a) DDSM; (b) Inbreast; Random split No (a) Follow up, No Mammogram Breast cancer
et al. 2018 600; (c) 200 (c) BCDR histology, expert
reader; (b) expert
reader, histology; (c)
clinical reports
Ciritsis et al. dCNN No (a) 101; (b) 43 Images (a) Internal validation; (b) Random split Yes Follow up, histology Yes Ultrasound Breast cancer
201992 external validation
Cogan et al. ResNet-101 Faster R- No 124 Images INbreast NA Yes Expert reader, No Mammogram Breast cancer
201993 CNN histology,
Dalmis et al. 2018 U-Net No 66 Images NR Random split No Follow up, histology No MRI Breast cancer
Dalmis et al. DenseNet No 576 Lesions Raboud University NR No Follow up, histology Yes MRI Breast cancer
2019101 Medical Center
Dhungel CNN No 82 Images INbreast Random split No Expert reader, No Mammogram Breast cancer
et al. 2017 histology,
Duggento CNN No 378 Images Curated Breast Imaging Random split No Expert reader No Mammogram Breast cancer
et al. 2019 SubSet of DDSM (CBIS-DDSM)
Fan et al. 2019 Faster R-CNN No 182 Images Fudan University Affiliated Random split No Histology No Digital breast Breast cancer
Cancer Centre tomosynthesis
Fujioka et al. GoogleNet No 120 Lesions Private Random split No Follow up, histology Yes Ultrasound Breast cancer
2019102
Gao et al. 2018 SD-CNN No (a) 49; (b) 89 (a) Lesions; (a) Mayo Clinic Arizona; (b) NR No (a) Histology; (b) No (a) Contrast enhanced Breast cancer
(b) images Inbreast expert reader, digital mammogram;
histology (b) mammogram
Ha et al. 2019 CNN No 60 Images Columbia University Random split No Follow up, histology No Mammogram DCIS
Medical Center
Han et al. 2017 GoogleNet No 829 Lesions Samsung Medical Random split No Histology No Ultrasound Breast cancer
Centre, Seoul
Herent et al. 2019 ResNet50 No 168 Lesions Journees Francophones de Random split No NR No MRI Breast cancer
Radiologie 2018
Hizukuri CNN No 194 Images Mie University Hospital Random split No Follow up, histology No Ultrasound Breast cancer
et al. 2018
Huyng et al. 2016 AlexNet No 607 Images University of Chicago NR No Histology No Mammogram Breast cancer
Jadoon et al. 2016 CNN-DW No 2976 Images IRMA NR No Histology No Mammogram Breast cancer
Jiao et al. 2016 CNN No 300 Images DDSM Random split No Follow up, histology, No Mammogram Breast cancer
expert reader
Jiao et al. 2018 (a) AlexNet; (b) No (a) 150; (b) 150 Images DDSM Random split No Follow up, histology, No Mammogram Breast cancer
parasitic metric expert reader
learning layers
Jung et al. 2018 RetinaNet No (a) 410; (b) 222 Images (a) Inbreast; (b) GURO Random split No (a) Expert reader; (b) No Mammogram Breast cancer
histology
database histology,
Kooi T et al. 2017 CNN No 1804 Images Netherlands screening Hold-out method No Expert reader, No Mammogram Breast Cancer
database histology,
Li et al. 2019 DenseNet-II No 2042 Images First Hospital of Shanxi Tenfold cross No Expert reader No Mammogram Breast cancer
Medical University validation
Li et al. 2019 VGG-16 No (a) 1854; Images Nanfang Hospital Fivefold cross No Follow up, histology No (a) Digital breast Breast cancer
(b) 1854 validation tomosynthesis; (b)
mammogram
Lin et al. 2014 FCMNN No 65 Images Far Eastern Memorial Tenfold cross No Histology No Ultrasound Breast cancer
Hospital, Taiwan validation
McKinney et al. MobileNetV2 - No (a) 25,856; Images (a) UK; (b) USA Random split Yes Follow up, histology Yes Mammogram Breast cancer
202094 ResNet-v2-50, (b) 3097
ResNet-v1-50
Mendel et al. 2018 VGG-19 No (a) 78; (b) 78 Images University of Chicago Leave-one- No Follow up, histology No (a) Mammogram; (b) Breast cancer
out method digital breast
tomosynthesis
Peng et al. 201695 ANN No (a) 100; (b) 100 Images (a) MIAS; (b) BancoWeb Hold-out method Yes Expert reader No Mammogram Breast cancer
Qi et al. 2019 Inception-Resnet-v2 No 1359 Images West China Hospital, Sichuan Random split No Expert consensus No Ultrasound Breast cancer
University
Qiu et al. 2017 CNN No 140 Images Private Random split No Histology No Mammogram Breast cancer
Ragab et al. 2019 AlexNet No (a) 676; Images (a) Digital database for Random split No Follow up, histology, No Mammogram Breast cancer
(b) 1581 screening mammography expert reader
Ribli et al. 201896 VGG-16 No 115 Images INbreast NA Yes Expert reader, No Mammogram Breast cancer
histology
Rodriguez-Ruiz CNN No 240 Images Two datasets combined NA Yes Expert reader, Yes Mammogram Breast cancer
et al. 201897 histology, follow up
Rodriguez-Ruiz CNN No 2642 Images Combined nine datasets NA Yes Follow up, histology Yes Mammogram Breast cancer
et al. 201998
Samala et al. 2016 DCNN No 94 Images University of Michigan Random split No Expert reader No Digital breast Breast cancer
Tao et al. 2019 RefineNet + No 253 Lesions Huaxi Hospital and China- Random split No Expert reader No Ultrasound Breast cancer
DenseNet121 Japan Friendship Hospital
Teare et al. 2017 Inception-v3 No 352 Images DDSM + Zebra Random split No Follow up, histology No Mammogram Breast cancer
Mammography Dataset
Truhn et al. CNN No 129 Lesions RWTH Aachen University, Random split No Follow up, histology Yes MRI Breast cancer
2018104
Wang et al. 2016 Inception-v3 No 74 Images Breast Cancer Digital Random split No Expert reader, No Mammogram Breast cancer
Repository (BCDR) histology
Wang et al. 2016 Stacked autoencoder No 204 Images Sun Yat-sen University Cancer Hold-out method No Histology No Mammogram Breast cancer
Center (Guangzhou, China)
and Nanhai Affiliated Hospital
of Southern Medical
University (Foshan, China)
Wang et al. 2017 CNN No 292 Images University of Chicago Random split No Histology No Mammogram Breast cancer
Wang et al. 2018 DNN No 292 Images University of Chicago Random split No Histology No Mammogram Breast cancer
Wu et al. 2019105 ResNet-22 No (a) 401; Images NYU Hold-out method No Histology Yes Mammogram Breast cancer
(b) 1440
Xiao et al. 2019 Inception-v3, No 206 Images Third Affiliated Hospital of Random split No Surgical confirmation, No Ultrasound Breast cancer
ResNet50, Xception Sun Yat-sen University histology
Yala et al. 2019106 ResNet18 No 26,540 Images Massachusetts General Random split No Clinical reports, follow Yes Mammogram Breast cancer
Hospital, Harvard Medical up, histology
School,
Yala et al. 2019111 ResNet18 No 8751 Images Massachusetts General Random split No Clinical reports, follow No Mammogram Breast cancer
Hospital, Harvard Medical up, histology
School,
Yap et al. 2018 FCN-AlexNet No (a) 306; (b) 163 Lesions (a) Private; (b) UDIAT NR No Expert reader No Ultrasound Breast cancer
Yap et al. 2019 FCN-8s No 94 Lesions Two datasets combined NR No Expert reader No Ultrasound Breast cancer
Yousefi et al. 2018 DCNN No 28 Images MGH Random split No Expert consensus No Digital breast Breast cancer
tomosynthesis
Zhou et al. 3D DenseNet No 307 Lesions Private Random split No Follow up, histology Yes MRI Breast cancer
2019107
Data
Image pre-processing, augmentation and preparation Are data augmentation techniques such as cropping, padding and flipping used?
Is there quality control of the images being used to train the algorithm? I.e., were poor quality
images excluded.
Were relevant images manually selected?
Study design Retrospective or prospective data collection.
Image eligibility How are images chosen for inclusion in the study?
Were the data from private or open-access repositories?
Training, validation, test sets Are each of the three sets independent of each other, without overlap?
Does data from the same patient appear in multiple datasets?
Datasets Are the datasets used single or multicentre?
Is a public or open-source dataset used?
Size of datasets Wide variation in size of datasets for training and testing.
Is the size of the datasets justified?
Are sample size statistical considerations applied for the test set?
Use of ‘external’ test sets for final reporting Is an independent test set used for ‘external validation’?
Is the independent test set constructed using an unenriched representative sample?
Multi-vendor images Are images from different scanners and vendors included in the datasets to enhance
generalisability?
Are imaging acquisition parameters described?
Algorithm
Index test Was sufficient detail given on the algorithm to allow replication and independent
validation?
What type of algorithm was used? E.g., CNN, Autoencoder, SVM.
Was the algorithm made publicly or commercially available?
Was the construct or architecture of the algorithm made available?
Additional AI algorithmic information Is the algorithm a static model or is it continuously evolving?
Demonstrate how algorithm makes decisions Is there a specific design for end-user interpretability, e.g., saliency or probability maps
Methods
Transfer learning Was transfer learning used for training and validation?
Cross validation Was k-fold cross validation used during training to reduce the effects of randomness in
dataset splits?
Reference standard Is the reference standard used of high quality and widely accepted in the field?
What was the rationale for choosing the reference standard?
Additional clinical information Was additional clinical information given to healthcare professionals to simulate normal
clinical process?
Performance benchmarking What was performance of algorithm benchmarked to?
What is expertise level and level of consensus of healthcare professionals if used?
Results
Raw diagnostic accuracy data Are raw diagnostic accuracy data reported in a contingency table demonstrating TP, FP,
FN, TN?
Metrics for estimating diagnostic accuracy Which diagnostic accuracy metrics reported? Sensitivity, Sensitivity, PPV, NPV, Accuracy,
performance AUROC
Unit of assessment Which unit of assessment reported, e.g., per patient, per scan or per lesion.
Rows in bold are part of STARD-2015 criteria.
studies is not fully applicable to clinical DL studies115. The variation risk of bias or concerning for applicability. Of particular concern
in reporting makes it very difficult to formally evaluate the was the applicability of reference standards and patient selection.
performance of algorithms. Furthermore, differences in reference Despite our results demonstrating that DL algorithms have a
standards, grader capabilities, disease definitions and thresholds high diagnostic accuracy in medical imaging, it is currently difficult
for diagnosis make direct comparison between studies and to determine if they are clinically acceptable or applicable. This is
algorithms very difficult. This can only be improved with well- partially due to the extensive variation and risk of bias identified in
designed and executed studies that explicitly address questions the literature to date. Furthermore, the definition of what
concerning transparency, reproducibility, ethics and effective- threshold is acceptable for clinical use and tolerance for errors
varies greatly across diseases and clinical scenarios119.
ness116 and specific reporting standards for AI studies115,117.
The QUADAS-2 (ref. 118) assessment tool was used to system-
atically evaluate the risk of bias and any applicability concerns of Limitations in the literature
the diagnostic accuracy studies. Although this tool was not Dataset. There are broad methodological deficiencies among the
designed for DL diagnostic accuracy studies, the evaluation included studies. Most studies were performed using retro-
allowed us to judge that a majority of studies in this field are at spectively collected data, using reference standards and labels
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
R. Aggarwal et al.
18
a) Ophthalmic Imaging
b) Respiratory Imaging
c) Breast Imaging
Fig. 2 QUADAS-2 summary plots. Risk of bias and applicability concerns summary about each QUADAS-2 domain presented as percentages
across the 82 included studies in ophthalmic imaging (a), 115 in respiratory imaging (b) and 82 in breast imaging (c).
that were not intended for the purposes of DL analysis. Minimal varying modality and quality represent important areas of
prospective studies and only two randomised studies109,120, research in DL121.
evaluating the performance of DL algorithms in clinical settings
were identified in the literature. Proper acquisition of test data is Study methodology. Many studies did not undertake external
essential to interpret model performance in a real-world clinical validation of the algorithm in a separate test set and relied upon
setting. Poor quality reference standards may result in the results from the internal validation data; the same dataset used to
decreased model performance due to suboptimal data labelling train the algorithm initially. This may lead to an overestimation of
in the validation set28, which could be a barrier to understanding the diagnostic accuracy of the algorithm. The problem of
the true capabilities of the model on the test set. This is overfitting has been well described in relation to machine
symptomatic of the larger issue that there is a paucity of gold- learning algorithms122. True demonstration of the performance
standard, prospectively collected, representative datasets for the of these algorithms can only be assumed if they are externally
purposes of DL model testing. However, as there are many validated on separate test sets with previously unseen data that
advantages to using retrospectively collected data, the resourceful are representative of the target population.
use of retrospective or synthetic data with the use of labels of Surprisingly, few studies compared the diagnostic accuracy of
npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
R. Aggarwal et al.
19
DL algorithms against expert human clinicians for medical Finally, our review concentrated on DL for speciality-specific
imaging. This would provide a more objective standard that medical imaging, and therefore it may not be appropriate to
would enable better comparison of models across studies. generalise our findings to other forms of medical imaging or AI
Furthermore, application of the same test dataset for diagnostic studies.
performance assessment of DL algorithms versus healthcare
professionals was identified in only select studies13. This
Future work
methodological deficiency limits the ability to gauge the clinical
applicability of these algorithms into clinical practice. Similarly, this For the quality of DL research to flourish in the future, we believe
issue can extend to model-versus-model comparisons. Specific that the adoption of the following recommendations are required
methods of model training or model architecture may not be as a starting point.
described well enough to permit emulation for comparison123.
Thus, standards for model development and comparison against Availability of large, open-source, diverse anonymised datasets with
controls will be needed as DL architectures and techniques annotations. This can be achieved through governmental sup-
continue to develop and are applied in medical contexts. port and will enable greater reproducibility of DL models124.
Reporting. There was varying terminology and a lack of Collaboration with academic centres to utilise their expertise in
transparency used in DL studies with regards to the validation pragmatic trial design and methodology125. Rather than classical
or test sets used. The term ‘validation’ was identified as being used trials, novel experimental and quasi-experimental methods to
interchangeably to either describe an external test set for the final evaluate DL have been proposed and should be evaluated126. This
algorithm or for an internal dataset that is used to fine tune the may include ongoing evaluation of algorithms once in clinical
model prior to ‘testing’. Furthermore, the inconsistent terminology practice, as they continue to learn and adapt to the population
led to difficulties understanding whether an independent external that they are implemented in.
test set was used to test diagnostic performance13.
Crucially, we found broad variation in the metrics used as Creation of AI-specific reporting standards. A major reason for the
outcomes for the performance of the DL algorithms in the difficulties encountered in evaluating the performance of DL on
literature. Very few studies reported true positives, false positives, medical imaging are largely due to inconsistent and haphazard
true negatives and false negatives in a contingency table as reporting. Although DL is widely considered as a ‘predictive’
should be the minimum for diagnostic accuracy studies114. model (where TRIPOD may be applied) the majority of AI
Moreover, some studies only reported metrics, such as dice interventions close to translation currently published are pre-
coefficient, F1 score, competition performance metric and Top-1 dominantly in the field of diagnostics (with specifics on index
accuracy that are often used in computer science, but may be tests, reference standards and true/false positive/negatives and
unfamiliar to clinicians13. Metrics such as AUC, sensitivity, summary diagnostic scores, centred directly in the domain of
specificity, PPV and NPV should be reported, as these are more STARD). Existing reporting guidelines for diagnostic accuracy
widely understood by healthcare professionals. However, it is studies (STARD)114, prediction models (TRIPOD)127, randomised
noted that NPV and PPV are dependent on the underlying trials (CONSORT)128 and interventional trial protocols (SPIRIT)129 do
prevalence of disease and as many test sets are artificially not fully cover DL research due to specific considerations in
constructed or balanced, then reporting the NPV or PPV may not methodology, data and interpretation required for these studies.
be valid. The wide range of metrics reported also leads to difficulty As such, we applaud the recent publication of the CONSORT-AI117
in comparing the performance of algorithms on similar datasets. and SPIRIT-AI130 guidelines, and await AI-specific amendments of
the TRIPOD-AI131 and STARD-AI115 statements (which we are
convening). We trust that when these are published, studies being
Study strengths and limitations
conducted will have a framework that enables higher quality and
This systematic review and meta-analysis statistically appraises more consistent reporting.
pooled data collected from 279 studies. It is the largest study to
date examining the diagnostic accuracy of DL on medical imaging. Development of specific tools for determining the risk of study bias
However, our findings must be viewed in consideration of several and applicability. An update to the QUADAS-2 tool taking into
limitations. Firstly, as we believe that many studies have account the nuances of DL diagnostic accuracy research should be
methodological deficiencies or are poorly reported, these studies considered.
may not be a reliable source for evaluating diagnostic accuracy.
Consequently, the estimates of diagnostic performance provided Updated specific ethical and legal framework. Outdated policies
in our meta-analysis are uncertain and may represent an over- need to be updated and key questions answered in terms of
estimation of the true accuracy. Secondly, we did not conduct a liability in cases of medical error, doctor and patient under-
quality assessment for the transparency of reporting in this review. standing, control over algorithms and protection of medical
This was because current guidelines to assess diagnostic accuracy data132. The World Health Organisation133 and others have started
reporting standards (STARD-2015114) were not designed for DL to develop guidelines and principles to regulate the use of AI.
studies and are not fully applicable to the specifics and nuances of These regulations will need to be adapted by each country to fit
DL research115. Thirdly, due to the nature of DL studies, we were their own political and healthcare context134. Furthermore, these
not able to perform classical statistical comparison of measures of guidelines will need to proactively and objectively evaluate
diagnostic accuracy between different imaging modalities. technology to ensure best practices are developed and imple-
Fourthly, we were unable to separate each imaging modality into mented in an evidence-based manner135.
different subsets, to enable comparison across subsets and allow
the heterogeneity and variance to be broken down. This was
because our study aimed to provide an overview of the literature CONCLUSION
in each specific speciality, and it was beyond the scope of this DL is a rapidly developing field that has great potential in all
review to examine each modality individually. The inherent aspects of healthcare, particularly radiology. This systematic
differences in imaging technology, patient populations, patholo- review and meta-analysis appraised the quality of the literature
gies and study designs meant that attempting to derive common and provided pooled diagnostic accuracy for DL techniques in
lessons across the board did not always offer easy comparisons. three medical specialities. While the results demonstrate that DL
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
R. Aggarwal et al.
20
currently has a high diagnostic accuracy, it is important that these review. Data were extracted on (i) first author, (ii) year of
findings are assumed in the presence of poor design, conduct and publication, (iii) type of neural network, (iv) population, (v)
reporting of studies, which can lead to bias and overestimating dataset—split into training, validation and test sets, (vi) imaging
the power of these algorithms. The application of DL can only be modality, (vii) body system/disease, (viii) internal/external valida-
improved with standardised guidance around study design and tion methods, (ix) reference standard, (x) diagnostic accuracy raw
reporting, which could help clarify clinical utility in the future. data—true and false positives and negatives, (xi) percentages of
There is an immediate need for the development of AI-specific AUC, accuracy, sensitivity, specificity, PPV, NPV and other metrics
STARD and TRIPOD statements to provide robust guidance around reported.
key issues in this field before the potential of DL in diagnostic Three investigators (R.A., V.S. and GM) assessed study metho-
healthcare is truly realised in clinical practice. dology using the QUADAS-2 checklist to evaluate the risk of bias
and any applicability concerns of the studies118.
Exclusion criteria
Articles were excluded if the article was not written in English. REFERENCES
Abstracts, conference articles, pre-prints, reviews and meta-
1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
analyses were not considered because an aim of this review was 2. Obermeyer, Z. & Emanuel, E. J. Predicting the future — big data, machine
to appraise the methodology, reporting standards and quality of learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016).
primary research studies being published in peer-reviewed 3. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29
journals. Studies that investigated the accuracy of image (2019).
segmentation or predicting disease rather than identification or 4. Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image
classification were excluded. Anal. 42, 60–88 (2017).
5. Bluemke, D. A. et al. Assessing radiology research on artificial intelligence: a brief
guide for authors, reviewers, and readers—from the radiology editorial board.
Data extraction and quality assessment Radiology 294, 487–489 (2020).
Two investigators (R.A. and V.S.) independently extracted demo- 6. Wahl, B., Cossy-Gantner, A., Germann, S. & Schwalbe, N. R. Artificial intelligence
graphic and diagnostic accuracy data from the studies, using a (AI) and global health: how can AI contribute to health in resource-poor set-
predefined electronic data extraction spreadsheet. The data fields tings? BMJ Glob. Health 3, e000798–e000798 (2018).
7. Zhang, L., Wang, H., Li, Q., Zhao, M.-H. & Zhan, Q.-M. Big data and medical
were chosen subsequent to an initial scoping review and were, in
research in China. BMJ 360, j5910 (2018).
the opinion of the investigators, sufficient to fulfil the aims of this
npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
R. Aggarwal et al.
21
8. Nakajima, Y., Yamada, K., Imamura, K. & Kobayashi, K. Radiologist supply and 34. Liu, H. et al. Development and validation of a deep learning system to detect
workload: international comparison. Radiat. Med. 26, 455–465 (2008). glaucomatous optic neuropathy using fundus photographs. JAMA Ophthalmol.
9. Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key 137, 1353–1360 (2019).
challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 35. Liu, S. et al. A deep learning-based algorithm identifies glaucomatous discs
195 (2019). using monoscopic fundus photographs. Ophthalmol. Glaucoma 1, 15–22 (2018).
10. Topol, E. J. High-performance medicine: the convergence of human and artificial 36. MacCormick, I. J. C. et al. Accurate, fast, data efficient and interpretable glau-
intelligence. Nat. Med. 25, 44–56 (2019). coma diagnosis with automated spatial analysis of the whole cup to disc profile.
11. Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based PLoS ONE 14, e0209409 (2019).
FDA-approved medical devices and algorithms: an online database. npj Digital 37. Phene, S. et al. Deep learning and glaucoma specialists: the relative importance
Med. 3, 118 (2020). of optic disc features to predict glaucoma referral in fundus photographs.
12. Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. JAMA Ophthalmology 126, 1627–1639 (2019).
319, 1317–1318 (2018). 38. Ramachandran, N., Hong, S. C., Sime, M. J. & Wilson, G. A. Diabetic retinopathy
13. Liu, X. et al. A comparison of deep learning performance against health-care screening using deep neural network. Clin. Exp. Ophthalmol. 46, 412–416 (2018).
professionals in detecting diseases from medical imaging: a systematic review 39. Raumviboonsuk, P. et al. Deep learning versus human graders for classifying
and meta-analysis. Lancet Digital Health 1, e271–e297 (2019). diabetic retinopathy severity in a nationwide screening program. npj Digital
14. Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an Med. 2, 25 (2019).
autonomous AI-based diagnostic system for detection of diabetic retinopathy in 40. Sayres, R. et al. Using a deep learning algorithm and integrated gradients
primary care offices. npj Digital Med. 1, 39 (2018). explanation to assist grading for diabetic retinopathy. Ophthalmology 126,
15. Bellemo, V. et al. Artificial intelligence using deep learning to screen for refer- 552–564 (2019).
able and vision-threatening diabetic retinopathy in Africa: a clinical validation 41. Ting, D. S. W. et al. Development and validation of a deep learning system for
study. Lancet Digital Health 1, e35–e44 (2019). diabetic retinopathy and related eye diseases using retinal images from multi-
16. Christopher, M. et al. Performance of deep learning architectures and transfer ethnic populations with diabetes. JAMA 318, 2211–2223 (2017).
learning for detecting glaucomatous optic neuropathy in fundus photographs. 42. Ting, D. S. W. et al. Deep learning in estimating prevalence and systemic risk
Sci. Rep. 8, 16685 (2018). factors for diabetic retinopathy: a multi-ethnic study. npj Digital Med. 2, 24
17. Gulshan, V. et al. Performance of a deep-learning algorithm vs manual grading (2019).
for detecting diabetic retinopathy in India. JAMA Ophthalmol 137, 987–993 43. Verbraak, F. D. et al. Diagnostic accuracy of a device for the automated detec-
(2019). tion of diabetic retinopathy in a primary care setting. Diabetes Care 42, 651
18. Keel, S., Wu, J., Lee, P. Y., Scheetz, J. & He, M. Visualizing deep learning models (2019).
for the detection of referable diabetic retinopathy and glaucoma. JAMA Oph- 44. Van Grinsven, M. J., van Ginneken, B., Hoyng, C. B., Theelen, T. & Sánchez, C. I.
thalmol. 137, 288–292 (2019). Fast convolutional neural network training using selective data sampling:
19. Sandhu, H. S. et al. Automated diagnosis and grading of diabetic retinopathy application to hemorrhage detection in color fundus images. IEEE Trans. Med.
using optical coherence tomography. Investig. Ophthalmol. Vis. Sci. 59, Imaging 35, 1273–1284 (2016).
3155–3160 (2018). 45. Rogers, T. W. et al. Evaluation of an AI system for the automated detection of
20. Zheng, C. et al. Detecting glaucoma based on spectral domain optical coher- glaucoma from stereoscopic optic disc photographs: the European Optic Disc
ence tomography imaging of peripapillary retinal nerve fiber layer: a compar- Assessment Study. Eye 33, 1791–1797 (2019).
ison study between hand-crafted features and deep learning model. Graefes 46. Al-Aswad, L. A. et al. Evaluation of a deep learning system for identifying
Arch. Clin. Exp. Ophthalmol. 258, 577–585 (2020). glaucomatous optic neuropathy based on color fundus photographs. J. Glau-
21. Kanagasingam, Y. et al. Evaluation of artificial intelligence-based grading of coma 28, 1029–1034 (2019).
diabetic retinopathy in primary care. JAMA Netw. Open 1, e182665–e182665 47. Brown, J. M. et al. Automated diagnosis of plus disease in retinopathy of pre-
(2018). maturity using deep convolutional neural networks. JAMA Ophthalmol. 136,
22. Alqudah, A. M. AOCT-NET: a convolutional network automated classification of 803–810 (2018).
multiclass retinal diseases using spectral-domain optical coherence tomography 48. Burlina, P. et al. Utility of deep learning methods for referability classification
images. Med. Biol. Eng. Comput. 58, 41–53 (2020). of age-related macular degeneration. JAMA Ophthalmol. 136, 1305–1307
23. Asaoka, R. et al. Validation of a deep learning model to screen for glaucoma (2018).
using images from different fundus cameras and data augmentation. Oph- 49. Burlina, P. M. et al. Automated grading of age-related macular degeneration
thalmol. Glaucoma 2, 224–231 (2019). from color fundus images using deep convolutional neural networks. JAMA
24. Bhatia, K. K. et al. Disease classification of macular optical coherence tomo- Ophthalmol. 135, 1170–1176 (2017).
graphy scans using deep learning software: validation on independent, multi- 50. Burlina, P., Pacheco, K. D., Joshi, N., Freund, D. E. & Bressler, N. M. Comparing
center data. Retina 40, 1549–1557 (2020). humans and deep learning performance for grading AMD: a study in using
25. Chan, G. C. Y. et al. Fusing results of several deep learning architectures for universal deep features and transfer learning for automated AMD analysis.
automatic classification of normal and diabetic macular edema in optical Computers Biol. Med. 82, 80–86 (2017).
coherence tomography. In Conference proceedings: Annual International Con- 51. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in
ference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in retinal disease. Nat. Med. 24, 1342–1350 (2018).
Medicine and Biology Society. Annual Conference, Vol. 2018, 670–673 (IEEE, 2018). 52. Gómez-Valverde, J. J. et al. Automatic glaucoma classification using color fundus
26. Gargeya, R. & Leng, T. Automated identification of diabetic retinopathy using images based on convolutional neural networks and transfer learning. Biomed.
deep learning. Ophthalmology 124, 962–969 (2017). Opt. Express 10, 892–913 (2019).
27. Grassmann, F. et al. A deep learning algorithm for prediction of age-related eye 53. Jammal, A. A. et al. Human versus machine: comparing a deep learning algo-
disease study severity scale for age-related macular degeneration from color rithm to human gradings for detecting glaucoma on fundus photographs. Am. J.
fundus photography. Ophthalmology 125, 1410–1420 (2018). Ophthalmol. 211, 123–131 (2019).
28. Gulshan, V. et al. Development and validation of a deep learning algorithm for 54. Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by
detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, image-based deep learning. Cell 172, 1122–1131.e1129 (2018).
2402–2410 (2016). 55. Li, F. et al. Deep learning-based automated detection of retinal diseases using
29. Hwang, D. K. et al. Artificial intelligence-based decision-making for age-related optical coherence tomography images. Biomed. Opt. Express 10, 6204–6226
macular degeneration. Theranostics 9, 232–245 (2019). (2019).
30. Keel, S. et al. Development and validation of a deep-learning algorithm for the 56. Long, E. et al. An artificial intelligence platform for the multihospital colla-
detection of neovascular age-related macular degeneration from colour fundus borative management of congenital cataracts. Nat. Biomed. Eng. 1, 0024 (2017).
photographs. Clin. Exp. Ophthalmol. 47, 1009–1018 (2019). 57. Matsuba, S. et al. Accuracy of ultra-wide-field fundus ophthalmoscopy-assisted
31. Krause, J. et al. Grader variability and the importance of reference standards for deep learning, a machine-learning technology, for detecting age-related
evaluating machine learning models for diabetic retinopathy. Ophthalmology macular degeneration. Int. Ophthalmol. 39, 1269–1275 (2019).
125, 1264–1272 (2018). 58. Nagasato, D. et al. Automated detection of a nonperfusion area caused by
32. Li, F. et al. Automatic detection of diabetic retinopathy in retinal fundus pho- retinal vein occlusion in optical coherence tomography angiography images
tographs based on deep learning algorithm. Transl. Vis. Sci. Technol. 8, 4 (2019). using deep learning. PLoS ONE 14, e0223965 (2019).
33. Li, Z. et al. An automated grading system for detection of vision-threatening 59. Peng, Y. et al. DeepSeeNet: a deep learning model for automated classification
referable diabetic retinopathy on the basis of color fundus photographs. Dia- of patient-based age-related macular degeneration severity from color fundus
betes Care 41, 2509–2516 (2018). photographs. Ophthalmology 126, 565–575 (2019).
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
R. Aggarwal et al.
22
60. Shibata, N. et al. Development of a deep residual learning algorithm to screen 87. Walsh, S. L. F., Calandriello, L., Silva, M. & Sverzellati, N. Deep learning for clas-
for glaucoma from fundus photography. Sci. Rep. 8, 14665 (2018). sifying fibrotic lung disease on high-resolution computed tomography: a case-
61. Zhang, Y. et al. Development of an automated screening system for retinopathy cohort study. Lancet Respir. Med. 6, 837–845 (2018).
of prematurity using a deep neural network for wide-angle retinal images. IEEE 88. Wang, S. et al. 3D convolutional neural network for differentiating pre-
Access 7, 10232–10241 (2019). invasive lesions from invasive adenocarcinomas appearing as ground-glass
62. Becker, A. S. et al. Classification of breast cancer in ultrasound imaging using a nodules with diameters ≤3 cm using HRCT. Quant. Imaging Med. Surg.
generic deep learning analysis software: a pilot study. Br. J. Radio. 91, 20170576 8, 491–499 (2018).
(2018). 89. Park, S. et al. Application of deep learning-based computer-aided detection
63. Zhang, C. et al. Toward an expert level of lung cancer detection and classifi- system: detecting pneumothorax on chest radiograph after biopsy. Eur. Radio.
cation using a deep convolutional neural network. Oncologist 24, 1159–1165 29, 5341–5348 (2019).
(2019). 90. Lakhani, P. & Sundaram, B. Deep learning at chest radiography: automated
64. Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep classification of pulmonary tuberculosis by using convolutional neural networks.
learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 Radiology 284, 574–582 (2017).
(2019). 91. Becker, A. S. et al. Deep learning in mammography: diagnostic accuracy of a
65. Hwang, E. J. et al. Deep learning for chest radiograph diagnosis in the emer- multipurpose image analysis software in the detection of breast cancer. Investig.
gency department. Radiology 293, 573–580 (2019). Radio. 52, 434–440 (2017).
66. Hwang, E. J. et al. Development and validation of a deep learning–based 92. Ciritsis, A. et al. Automatic classification of ultrasound breast lesions using a
automated detection algorithm for major thoracic diseases on chest radio- deep convolutional neural network mimicking human decision-making. Eur.
graphs. JAMA Netw. Open 2, e191095–e191095 (2019). Radio. 29, 5458–5468 (2019).
67. Hwang, E. J. et al. Development and validation of a deep learning–based 93. Cogan, T., Cogan, M. & Tamil, L. RAMS: remote and automatic mammogram
automatic detection algorithm for active pulmonary tuberculosis on chest screening. Comput. Biol. Med. 107, 18–29 (2019).
radiographs. Clin. Infect. Dis. https://doi.org/10.1093/cid/ciy967 (2018). 94. McKinney, S. M. et al. International evaluation of an AI system for breast cancer
68. Liang, C. H. et al. Identifying pulmonary nodules or masses on chest radiography screening. Nature 577, 89–94 (2020).
using deep learning: external validation and strategies to improve clinical 95. Peng, W., Mayorga, R. V. & Hussein, E. M. A. An automated confirmatory system
practice. Clin. Radiol. 75, 38–45 (2020). for analysis of mammograms. Comput. Methods Prog. Biomed. 125, 134–144
69. Nam, J. G. et al. Development and validation of deep learning–based automatic (2016).
detection algorithm for malignant pulmonary nodules on chest radiographs. 96. Ribli, D., Horváth, A., Unger, Z., Pollner, P. & Csabai, I. Detecting and classifying
Radiology 290, 218–228 (2018). lesions in mammograms with deep learning. Sci. Rep. 8, 4165 (2018).
70. Qin, Z. Z. et al. Using artificial intelligence to read chest radiographs for 97. Rodríguez-Ruiz, A. et al. Detection of Breast cancer with mammography:
tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of effect of an artificial intelligence support system. Radiology 290, 305–314
three deep learning systems. Sci. Rep. 9, 15000 (2019). (2018).
71. Setio, A. A. A. et al. Pulmonary nodule detection in CT images: false positive 98. Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer
reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35, detection in mammography: comparison with 101 radiologists. J. Natl Cancer
1160–1169 (2016). Inst. 111, 916–922 (2019).
72. Sim, Y. et al. Deep convolutional neural network–based software improves 99. Byra, M. et al. Breast mass classification in sonography with transfer learning
radiologist detection of malignant lung nodules on chest radiographs. Radiology using a deep convolutional neural network and color conversion. Med. Phys. 46,
294, 199–209 (2020). 746–755 (2019).
73. Taylor, A. G., Mielke, C. & Mongan, J. Automated detection of moderate and 100. Choi, J. S. et al. Effect of a deep learning framework-based computer-aided
large pneumothorax on frontal chest X-rays using deep convolutional neural diagnosis system on the diagnostic performance of radiologists in differ-
networks: a retrospective study. PLOS Med. 15, e1002697 (2018). entiating between malignant and benign masses on breast ultrasonography.
74. Uthoff, J. et al. Machine learning approach for distinguishing malignant and Korean J. Radio. 20, 749–758 (2019).
benign lung nodules utilizing standardized perinodular parenchymal features 101. Dalmis, M. U. et al. Artificial intelligence–based classification of breast lesions
from CT. Med. Phys. 46, 3207–3216 (2019). imaged with a multiparametric breast mri protocol with ultrafast DCE-MRI, T2,
75. Zech, J. R. et al. Variable generalization performance of a deep learning model and DWI. Investig. Radiol. 54, 325–332 (2019).
to detect pneumonia in chest radiographs: a cross-sectional study. PLOS Med. 102. Fujioka, T. et al. Distinction between benign and malignant breast masses at
15, e1002683 (2018). breast ultrasound using deep learning method with convolutional neural net-
76. Cha, M. J., Chung, M. J., Lee, J. H. & Lee, K. S. Performance of deep learning work. Jpn J. Radio. 37, 466–472 (2019).
model in detecting operable lung cancer with chest radiographs. J. Thorac. 103. Kim, S. M. et al. A comparison of logistic regression analysis and an artificial
Imaging 34, 86–91 (2019). neural network using the BI-RADS Lexicon for ultrasonography in conjunction
77. Chae, K. J. et al. Deep learning for the classification of small (≤2 cm) pul- with introbserver variability. J. Digital Imaging 25, 599–606 (2012).
monary nodules on ct imaging: a preliminary study. Acad. Radiol. 27, E55–E63 104. Truhn, D. et al. Radiomic versus convolutional neural networks analysis for
(2020). classification of contrast-enhancing lesions at multiparametric breast MRI.
78. Ciompi, F. et al. Towards automatic pulmonary nodule management in lung Radiology 290, 290–297 (2019).
cancer screening with deep learning. Sci. Rep. 7, 46479 (2017). 105. Wu, N. et al. Deep neural networks improve radiologists’ performance in
79. Dunnmon, J. A. et al. Assessment of convolutional neural networks for auto- breast cancer screening. IEEE Trans. Med. Imaging 39, 1184–1194 (2020).
mated classification of chest radiographs. Radiology 290, 537–544 (2018). 106. Yala, A., Schuster, T., Miles, R., Barzilay, R. & Lehman, C. A deep learning model to
80. Li, X. et al. Deep learning-enabled system for rapid pneumothorax screening on triage screening mammograms: a simulation study. Radiology 293, 38–46
chest CT. Eur. J. Radiol. 120, 108692 (2019). (2019).
81. Li, L., Liu, Z., Huang, H., Lin, M. & Luo, D. Evaluating the performance of a deep 107. Zhou, J. et al. Weakly supervised 3D deep learning for breast cancer classifica-
learning-based computer-aided diagnosis (DL-CAD) system for detecting and tion and localization of the lesions in MR images. J. Magn. Reson. Imaging 50,
characterizing lung nodules: comparison with the performance of double 1144–1151 (2019).
reading by radiologists. Thorac. Cancer 10, 183–192 (2019). 108. Li, Z. et al. Efficacy of a deep learning system for detecting glaucomatous optic
82. Majkowska, A. et al. Chest radiograph interpretation with deep learning models: neuropathy based on color fundus photographs. Ophthalmology 125,
assessment with radiologist-adjudicated reference standards and population- 1199–1206 (2018).
adjusted evaluation. Radiology 294, 421–431 (2019). 109. Lin, H. et al. Diagnostic efficacy and therapeutic decision-making capacity of an
83. Park, S. et al. Deep learning-based detection system for multiclass lesions on artificial intelligence platform for childhood cataracts in eye clinics: a multi-
chest radiographs: comparison with observer readings. Eur. Radiol. 30, centre randomized controlled trial. EClinicalMedicine 9, 52–59 (2019).
1359–1368 (2019). 110. Annarumma, M. et al. Automated triaging of adult chest radiographs with deep
84. Patel, B. N. et al. Human–machine partnership with artificial intelligence for artificial neural networks. Radiology 291, 196–202 (2019).
chest radiograph diagnosis. npj Digital Med. 2, 111 (2019). 111. Yala, A., Lehman, C., Schuster, T., Portnoi, T. & Barzilay, R. A deep learning
85. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: a retrospective mammography-based model for improved breast cancer risk prediction. Radi-
comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Med. 15, ology 292, 60–66 (2019).
e1002686 (2018). 112. Sedgwick, P. Meta-analyses: how to read a funnel plot. BMJ 346, f1342 (2013).
86. Singh, R. et al. Deep learning in chest radiography: detection of findings and 113. Herent, P. et al. Detection and characterization of MRI breast lesions using deep
presence of change. PLoS ONE 13, e0204155 (2018). learning. Diagn. Inter. Imaging 100, 219–225 (2019).
npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
R. Aggarwal et al.
23
114. Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting 137. Reitsma, J. B. et al. Bivariate analysis of sensitivity and specificity produces
diagnostic accuracy studies. BMJ 351, h5527 (2015). informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 58,
115. Sounderajah, V. et al. Developing specific reporting guidelines for diagnostic 982–990 (2005).
accuracy studies assessing AI interventions: the STARD-AI Steering Group. Nat. 138. DerSimonian, R. & Laird, N. Meta-analysis in clinical trials. Controlled Clin. Trials 7,
Med. 26, 807–808 (2020). 177–188 (1986).
116. Vollmer, S. et al. Machine learning and artificial intelligence research for patient 139. Jones, C. M., Ashrafian, H., Darzi, A. & Athanasiou, T. Guidelines for diagnostic
benefit: 20 critical questions on transparency, replicability, ethics, and effec- tests and diagnostic accuracy in surgical research. J. Investig. Surg. 23, 57–65
tiveness. BMJ 368, l6927 (2020). (2010).
117. Liu, X. et al. Reporting guidelines for clinical trial reports for interventions
involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26,
1364–1374 (2020). ACKNOWLEDGEMENTS
118. Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of Infrastructure support for this research was provided by the NIHR Imperial Biomedical
diagnostic accuracy studies. Ann. Intern. Med. 155, 529–536 (2011). Research Centre (BRC).
119. Food, U. & Administration, D. Artificial Intelligence and Machine Learning in
Software as a Medical Device (US Food and Drug Administratio, 2019).
120. Titano, J. J. et al. Automated deep-neural-network surveillance of cranial images
AUTHOR CONTRIBUTIONS
for acute neurologic events. Nat. Med. 24, 1337–1341 (2018).
121. Rankin, D. et al. Reliability of supervised machine learning using synthetic data H.A. conceptualised the study, R.A., V.S., G.M. and H.A. designed the study, extracted
in health care: Model to preserve privacy for data sharing. JMIR Med. Inform. 8, data, conducted the analysis and wrote the manuscript. D.S.W.T., A.K., D.K. and A.D.
e18910 (2020). assisted in writing and editing the manuscript. All authors approved the final version
122. Cawley, G. C. & Talbot, N. L. On over-fitting in model selection and subsequent of the manuscript and take accountability for all aspects of the work.
selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107
(2010).
123. Blalock, D., Ortiz, J., Frankle, J. & Guttag, J. What is the state of neural network COMPETING INTERESTS
pruning? Preprint at https://arxiv.org/abs/2003.03033 (2020). D.K. and A.K. are employees of Google Health. A.D. is an adviser at Google Health. D.S.
124. Beam, A. L., Manrai, A. K. & Ghassemi, M. Challenges to the reproducibility of W.T holds a patent on a deep learning system for the detection of retinal diseases.
machine learning models in health care. JAMA 323, 305–306 (2020).
125. Celi, L. A. et al. Bridging the health data divide. J. Med. Internet Res. 18, e325
(2016). ADDITIONAL INFORMATION
126. Shah, P. et al. Artificial intelligence and machine learning in clinical develop- Supplementary information The online version contains supplementary material
ment: a translational perspective. npj Digital Med. 2, 69 (2019). available at https://doi.org/10.1038/s41746-021-00438-z.
127. Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of
a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Correspondence and requests for materials should be addressed to H.A.
the TRIPOD statement. BMJ 350, g7594 (2015).
128. Schulz, K. F., Altman, D. G. & Moher, D. CONSORT 2010 Statement: updated Reprints and permission information is available at http://www.nature.com/
guidelines for reporting parallel group randomised trials. BMJ 340, c332 reprints
(2010).
129. Chan, A.-W. et al. SPIRIT 2013 statement: defining standard protocol items for Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
clinical trials. Ann. Intern. Med. 158, 200–207 (2013). in published maps and institutional affiliations.
130. Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions
involving artificial intelligence: the SPIRIT-AI extension. Nat. Med. 26, 1351–1363
(2020).
131. Collins, G. S. & Moons, K. G. Reporting of artificial intelligence prediction models.
Lancet 393, 1577–1579 (2019). Open Access This article is licensed under a Creative Commons
132. Ngiam, K. Y. & Khor, I. W. Big data and machine learning algorithms for health- Attribution 4.0 International License, which permits use, sharing,
care delivery. Lancet Oncol. 20, e262–e273 (2019). adaptation, distribution and reproduction in any medium or format, as long as you give
133. World Health Organization. Big Data and Artificial Intelligence for Achieving appropriate credit to the original author(s) and the source, provide a link to the Creative
Universal Health Coverage: an International Consultation on Ethics: Meeting Commons license, and indicate if changes were made. The images or other third party
Report, 12–13 October 2017 (World Health Organization, 2018). material in this article are included in the article’s Creative Commons license, unless
134. Cath, C., Wachter, S., Mittelstadt, B., Taddeo, M. & Floridi, L. Artificial Intelligence indicated otherwise in a credit line to the material. If material is not included in the
and the ‘Good Society’: the US, EU, and UK approach. Sci. Eng. Ethics 24, article’s Creative Commons license and your intended use is not permitted by statutory
505–528 (2018). regulation or exceeds the permitted use, you will need to obtain permission directly
135. Mittelstadt, B. The ethics of biomedical ‘Big Data’ analytics. Philos. Technol. 32, from the copyright holder. To view a copy of this license, visit http://creativecommons.
17–21 (2019). org/licenses/by/4.0/.
136. McInnes, M. D. F. et al. Preferred reporting items for a systematic review and
meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement.
JAMA 319, 388–396 (2018). © The Author(s) 2021
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65