Aggarwal Et Al. - 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

www.nature.

com/npjdigitalmed

REVIEW ARTICLE OPEN

Diagnostic accuracy of deep learning in medical imaging: a


systematic review and meta-analysis
Ravi Aggarwal1, Viknesh Sounderajah1, Guy Martin 1
, Daniel S. W. Ting 2
, Alan Karthikesalingam1, Dominic King1,
Hutan Ashrafian 1 ✉ and Ara Darzi 1

Deep learning (DL) has the potential to transform medical diagnostics. However, the diagnostic accuracy of DL is uncertain. Our aim
was to evaluate the diagnostic accuracy of DL algorithms to identify pathology in medical imaging. Searches were conducted in
Medline and EMBASE up to January 2020. We identified 11,921 studies, of which 503 were included in the systematic review. Eighty-
two studies in ophthalmology, 82 in breast disease and 115 in respiratory disease were included for meta-analysis. Two hundred
twenty-four studies in other specialities were included for qualitative review. Peer-reviewed studies that reported on the diagnostic
accuracy of DL algorithms to identify pathology using medical imaging were included. Primary outcomes were measures of
diagnostic accuracy, study design and reporting standards in the literature. Estimates were pooled using random-effects meta-
analysis. In ophthalmology, AUC’s ranged between 0.933 and 1 for diagnosing diabetic retinopathy, age-related macular
degeneration and glaucoma on retinal fundus photographs and optical coherence tomography. In respiratory imaging, AUC’s
ranged between 0.864 and 0.937 for diagnosing lung nodules or lung cancer on chest X-ray or CT scan. For breast imaging, AUC’s
1234567890():,;

ranged between 0.868 and 0.909 for diagnosing breast cancer on mammogram, ultrasound, MRI and digital breast tomosynthesis.
Heterogeneity was high between studies and extensive variation in methodology, terminology and outcome measures was noted.
This can lead to an overestimation of the diagnostic accuracy of DL algorithms on medical imaging. There is an immediate need for
the development of artificial intelligence-specific EQUATOR guidelines, particularly STARD, in order to provide guidance around key
issues in this field.
npj Digital Medicine (2021)4:65 ; https://doi.org/10.1038/s41746-021-00438-z

INTRODUCTION critical appraisal and independent evaluation of these technolo-


Artificial Intelligence (AI), and its subfield of deep learning (DL)1, gies are still in their infancy12. Even within seminal studies in the
offers the prospect of descriptive, predictive and prescriptive field, there remains wide variation in design, methodology and
analysis, in order to attain insight that would otherwise be reporting that limits the generalisability and applicability of their
untenable through manual analyses2. DL-based algorithms, using findings13. Moreover, it is noted that there has been no
architectures such as convolutional neural networks (CNNs), are overarching medical specialty-specific meta-analysis assessing
distinct from traditional machine learning approaches. They are diagnostic accuracy of DL performance, particularly in ophthal-
distinguished by their ability to learn complex representations in mology, respiratory medicine and breast surgery, which have the
order to improve pattern recognition from raw data, rather than most diagnostic studies to date13.
requiring human engineering and domain expertise to structure Therefore, the aim of this review is to (1) quantify the diagnostic
accuracy of DL in speciality-specific radiological imaging mod-
data and design feature extractors3.
alities to identify or classify disease, and (2) to appraise the
Of all avenues through which DL may be applied to healthcare;
medical imaging, part of the wider remit of diagnostics, is seen as variation in methodology and reporting of DL-based radiological
diagnosis, in order to highlight the most common flaws that are
the largest and most promising field4,5. Currently, radiological
pervasive across the field.
investigations, regardless of modality, require interpretation by a
human radiologist in order to attain a diagnosis in a timely
fashion. With increasing demands upon existing radiologists
(especially in low-to-middle-income countries)6–8, there is a RESULTS
growing need for diagnosis automation. This is an issue that DL Search and study selection
is able to address9. Our search identified 11,921 abstracts, of which 9484 were
Successful integration of DL technology into routine clinical screened after duplicates were removed. Of these, 8721 did not
practice relies upon achieving diagnostic accuracy that is non- fulfil inclusion criteria based on title and abstract. Seven hundred
inferior to healthcare professionals. In addition, it must provide sixty-three full manuscripts were individually assessed and 260
other benefits, such as speed, efficiency, cost, bolstering were excluded at this step. Five hundred three papers fulfilled
accessibility and the maintenance of ethical conduct. inclusion criteria for the systematic review and contained data
Although regulatory approval has already been granted by the required for sensitivity, specificity or AUC. Two hundred seventy-
Food and Drug Administration for select DL-powered diagnostic three studies were included for meta-analysis, 82 in ophthalmol-
software to be used in clinical practice10,11, many note that the ogy, 115 in respiratory medicine and 82 in breast cancer

1
Institute of Global Health Innovation, Imperial College London, London, UK. 2Singapore Eye Research Institute, Singapore National Eye Center, Singapore, Singapore. ✉email: h.
ashrafi[email protected]

Published in partnership with Seoul National University Bundang Hospital


R. Aggarwal et al.
2
(see Fig. 1). These three fields were chosen to meta-analyse as reported on diabetic macular oedema (DME) or early DR on OCT
they had the largest numbers of studies with available data. Two scans. AUC was 0.939 (95% CI 0.920–0.958) for RFP versus 1.00
hundred twenty-four other studies were included for qualitative (95% CI 0.999–1.000) for OCT.
synthesis in other medical specialities. Summary estimates of Age-related macular degeneration: Twelve studies reported
imaging and speciality-specific diagnostic accuracy metrics are diagnostic accuracy data for features of varying severity of AMD
described in Table 1. Units of analysis for each speciality and on RFP (14 cohorts) and 11 studies in OCT (21 cohorts). AUC was
modality are indicated in Tables 2–4. 0.963 (95% CI 0.948–0.979) for RFP versus 0.969 (95% CI
0.955–0.983) for OCT.
Ophthalmology imaging Glaucoma: Seventeen studies with 30 patient cohorts reported
Eighty-two studies with 143 separate patient cohorts reported diagnostic accuracy for features of glaucomatous optic neuro-
pathy, optic discs or suspect glaucoma on RFP and five studies
diagnostic accuracy data for DL in ophthalmology (see Table 2 and
with 6 cohorts on OCT. AUC was 0.933 (95% CI 0.924–0.942) for
Supplementary References 1). Optical coherence tomography
RFP and 0.964 (95% CI 0.941–0.986) for OCT. One study34 with six
(OCT) and retinal fundus photographs (RFP) were the two imaging
cohorts on RFP provided contingency tables. When averaging
modalities performed in this speciality with four main pathologies across the cohorts, the pooled sensitivity was 0.94 (95% CI
being diagnosed—diabetic retinopathy (DR), age-related macular 0.92–0.96) and pooled specificity was 0.95 (95% CI 0.91–0.97). The
degeneration (AMD), glaucoma and retinopathy of AUC of the summary receiver-operating characteristic (SROC)
prematurity (ROP). curve was 0.98 (95% CI 0.96–0.99)—see Supplementary Fig. 1.
Only eight studies14–21 used prospectively collected data and 29 Retinopathy of prematurity: Three studies reported diagnostic
(refs. 14,15,17,18,21–45) studies validated algorithms on external accuracy for identifying plus diseases in ROP from RFP. Sensitivity
datasets. No studies provided a prespecified sample size calcula- was 0.960 (95% CI 0.913—1.008) and specificity was 0.907 (95% CI
tion. Twenty-five studies17,28,29,35,37,39,40,44–61 compared algorithm 0.907–1.066). AUC was only reported in two studies so was not
performance against healthcare professionals. Reference stan- pooled.
dards, definitions of disease and threshold for diagnosis varied Others: Eight other studies reported on diagnostic accuracy in
greatly as did the method of internal validation used. There was ophthalmology either using different imaging modalities (ocular
1234567890():,;

high heterogeneity across all studies (see Table 2). images and visual fields) or for identifying other diagnoses
Diabetic retinopathy: Twenty-five studies with 48 different (pseudopapilloedema, retinal vein occlusion and retinal detach-
patient cohorts reported diagnostic accuracy data for all, referable ment). These studies were not included in the meta-analysis.
or vision-threatening DR on RFP. Twelve studies and 16 cohorts
Respiratory imaging
11921 records iden fied One hundred and fifteen studies with 244 separate patient
4902 from MEDLINE
6952 from Embase
cohorts report on diagnostic accuracy of DL on respiratory disease
67 from other sources (see Table 3 and Supplementary References 2). Lung nodules were
largely identified on CT scans, whereas chest X-rays (CXR) were
used to diagnose a wide spectrum of conditions from simply
2437 duplicates removed
being ‘abnormal’ to more specific diagnoses, such as pneu-
mothorax, pneumonia and tuberculosis.
9484 records screened Only two studies62,63 used prospectively collected data and 13
(refs. 63–75) studies validated algorithms on external data. No
studies provided a prespecified sample size calculation. Twenty-
8721 records excluded one54,63–67,70,72,76–88 studies compared algorithm performance
against healthcare professionals. Reference standards varied
763 full-text ar cles greatly as did the method of internal validation used. There was
assessed for eligibility high heterogeneity across all studies (see Table 3).
Lung nodules: Fifty-six studies with 74 separate patient cohorts
260 excluded
114 conference papers reported diagnostic accuracy for identifying lung nodules on CT
72 segmenta on scans on a per lesion basis, compared with nine studies and 14
42 predic on
12 not imaging
patient cohorts on CXR. AUC was 0.937 (95% CI 0.924–0.949) for CT
11 no outcomes versus 0.884 (95% CI 0.842–0.925) for CXR. Seven studies reported
9 not pathology on diagnostic accuracy for identifying lung nodules on CT scans
on a per scan basis, these were not included in the meta-analysis.
503 ar cles included in Lung cancer or mass: Six studies with nine patient cohorts
qualita ve synthesis 224 studies in other medical speciali es
1 Cardiology reported diagnostic accuracy for identifying mass lesions or lung
15 Dermatology
19 Endocrine/Thyroid
cancer on CT scans compared with eight studies and ten cohorts
2 ENT on CXR. AUC was 0.887 (95% CI 0.847–0.928) for CT versus 0.864
24 Gastroenterology/Hepatology (95% CI 0.827–0.901) for CXR.
2 Haematology
279 studies included in meta-analysis 11 Maxillofacial Surgery
Abnormal Chest X-ray: Twelve studies reported diagnostic
115 in respiratory medicine 1 Metabolic Medicine accuracy for abnormal CXR with 13 different patient cohorts.
82 in ophthalmology 78 Neurology/Neurosurgery AUC was 0.917 (95% CI 0.869–0.966), sensitivity was 0.873 (95% CI
82 in breast cancer 9 Oncology
28 Orthopaedics
0.762–0.985) and specificity was 0.894 (95% CI 0.860–0.929).
5 Rheumatology Pneumothorax: Ten studies reported diagnostic accuracy for
3 GI surgery pneumothorax on CXR with 14 different patient cohorts. AUC was
25 Urology
1 Vascular Surgery 0.910 (95% CI 0.863–0.957), sensitivity was 0.718 (95% CI
0.433–1.004) and specificity was 0.918 (95% CI 0.870–0.965). Five
Fig. 1 PRISMA flow diagram of included studies. PRISMA patient cohorts from two studies73,89 provided contingency tables
(preferred reporting items for systematic reviews and meta-analyses) with raw diagnostic accuracy. When averaging across the cohorts,
flow diagram of included studies. the pooled sensitivity was 0.70 (95% CI 0.45–0.87) and pooled

npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
Table 1. Summary estimates of pooled speciality and imaging modality specific diagnostic accuracy metrics.
Imaging modality Diagnosis AUC 95% CI I2 Sensitivity 95% CI I2 Specificity 95% CI I2 PPV 95% CI I2 NPV 95% CI I2 Accuracy 95% CI I2 F1 score 95% CI I2

Ophthalmology imaging
RFP DR 0.939 0.920–0.958 99.9 0.976 0.975–0.977 99.9 0.902 0.889–0.916 99.7 0.389 0.166–0.612 99.7 1 1 90.6 0.927 0.899–0.955 96.3
RFP AMD 0.963 0.948–0.979 99.3 0.973 0.971–0.974 99.9 0.924 0.896–0.952 99.6 0.797 0.719–0.875 99.9
RFP Glaucoma 0.933 0.924–0.942 99.6 0.883 0.862–0.904 99.9 0.918 0.898–0.938 99.7 0.881 0.847–0.915 97.7
RFP ROP 0.96 0.913–1.008 99.5 0.907 0.749–1.066 99.8
OCT DR 1 0.999–1.0 98.1 0.954 0.937–0.972 98.9 0.993 0.991–0.994 98.2 0.97 0.959–0.981 97.5
OCT AMD 0.969 0.955–0.983 99.4 0.997 0.996–0.997 99.7 0.932 0.914–0.950 98.9 0.936 0.906–0.965 99.6
OCT Glaucoma 0.964 0.941–0.986 77.7
Respiratory imaging
CT Lung nodules 0.937 0.924–0.949 97 0.86 0.831–0.890 99.7 0.896 0.871–0.921 99.2 0.785 0.711–0.858 99.2 0.889 0.870–0.908 98.4 0.79 0.747–0.834 97.9
CT Lung cancer 0.887 0.847–0.928 95.9 0.837 0.780–0.894 94.6 0.826 0.735–0.918 98.1 0.827 0.784–0.870 81.7
X-ray Nodules 0.884 0.842–0.925 99.6 0.75 0.634–0.866 99 0.944 0.912–0.976 98.4 0.86 0.736–0.984 99.8 0.894 0.842–0.945 81.4
X-ray Mass 0.864 0.827–0.901 99.7 0.801 0.683–0.919 99.7
X-ray Abnormal 0.917 0.869–0.966 99.9 0.873 0.762–0.985 99.9 0.894 0.860–0.929 98.7 0.85 0.567–1.133 100 0.859 0.736–0.983 99 0.76 0.558–0.962 99.7
X-ray Atelectasis 0.824 0.783–0.866 99.7
X-ray Cardiomegaly 0.905 0.871–0.938 99.7
X-ray Consolidation 0.875 0.800–0.949 99.9 0.914 0.816–1.013 99.5 0.751 0.637–0.866 98.6 0.897 0.828–0.966 96.4
X-ray Pulmonary oedema 0.893 0.843–0.944 99.9
X-ray Effusion 0.906 0.862–0.950 99.8
X-ray Emphysema 0.885 0.855–0.916 99.7
X-ray Fibrosis 0.834 0.796–0.872 99.7
X-ray Hiatus hernia 0.894 0.858–0.930 99.8

Published in partnership with Seoul National University Bundang Hospital


X-ray Infiltration 0.724 0.682–0.767 99.6
X-ray Pleural thickening 0.816 0.762–0.870 99.8
X-ray Pneumonia 0.845 0.782–0.907 99.9 0.951 0.936–0.965 96.3 0.716 0.480–0.953 100 0.681 0.367–0.995 100 0.763 0.559–0.968 100 0.889 0.838–0.941 97.6
X-ray Pneumothorax 0.91 0.863–0.957 99.9 0.718 0.433–1.004 100 0.918 0.870–0.965 99.9 0.496 0.369–0.623 100
X-ray Tuberculosis 0.979 0.978–0.981 99.6 0.998 0.997–0.999 99.6 1 0.999–1.000 95.3 0.94 0.921–0.959 84.6
Breast imaging
MMG Breast cancer 0.873 0.853–0.894 98.8 0.851 0.779–0.923 99.9 0.882 0.859–0.905 97.2 0.905 0.880–0.930 97.9
Ultrasound Breast cancer 0.909 0.881–0.936 91.7 0.853 0.815–0.891 93.9 0.901 0.870–0.931 96.6 0.804 0.727–0.880 93.7 0.922 0.851–0.992 97.2 0.873 0.841–0.906 87.5 0.855 0.803–0.906 87.9
R. Aggarwal et al.

MRI Breast cancer 0.868 0.850–0.886 27.8 0.786 0.710–0.861 80.5 0.788 0.697–0.880 86.2
DBT Breast cancer 0.908 0.880–0.937 63.2 0.831 0.675–0.988 97.6 0.918 0.905–0.930 0

npj Digital Medicine (2021) 65


3
4
Table 2. Characteristics of ophthalmic imaging studies.
Study Model Prospective? Test set Population Test datasets Type of internal External Reference AI vs Imaging modality Pathology
validation validation standard clinician?

Abramoff et al. 2016 AlexNet/VGG No 1748 Photographs Messidor-2 NR No Expert consensus No Retinal fundus Referable DR
photography
Abramoff et al. AlexNet/VGG Yes 819 Patients Prospective cohort from 10 NR Yes Expert consensus No Retinal fundus More than mild DR
201814 primary care practice photography
sites in USA
Ahn et al. 2018 (a) Inception-v3; (b) No (a) 464; (b) 464 Images Kim’s Eye Hospital, Korea Random split No Expert consensus No Retinal fundus Early and advanced
customised CNN photography glaucoma

npj Digital Medicine (2021) 65


Ahn et al. 2019 ResNet50 No 219 Photographs Kim’s Eye Hospital, Korea Random split No Expert consensus No Retinal fundus Pseudopapilloedema
photographs
Al-Aswad et al. Pegasus (ResNet50) No 110 Photographs Singapore Malay Eye Study Random split No Existing diagnosis Yes Retinal fundus Glaucoma
201946 from source data photographs
Alqudah et al. 201922 AOCT-NET No 1250 Scans Farsiu Ophthalmology 2013 Hold- Yes NR No OCT (a) AMD; (b) DME
out method
Arcadu et al. 2019 Inception-v3 No (a) 1237; Images RISE/RIDE trials Random split No Expert consensus No Retinal fundus (a) DME—central subfield
(b) 1798 photography thickness >400 µm; (b) DME
—central fovea thickness
>400 µm
Asaoka et al. 2016 Deep feed-forward No 279 Eyes University of Tokyo Random split No Other imaging No Visual Fields Preperimetric open-angle
neural network with Hospital, Tokyo technique glaucoma
stacked denoising
autoencoder
Asaoka et al. 2019 Customised CNN No 196 Images University of Tokyo Random split No Expert consensus No OCT Early open-angle glaucoma
Hospital, Tokyo
23
Asaoka et al. 2019 ResNet50 No (a) 205; (b) 171 Scans (a) Iinan Hospital; (b) NR Yes Expert consensus No OCT Glaucoma
Hiroshiuma University
15
Bellemo et al. 2019 VGG/ResNet Yes 3093 Eyes Kitwe Central Hospital Eye NA Yes Expert consensus No Retinal fundus (a) Referable DR; (b) vision-
R. Aggarwal et al.

Unit, Zambia photography threatening DR; (c) DME


24
Bhatia et al. 2019 VGG-16 No (a) 4686; (b) Scans (a) Shiley Eye Institute of the NA Yes (a) Expert No OCT (a) Abnormal scan; (b–f)
384; (c) 148; (d) UCDS; (b) Devers Eye Institute; consensus; (b) NR; AMD; (g–h) DME
100; (e) 135; (f) (c) Noor Eye Hospital; (d) (c) NR; (d) NR; (e)
135; (g) 148; Ophthalmica Ophthalmology Expert consensus
(h) 100 Greece; (e) Cardiff University; (f ) + further imaging;
Cardiff University; (g) Noor Eye (f ) expert
Hospital; (h) Ophthalmica consensus +
Ophthalmology Greece further imaging;
(g) NR; (h) NR
Brown et al. 201847 Inception-v1 and U- No 100 Photographs i-ROP Hold- No Expert consensus Yes Retinal fundus Plus disease in ROP
Net out method photography
49
Burlina et al. 2017 DCNN No 5664 Images AREDS 4 dataset NR No Expert consensus Yes Retinal fundus AMD-AREDS 4 step
photography
48
Burlina et al. 2018 ResNet50 No 5000 Images AREDS Random split No Reading No Retinal fundus Referable AMD
centre grader photographs
50
Burlina et al. 2018 AlexNet No 13,480 Photographs NIH AREDS NR No Reading Yes Retinal fundus Referable AMD
centre grader photography
Burlina et al. 2018 ResNet50 No (a) 6654; Images (a) AREDS 9 dataset; (b) AREDS NR No Reading Yes Retinal fundus (a) AMD-AREDS 4 step; (b)
(b) 58,978 4 dataset centre grader photography AMD-AREDS 9 step
25
Chan et al. 2018 AlexNet, VGGNet, No 4096 Images SERI NR Yes Reading No OCT DME
GoogleNet centre grader
Choi et al. 2017 VGG-19 No (a) 3000; Photographs STARE database Random split No Expert consensus No Retinal fundus (a) DR; (b) AMD
(b) 3000 photographs
Christopher et al. (a) VGG-16; (b) Yes 1482 Images ADAGES and DIGS Random split No Expert consensus No Retinal fundus Glaucomatous optic
201816 Inception-v3; (c) photography neuropathy
ResNet50
Das et al. 2019 VGG-16 No 1000 Images UCSD Hold- No Expert consensus No OCT DME
out method
51
De Fauw et al. 2018 (a) U-Net (b) No (a) 997; (b) 116 (a) Scans Moorfields, London Random split No Follow up Yes OCT Urgent referral eye disease
customised CNN (Topcon device);
(b) scans

Published in partnership with Seoul National University Bundang Hospital


Table 2 continued
Study Model Prospective? Test set Population Test datasets Type of internal External Reference AI vs Imaging modality Pathology
validation validation standard clinician?

(Spectralis
device)
ElTanboly et al. 2016 Deep fusion No 12 OCT scans Hold- No NR No OCT Early DR
classification out method
network (DFCN)
Gargeya et al. 201726 CNN No (a) 15,000 (b) Photographs (a) EyePACS-1; (b) Messidor-2; Random split Yes Expert consensus No Retinal fundus DR
1748; (c) 463 (c) E-Opthma photography
Gomez-Valverde VGG-19 No 494 Photographs ESPERANZA Random split No Expert consensus Yes Retinal fundus Glaucoma suspect or
et al. 201952 photographs glaucoma
Grassman et al. Ensemble: No (a) 12,019; Images (a) AREDS dataset; (b) KORA Random split Yes Reading No Retinal fundus AMD-AREDS 9 step
201827 random forest (b) 5555 dataset centre grader photography
17
Gulshan et al. 2019 Inception-v3 Yes 3049 Photographs Prospective NA Yes Expert consensus Yes Retinal fundus Referable DR
photographs
28
Gulshan et al. 2016 Inception-v3 No (a) 8788; Photographs (a) EyePACS-1; (b) Messidor-2 Random split Yes Reading Yes Retinal fundus Referable DR
(b) 1745 centre grader photography
29
Hwang et al. 2019 (a) ResNet50; (b) VGG- No (a–c) 3872; Images (a–c) Department of Random split Yes Expert consensus Yes OCT AMD-AREDS 4 step
16; (c) Inception-v3; (d–f ) 750 Ophthalmology of Taipei
(d) ResNet50; (e) VGG- Veterans General Hospital; (d–f )
16; (f ) Inception-v3 External validation
Jammal et al. 201953 ResNet34 No 490 Images Randomly drawn No Reading Yes Retinal fundus Glaucomatous optic
from test sample centre grader photographs neuropathy
Kanagasingham et al. DCNN Yes 398 Patients Primary Care Practice, Midland, NA Yes Reading No Retinal fundus Referable DR
201821 Western Australia centre grader photography
Karri et al. 2017 GoogLeNet No 21 Scans Duke University Random split No NR No OCT (a) DME; (b) dry AMD
Keel et al. 201818 Inception-v3 Yes 93 Images St Vincent’s Hospital Melbourne NA Yes Reading No Retinal fundus Referable DR
and University Hospital centre grader photography
Geelong, Barwon Health

Published in partnership with Seoul National University Bundang Hospital


Keel et al. 201930 CNN No 86,202 Photographs Melbourne Collaborative Hold- Yes Expert consensus No Retinal fundus Neovascular AMD
Cohort Study out method photographs
Kermany et al. 201854 Inception-v3 No (a) 1000; Scans Shiley Eye Institute of the Random split No Consensus Yes OCT (a) Choroidal
(b–d) 500 University of California San involving experts neovascularisation vs DME
Diego, the California Retinal and non-experts vs drusen vs normal; (b)
Research Foundation, Medical choroidal
Centre Ophthalmology neovascularisation; (c) DME;
Associates, the Shanghai First (d) AMD
People’s Hospital, and Beijing
Tongren Eye Centre
R. Aggarwal et al.

Krause et al. 201831 CNN No 1958 Images EyePACS-2 Hold- Yes Expert consensus No Retinal fundus Referable DR
out method photographs
Lee et al. 2017 VGG-16 No 2151 Scans Random split No Routine No OCT AMD
clinical notes
Lee et al. 2019 CNN No 200 Photographs Seoul National University Hold- No Other imaging No Retinal fundus Glaucoma
Hospital out method technique photographs
108
Li et al. 2018 Inception-v3 No 8000 Scans Guangdong (China) Random split No Expert graders No Retinal fundus Glaucomatous optic
photography neuropathy
55
Li et al. 2019 VGG-16 No 1000 Images Shiley Eye Institute of the Random split No Expert consensus No OCT Choroidal
University of California San neovascularisation vs DME
Diego, the California Retinal vs drusen Vs normal
Research Foundation, Medical
Centre Ophthalmology
Associates, the Shanghai First
People’s Hospital, and Beijing
Tongren Eye Centre
Li et al. 2019 OCT-NET No 859 Scans Wenzhou Medical University Random split No Expert graders No OCT Early DR
Li et al. 201933 Inception-v3 No 800 Images Messidor-2 Random split Yes Reading No Retinal fundus Referable DR
centre grader photographs
Li et al. 2019 ResNet50 No 1635 Images Shanghai Zhongshan Hospital Random split No Reading Yes OCT DME
and the Shanghai First People’s centre grader
Hospital
Lin et al. 2019109 CC-Cruiser Yes— 350 Images Multicentre RCT NA NA Expert consensus Yes Slit-lamp Childhood cataracts

npj Digital Medicine (2021) 65


multicentre RCT photography
Li F et al. 2018 VGG-15 No 300 Images NR Random split No NR No Visual Fields Glaucoma
5
Table 2 continued 6
Study Model Prospective? Test set Population Test datasets Type of internal External Reference AI vs Imaging modality Pathology
validation validation standard clinician?

Li Z et al. 201833 CNN No 35,201 Photographs NIEHS, SiMES, AusDiab Random split Yes Reading No Retinal fundus Referable DR
centre grader photographs
35
Liu et al. 2018 ResNet50 No (a) 754; (b) 30 Photographs (a) NR; (b) HRF Random split Yes Reading Yes Retinal fundus Glaucomatous optic discs
centre grader photographs
34
Liu et al. 2019 CNN No (a) 28,569; (b) Photographs (a) Local Validation (Chinese Random split Yes Consensus No Retinal fundus Glaucomatous optic
20,466; (c) Glaucoma Study Alliance); (b) involving experts photographs neuropathy
12,718; (d) Beijing Tongren Hospital; (c) and non-experts
9305; (e) Peking University Third

npj Digital Medicine (2021) 65


29,676; (f ) 7877 Hospital; (d) Harbin Medical
University First Hospital; (e)
Handan Eye Study; (f ) Hamilton
Glaucoma Centre
Long et al. 201756 DCNN No 57 Images Multihospital clinical trial Hold- No Expert consensus Yes Ocular images Congenital Cataracts
out method
MacCormick et al. DenseNet No (a) 130; (b) 159 Images (a) ORIGA; (b) RIM-ONE Random split Yes (a) NR; (b) expert No Retinal fundus Glaucomatous optic discs
201936 consensus photography
Maetshke et al. 2019 3D CNN No 110 OCT scans Fivefold cross validation Random split No Follow up No OCT Glaucomatous optic
neuropathy
Matsuba et al. 201857 DCNN No 111 Images Tsukazaki Hospital NR No Expert consensus Yes Retinal fundus Exudative AMD
+ further imaging photography
(optos)
Medeiros et al. 2019 ResNet34 No 6292 Images Duke University Random split No Follow up No Retinal fundus Glaucomatous optic
photography neuropathy
Motozawa et al. 2019 CNN No 382 Images Kobe City Medical Centre Random split No Routine No OCT AMD
clinical notes
Muhammad AlexNet No 102 Images NR NR No Expert consensus No OCT Glaucoma suspect or
et al. 2017 glaucoma
R. Aggarwal et al.

Nagasato et al. 2019 VGG-16 No 466 Images NR K-fold cross No NR No Retinal fundus Retinal vein occlusion
validation photography
(optos)
Nagasato et al. DNN No 322 Scans Tsukazaki Hospital and K-fold cross No Expert graders Yes OCT Retinal vein occlusion
201958 Tokushima University Hospital validation
Nagasawa et al. 2019 VGG-16 No 378 Images Tsukazaki Hospital and K-fold cross No Expert graders No Retinal fundus Proliferative diabetic
Tokushima University Hospital validation photography retinopathy
(optos)
Ohsugi et al. 2017 DCNN No 166 Images Tsukazaki Hospital Random split No Expert consensus No Retinal fundus Rhegmatogenous retinal
photography detachment
(optos)
Peng et al. 201959 Inception-v3 No 900 Images AREDS Random split No Reading Yes Retinal fundus Age-related macular
centre grader photography degeneration-AREDS 4 step
Perdomo et al. 2019 OCT-NET No 2816 Images SERI-CUHK data set Random split No Expert graders No OCT DME
Phan et al. 2019 DenseNet201 No 828 Images Yamanashi Koseiren Hospital No Expert consensus No Retinal fundus Glaucoma
+ further imaging photography
Phene et al. 201937 Inception-v3 No (a) 1205; (b) Images (a) EyePACS, Inoveon, the Random split Yes Reading Yes Retinal fundus Glaucomatous optic
9642; (c) 346 United Kingdom Biobank, the centre grader photographs neuropathy
Age-Related Eye Disease Study,
and Sankara Nethralaya; (b)
Atlanta Veterans Affairs (VA)
Eye Clinic; (c) Dr. Shroff’s
Charity Eye Hospital, New
Delhi, India
Prahs et al. 2017 GoogLeNet No 5358 Images Heidelberg Eye Explorer, Random split No Expert graders No OCT Injection vs No injection
Heidelberg Engineering for AMD
Raju et al. 2017 CNN No 53,126 Images EyePACS-1 Random split No NR No Retinal fundus Referable DR
photography
Ramachandran et al. Visiona intelligent No (a) 485; Photographs (a) ODEMS; (b) Messidor NA Yes Expert graders No Retinal fundus Referable DR
201838 diabetic retinopathy (b) 1200 photographs
screening platform
Raumviboonsuk et al. Inception-v4 No (a–c) 25,348; Images National screening program for NA Yes Expert consensus Yes Retinal fundus (a) Moderate non-
201939 (d) 24,332 DR in Thailand photography proliferative DR or worse;
(b) severe non-proliferative
DR or worse; (c) proliferative
DR; (d) referable DME

Published in partnership with Seoul National University Bundang Hospital


Table 2 continued
Study Model Prospective? Test set Population Test datasets Type of internal External Reference AI vs Imaging modality Pathology
validation validation standard clinician?

Redd et al. 2018 Inception-v1 and U- No 4861 Images Multicentre i-ROP study NR No Expert graders + No Retinal fundus Plus disease in ROP
Net further imaging photography
Rogers et al. 201945 Pegasus (ResNet50) No 94 Photographs EODAT NA Yes Reading Yes Retinal fundus Glaucomatous optic
centre grader photographs neuropathy
Sandhu et al. 201819 Deep fusion SNCAE Yes 160 Scans University of Waikato NA No Clinical diagnosis No Retinal fundus Non-proliferative DR
photographs
Sayres et al. 201940 Inception-v4 No 2000 Images EyePACS-2 NA Yes Expert consensus Yes Retinal fundus Referable DR
photographs
Shibata et al. 201860 (a) ResNet; (b) VGG-16 No 110 Images Matsue Red Cross Hospital Random split No Expert consensus Yes Retinal fundus Glaucoma
photography
Stevenson et al. 2019 Inception-v3 No (a) 2333; (b) Photographs Publicly available databases Random split No Existing diagnosis No Retinal fundus (a) Glaucoma; (b) DR;
2283; (c) 2105 from source data photographs (c) AMD
Ting et al. 201741 VGGNet No (a) 71,896; (b) Images (a) Singapore National Diabetic Random split Yes Expert consensus No Retinal fundus Referable DR
15,798; (c) Retinopathy Screening photography
3052; (d) 4512; Program 2014–2015; (b)
(e) 1936; (f) Guangdong (China); (c)
1052; (g) 1968; Singapore Malay Eye Study; (d)
(h) 2302; (i) Singapore Indian Eye Study; (e)
1172; (j) 1254; Singapore Chinese Eye Study;
(k) 7706; (l) (f) Beijing Eye Study; (g) African
35,948; American Eye Disease Study; (h)
(m) 35,948 Royal Victoria Eye and Ear
Hospital; (i) Mexican; (j) Chinese
University of Hong Kong, (k, l)
Singapore National Diabetic
Retinopathy Screening
Program 2014–2015
Ting et al. 201942 VGGNet No 85,902 Images Combined eight datasets NA Yes Consensus No Retinal fundus (a) Any DR; (b) referable DR;

Published in partnership with Seoul National University Bundang Hospital


involving experts photography (c) vision-threatening DR
and non-experts
Treder et al. 2017 Inception-v3 No 100 Scans NR Hold- No NR No OCT Exudative AMD
out method
van Grinsven et al. (a) Ses CNN 60; (b) No 1200 Images Messidor Random split Yes Existing diagnosis Yes Retinal fundus Retinal haemorrhage
201644 NSesCNN170 from source data photographs
43
Verbraak et al. 2019 AlexNet/VGG No 1293 Images Netherlands Star-SHL NA Yes Expert consensus No Retinal fundus (a) DR-vision-threatening;
photography (b) DR- more than mild
Xu et al. 2017 CNN No 200 Photographs Kaggle Random split No Existing diagnosis No Retinal fundus DR
from source data photographs
R. Aggarwal et al.

Yang et al. 2019 VGGNet No 500 Photographs Intelligent Ophthalmology Hold- No Expert consensus No Retinal fundus Referable DR
Database of Zhejiang Society out method photographs
for Mathematical Medicine
in China
Yoo et al. 2019 VGG-19 No 900 Scans Project Macula Random split No NR No (a) OCT; (b) retinal AMD
fundus
photographs
Zhang et al. 201961 VGG-16 No 1742 Images Telemed-R screening Random split No Expert consensus Yes Retinal fundus ROP
photographs
20
Zheng et al. 2019 Inception-v3 Yes 102 Scans Joint Shantou International Eye Hold- No NR No OCT Glaucomatous optic
Centre of Shantou University out method neuropathy
and the Chinese University of
Hong Kong (JSIEC)

npj Digital Medicine (2021) 65


7
8
Table 3. Characteristics of respiratory imaging studies.
Study Model Prospective? Test set Population Test datasets Type of internal External Reference standard AI vs Imaging Body system/disease
validation validation clinician modality

Abiyev et al. 2018 CNN No 380 Images Chest X-ray14 Random split No Routine clinical reports No X-ray Abnormal X-ray
Al-Shabi et al. 2019 Local-Global No 848 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Alakwaa et al. 2017 U-Net No 419 Scans Kaggle Data Science Bowl Random split No Expert reader, existing No CT Lung cancer
labels in dataset
Ali et al. 2018 3D CNN No 668 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Annarumma et al. CNN No 15,887 Images Kings College London Hold-out method No Routine clinical reports No X-ray (a) Critical radiographs;

npj Digital Medicine (2021) 65


2019110 (b) normal radiographs
64
Ardila et al. 2019 Inception-v1 No (a) 6716; Scans (a) National Lung Cancer Random split Yes Histopathology, Yes CT Lung cancer
(b) 1139 Screening Trial; (b) follow up
Northwestern Medicine
Baltruschat et al. 2019 ResNet50 No 22,424 X-rays Chest X-ray14 Random split No Routine clinical reports No X-ray (a) Abnormal chest X-ray;
(b) normal chest X-ray; (c)
atelectasis; (d)
cardiomegaly; (e)
effusion; (f) infiltration; (g)
mass; (h) nodule (i)
pneumonia; (j)
pneumothorax; (k)
consolidation; (l) oedema;
(m) emphysema; (n)
fibrosis; (o) pleural
thickening; (p) hernia
Bar et al. 2018 CNN No 194 Images Diagnostic Imaging Random split No Expert readers No X-ray (a) Abnormal X-ray; (b)
Department of Sheba cardiomegaly
Medical Centre, Tel
Hashomer, Israel
Becker et al. 201862 CNN Yes 21 X-rays Infectious Diseases Institute Random split No Expert consensus No X-ray Tuberculosis
R. Aggarwal et al.

in Kampala, Uganda
Behzadi-Khormouji (a) ChestNet; (b) VGG-16; No 582 X-rays Guangzhou Women and NR No Expert readers No X-ray Consolidation
et al. 2020 (c) DenseNet121 Children’s Medical Centre
Beig et al. 2019 CNN No 145 Scans Erlangen Germany, Random split No Histopathology No CT Lung cancer
Waukesha Wis, Cleveland
Ohio, Tochigi-ken Japan
Causey et al. 2018 CNN No (a) 424; (b) 213 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Cha et al. 201976 ResNet50 No (a) 1483; (b) 500 X-rays Samsung Medical Random split No Other imaging, expert Yes X-ray (a) Lung cancer; (b) T1
Centre, Seoul readers lung cancer
Chae et al. 201977 Ct-LUNGNET No 60 Nodules Chonbuk National Random split No Expert readers, Yes CT Nodules
University Hospital histopathology,
follow up
Chakravarthy Probabilistic neural No 119 Scans LIDC/IDRI NR No NR No CT Lung cancer
et al. 2019 network
Chen et al. 2019 3D CNN No 3674 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Cheng et al. 2016 Stacked denoising No 1400 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
autoencoder
Cicero et al. 2017 GoogLeNet No 2443 Images Department of Medical Random split No Expert readers, routine No X-ray (a) Effusion; (b) oedema;
Imaging, St Michael’s clinical reports (c) consolidation; (d)
Hospital, Toronto cardiomegaly; (e)
pneumothorax
Ciompi et al. 201778 ConvNet No 639 Nodules Danish Lung Cancer Random split No Non-expert readers Yes CT (a) Nodules—solid; (b)
Screening Trial (DLCST) nodules—calcified; (c)
nodules—part-solid; (d)
nodules—non-solid; (e)
nodules—perifissural; (f)
nodules—spiculated
Correa et al. 2018 CNN No 60 Images Lima, Peru NR No Expert readers No Ultrasound Paediatric pneumonia
da Silva et al. 2017 Evolutionary CNN No 200 Nodules LIDC-IDRI Hold-out method No Expert readers No CT Nodules
da Silva et al. 2018 Particle swarm No 2000 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
optimisation algorithm
within CNN
Dai et al. 2018 3D DenseNet-40 No 211 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Dou et al. 2017 3D CNN No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules

Published in partnership with Seoul National University Bundang Hospital


Table 3 continued
Study Model Prospective? Test set Population Test datasets Type of internal External Reference standard AI vs Imaging Body system/disease
validation validation clinician modality

Dunnmon et al. ResNet18 No 533 Images Stanford University Hold-out method No Expert consensus Yes X-ray Abnormal X-ray
201979
Gao et al. 2018 CNN No 20 Scans University Hospitals Random split No NR No CT Interstitial lung disease
of Geneva
Gong et al. 2019 3D SE-ResNet No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
Gonzalez et al. 2018 CNN No 1000 Scans ECLIPSE study Random split No NR No CT COPD
Gruetzemacher DNN No 1186 Nodules LUNA16 Ninefold cross No NR No CT Nodules
et al. 2018 validation
Gu et al. 2018 3D CNN No 1186 Nodules LUNA16 Tenfold cross No Expert readers No CT Nodules
validation
Hamidian et al. 2017 3D CNN No 104 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Han et al. 2018 Multi-CNNs No 812 Regions of LIDC-IDRI Random split No NR No CT Ground glass opacity
interest
Heo et al. 2019 VGG-19 No 37,677 X-rays Yonsei University Hospital, Hold-out method No Expert readers No X-ray Tuberculosis
South Korea
Hua et al. 2015 (a) CNN; (b) deep belief No 2545 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
network
Huang et al. 2019 R-CNN No 176 Scans LIDC-IDRI Random split No Expert readers No CT Nodules
Huang et al. 2019 Amalgamated-CNN No 1795 Nodules LIDC/IDRI and Ali Tianchi Random split No Expert readers No CT Nodules
medical
Hussein et al. 2019 VGG No 1144 Nodules LIDC/IDRI Random split No Expert readers No CT Lung cancer
Hwang et al. 201867 DCNN No (a) 450; (b) 183; X-rays (a) Internal validation; (b) Random split Yes Expert readers Yes X-ray Tuberculosis
(c) 140; (d) 173; Seoul National University
(e) 170; (f) 132; Hospital; (c) Boromae
(g) 646 Hospital; (d) Kyunghee

Published in partnership with Seoul National University Bundang Hospital


University Hospital; (e)
Daejeon Eulji Medical
Centre; (f) Montgomery; (g)
Hwang et al. 201965 Lunit INSIGHT No 1135 X-rays Seoul National University NA Yes Expert consensus, other Yes X-ray Abnormal chest X-ray
Hospital imaging
66
Hwang et al. 2019 DCNN No (a) 1089; X-rays (a) Internal validation; (b) Random split Yes Expert reader, other Yes X-ray Neoplasm/TB/
(b) 1015 external validation imaging, histopathology pneumonia/
pneumothorax
Jiang et al. 2018 CNN No 25,723 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Jin et al. 2018 ResNet 3D No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
R. Aggarwal et al.

Jung et al. 2018 3D DCNN No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
Kang et al. 2017 3D multi view-CNN No 776 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Kermany et al. 2018 Inception-v3 No 624 X-rays Guangzhou Women and Random split No Expert readers Yes X-ray Pneumonia
Children’s Medical Centre
Kim et al. 2019 MGI-CNN No 1186 Nodules LIDC/IDRI NR No Expert readers No CT Nodules
Lakhani et al. 201790 (a) AlexNet; (b) No 150 X-rays Montgomery County MD, Random split No Routine clinical reports, No X-ray Tuberculosis
GoogLeNet; (c) Ensemble Shenzhen China, Belarus TB expert reader,
(AlexNet + GoogLeNet); public Health Program, histopathology
(d) Radiologist augmented Thomas Jefferson University
Hospital
Li et al. 2016 CNN No 8937 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Li et al. 201981 DL-CAD No 812 Nodules Shenzhen Hospital NR No Expert consensus Yes CT Nodules
Li et al. 201980 CNN No 200 Scans Massachusetts General Random split No Routine clinical reports Yes CT Pneumothorax
Hospital
Liang et al. 202068 CNN No 100 Images Kaohsiung Veterans General NA Yes Other imaging No X-ray Nodules
Hospital, Taiwan
Liang et al. 2019 (a) Custom CNN; (b) VGG- No 624 X-rays Guangzhou Women and Random split No Expert readers No X-ray Pneumonia
16; (c) DenseNet121; (d) Children’s Medical Centre
Inception-v3; (e) Xception
Liu et al. 2017 3D CNN No 326 Nodules National Lung Cancer Fivefold cross No Histopathology, No CT Nodules
Screening Trial and Early validation follow up
Lung Cancer Action

npj Digital Medicine (2021) 65


Program
Liu et al. 2019 CDP-ResNet No 539 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
9
Table 3 continued 10
Study Model Prospective? Test set Population Test datasets Type of internal External Reference standard AI vs Imaging Body system/disease
validation validation clinician modality

Liu H et al. 2019 Segmentation-based deep No 112,120 X-rays Chest X-ray14 NR No Routine clinical reports No X-ray (a) Atelectasis; (b)
fusion network cardiomegaly; (c)
effusion; (d) infiltration;
(e) mass; (f) nodule; (g)
pneumonia; (h)
pneumothorax; (i)
consolidation; (j) oedema;
(k) emphysema; (l)
fibrosis; (m) fibrosis; (n)

npj Digital Medicine (2021) 65


pleural thickening;
(o) hernia
Majkowska et al. CNN No (a–d) 1818; X-rays (a–d) Hospital group in Random split No Expert consensus Yes X-ray (a) Pneumothorax (b)
201982 (e–h) 1962 India (Bangalore, nodule; (c) opacity; (d)
Bhubaneshwar, Chennai, fracture; (e)
Hyderabad, New Delhi); pneumothorax; (f )
(e–h) Chest X-ray14 nodule; (g) opacity; (h)
fracture
Monkam et al. 2018 CNN No 2600 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Nam et al. 201869 CNN No (a) 600; (b) 181; Chest (a) Internal validation; (b) Random split Yes (a) Routine clinical No X-ray Nodules
(c) 182; (d) 181; radiographs Seoul National University reports, histopathology;
(e) 149 Hospital; (c) Boromae (b–e) histopathology,
Hospital; (d) National follow up, other imaging
Cancer Centre, Korea; (e)
University of California an
Francisco Medical Centre
Naqi et al. 2018 Two-level stacked No 777 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
autoencoder + softmax
Nasrullah et al. 2019 Faster R-CNN No 2562 Nodules LIDC/IDRI NR No Expert readers No CT Nodules
R. Aggarwal et al.

Nibali et al. 2017 ResNet No 166 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Nishio et al. 2018 VGG-16 No 123 Nodules Kyoto University Hospital Random split No NR No CT Nodules
Onishi et al. 2019 AlexNet No 60 Nodules NR NR No Histopathology, No CT Nodules
follow up
Onishi et al. 2019 Wasserstein generative No 60 Nodules Fujita Health University NR No Histopathology, No CT Nodules
adversarial network Hospital follow up
Park et al. 201989 YOLO No 503 X-rays Asan Medical Centre and Hold-out method No Expert reader No X-ray Pneumothorax
Seoul National University
Bundang Hospital
Park et al. 201983 CNN No 200 Images Asan Medical Centre and Hold-out method No Expert consensus Yes X-ray (a) Nodules; (b) opacity;
Seoul National University (c) effusion; (d)
Bundang Hospital pneumothorax; (e)
abnormal chest X-ray
Pasa et al. 2019 Custom CNN No 220 X-rays NIH Tuberculosis Chest X- Random split No NR No X-ray Tuberculosis
ray dataset and Belarus
Tuberculosis Portal dataset
Patel et al. 201984 CheXMax No 50 X-rays Stanford University Hold-out method No Expert reader, other Yes X-ray Pneumonia
imaging, clinical notes
Paul et al. 2018 VGG-s CNN No 237 Nodules National Lung Cancer Hold-out method No Expert readers, No CT Nodules
Screening Trial follow up
Pesce et al. 2019 Convolution networks No 7850 X-rays Guy’s and St. Thomas’ NHS Random split No Routine clinical reports No X-ray Lung lesions
with attention feedback Foundation Trust
(CONAF)
Pezeshk et al. 2019 3D CNN No 128 Nodules LUNA16 Random split No Expert readers No CT Nodules
Qin et al. 201970 (a) Lunit; (b) qXR (Qure.ai); No 1196 X-rays Nepal and Cameroon NA Yes Expert readers Yes X-ray Tuberculosis
(c) CAD4TB
Rajpurkar et al. 201885 CNN No 420 X-rays ChestXray-14 Random split No Routine clinical reports Yes X-ray (a) Atelectasis; (b)
cardiomegaly; (c)
consolidation; (d)
oedema; (e) effusion; (f)
emphysema; (g) fibrosis;
(h) hernia; (i) infiltration;
(j) mass; (k) nodule; (l)
pleural thickening; (m)
pneumonia; (n)
pneumothorax

Published in partnership with Seoul National University Bundang Hospital


Table 3 continued
Study Model Prospective? Test set Population Test datasets Type of internal External Reference standard AI vs Imaging Body system/disease
validation validation clinician modality

Ren et al. 2019 Manifold regularized No 98 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
classification deep neural
network
Sahu et al. 2019 Multi-section CNN No 130 Nodules LIDC-IDRI Tenfold cross No Expert readers No CT Nodules
validation
Schwyzer et al. 2018 CNN No 100 Patients NR NR No NR No FDG-PET Lung cancer
Setio et al. 201671 ConvNet No (a) 1186; (b) 50; (a) Nodules; (b) LIDC-IDRI Fivefold cross Yes (a) Expert readers; No CT Nodules
(c) 898 scans; (c) nodules validation (b, c) NR
Shaffie et al. 2018 Deep autoencoder No 727 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Shen et al. 2017 Multiscale CNN No 1375 Nodules LIDC-IDRI NR No Expert readers No CT Nodules
Sim et al. 201972 ResNet50 No 800 Images Freiberg University Hospital NA Yes Other imaging, Yes X-ray Nodules
Freiburg, Massachusetts histopathology
General Hospital Boston,
Samsung Medical Centre
Seoul, Severance
Hospital Seoul
Singh et al. 201886 Qure-AI No 724 Chest Chest X-ray8 Random split No Routine clinical reports Yes X-ray (a) Lesions; (b) effusion;
radiographs (c) hilar prominence; (d)
cardiomegaly
Song et al. 2017 (a) CNN; (b) DNN; (c) No 5024 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
stacked autoencoder
Stephen et al. 2019 CNN No 2134 Images Guangzhou Women and Random split No NR No X-ray Pneumonia
Children’s Medical Centre
Sun et al. 2017 (a) CNN; (b) deep belief No 88,948 Samples LIDC-IDRI Tenfold cross No Expert readers No CT Nodules
network; (c) stacked validation
denoising autoencoder

Published in partnership with Seoul National University Bundang Hospital


Tan et al. 2019 CNN No 280 Nodules LIDC-IDRI Tenfold cross No NR No CT Nodules
validation
73
Taylor et al. 2018 (a) Inception-v3; (b) VGG- No (a, b) 1990; (c, X-rays (a,b) Internal validation (c,d) Random split Yes Expert consensus No X-ray Pneumothorax
19; (c) Inception-v3; (d) d) 112,120 Chest X-ray14
VGG-19
Teramoto et al. 2016 CNN No 104 Scans Fujita Health University NR No Expert reader No PET/CT Nodules
Hospital
Togacar et al. 2019 AlexNet + VGG-16 + VGG- No 1754 X-rays Firat University, Turkey Random split No NR No X-ray Pneumonia
19
Togacar et al. 2020 (a) LeNet; (b) AlexNet; (c) No 100 Images Cancer Imaging Archive NR No Expert readers No CT Lung cancer
R. Aggarwal et al.

VGG-16
Tran et al. 2019 LdcNet No 1186 Nodules LUNA16 Tenfold cross No Expert readers No CT Nodules
validation
Tu et al. 2017 CNN No 20 Nodules LIDC-IDRI Tenfold cross No Expert readers No CT (a) Nodules—non-solid;
validation (b) nodules—part-solid;
(c) nodules—solid
Uthoff et al. 201974 CNN No 100 Nodules INHALE STUDY NA Yes Histopathology, No CT Nodules
follow up
87
Walsh et al. 2018 Inception-ResNet-v2 No 150 Scans La Fondazione Policlinico Random split No Expert readers Yes CT Interstitial lung disease
Universitario A Gemelli
IRCCS, Rome, Italy, and
University of Parma,
Parma, Italy
Wang et al. 2017 AlexNet No 230 X-rays Japanese Society of Tenfold cross No Other imaging No X-ray Nodules
Radiological Technology validation
(JSRT) database
Wang et al. 201888 3D CNN No 200 Scans Fudan University Shanghai Random split No Expert readers, Yes HRCT Lung cancer
Cancer Centre histopathology
Wang et al. 2018 VGG-16 No 744 X-rays JSRT, OpenI, SZCX and MC Random split No Other imaging No X-ray (a) Abnormal chest X-ray;
(b) normal chest X-ray
Wang et al. 2019 ChestNet No 442 X-rays Zhejiang University School Random split No Expert readers No X-ray Pneumothorax
of Medicine (ZJU-2) and
Chest X-ray14

npj Digital Medicine (2021) 65


Wang et al. 2019 (a) AlexNet; (b) No 7580 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
GoogLeNet; (c) ResNet
11
Table 3 continued 12
Study Model Prospective? Test set Population Test datasets Type of internal External Reference standard AI vs Imaging Body system/disease
validation validation clinician modality

Wang et al. 2019 ResNet152 No 25,596 X-rays Chest X-ray14 Random split No Routine clinical reports No X-ray (a) Atelectasis; (b)
cardiomegaly; (c)
effusion; (d) infiltration;
(e) mass; (f) nodule; (g)
pneumonia; (h)
pneumothorax; (i)
consolidation; (j) oedema;
(k) emphysema; (l)
fibrosis; (m) pleural

npj Digital Medicine (2021) 65


thickening; (n) hernia; (o)
abnormal chest X-ray
Xie et al. 2018 LeNet-5 No 1972 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Xie et al. 2019 ResNet50 No 1945 Nodules LIDC-IDRI Tenfold cross No Expert readers No CT Nodules
validation
Yates et al. 2018 Inception-v3 No 5505 X-rays Chest X-ray14 + Indiana Random split No Routine clinical reports No X-ray Abnormal chest X-ray
University
Ye et al. 2019 (a) AlexNet; (b) No (a) 321; (b) 321; (a) Nodules; (b) (a, b) LIDC-IDRI; (c) private Random split No Expert readers No CT (a, b) Nodules; (c) ground
GoogLeNet; (c) Res- (c) 593 nodules; (c) glass opacity
Net150 regions of
interest
Zech et al. 201875 CNN No (a) 30,450; X-rays (a) Mount Sinai and Chest Random split Yes Expert readers No X-ray Pneumonia
(b) 3807 X-ray14; (b) Indiana
University Network for
Patient Care
Zhang et al. 2018 3D DCNN No 1186 Nodules LUNA16 NR No Expert readers No CT Nodules
Zhang et al. 2019 Voxel-level-1D CNN No 67 Nodules Stony Brook University Twofold cross No Histopathology No CT Nodules
Hospital validation
Zhang et al. 2019 3D deep dual path No 1004 Nodules LIDC/IDRI Tenfold cross No Expert readers No CT Nodules
R. Aggarwal et al.

network validation
Zhang C et al. 2019 3D CNN Yes 50 Images Guangdong Lung Cancer Random split Yes Histopathology, Yes CT Nodules
Institute follow up
Zhang et al. 201963 Mask R-CNN No 134 Slices Shenzhen Hospital Random split No Expert readers No CT/PET Lung cancer
Zhang S et al. 2019 Le-Net5 No 762 Nodules LIDC/IDRI Random split No Expert readers No CT Nodules
Zhang T et al. 2017 Deep Belief Network No 1664 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Zhao X et al. 2018 Agile CNN No 743 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Zhao X et al. 2019 (a) AlexNet; (b) No 2028 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
GoogLeNet; (c) ResNet; (d)
VifarNet
Zheng et al. 2019 CNN No 1186 Nodules LIDC-IDRI Random split No Expert readers No CT Nodules
Zhou et al. 2019 Inception-v3 and No 600 Images Chest X-ray8 Random split No Routine clinical reports No X-ray Cardiomegaly
ResNet50

Published in partnership with Seoul National University Bundang Hospital


R. Aggarwal et al.
13
specificity was 0.94 (95% CI 0.90–0.97). The AUC of the SROC curve Funnel plots were produced for the diagnostic accuracy
was 0.94 (95% CI 0.92–0.96)—see Supplementary Fig. 2. outcome measure with the largest number of patient cohorts in
Pneumonia: Ten studies reported diagnostic accuracy for each medical speciality, in order to detect bias in the studies
pneumonia on CXR with 15 different patient cohorts. AUC was included112 (see Supplementary Figs. 5–7). These demonstrate
0.845 (95% CI 0.782–0.907), sensitivity was 0.951 (95% CI that there is high risk of bias in studies detecting lung nodules on
0.936–0.965) and specificity was 0.716 (95% CI 0.480–0.953). CT scans and detecting DR on RFP, but not for detecting breast
Tuberculosis: Six studies reported diagnostic accuracy for cancer on MMG.
tuberculosis on CXR with 17 different patient cohorts. AUC was
0.979 (95% CI 0.978–0.981), sensitivity was 0.998 (95% CI Assessment of the validity and applicability of the evidenc
0.997–0.999) and specificity was 1.000 (95% CI 0.999–1.000). Four The overall risk of bias and applicability using Quality Assessment
patient cohorts from one study90 provided contingency tables of Diagnostic Accuracies Studies 2 (QUADAS-2) led to a majority of
with raw diagnostic accuracy. When averaging across the cohorts, studies in all specialities being classified as high risk, particularly
the pooled sensitivity was 0.95 (95% CI 0.91–0.97) and pooled with major deficiencies in regard to patient selection, flow and
specificity was 0.97 (95% CI 0.93–0.99). The AUC of the SROC curve timing and applicability of the reference standard (see Fig. 2). For
was 0.97 (95% CI 0.96–0.99)—see Supplementary Fig. 3. the patient selection domain, a high or unclear risk of bias was
X-ray imaging was also used to identify atelectasis, pleural seen in 59/82 (72%) of ophthalmic studies, 89/115 (77%) of
thickening, fibrosis, emphysema, consolidation, hiatus hernia, pul- respiratory studies and 62/82 (76%) or breast studies. These were
monary oedema, infiltration, effusion, mass and cardiomegaly. CT mostly related to a case–control study design and sampling issues.
imaging was also used to diagnose COPD, ground glass opacity and For the flow and timing domain, a high or unclear risk of bias was
interstitial lung disease, but these were not included in the meta- seen in 66/82 (80%) of ophthalmic studies, 93/115 (81%) of
analysis. respiratory studies and 70/82 (85%) of breast studies. This was
largely due to missing information about patients not receiving
Breast imaging the index test or whether all patients received the same reference
Eighty-two studies with 100 separate patient cohorts report on standard. For the reference standard domain, concerns regarding
diagnostic accuracy of DL on breast disease (see Table 4 and applicability was seen in 60/82 (73%) of ophthalmic studies, 104/
Supplementary References 3). The four imaging modalities of 115 (90%) of respiratory studies and 78/82 (95%) of breast studies.
mammography (MMG), digital breast tomosynthesis (DBT), ultra- This was mostly due to reference standard inconsistencies if the
sound and magnetic resonance imaging (MRI) were used to index test was validated on external datasets.
diagnose breast cancer.
No studies used prospectively collected data and eight91–98 studies DISCUSSION
validated algorithms on external data. No studies provided a
This study sought to (1) quantify the diagnostic accuracy of DL
prespecified sample size calculation. Sixteen studies62,91,92,94,97–107
algorithms to identify specific pathology across distinct radiolo-
compared algorithm performance against healthcare professionals.
gical modalities, and (2) appraise the variation in study reporting
Reference standards varied greatly as did the method of internal
of DL-based radiological diagnosis. The findings of our speciality-
validation used. There was high heterogeneity across all studies (see specific meta-analysis suggest that DL algorithms generally have a
Table 4). high and clinically acceptable diagnostic accuracy in identifying
Breast cancer: Forty-eight studies with 59 separate patient disease. High diagnostic accuracy with analogous DL approaches
cohorts reported diagnostic accuracy for identifying breast cancer was identified in all specialities despite different workflows,
on MMG (AUC 0.873 [95% CI 0.853–0.894]), 22 studies and 25 pathology and imaging modalities, suggesting that DL algorithms
patient cohorts on ultrasound (AUC 0.909 [95% CI 0.881–0.936]), can be deployed across different areas in radiology. However, due
and eight studies on MRI (AUC 0.868 [95% CI 0.850–0.886]) and to high heterogeneity and variance between studies, there is
DBT (AUC 0.908 [95% CI 0.880–0.937]). considerable uncertainty around estimates of diagnostic accuracy
in this meta-analysis.
Other specialities In ophthalmology, the findings suggest features of diseases,
Our literature search also identified 224 studies in other medical such as DR, AMD and glaucoma can be identified with a high
specialities reporting on diagnostic accuracy of DL algorithms to sensitivity, specificity and AUC, using DL on both RFP and OCT
identify disease. These included large numbers of studies in the scans. In general, we found higher sensitivity, specificity, accuracy
fields of neurology/neurosurgery (78), gastroenterology/hepatol- and AUC with DL on OCT scans over RFP for DR, AMD and
ogy (24) and urology (25). Out of the 224 studies, only 55 glaucoma. Only sensitivity was higher for DR on RFP over OCT.
compared algorithm performance against healthcare profes- In respiratory medicine, our findings suggest that DL has high
sionals, although 80% of studies in the field of dermatology did sensitivity, specificity and AUC to identify chest pathology on CT
(see Supplementary References 4, Supplementary Table 1 and scans and CXR. DL on CT had higher sensitivity and AUC for
Supplementary Fig. 4). detecting lung nodules; however, we found a higher specificity,
PPV and F1 score on CXR. For diagnosing cancer or lung mass, DL
on CT had a higher sensitivity than CXR.
Variation of reporting In breast cancer imaging, our findings suggest that DL generally
A key finding of our review was the large degree of variation in has a high diagnostic accuracy to identify breast cancer on
methodology, reference standards, terminology and reporting mammograms, ultrasound and DBT. The performance was found
among studies in all specialities. The most common variables to be very similar for these modalities. In MRI, however, the
amongst DL studies in medical imaging include issues with the diagnostic accuracy was lower; this may be due to small datasets
quality and size of datasets, metrics used to report performance and the use of 2D images. The utilisation of larger databases and
and methods used for validation (see Table 5). Only eight studies in multiparametric MRI may increase the diagnostic accuracy113.
ophthalmology imaging14,21,32,33,43,55,108,109, ten studies in respira- Extensive variation in the methodology, data interpretability,
tory imaging64,66,70,72,75,79,82,87,89,110 and six studies in breast terminology and outcome measures could be explained by a lack
imaging62,91,97,104,106,111 mentioned adherence to the STARD-2015 of consensus in how to conduct and report DL studies. The STARD-
guidelines or had a STARD flow diagram in the manuscript. 2015 checklist114, designed for reporting of diagnostic accuracy

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
14
Table 4. Characteristics of breast imaging studies.
Study Model Prospective? Test Set Population Test datasets Type of internal External Reference standard AI vs Imaging modality Body system/disease
validation validation clinician?

Abdelsamea CNN No 118 Images NR Tenfold cross No NR No Mammogram Breast cancer


et al. 2019 validation
Agnes et al. 2020 Multiscale all CNN No 322 Images mini-MIAS Random split No Existing labels from No Mammogram Breast cancer
dataset
Akselrod-Ballin Faster R-CNN No 170 Images Multicentre hospital data set Random split No Expert reader No Mammogram Breast cancer
et al. 2017
Al-Antari YOLO No 410 Images INbreast Random split No Expert reader, No Mammogram Breast cancer

npj Digital Medicine (2021) 65


et al. 2018 histology,
Al-Antari DBN No 150 Images DDSM Random split No Follow up, histology, No Mammogram Breast cancer
et al. 2018 expert reader
Al-Masni YOLO No 120 Images DDSM Random split No Follow up, histology, No Mammogram Breast cancer
et al. 2018 expert reader
Antropova VGG-19 No (a) 690; (b) 245; (a) Lesions; (b) Private Random split No Histology No (a) MRI; (b) Breast cancer
et al. 2017 (c) 1125 images; (c) mammogram; (c)
lesions ultrasound
Antropova VGGNet No 138 Lesions University of Chicago Random split No Histology No MRI Breast cancer
et al. 2018
Antropova VGGNet No 141 Lesions University of Chicago Random split No Histology No MRI Breast cancer
et al. 2018
Arevalo et al. 2016 CNN3 No 736 Images Breast Cancer Digital Stratified No Histology No Mammogram Breast cancer
Repository (BCDR), Portugal Sampling
Bandeira Diniz CNN No (a) 200; (b) 288 Images (a) DDSM Dense Breast; (b) Random split No Follow up, histology, No Mammogram Breast cancer
et al. 2018 DDSM Non Dense Breast expert reader
Becker et al. dANN No 70 Images Breast Cancer Digital Random split Yes Expert reader Yes Mammogram Breast cancer
201791 Repository (BCDR)
R. Aggarwal et al.

Becker et al. DNN No 192 Lesions Private Random split No Follow up, histology Yes Ultrasound Breast cancer
201862
Bevilacqua VGG-S No 39 Images NR NR No NR No Digital breast Breast cancer
et al. 2019 tomosynthesis
Byra et al. 201999 VGG-19 No (a) 150; (b) 163; Images (a) Moores Cancer Center, Random split No (a) Follow up, Yes Ultrasound Breast cancer
(c) 100 University of California; (b) histology; (b) expert
UDIAT (c) OASBUD reader; (c) expert
reader, histology,
follow up
Cai et al. 2019 CNN No 99 Images SYSUCC and Foshan, China Random split No Histology No Mammogram Breast cancer
Cao et al. 2019 SSD300 + ZFNet No 183 Lesions Sichuan Provincial People’s Random split No Expert consensus No Ultrasound Breast cancer
Hospital
Cao et al. 2019 NF-Net No 272 Lesions Sichuan Provincial People’s Random split No Histology No Ultrasound Breast cancer
Hospital
Cheng et al. 2016 Stacked denoising No 520 Lesions Taipei Veterans General NR No Histology No Ultrasound Breast Nodules
autoencoder Hospital
Chiao et al. 2019 Mask R-CNN No 61 Images China Medical University Random split No Histology, routine No Ultrasound Breast cancer
Hospital clinical report
Choi et al. 2019100 CNN No 253 Lesions Samsung Medical NR No Follow up, histology Yes Ultrasound Breast cancer
Centre, Seoul
Chougrad Inception-v3 No (a) 5316; (b) Images (a) DDSM; (b) Inbreast; Random split No (a) Follow up, No Mammogram Breast cancer
et al. 2018 600; (c) 200 (c) BCDR histology, expert
reader; (b) expert
reader, histology; (c)
clinical reports
Ciritsis et al. dCNN No (a) 101; (b) 43 Images (a) Internal validation; (b) Random split Yes Follow up, histology Yes Ultrasound Breast cancer
201992 external validation
Cogan et al. ResNet-101 Faster R- No 124 Images INbreast NA Yes Expert reader, No Mammogram Breast cancer
201993 CNN histology,
Dalmis et al. 2018 U-Net No 66 Images NR Random split No Follow up, histology No MRI Breast cancer
Dalmis et al. DenseNet No 576 Lesions Raboud University NR No Follow up, histology Yes MRI Breast cancer
2019101 Medical Center
Dhungel CNN No 82 Images INbreast Random split No Expert reader, No Mammogram Breast cancer
et al. 2017 histology,

Published in partnership with Seoul National University Bundang Hospital


Table 4 continued
Study Model Prospective? Test Set Population Test datasets Type of internal External Reference standard AI vs Imaging modality Body system/disease
validation validation clinician?

Duggento CNN No 378 Images Curated Breast Imaging Random split No Expert reader No Mammogram Breast cancer
et al. 2019 SubSet of DDSM (CBIS-DDSM)
Fan et al. 2019 Faster R-CNN No 182 Images Fudan University Affiliated Random split No Histology No Digital breast Breast cancer
Cancer Centre tomosynthesis
Fujioka et al. GoogleNet No 120 Lesions Private Random split No Follow up, histology Yes Ultrasound Breast cancer
2019102
Gao et al. 2018 SD-CNN No (a) 49; (b) 89 (a) Lesions; (a) Mayo Clinic Arizona; (b) NR No (a) Histology; (b) No (a) Contrast enhanced Breast cancer
(b) images Inbreast expert reader, digital mammogram;
histology (b) mammogram
Ha et al. 2019 CNN No 60 Images Columbia University Random split No Follow up, histology No Mammogram DCIS
Medical Center
Han et al. 2017 GoogleNet No 829 Lesions Samsung Medical Random split No Histology No Ultrasound Breast cancer
Centre, Seoul
Herent et al. 2019 ResNet50 No 168 Lesions Journees Francophones de Random split No NR No MRI Breast cancer
Radiologie 2018
Hizukuri CNN No 194 Images Mie University Hospital Random split No Follow up, histology No Ultrasound Breast cancer
et al. 2018
Huyng et al. 2016 AlexNet No 607 Images University of Chicago NR No Histology No Mammogram Breast cancer
Jadoon et al. 2016 CNN-DW No 2976 Images IRMA NR No Histology No Mammogram Breast cancer
Jiao et al. 2016 CNN No 300 Images DDSM Random split No Follow up, histology, No Mammogram Breast cancer
expert reader
Jiao et al. 2018 (a) AlexNet; (b) No (a) 150; (b) 150 Images DDSM Random split No Follow up, histology, No Mammogram Breast cancer
parasitic metric expert reader
learning layers
Jung et al. 2018 RetinaNet No (a) 410; (b) 222 Images (a) Inbreast; (b) GURO Random split No (a) Expert reader; (b) No Mammogram Breast cancer
histology

Published in partnership with Seoul National University Bundang Hospital


103
Kim et al. 2012 ANN No 70 Lesions Kangwon National University Random split No Expert consensus Yes Ultrasound Breast cancer
College of Medicine
Kim et al. 2018 ResNet No 1238 Images Yonsei University Random split No Follow up, histology No Mammogram Breast cancer
Health System
Kim et al. 2018 VGG-16 No 340 Images DDSM Hold-out method No Follow up, histology, No Mammogram Breast cancer
expert reader
Kooi et al. 2017 CNN No 18,182 Images Netherlands screening Random split No Expert reader, No Mammogram Breast cancer
database histology,
Kooi et al. 2017 CNN No 1523 Images Netherlands screening Random split No Expert reader, No Mammogram Breast cancer
R. Aggarwal et al.

database histology,
Kooi T et al. 2017 CNN No 1804 Images Netherlands screening Hold-out method No Expert reader, No Mammogram Breast Cancer
database histology,
Li et al. 2019 DenseNet-II No 2042 Images First Hospital of Shanxi Tenfold cross No Expert reader No Mammogram Breast cancer
Medical University validation
Li et al. 2019 VGG-16 No (a) 1854; Images Nanfang Hospital Fivefold cross No Follow up, histology No (a) Digital breast Breast cancer
(b) 1854 validation tomosynthesis; (b)
mammogram
Lin et al. 2014 FCMNN No 65 Images Far Eastern Memorial Tenfold cross No Histology No Ultrasound Breast cancer
Hospital, Taiwan validation
McKinney et al. MobileNetV2 - No (a) 25,856; Images (a) UK; (b) USA Random split Yes Follow up, histology Yes Mammogram Breast cancer
202094 ResNet-v2-50, (b) 3097
ResNet-v1-50
Mendel et al. 2018 VGG-19 No (a) 78; (b) 78 Images University of Chicago Leave-one- No Follow up, histology No (a) Mammogram; (b) Breast cancer
out method digital breast
tomosynthesis
Peng et al. 201695 ANN No (a) 100; (b) 100 Images (a) MIAS; (b) BancoWeb Hold-out method Yes Expert reader No Mammogram Breast cancer
Qi et al. 2019 Inception-Resnet-v2 No 1359 Images West China Hospital, Sichuan Random split No Expert consensus No Ultrasound Breast cancer
University
Qiu et al. 2017 CNN No 140 Images Private Random split No Histology No Mammogram Breast cancer
Ragab et al. 2019 AlexNet No (a) 676; Images (a) Digital database for Random split No Follow up, histology, No Mammogram Breast cancer
(b) 1581 screening mammography expert reader

npj Digital Medicine (2021) 65


(DDSM); (b) Curated Breast
Imaging SubSet of DDSM
(CBIS-DDSM)
15
Table 4 continued 16
Study Model Prospective? Test Set Population Test datasets Type of internal External Reference standard AI vs Imaging modality Body system/disease
validation validation clinician?

Ribli et al. 201896 VGG-16 No 115 Images INbreast NA Yes Expert reader, No Mammogram Breast cancer
histology
Rodriguez-Ruiz CNN No 240 Images Two datasets combined NA Yes Expert reader, Yes Mammogram Breast cancer
et al. 201897 histology, follow up
Rodriguez-Ruiz CNN No 2642 Images Combined nine datasets NA Yes Follow up, histology Yes Mammogram Breast cancer
et al. 201998
Samala et al. 2016 DCNN No 94 Images University of Michigan Random split No Expert reader No Digital breast Breast cancer

npj Digital Medicine (2021) 65


tomosynthesis
Samala et al. 2017 DCNN No 907 Images DDSM + private Random split No Expert reader No Mammogram Breast cancer
Samala et al. 2018 DCNN No 94 Images University of Michigan Random split No Expert reader No Digital breast Breast cancer
tomosynthesis
Samala et al. 2019 AlexNet No 94 Images University of Michigan Random split No Expert reader No Digital breast Breast cancer
tomosynthesis
Shen et al. 2019 (a) VGG-16; (b) No (a) 376; (b) 376; Images (a) Curated Breast Imaging Random split No (a) Histology; (b) No Mammogram Breast cancer
ResNet; (c) ResNet- (c) 107 SubSet of DDSM (CBIS- histology; (c)
VGG DDSM); (b) Curated Breast expert reader
Imaging SubSet of DDSM
(CBIS-DDSM); (c) Inbreast
Shin et al. 2019 VGG-16 No (a) 600; (b) 40 Images (a) Seoul National University Random split No (a) NR; (b) No Ultrasound Breast cancer
Bundang Hospital; (b) UDIAT expert reader
Diagnostic Centre of the Parc
Taulí Corporation
Stoffel et al. 2018 CNN No 33 Images Private Random split No Surgical confirmation Yes Ultrasound Phyllodes tumour
Sun et al. 2017 CNN No 758 Images University of Texas at El Paso Random split No Expert reader No Mammogram Breast cancer
Tanaka et al. 2019 VGG-19, Resnet152 No 154 Lesions Japan Association of Breast Random split No Histology No Ultrasound Breast cancer
and Thyroid Sonology
R. Aggarwal et al.

Tao et al. 2019 RefineNet + No 253 Lesions Huaxi Hospital and China- Random split No Expert reader No Ultrasound Breast cancer
DenseNet121 Japan Friendship Hospital
Teare et al. 2017 Inception-v3 No 352 Images DDSM + Zebra Random split No Follow up, histology No Mammogram Breast cancer
Mammography Dataset
Truhn et al. CNN No 129 Lesions RWTH Aachen University, Random split No Follow up, histology Yes MRI Breast cancer
2018104
Wang et al. 2016 Inception-v3 No 74 Images Breast Cancer Digital Random split No Expert reader, No Mammogram Breast cancer
Repository (BCDR) histology
Wang et al. 2016 Stacked autoencoder No 204 Images Sun Yat-sen University Cancer Hold-out method No Histology No Mammogram Breast cancer
Center (Guangzhou, China)
and Nanhai Affiliated Hospital
of Southern Medical
University (Foshan, China)
Wang et al. 2017 CNN No 292 Images University of Chicago Random split No Histology No Mammogram Breast cancer
Wang et al. 2018 DNN No 292 Images University of Chicago Random split No Histology No Mammogram Breast cancer
Wu et al. 2019105 ResNet-22 No (a) 401; Images NYU Hold-out method No Histology Yes Mammogram Breast cancer
(b) 1440
Xiao et al. 2019 Inception-v3, No 206 Images Third Affiliated Hospital of Random split No Surgical confirmation, No Ultrasound Breast cancer
ResNet50, Xception Sun Yat-sen University histology
Yala et al. 2019106 ResNet18 No 26,540 Images Massachusetts General Random split No Clinical reports, follow Yes Mammogram Breast cancer
Hospital, Harvard Medical up, histology
School,
Yala et al. 2019111 ResNet18 No 8751 Images Massachusetts General Random split No Clinical reports, follow No Mammogram Breast cancer
Hospital, Harvard Medical up, histology
School,
Yap et al. 2018 FCN-AlexNet No (a) 306; (b) 163 Lesions (a) Private; (b) UDIAT NR No Expert reader No Ultrasound Breast cancer
Yap et al. 2019 FCN-8s No 94 Lesions Two datasets combined NR No Expert reader No Ultrasound Breast cancer
Yousefi et al. 2018 DCNN No 28 Images MGH Random split No Expert consensus No Digital breast Breast cancer
tomosynthesis
Zhou et al. 3D DenseNet No 307 Lesions Private Random split No Follow up, histology Yes MRI Breast cancer
2019107

Published in partnership with Seoul National University Bundang Hospital


R. Aggarwal et al.
17
Table 5. Variation in DL imaging studies.

Data
Image pre-processing, augmentation and preparation Are data augmentation techniques such as cropping, padding and flipping used?
Is there quality control of the images being used to train the algorithm? I.e., were poor quality
images excluded.
Were relevant images manually selected?
Study design Retrospective or prospective data collection.
Image eligibility How are images chosen for inclusion in the study?
Were the data from private or open-access repositories?
Training, validation, test sets Are each of the three sets independent of each other, without overlap?
Does data from the same patient appear in multiple datasets?
Datasets Are the datasets used single or multicentre?
Is a public or open-source dataset used?
Size of datasets Wide variation in size of datasets for training and testing.
Is the size of the datasets justified?
Are sample size statistical considerations applied for the test set?
Use of ‘external’ test sets for final reporting Is an independent test set used for ‘external validation’?
Is the independent test set constructed using an unenriched representative sample?
Multi-vendor images Are images from different scanners and vendors included in the datasets to enhance
generalisability?
Are imaging acquisition parameters described?
Algorithm
Index test Was sufficient detail given on the algorithm to allow replication and independent
validation?
What type of algorithm was used? E.g., CNN, Autoencoder, SVM.
Was the algorithm made publicly or commercially available?
Was the construct or architecture of the algorithm made available?
Additional AI algorithmic information Is the algorithm a static model or is it continuously evolving?
Demonstrate how algorithm makes decisions Is there a specific design for end-user interpretability, e.g., saliency or probability maps
Methods
Transfer learning Was transfer learning used for training and validation?
Cross validation Was k-fold cross validation used during training to reduce the effects of randomness in
dataset splits?
Reference standard Is the reference standard used of high quality and widely accepted in the field?
What was the rationale for choosing the reference standard?
Additional clinical information Was additional clinical information given to healthcare professionals to simulate normal
clinical process?
Performance benchmarking What was performance of algorithm benchmarked to?
What is expertise level and level of consensus of healthcare professionals if used?
Results
Raw diagnostic accuracy data Are raw diagnostic accuracy data reported in a contingency table demonstrating TP, FP,
FN, TN?
Metrics for estimating diagnostic accuracy Which diagnostic accuracy metrics reported? Sensitivity, Sensitivity, PPV, NPV, Accuracy,
performance AUROC
Unit of assessment Which unit of assessment reported, e.g., per patient, per scan or per lesion.
Rows in bold are part of STARD-2015 criteria.

studies is not fully applicable to clinical DL studies115. The variation risk of bias or concerning for applicability. Of particular concern
in reporting makes it very difficult to formally evaluate the was the applicability of reference standards and patient selection.
performance of algorithms. Furthermore, differences in reference Despite our results demonstrating that DL algorithms have a
standards, grader capabilities, disease definitions and thresholds high diagnostic accuracy in medical imaging, it is currently difficult
for diagnosis make direct comparison between studies and to determine if they are clinically acceptable or applicable. This is
algorithms very difficult. This can only be improved with well- partially due to the extensive variation and risk of bias identified in
designed and executed studies that explicitly address questions the literature to date. Furthermore, the definition of what
concerning transparency, reproducibility, ethics and effective- threshold is acceptable for clinical use and tolerance for errors
varies greatly across diseases and clinical scenarios119.
ness116 and specific reporting standards for AI studies115,117.
The QUADAS-2 (ref. 118) assessment tool was used to system-
atically evaluate the risk of bias and any applicability concerns of Limitations in the literature
the diagnostic accuracy studies. Although this tool was not Dataset. There are broad methodological deficiencies among the
designed for DL diagnostic accuracy studies, the evaluation included studies. Most studies were performed using retro-
allowed us to judge that a majority of studies in this field are at spectively collected data, using reference standards and labels

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
R. Aggarwal et al.
18
a) Ophthalmic Imaging

b) Respiratory Imaging

c) Breast Imaging

Fig. 2 QUADAS-2 summary plots. Risk of bias and applicability concerns summary about each QUADAS-2 domain presented as percentages
across the 82 included studies in ophthalmic imaging (a), 115 in respiratory imaging (b) and 82 in breast imaging (c).

that were not intended for the purposes of DL analysis. Minimal varying modality and quality represent important areas of
prospective studies and only two randomised studies109,120, research in DL121.
evaluating the performance of DL algorithms in clinical settings
were identified in the literature. Proper acquisition of test data is Study methodology. Many studies did not undertake external
essential to interpret model performance in a real-world clinical validation of the algorithm in a separate test set and relied upon
setting. Poor quality reference standards may result in the results from the internal validation data; the same dataset used to
decreased model performance due to suboptimal data labelling train the algorithm initially. This may lead to an overestimation of
in the validation set28, which could be a barrier to understanding the diagnostic accuracy of the algorithm. The problem of
the true capabilities of the model on the test set. This is overfitting has been well described in relation to machine
symptomatic of the larger issue that there is a paucity of gold- learning algorithms122. True demonstration of the performance
standard, prospectively collected, representative datasets for the of these algorithms can only be assumed if they are externally
purposes of DL model testing. However, as there are many validated on separate test sets with previously unseen data that
advantages to using retrospectively collected data, the resourceful are representative of the target population.
use of retrospective or synthetic data with the use of labels of Surprisingly, few studies compared the diagnostic accuracy of

npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
R. Aggarwal et al.
19
DL algorithms against expert human clinicians for medical Finally, our review concentrated on DL for speciality-specific
imaging. This would provide a more objective standard that medical imaging, and therefore it may not be appropriate to
would enable better comparison of models across studies. generalise our findings to other forms of medical imaging or AI
Furthermore, application of the same test dataset for diagnostic studies.
performance assessment of DL algorithms versus healthcare
professionals was identified in only select studies13. This
Future work
methodological deficiency limits the ability to gauge the clinical
applicability of these algorithms into clinical practice. Similarly, this For the quality of DL research to flourish in the future, we believe
issue can extend to model-versus-model comparisons. Specific that the adoption of the following recommendations are required
methods of model training or model architecture may not be as a starting point.
described well enough to permit emulation for comparison123.
Thus, standards for model development and comparison against Availability of large, open-source, diverse anonymised datasets with
controls will be needed as DL architectures and techniques annotations. This can be achieved through governmental sup-
continue to develop and are applied in medical contexts. port and will enable greater reproducibility of DL models124.

Reporting. There was varying terminology and a lack of Collaboration with academic centres to utilise their expertise in
transparency used in DL studies with regards to the validation pragmatic trial design and methodology125. Rather than classical
or test sets used. The term ‘validation’ was identified as being used trials, novel experimental and quasi-experimental methods to
interchangeably to either describe an external test set for the final evaluate DL have been proposed and should be evaluated126. This
algorithm or for an internal dataset that is used to fine tune the may include ongoing evaluation of algorithms once in clinical
model prior to ‘testing’. Furthermore, the inconsistent terminology practice, as they continue to learn and adapt to the population
led to difficulties understanding whether an independent external that they are implemented in.
test set was used to test diagnostic performance13.
Crucially, we found broad variation in the metrics used as Creation of AI-specific reporting standards. A major reason for the
outcomes for the performance of the DL algorithms in the difficulties encountered in evaluating the performance of DL on
literature. Very few studies reported true positives, false positives, medical imaging are largely due to inconsistent and haphazard
true negatives and false negatives in a contingency table as reporting. Although DL is widely considered as a ‘predictive’
should be the minimum for diagnostic accuracy studies114. model (where TRIPOD may be applied) the majority of AI
Moreover, some studies only reported metrics, such as dice interventions close to translation currently published are pre-
coefficient, F1 score, competition performance metric and Top-1 dominantly in the field of diagnostics (with specifics on index
accuracy that are often used in computer science, but may be tests, reference standards and true/false positive/negatives and
unfamiliar to clinicians13. Metrics such as AUC, sensitivity, summary diagnostic scores, centred directly in the domain of
specificity, PPV and NPV should be reported, as these are more STARD). Existing reporting guidelines for diagnostic accuracy
widely understood by healthcare professionals. However, it is studies (STARD)114, prediction models (TRIPOD)127, randomised
noted that NPV and PPV are dependent on the underlying trials (CONSORT)128 and interventional trial protocols (SPIRIT)129 do
prevalence of disease and as many test sets are artificially not fully cover DL research due to specific considerations in
constructed or balanced, then reporting the NPV or PPV may not methodology, data and interpretation required for these studies.
be valid. The wide range of metrics reported also leads to difficulty As such, we applaud the recent publication of the CONSORT-AI117
in comparing the performance of algorithms on similar datasets. and SPIRIT-AI130 guidelines, and await AI-specific amendments of
the TRIPOD-AI131 and STARD-AI115 statements (which we are
convening). We trust that when these are published, studies being
Study strengths and limitations
conducted will have a framework that enables higher quality and
This systematic review and meta-analysis statistically appraises more consistent reporting.
pooled data collected from 279 studies. It is the largest study to
date examining the diagnostic accuracy of DL on medical imaging. Development of specific tools for determining the risk of study bias
However, our findings must be viewed in consideration of several and applicability. An update to the QUADAS-2 tool taking into
limitations. Firstly, as we believe that many studies have account the nuances of DL diagnostic accuracy research should be
methodological deficiencies or are poorly reported, these studies considered.
may not be a reliable source for evaluating diagnostic accuracy.
Consequently, the estimates of diagnostic performance provided Updated specific ethical and legal framework. Outdated policies
in our meta-analysis are uncertain and may represent an over- need to be updated and key questions answered in terms of
estimation of the true accuracy. Secondly, we did not conduct a liability in cases of medical error, doctor and patient under-
quality assessment for the transparency of reporting in this review. standing, control over algorithms and protection of medical
This was because current guidelines to assess diagnostic accuracy data132. The World Health Organisation133 and others have started
reporting standards (STARD-2015114) were not designed for DL to develop guidelines and principles to regulate the use of AI.
studies and are not fully applicable to the specifics and nuances of These regulations will need to be adapted by each country to fit
DL research115. Thirdly, due to the nature of DL studies, we were their own political and healthcare context134. Furthermore, these
not able to perform classical statistical comparison of measures of guidelines will need to proactively and objectively evaluate
diagnostic accuracy between different imaging modalities. technology to ensure best practices are developed and imple-
Fourthly, we were unable to separate each imaging modality into mented in an evidence-based manner135.
different subsets, to enable comparison across subsets and allow
the heterogeneity and variance to be broken down. This was
because our study aimed to provide an overview of the literature CONCLUSION
in each specific speciality, and it was beyond the scope of this DL is a rapidly developing field that has great potential in all
review to examine each modality individually. The inherent aspects of healthcare, particularly radiology. This systematic
differences in imaging technology, patient populations, patholo- review and meta-analysis appraised the quality of the literature
gies and study designs meant that attempting to derive common and provided pooled diagnostic accuracy for DL techniques in
lessons across the board did not always offer easy comparisons. three medical specialities. While the results demonstrate that DL

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
R. Aggarwal et al.
20
currently has a high diagnostic accuracy, it is important that these review. Data were extracted on (i) first author, (ii) year of
findings are assumed in the presence of poor design, conduct and publication, (iii) type of neural network, (iv) population, (v)
reporting of studies, which can lead to bias and overestimating dataset—split into training, validation and test sets, (vi) imaging
the power of these algorithms. The application of DL can only be modality, (vii) body system/disease, (viii) internal/external valida-
improved with standardised guidance around study design and tion methods, (ix) reference standard, (x) diagnostic accuracy raw
reporting, which could help clarify clinical utility in the future. data—true and false positives and negatives, (xi) percentages of
There is an immediate need for the development of AI-specific AUC, accuracy, sensitivity, specificity, PPV, NPV and other metrics
STARD and TRIPOD statements to provide robust guidance around reported.
key issues in this field before the potential of DL in diagnostic Three investigators (R.A., V.S. and GM) assessed study metho-
healthcare is truly realised in clinical practice. dology using the QUADAS-2 checklist to evaluate the risk of bias
and any applicability concerns of the studies118.

METHODS Data synthesis and analysis


This systematic review was conducted in accordance with the A bivariate model for diagnostic meta-analysis was used to
guidelines for the ‘Preferred Reporting Items for Systematic calculate summary estimates of sensitivity, specificity and AUC
Reviews and Meta-Analyses’ extension for diagnostic accuracy data137. Independent proportion and their differences were
studies statement (PRISMA-DTA)136. calculated and pooled through DerSimonian and Laird random-
effects modelling138. This considered both between-study and
Eligibility criteria within-study variances that contributed to study weighting.
Studies that report upon the diagnostic accuracy of DL algorithms Study-specific estimates and 95% CIs were computed and
to investigate pathology or disease on medical imaging were represented on forest plots. Heterogeneity between studies
sought. The primary outcome was various diagnostic accuracy was assessed using I2 (25–49% was considered to be low
metrics. Secondary outcomes were study design and quality of heterogeneity, 50–74% was moderate and >75% was high
reporting. heterogeneity). Where raw diagnostic accuracy data were
available, the SROC model was used to evaluate the relationship
between sensitivity and specificity139. We utilised Stata version
Data sources and searches
15 (Stata Corp LP, College Station, TX, USA) for all statistical
Electronic bibliographic searches were conducted in Medline and analyses.
EMBASE up to 3rd January 2020. MESH terms and all-field search We chose to appraise the performance of DL algorithms to
terms were searched for ‘neural networks’ (DL or convolutional or identify individual disease or pathology patterns on different
cnn) and ‘imaging’ (magnetic resonance or computed tomogra- imaging modalities in isolation, e.g., identifying lung nodules on a
phy or OCT or ultrasound or X-ray) and ‘diagnostic accuracy thoracic CT scan. We felt that combining imaging modalities and
metrics’ (sensitivity or specificity or AUC). For the full search diagnoses would add heterogeneity and variation to the analysis.
strategy, please see Supplementary Methods 1. The search Meta-analysis was only performed where there were greater than
included all study designs. Further studies were identified through or equal to three patient cohorts, reporting for each specific
manual searches of bibliographies and citations until no further pathology and imaging modality. This study is registered with
relevant studies were identified. Two investigators (R.A. and V.S.) PROSPERO, CRD42020167503.
independently screened titles and abstracts, and selected all
relevant citations for full-text review. Disagreement regarding
Reporting summary
study inclusion was resolved by discussion with a third
investigator (H.A.). Further information on research design is available in the Nature
Research Reporting Summary linked to this article.
Inclusion criteria
Studies that comprised a diagnostic accuracy assessment of a DL DATA AVAILABILITY
algorithm on medical imaging in human populations were The authors declare that all the data included in this study are available within the
eligible. Only studies that stated either diagnostic accuracy raw paper and its Supplementary Information files.
data, or sensitivity, specificity, AUC, NPV, PPV or accuracy data
were included in the meta-analysis. No limitations were placed on Received: 6 October 2020; Accepted: 25 February 2021;
the date range and the last search was performed in January 2020.

Exclusion criteria
Articles were excluded if the article was not written in English. REFERENCES
Abstracts, conference articles, pre-prints, reviews and meta-
1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
analyses were not considered because an aim of this review was 2. Obermeyer, Z. & Emanuel, E. J. Predicting the future — big data, machine
to appraise the methodology, reporting standards and quality of learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016).
primary research studies being published in peer-reviewed 3. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29
journals. Studies that investigated the accuracy of image (2019).
segmentation or predicting disease rather than identification or 4. Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image
classification were excluded. Anal. 42, 60–88 (2017).
5. Bluemke, D. A. et al. Assessing radiology research on artificial intelligence: a brief
guide for authors, reviewers, and readers—from the radiology editorial board.
Data extraction and quality assessment Radiology 294, 487–489 (2020).
Two investigators (R.A. and V.S.) independently extracted demo- 6. Wahl, B., Cossy-Gantner, A., Germann, S. & Schwalbe, N. R. Artificial intelligence
graphic and diagnostic accuracy data from the studies, using a (AI) and global health: how can AI contribute to health in resource-poor set-
predefined electronic data extraction spreadsheet. The data fields tings? BMJ Glob. Health 3, e000798–e000798 (2018).
7. Zhang, L., Wang, H., Li, Q., Zhao, M.-H. & Zhan, Q.-M. Big data and medical
were chosen subsequent to an initial scoping review and were, in
research in China. BMJ 360, j5910 (2018).
the opinion of the investigators, sufficient to fulfil the aims of this

npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
R. Aggarwal et al.
21
8. Nakajima, Y., Yamada, K., Imamura, K. & Kobayashi, K. Radiologist supply and 34. Liu, H. et al. Development and validation of a deep learning system to detect
workload: international comparison. Radiat. Med. 26, 455–465 (2008). glaucomatous optic neuropathy using fundus photographs. JAMA Ophthalmol.
9. Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key 137, 1353–1360 (2019).
challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 35. Liu, S. et al. A deep learning-based algorithm identifies glaucomatous discs
195 (2019). using monoscopic fundus photographs. Ophthalmol. Glaucoma 1, 15–22 (2018).
10. Topol, E. J. High-performance medicine: the convergence of human and artificial 36. MacCormick, I. J. C. et al. Accurate, fast, data efficient and interpretable glau-
intelligence. Nat. Med. 25, 44–56 (2019). coma diagnosis with automated spatial analysis of the whole cup to disc profile.
11. Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based PLoS ONE 14, e0209409 (2019).
FDA-approved medical devices and algorithms: an online database. npj Digital 37. Phene, S. et al. Deep learning and glaucoma specialists: the relative importance
Med. 3, 118 (2020). of optic disc features to predict glaucoma referral in fundus photographs.
12. Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. JAMA Ophthalmology 126, 1627–1639 (2019).
319, 1317–1318 (2018). 38. Ramachandran, N., Hong, S. C., Sime, M. J. & Wilson, G. A. Diabetic retinopathy
13. Liu, X. et al. A comparison of deep learning performance against health-care screening using deep neural network. Clin. Exp. Ophthalmol. 46, 412–416 (2018).
professionals in detecting diseases from medical imaging: a systematic review 39. Raumviboonsuk, P. et al. Deep learning versus human graders for classifying
and meta-analysis. Lancet Digital Health 1, e271–e297 (2019). diabetic retinopathy severity in a nationwide screening program. npj Digital
14. Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an Med. 2, 25 (2019).
autonomous AI-based diagnostic system for detection of diabetic retinopathy in 40. Sayres, R. et al. Using a deep learning algorithm and integrated gradients
primary care offices. npj Digital Med. 1, 39 (2018). explanation to assist grading for diabetic retinopathy. Ophthalmology 126,
15. Bellemo, V. et al. Artificial intelligence using deep learning to screen for refer- 552–564 (2019).
able and vision-threatening diabetic retinopathy in Africa: a clinical validation 41. Ting, D. S. W. et al. Development and validation of a deep learning system for
study. Lancet Digital Health 1, e35–e44 (2019). diabetic retinopathy and related eye diseases using retinal images from multi-
16. Christopher, M. et al. Performance of deep learning architectures and transfer ethnic populations with diabetes. JAMA 318, 2211–2223 (2017).
learning for detecting glaucomatous optic neuropathy in fundus photographs. 42. Ting, D. S. W. et al. Deep learning in estimating prevalence and systemic risk
Sci. Rep. 8, 16685 (2018). factors for diabetic retinopathy: a multi-ethnic study. npj Digital Med. 2, 24
17. Gulshan, V. et al. Performance of a deep-learning algorithm vs manual grading (2019).
for detecting diabetic retinopathy in India. JAMA Ophthalmol 137, 987–993 43. Verbraak, F. D. et al. Diagnostic accuracy of a device for the automated detec-
(2019). tion of diabetic retinopathy in a primary care setting. Diabetes Care 42, 651
18. Keel, S., Wu, J., Lee, P. Y., Scheetz, J. & He, M. Visualizing deep learning models (2019).
for the detection of referable diabetic retinopathy and glaucoma. JAMA Oph- 44. Van Grinsven, M. J., van Ginneken, B., Hoyng, C. B., Theelen, T. & Sánchez, C. I.
thalmol. 137, 288–292 (2019). Fast convolutional neural network training using selective data sampling:
19. Sandhu, H. S. et al. Automated diagnosis and grading of diabetic retinopathy application to hemorrhage detection in color fundus images. IEEE Trans. Med.
using optical coherence tomography. Investig. Ophthalmol. Vis. Sci. 59, Imaging 35, 1273–1284 (2016).
3155–3160 (2018). 45. Rogers, T. W. et al. Evaluation of an AI system for the automated detection of
20. Zheng, C. et al. Detecting glaucoma based on spectral domain optical coher- glaucoma from stereoscopic optic disc photographs: the European Optic Disc
ence tomography imaging of peripapillary retinal nerve fiber layer: a compar- Assessment Study. Eye 33, 1791–1797 (2019).
ison study between hand-crafted features and deep learning model. Graefes 46. Al-Aswad, L. A. et al. Evaluation of a deep learning system for identifying
Arch. Clin. Exp. Ophthalmol. 258, 577–585 (2020). glaucomatous optic neuropathy based on color fundus photographs. J. Glau-
21. Kanagasingam, Y. et al. Evaluation of artificial intelligence-based grading of coma 28, 1029–1034 (2019).
diabetic retinopathy in primary care. JAMA Netw. Open 1, e182665–e182665 47. Brown, J. M. et al. Automated diagnosis of plus disease in retinopathy of pre-
(2018). maturity using deep convolutional neural networks. JAMA Ophthalmol. 136,
22. Alqudah, A. M. AOCT-NET: a convolutional network automated classification of 803–810 (2018).
multiclass retinal diseases using spectral-domain optical coherence tomography 48. Burlina, P. et al. Utility of deep learning methods for referability classification
images. Med. Biol. Eng. Comput. 58, 41–53 (2020). of age-related macular degeneration. JAMA Ophthalmol. 136, 1305–1307
23. Asaoka, R. et al. Validation of a deep learning model to screen for glaucoma (2018).
using images from different fundus cameras and data augmentation. Oph- 49. Burlina, P. M. et al. Automated grading of age-related macular degeneration
thalmol. Glaucoma 2, 224–231 (2019). from color fundus images using deep convolutional neural networks. JAMA
24. Bhatia, K. K. et al. Disease classification of macular optical coherence tomo- Ophthalmol. 135, 1170–1176 (2017).
graphy scans using deep learning software: validation on independent, multi- 50. Burlina, P., Pacheco, K. D., Joshi, N., Freund, D. E. & Bressler, N. M. Comparing
center data. Retina 40, 1549–1557 (2020). humans and deep learning performance for grading AMD: a study in using
25. Chan, G. C. Y. et al. Fusing results of several deep learning architectures for universal deep features and transfer learning for automated AMD analysis.
automatic classification of normal and diabetic macular edema in optical Computers Biol. Med. 82, 80–86 (2017).
coherence tomography. In Conference proceedings: Annual International Con- 51. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in
ference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in retinal disease. Nat. Med. 24, 1342–1350 (2018).
Medicine and Biology Society. Annual Conference, Vol. 2018, 670–673 (IEEE, 2018). 52. Gómez-Valverde, J. J. et al. Automatic glaucoma classification using color fundus
26. Gargeya, R. & Leng, T. Automated identification of diabetic retinopathy using images based on convolutional neural networks and transfer learning. Biomed.
deep learning. Ophthalmology 124, 962–969 (2017). Opt. Express 10, 892–913 (2019).
27. Grassmann, F. et al. A deep learning algorithm for prediction of age-related eye 53. Jammal, A. A. et al. Human versus machine: comparing a deep learning algo-
disease study severity scale for age-related macular degeneration from color rithm to human gradings for detecting glaucoma on fundus photographs. Am. J.
fundus photography. Ophthalmology 125, 1410–1420 (2018). Ophthalmol. 211, 123–131 (2019).
28. Gulshan, V. et al. Development and validation of a deep learning algorithm for 54. Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by
detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, image-based deep learning. Cell 172, 1122–1131.e1129 (2018).
2402–2410 (2016). 55. Li, F. et al. Deep learning-based automated detection of retinal diseases using
29. Hwang, D. K. et al. Artificial intelligence-based decision-making for age-related optical coherence tomography images. Biomed. Opt. Express 10, 6204–6226
macular degeneration. Theranostics 9, 232–245 (2019). (2019).
30. Keel, S. et al. Development and validation of a deep-learning algorithm for the 56. Long, E. et al. An artificial intelligence platform for the multihospital colla-
detection of neovascular age-related macular degeneration from colour fundus borative management of congenital cataracts. Nat. Biomed. Eng. 1, 0024 (2017).
photographs. Clin. Exp. Ophthalmol. 47, 1009–1018 (2019). 57. Matsuba, S. et al. Accuracy of ultra-wide-field fundus ophthalmoscopy-assisted
31. Krause, J. et al. Grader variability and the importance of reference standards for deep learning, a machine-learning technology, for detecting age-related
evaluating machine learning models for diabetic retinopathy. Ophthalmology macular degeneration. Int. Ophthalmol. 39, 1269–1275 (2019).
125, 1264–1272 (2018). 58. Nagasato, D. et al. Automated detection of a nonperfusion area caused by
32. Li, F. et al. Automatic detection of diabetic retinopathy in retinal fundus pho- retinal vein occlusion in optical coherence tomography angiography images
tographs based on deep learning algorithm. Transl. Vis. Sci. Technol. 8, 4 (2019). using deep learning. PLoS ONE 14, e0223965 (2019).
33. Li, Z. et al. An automated grading system for detection of vision-threatening 59. Peng, Y. et al. DeepSeeNet: a deep learning model for automated classification
referable diabetic retinopathy on the basis of color fundus photographs. Dia- of patient-based age-related macular degeneration severity from color fundus
betes Care 41, 2509–2516 (2018). photographs. Ophthalmology 126, 565–575 (2019).

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65
R. Aggarwal et al.
22
60. Shibata, N. et al. Development of a deep residual learning algorithm to screen 87. Walsh, S. L. F., Calandriello, L., Silva, M. & Sverzellati, N. Deep learning for clas-
for glaucoma from fundus photography. Sci. Rep. 8, 14665 (2018). sifying fibrotic lung disease on high-resolution computed tomography: a case-
61. Zhang, Y. et al. Development of an automated screening system for retinopathy cohort study. Lancet Respir. Med. 6, 837–845 (2018).
of prematurity using a deep neural network for wide-angle retinal images. IEEE 88. Wang, S. et al. 3D convolutional neural network for differentiating pre-
Access 7, 10232–10241 (2019). invasive lesions from invasive adenocarcinomas appearing as ground-glass
62. Becker, A. S. et al. Classification of breast cancer in ultrasound imaging using a nodules with diameters ≤3 cm using HRCT. Quant. Imaging Med. Surg.
generic deep learning analysis software: a pilot study. Br. J. Radio. 91, 20170576 8, 491–499 (2018).
(2018). 89. Park, S. et al. Application of deep learning-based computer-aided detection
63. Zhang, C. et al. Toward an expert level of lung cancer detection and classifi- system: detecting pneumothorax on chest radiograph after biopsy. Eur. Radio.
cation using a deep convolutional neural network. Oncologist 24, 1159–1165 29, 5341–5348 (2019).
(2019). 90. Lakhani, P. & Sundaram, B. Deep learning at chest radiography: automated
64. Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep classification of pulmonary tuberculosis by using convolutional neural networks.
learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 Radiology 284, 574–582 (2017).
(2019). 91. Becker, A. S. et al. Deep learning in mammography: diagnostic accuracy of a
65. Hwang, E. J. et al. Deep learning for chest radiograph diagnosis in the emer- multipurpose image analysis software in the detection of breast cancer. Investig.
gency department. Radiology 293, 573–580 (2019). Radio. 52, 434–440 (2017).
66. Hwang, E. J. et al. Development and validation of a deep learning–based 92. Ciritsis, A. et al. Automatic classification of ultrasound breast lesions using a
automated detection algorithm for major thoracic diseases on chest radio- deep convolutional neural network mimicking human decision-making. Eur.
graphs. JAMA Netw. Open 2, e191095–e191095 (2019). Radio. 29, 5458–5468 (2019).
67. Hwang, E. J. et al. Development and validation of a deep learning–based 93. Cogan, T., Cogan, M. & Tamil, L. RAMS: remote and automatic mammogram
automatic detection algorithm for active pulmonary tuberculosis on chest screening. Comput. Biol. Med. 107, 18–29 (2019).
radiographs. Clin. Infect. Dis. https://doi.org/10.1093/cid/ciy967 (2018). 94. McKinney, S. M. et al. International evaluation of an AI system for breast cancer
68. Liang, C. H. et al. Identifying pulmonary nodules or masses on chest radiography screening. Nature 577, 89–94 (2020).
using deep learning: external validation and strategies to improve clinical 95. Peng, W., Mayorga, R. V. & Hussein, E. M. A. An automated confirmatory system
practice. Clin. Radiol. 75, 38–45 (2020). for analysis of mammograms. Comput. Methods Prog. Biomed. 125, 134–144
69. Nam, J. G. et al. Development and validation of deep learning–based automatic (2016).
detection algorithm for malignant pulmonary nodules on chest radiographs. 96. Ribli, D., Horváth, A., Unger, Z., Pollner, P. & Csabai, I. Detecting and classifying
Radiology 290, 218–228 (2018). lesions in mammograms with deep learning. Sci. Rep. 8, 4165 (2018).
70. Qin, Z. Z. et al. Using artificial intelligence to read chest radiographs for 97. Rodríguez-Ruiz, A. et al. Detection of Breast cancer with mammography:
tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of effect of an artificial intelligence support system. Radiology 290, 305–314
three deep learning systems. Sci. Rep. 9, 15000 (2019). (2018).
71. Setio, A. A. A. et al. Pulmonary nodule detection in CT images: false positive 98. Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer
reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35, detection in mammography: comparison with 101 radiologists. J. Natl Cancer
1160–1169 (2016). Inst. 111, 916–922 (2019).
72. Sim, Y. et al. Deep convolutional neural network–based software improves 99. Byra, M. et al. Breast mass classification in sonography with transfer learning
radiologist detection of malignant lung nodules on chest radiographs. Radiology using a deep convolutional neural network and color conversion. Med. Phys. 46,
294, 199–209 (2020). 746–755 (2019).
73. Taylor, A. G., Mielke, C. & Mongan, J. Automated detection of moderate and 100. Choi, J. S. et al. Effect of a deep learning framework-based computer-aided
large pneumothorax on frontal chest X-rays using deep convolutional neural diagnosis system on the diagnostic performance of radiologists in differ-
networks: a retrospective study. PLOS Med. 15, e1002697 (2018). entiating between malignant and benign masses on breast ultrasonography.
74. Uthoff, J. et al. Machine learning approach for distinguishing malignant and Korean J. Radio. 20, 749–758 (2019).
benign lung nodules utilizing standardized perinodular parenchymal features 101. Dalmis, M. U. et al. Artificial intelligence–based classification of breast lesions
from CT. Med. Phys. 46, 3207–3216 (2019). imaged with a multiparametric breast mri protocol with ultrafast DCE-MRI, T2,
75. Zech, J. R. et al. Variable generalization performance of a deep learning model and DWI. Investig. Radiol. 54, 325–332 (2019).
to detect pneumonia in chest radiographs: a cross-sectional study. PLOS Med. 102. Fujioka, T. et al. Distinction between benign and malignant breast masses at
15, e1002683 (2018). breast ultrasound using deep learning method with convolutional neural net-
76. Cha, M. J., Chung, M. J., Lee, J. H. & Lee, K. S. Performance of deep learning work. Jpn J. Radio. 37, 466–472 (2019).
model in detecting operable lung cancer with chest radiographs. J. Thorac. 103. Kim, S. M. et al. A comparison of logistic regression analysis and an artificial
Imaging 34, 86–91 (2019). neural network using the BI-RADS Lexicon for ultrasonography in conjunction
77. Chae, K. J. et al. Deep learning for the classification of small (≤2 cm) pul- with introbserver variability. J. Digital Imaging 25, 599–606 (2012).
monary nodules on ct imaging: a preliminary study. Acad. Radiol. 27, E55–E63 104. Truhn, D. et al. Radiomic versus convolutional neural networks analysis for
(2020). classification of contrast-enhancing lesions at multiparametric breast MRI.
78. Ciompi, F. et al. Towards automatic pulmonary nodule management in lung Radiology 290, 290–297 (2019).
cancer screening with deep learning. Sci. Rep. 7, 46479 (2017). 105. Wu, N. et al. Deep neural networks improve radiologists’ performance in
79. Dunnmon, J. A. et al. Assessment of convolutional neural networks for auto- breast cancer screening. IEEE Trans. Med. Imaging 39, 1184–1194 (2020).
mated classification of chest radiographs. Radiology 290, 537–544 (2018). 106. Yala, A., Schuster, T., Miles, R., Barzilay, R. & Lehman, C. A deep learning model to
80. Li, X. et al. Deep learning-enabled system for rapid pneumothorax screening on triage screening mammograms: a simulation study. Radiology 293, 38–46
chest CT. Eur. J. Radiol. 120, 108692 (2019). (2019).
81. Li, L., Liu, Z., Huang, H., Lin, M. & Luo, D. Evaluating the performance of a deep 107. Zhou, J. et al. Weakly supervised 3D deep learning for breast cancer classifica-
learning-based computer-aided diagnosis (DL-CAD) system for detecting and tion and localization of the lesions in MR images. J. Magn. Reson. Imaging 50,
characterizing lung nodules: comparison with the performance of double 1144–1151 (2019).
reading by radiologists. Thorac. Cancer 10, 183–192 (2019). 108. Li, Z. et al. Efficacy of a deep learning system for detecting glaucomatous optic
82. Majkowska, A. et al. Chest radiograph interpretation with deep learning models: neuropathy based on color fundus photographs. Ophthalmology 125,
assessment with radiologist-adjudicated reference standards and population- 1199–1206 (2018).
adjusted evaluation. Radiology 294, 421–431 (2019). 109. Lin, H. et al. Diagnostic efficacy and therapeutic decision-making capacity of an
83. Park, S. et al. Deep learning-based detection system for multiclass lesions on artificial intelligence platform for childhood cataracts in eye clinics: a multi-
chest radiographs: comparison with observer readings. Eur. Radiol. 30, centre randomized controlled trial. EClinicalMedicine 9, 52–59 (2019).
1359–1368 (2019). 110. Annarumma, M. et al. Automated triaging of adult chest radiographs with deep
84. Patel, B. N. et al. Human–machine partnership with artificial intelligence for artificial neural networks. Radiology 291, 196–202 (2019).
chest radiograph diagnosis. npj Digital Med. 2, 111 (2019). 111. Yala, A., Lehman, C., Schuster, T., Portnoi, T. & Barzilay, R. A deep learning
85. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: a retrospective mammography-based model for improved breast cancer risk prediction. Radi-
comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Med. 15, ology 292, 60–66 (2019).
e1002686 (2018). 112. Sedgwick, P. Meta-analyses: how to read a funnel plot. BMJ 346, f1342 (2013).
86. Singh, R. et al. Deep learning in chest radiography: detection of findings and 113. Herent, P. et al. Detection and characterization of MRI breast lesions using deep
presence of change. PLoS ONE 13, e0204155 (2018). learning. Diagn. Inter. Imaging 100, 219–225 (2019).

npj Digital Medicine (2021) 65 Published in partnership with Seoul National University Bundang Hospital
R. Aggarwal et al.
23
114. Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting 137. Reitsma, J. B. et al. Bivariate analysis of sensitivity and specificity produces
diagnostic accuracy studies. BMJ 351, h5527 (2015). informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 58,
115. Sounderajah, V. et al. Developing specific reporting guidelines for diagnostic 982–990 (2005).
accuracy studies assessing AI interventions: the STARD-AI Steering Group. Nat. 138. DerSimonian, R. & Laird, N. Meta-analysis in clinical trials. Controlled Clin. Trials 7,
Med. 26, 807–808 (2020). 177–188 (1986).
116. Vollmer, S. et al. Machine learning and artificial intelligence research for patient 139. Jones, C. M., Ashrafian, H., Darzi, A. & Athanasiou, T. Guidelines for diagnostic
benefit: 20 critical questions on transparency, replicability, ethics, and effec- tests and diagnostic accuracy in surgical research. J. Investig. Surg. 23, 57–65
tiveness. BMJ 368, l6927 (2020). (2010).
117. Liu, X. et al. Reporting guidelines for clinical trial reports for interventions
involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26,
1364–1374 (2020). ACKNOWLEDGEMENTS
118. Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of Infrastructure support for this research was provided by the NIHR Imperial Biomedical
diagnostic accuracy studies. Ann. Intern. Med. 155, 529–536 (2011). Research Centre (BRC).
119. Food, U. & Administration, D. Artificial Intelligence and Machine Learning in
Software as a Medical Device (US Food and Drug Administratio, 2019).
120. Titano, J. J. et al. Automated deep-neural-network surveillance of cranial images
AUTHOR CONTRIBUTIONS
for acute neurologic events. Nat. Med. 24, 1337–1341 (2018).
121. Rankin, D. et al. Reliability of supervised machine learning using synthetic data H.A. conceptualised the study, R.A., V.S., G.M. and H.A. designed the study, extracted
in health care: Model to preserve privacy for data sharing. JMIR Med. Inform. 8, data, conducted the analysis and wrote the manuscript. D.S.W.T., A.K., D.K. and A.D.
e18910 (2020). assisted in writing and editing the manuscript. All authors approved the final version
122. Cawley, G. C. & Talbot, N. L. On over-fitting in model selection and subsequent of the manuscript and take accountability for all aspects of the work.
selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107
(2010).
123. Blalock, D., Ortiz, J., Frankle, J. & Guttag, J. What is the state of neural network COMPETING INTERESTS
pruning? Preprint at https://arxiv.org/abs/2003.03033 (2020). D.K. and A.K. are employees of Google Health. A.D. is an adviser at Google Health. D.S.
124. Beam, A. L., Manrai, A. K. & Ghassemi, M. Challenges to the reproducibility of W.T holds a patent on a deep learning system for the detection of retinal diseases.
machine learning models in health care. JAMA 323, 305–306 (2020).
125. Celi, L. A. et al. Bridging the health data divide. J. Med. Internet Res. 18, e325
(2016). ADDITIONAL INFORMATION
126. Shah, P. et al. Artificial intelligence and machine learning in clinical develop- Supplementary information The online version contains supplementary material
ment: a translational perspective. npj Digital Med. 2, 69 (2019). available at https://doi.org/10.1038/s41746-021-00438-z.
127. Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of
a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Correspondence and requests for materials should be addressed to H.A.
the TRIPOD statement. BMJ 350, g7594 (2015).
128. Schulz, K. F., Altman, D. G. & Moher, D. CONSORT 2010 Statement: updated Reprints and permission information is available at http://www.nature.com/
guidelines for reporting parallel group randomised trials. BMJ 340, c332 reprints
(2010).
129. Chan, A.-W. et al. SPIRIT 2013 statement: defining standard protocol items for Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
clinical trials. Ann. Intern. Med. 158, 200–207 (2013). in published maps and institutional affiliations.
130. Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions
involving artificial intelligence: the SPIRIT-AI extension. Nat. Med. 26, 1351–1363
(2020).
131. Collins, G. S. & Moons, K. G. Reporting of artificial intelligence prediction models.
Lancet 393, 1577–1579 (2019). Open Access This article is licensed under a Creative Commons
132. Ngiam, K. Y. & Khor, I. W. Big data and machine learning algorithms for health- Attribution 4.0 International License, which permits use, sharing,
care delivery. Lancet Oncol. 20, e262–e273 (2019). adaptation, distribution and reproduction in any medium or format, as long as you give
133. World Health Organization. Big Data and Artificial Intelligence for Achieving appropriate credit to the original author(s) and the source, provide a link to the Creative
Universal Health Coverage: an International Consultation on Ethics: Meeting Commons license, and indicate if changes were made. The images or other third party
Report, 12–13 October 2017 (World Health Organization, 2018). material in this article are included in the article’s Creative Commons license, unless
134. Cath, C., Wachter, S., Mittelstadt, B., Taddeo, M. & Floridi, L. Artificial Intelligence indicated otherwise in a credit line to the material. If material is not included in the
and the ‘Good Society’: the US, EU, and UK approach. Sci. Eng. Ethics 24, article’s Creative Commons license and your intended use is not permitted by statutory
505–528 (2018). regulation or exceeds the permitted use, you will need to obtain permission directly
135. Mittelstadt, B. The ethics of biomedical ‘Big Data’ analytics. Philos. Technol. 32, from the copyright holder. To view a copy of this license, visit http://creativecommons.
17–21 (2019). org/licenses/by/4.0/.
136. McInnes, M. D. F. et al. Preferred reporting items for a systematic review and
meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement.
JAMA 319, 388–396 (2018). © The Author(s) 2021

Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2021) 65

You might also like