Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability

Toward Explainability of Machine Learning in Medical Imaging:
Generalizability, Separability, and Learnability
by Shuyue Guan
B.S. in Physics, June 2010, Northeast Forestry University, China

M.S. in Biophysics, June 2013, Northeast Forestry University, China
M.S. in Computer Science, December 2016, The George Washington
University, USA
A Dissertation submitted to
The Faculty of
The School of Engineering and Applied Science
of The George Washington University
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
May 15, 2022
Dissertation directed by
Murray H. Loew
Professor of Biomedical Engineering
The School of Engineering and Applied Science of The George Washington
University certifies that Shuyue Guan has passed the Final Examination
for the degree of Doctor of Philosophy as of April 07, 2022. This is the final
and approved form of the dissertation.

Shuyue Guan
Dissertation Research Committee:
Murray H. Loew, Professor of Biomedical Engineering,

Dissertation Director
Jason M. Zara, Professor of Biomedical Engineering,

Committee Member
Matthew W. Kay, Professor of Biomedical Engineering,

Committee Member
Miloš Doroslovački, Associate Professor of Electrical and

Computer Engineering, Committee Member
Robert Pless, Professor of Computer Science,

Committee Member
Ronald M. Summers, Senior Investigator, National Institutes of

Health Clinical Center, Committee Member
ii
© Copyright 2022 by Shuyue Guan
All rights reserved
iii
Dedication
To my parents: Guan, Ning and Fu, Dihua
“You the wise, tell me, why should our days leave us, never to return?”
– Zhu, Ziqing
iv
Acknowledgments
First and foremost, I want to express my sincerest thanks to Prof. Murray

Loew, my Ph.D. advisor, for his excellent advice, support, and help in my
studies and life during the past years. My research achievements cannot
be made without his knowledgeable guidance and continuous support. I
met Prof. Murray Loew in his Computer Vision class; his sense of humor,
wide scope of knowledge, and endless enthusiasm for teaching and research
make him a perfect academic advisor. I feel extremely fortunate to have
become his Ph.D. student and can never forget the exciting day to know
it. I extraordinarily appreciate the experience of studying with him over
many years, and the skills, expertise, insight, wisdom, patience, and work
attitude I learned from him will permanently impact my career and life in
the future.
I am very grateful to Prof. Jason M. Zara, Prof. Matthew W. Kay, Prof.
Miloš Doroslovački, Prof. Robert Pless, and Dr. Ronald M. Summers for their
service on my dissertation committee. I also thank them for their advice
on my final dissertation and dissertation proposal. I sincerely thank Prof.
Vesna Zderic for teaching the excellent BME course I had taken, writing
recommendation letters for me, and helping me a lot regarding graduate
affairs.
I had great experiences collaborating with various talented people. I
thank Ange Lou and Nada Kamona for productive and inspiring collabora-
tions with their projects. The project of Prof. Narine Sarvazyan, Dr. Huda
Asfour, and Dr. Narine Muselimyan offered me financial support for one
year. It was a wonderful experience collaborating with them, and I feel
extraordinarily grateful for it.
I appreciate Dr. Aldo Badano and Dr. Weijie Chen for being my mentors
v
and inviting me to join the FDA. I feel exceptionally fortunate to have the
opportunity to work on amazing projects with their talented teams. I also
greatly thank my previous academic advisors. Prof. Claire Monteleoni
was my first advisor at GWU and supervised me on a summer research
project; Prof. Dawei Qi was my research supervisor at the Northeast Forestry
University (NEFU). They shared their rich expertise in image processing and
machine learning with me and have directly influenced my research and
career choices.
I sincerely thank Prof. Claire Monteleoni, Prof. Murray Loew, Prof. Jason
M. Zara, Dr. Amrinder Arora, Dr. Amy Lingley-Papadopoulos, and Dr.
HyungSok Choe for their courses providing me with the teaching assistant
position. This position offered me financial support, and I have gained
knowledge and skills through working for their courses.
I am also very grateful to all previous and recent members I met in the
Medical Imaging and Image Analysis Laboratory (the Loew’s Lab) at George
Washington University. I learned a lot from them during our lab meeting,
and they also provided me with many valuable questions and comments on
my research.
I would like to send my heartfelt gratitude to all my friends, fellows, and
classmates. They make my life a lot more fun and beautiful.
I sincerely acknowledge everyone who has helped me during my doctoral
study in the past years, including faculty, staff, editors, and reviewers in
various academic activities, workshops, conferences, and journals.
Finally, I want to thank my beloved girlfriend from the bottom of my
heart for her accompanying, her love, and our affection. My immediate
family gives me infinite support and help. It is a constant source of love,
entertainment, encouragement, assistance, and understanding. In addition,
vi
this dissertation is dedicated to my dear parents, who are the “first cause”
of my all!
vii
Abstract of Dissertation

The applications of Deep Learning (DL) for medical imaging have become
increasingly popular in recent years. During my studies of applications of
Machine Learning (ML) and DL methods in medical imaging, I realized that
there is a trade-off between accuracy and explainability for these methods.
Although some DL methods have better performances, they are more difficult
to understand and to explain. The lack of explainability limits the acceptance
of DL applications by clinicians. The requirement of explainability and the
DL applications for medical imaging that I have investigated thus have
stimulated my research interest in eXplainable Artificial Intelligence (XAI).
Explainability has multiple facets, and there is to date no unified defi-
nition. For explainable ML, I have primarily addressed these aspects: the
separability of data, cluster validation, Generative Adversarial Network (GAN)
evaluation, generalizability of the Deep Neural Network (DNN), learnability
of DL models, and transparent DL. The study of explainable ML had been
motivated by several completed applications for medical-object detection and
segmentation. Studies of medical image analysis and the XAI contain very
rich questions. My research aims to contribute to medical image analysis
by focusing on the performance (accuracy) and explainability of applica-
tions, using ML and DL. The long-term goals of these works are to help
make DL-based Computer-Aided Diagnosis (CAD) systems be transparent,
understandable, and explainable and to win the trust of end-users; eventu-
ally, these new techniques can be widely accepted by clinicians to improve
medical diagnosis and treatment outcomes.
viii
Table of Contents
Dedication iv
Acknowledgments v
Abstract of Dissertation viii
List of Figures xviii
List of Tables xx
List of Abbreviations xxi
List of Symbols xxiii
Chapter 1: Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Summary . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Hyper-spectral Image-based Cardiac Ablation Lesion
Detection . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Applications of Transfer Learning and the Generative
Adversarial Network (GAN) in Breast Cancer Detection 8
1.2.3 Transparent Deep Learning/Machine Learning . . . 9
1.2.4 Deep Learning-based Medical Image Segmentation . 10
1.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . 11
1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Distance-based Intrinsic Measure of Data Separabil-

ity 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Methodological Development for Distance-based Separability
Index (DSI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Intra-class and between-class distance sets . . . . . 23
2.3.2 Definition and computation of the DSI . . . . . . . . 24
2.3.3 Theorem: DSI and similarity of data distributions . 25
2.3.4 Proof of the Theorem . . . . . . . . . . . . . . . . . . . 27
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 Two-class Synthetic Data . . . . . . . . . . . . . . . . 33
2.4.2 CIFAR-10/100 Datasets . . . . . . . . . . . . . . . . . 39
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Comparison of Distributions . . . . . . . . . . . . . . 43
2.5.2 Kolmogorov–Smirnov Test and Other Measures . . . 44
2.5.3 Distance Metrics . . . . . . . . . . . . . . . . . . . . . 46
ix
2.5.4 Future Work and Limitations . . . . . . . . . . . . . . 47
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 3: Hyperspectral Images-based Cardiac Ablation Lesion

Detection Using Unsupervised Learning 53
3.1 Introduction of the Autofluorescence-based Hyperspectral
Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.1 Hyperspectral Imaging Hardware . . . . . . . . . . . 56
3.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . 58
3.2 Ablation Lesion Detection Using Unsupervised Learning . . 59
3.2.1 K-means Clustering . . . . . . . . . . . . . . . . . . . 61
3.2.2 Evaluation and Results . . . . . . . . . . . . . . . . . 62
3.3 Optimization of Wavelength Selection . . . . . . . . . . . . . 67
3.3.1 Feature Grouping . . . . . . . . . . . . . . . . . . . . 67
3.3.2 Wavelength Bands Selection . . . . . . . . . . . . . . 68
3.3.3 Time Cost Analysis . . . . . . . . . . . . . . . . . . . . 73
3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Introduction of Cluster Validation in Unsupervised Learning 77
3.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Experiments of Cluster validation using DSI . . . . . . . . . 79
3.5.1 Materials and Methods . . . . . . . . . . . . . . . . . 79
3.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 83
3.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 95
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 4: Breast Cancer Detection Using Explainable Deep

Learning 100
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Breast Cancer Detection Using Transfer Learning in Convolu-
tional Neural Networks . . . . . . . . . . . . . . . . . . . . . 103
4.2.1 MIAS Mammograms and Images Pre-processing . . . 105
4.2.2 Pre-trained Model: VGG-16 . . . . . . . . . . . . . . . 106
4.2.3 Experiments and Results . . . . . . . . . . . . . . . . 106
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 Breast Cancer Detection Using Synthetic Mammograms from
Generative Adversarial Networks . . . . . . . . . . . . . . . . 114
4.3.1 Introduction of the Mammogram Data: DDSM . . . 114
4.3.2 Image Augmentation by Affine Transformation . . . 116
4.3.3 Introduction of Generative Adversarial Network (GAN)
Augmentation . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 128
4.4 Evaluation of Generative Adversarial Network Performance 131
x
4.4.1 Introduction of GAN Evaluation Metrics . . . . . . . 132
4.4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . 133
4.4.3 Likeness Score: A Modified DSI for GANs Evaluation 139
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 154
4.5 Generalizability of Deep Neural Networks . . . . . . . . . . . 161
4.5.1 Introduction of Generalizability of Neural Networks . 161
4.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 176
4.6 Estimation of Training Accuracy for Two-layer Neural Networks 178
4.6.1 Introduction and Related Work . . . . . . . . . . . . . 179
4.6.2 The Hidden Layer: Space Partitioning . . . . . . . . . 182
4.6.3 Empirical Corrections . . . . . . . . . . . . . . . . . . 191
4.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 199
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Chapter 5: Deep Learning-based Medical Images Segmentation 206

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.2 Segmentation of Thermal Breast Images . . . . . . . . . . . 207
5.2.1 Background and Related Work . . . . . . . . . . . . . 207
5.2.2 Breast Thermography Image Collection and Image Pre-
processing . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.2.3 Segmentation Model Architecture . . . . . . . . . . . 211
5.2.4 Experiments and Evaluation . . . . . . . . . . . . . . 213
5.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 219
5.2.7 Extended Studies . . . . . . . . . . . . . . . . . . . . . 222
5.3 Targets Segmentation by Trained Classifier . . . . . . . . . . 224
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 225
5.3.2 The Grad-CAM Method . . . . . . . . . . . . . . . . . 229
5.3.3 Proposed Experiments . . . . . . . . . . . . . . . . . . 230
5.3.4 Image Data Pre-processing . . . . . . . . . . . . . . . 233
5.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 240
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Chapter 6: Conclusions and Future Work 244
List of Publications 249
Bibliography 252
Appendix A: The CNN architecture for Cifar-10/100 used in Sec-

tion 2.4.2 283
xi
Appendix B: Synthetic Datasets 285
Appendix C: Simplification from Equation 4.4 to Equation 4.5 286
xii
List of Figures
1.1 Accuracy and explainability trade-off [97] . . . . . . . . . . . . . 2
2.1 Different separability of two datasets . . . . . . . . . . . . . . . . 16

2.2 An example of two-class dataset in 2-D shows the definition and
computation of the DSI. Details about the ICD and BCD sets are
in Section 2.3.1, and Section 2.3.2 contains more details about
computation of the DSI. The proof that DSI can measure the
separability of dataset is shown in Section 2.3.3. . . . . . . . . 20
2.3 Two-class dataset with maximum entropy . . . . . . . . . . . . . 22
2.4 Two non-overlapping small cells . . . . . . . . . . . . . . . . . . . 28
2.5 Typical two-class datasets and their ICD and BCD set distributions 34
2.6 Two-class datasets with different cluster standard deviation (SD)
and trained decision boundaries. . . . . . . . . . . . . . . . . . . 36
2.7 Complexity measures for two-class datasets with different cluster
SDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8 DSIs of CIFAR-10 subsets . . . . . . . . . . . . . . . . . . . . . . 40
2.9 Manipulation (e.g., pre-processing) of images in datasets can
change their complexities. We then simultaneously compare dif-
ferent methods of pre-processing and of complexity measures (the
y-axes) including the Training Distinctness (TD, Definition 2.3) as
ground truth, on the CIFAR-10/100 datasets. The x-axes show
pre-processing methods, from left to right: Color (factor = 2) and
Sharpness (2), Color (2), Contrast (2), Color (0.1), and Contrast
(0.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.10 DSI calculation using different distribution measures. . . . . . 46
2.11 DSI calculation using different distance metrics. . . . . . . . . . 47
2.12 DSIs of input and output data from each layer in the FCNN model
for nine datasets from Section 2.4.1.2. The x-axis represents the
outputs from layers of the FCNN: input layer, three hidden layers,
and output layer. The y-axis represents the DSI values of output.
Plots are for the nine datasets. . . . . . . . . . . . . . . . . . . . 49
2.13 Two-class datasets with different decision boundaries. They have
the same DSI but different Training Distinctness (TD). The dataset
(b) having more complex decision boundary is more difficult to
be classified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1 Proposed concept of acquiring hyperspectral imaging data from

the heart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Schematic showing hypercube acquisition. CCD – charge coupled
device, LCTF – liquid tunable filter, UV LED – ultraviolet light
emitting diode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xiii
3.3 Hypercube of aHSI images: images in the hypercube were ordered
by their wavelength increasingly on the Z-axis. Each pixel on the
X-Y plane thus has an associated spectrum. . . . . . . . . . . . 58
3.4 (a) Pre-processing operations and reshaping hypercube into a 2D
matrix; (b) the rule of reshaping and inverse-reshaping. . . . . 60
3.5 K-means clustering. . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Appearance of ablated tissue after: (a) linear unmixing from aHSI
system, (b) TTC staining. . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 Results for porcine atria (Set-1) clustered by k-means into: (a) 5
clusters and (b) 10 clusters. Panel (c) shows an auto-fluorescence
image at 500 nm; (d) shows the lesion areas (red) detected when
k=10, superimposed on the image in (c). The corresponding
lesion component image, which is from the unmixed image that
contains lesion component and non-lesion component, is shown
in (e); followed by binary image obtained from (e) by applying
Otsu’s thresholding (f). . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Maximum, average, and minimum accuracies over 10 datasets
for each k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 One kind of 4-feature grouping. . . . . . . . . . . . . . . . . . . . 67
3.10 Accuracies of SNs for one dataset (Set-1). . . . . . . . . . . . . . 69
3.11 Feature grouping results for porcine atria (Set-1): (a) k-means
clustering (k=10) by using all 31 features; (b) k-means clustering
(k=10) by using four features from 4-feature grouping (SN=2857):
[wavelength groups: 420-510, 520-600, 610-630, 640-720 nm];
(c) k-means clustering (k=10) by using four features from different
4-feature grouping (SN=3716): [wavelength groups: 420-580,
590-600, 610-680, 690-720 nm]. . . . . . . . . . . . . . . . . . . 69
3.12 Feature grouping accuracies for 10 datasets; each row represents
a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.13 Accuracies over 10 datasets. . . . . . . . . . . . . . . . . . . . . . 71
3.14 Smoothed Minimum accuracies (scaled) with the 3 dividers. The
green area (SN: 2730-3245) includes most high-accuracy combi-
nations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.15 Evaluated accuracies over 10 datasets. . . . . . . . . . . . . . . 73
3.16 The simplified flowchart summarizes the methods in this study.
The ground truth areas of lesion were obtained from aHSI system
by linear unmixing and verified by TTC analysis. By comparing
with truth data, we found the optimal k-value for k-means algo-
rithm (green) as well as the optimal groups (blue). The procedure
on the left (red) is our proposed methods for lesion detection from
ablated tissue to lesion areas. . . . . . . . . . . . . . . . . . . . . 74
3.17 Time costs for k-means clustering. . . . . . . . . . . . . . . . . . 75
3.18 Two clusters (classes) datasets with different label assignments.
Each histogram indicates the relative frequency of the value of
each of the three distance measures (indicated by color). . . . . 81
xiv
3.19 An example of rank numbers assignment. . . . . . . . . . . . . 86
3.20 Examples for rank-differences of synthetic datasets. . . . . . . . 97
3.21 Wrongly-predicted clusters have a higher DSI score than real
clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 Result of the New-model. Blue curve is the accuracy after each
epoch of training, and red curve is smoothed accuracy (the
smoothing interval is about 20 epochs). . . . . . . . . . . . . . . 110
4.2 Result of the Feature-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the
4.3 Result of the Tuning-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the
4.4 Comparing of the three CNN classification models: New-model
(yellow); Feature-model, to train a neural network-classifier (red);
Tuning-model (blue). The values are maximum smoothed accu-
racy and time cost (second) of training per epoch. . . . . . . . . 113
4.5 (A) A mammographic image from DDSM rendered in grayscale;
(B) Cropped ROI by the given truth abnormality boundary; (C)
Convert Grey to RGB image by duplication. . . . . . . . . . . . . 117
4.6 The three types of affine transformation. . . . . . . . . . . . . . 117
4.7 The principle of GAN. . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.8 Validation accuracy of CNN classifiers trained by three types of
AFF ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.9 The flowchart of our experiment plan. CNN classifiers were
trained by data including ORG, AFF and GAN ROIs. Valida-
tion data for the classifier were ORG ROIs that were never used
for training. The AFF box means to apply affine transformations. 125
4.10 (Top row) Real abnormal ROIs; (Bottom row) synthetic abnormal
ROIs generated from GAN. . . . . . . . . . . . . . . . . . . . . . . 126
4.11 Training accuracy and validation accuracy for six training datasets. 127
4.12 Histogram of mean and standard deviation. (Normalized) . . . . 130
4.13 Problems of generated images from the perspective of distribution.
The area of dotted line is the distribution of real images. The dark-
blue dots are real samples and red dots are generated images. (a)
is overfitting, lack of Creativity. (b) is lack of Inheritance. (c) is
called mode collapse for GAN and (d) is mode dropping. Both (c)
and (d) are examples of lack of Diversity. . . . . . . . . . . . . . 133
4.14 Lack of Creativity, Diversity, and Inheritance in 2D. Histograms
of (a) and (b) are zoomed to ranges near zero; (c) has the entire
histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.15 Plots of values in Table 4.9. . . . . . . . . . . . . . . . . . . . . . 144
xv
4.16 Column 1: samples from four types of real images; column 2-4:
samples from synthetic images of three GANs trained by the four
types of images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.17 Normalized and ranked scores. X-axis shows scores and y-axis
shows their normalized values; 0 is for the worst (model) perfor-
mance and 1 is for the best (model) performance. Colors are for
generators and shapes are for image types; see details in legend. 148
4.18 Column 1: samples from real images of CIFAR-10; column 2-6:
samples from synthetic images of five GANs: DCGAN, WGAN-GP,
SNGAN, LSGAN, and SAGAN trained by the original 2,000-image
subset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.19 Processes to build real set and generated sets including opti-
mal generated images and generated images lack creativity, lack
diversity, lack creativity & diversity, and lack inheritance. . . . 151
4.20 Column 1: samples from the real set; column 2-6: sample images
from the five virtual GAN models: Opt., LC, LD, LC&D, and LIn
trained by the real set. . . . . . . . . . . . . . . . . . . . . . . . . 153
4.21 Real and generated datasets from virtual GANs on MNIST. First
row: the 2D tSNE plots of real (blue) and generated (orange) data
points from each virtual GAN. Second row: histograms of ICDs
(blue for real data; orange for generated data) and BCD for real
and generated datasets. The histograms in (b)-(d) are zoomed to
the beginning of plots; (a) and (e) have the entire histograms. . 155
4.22 Time cost of measures running on a single core of CPU (i7-6900K).
To test time costs, we used same amount of real and generated
images (200, 500, 1000, 2000, and 5000) from CIFAR-10 dataset
and DCGAN trained on CIFAR-10. † IS only used the generated
images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.23 To generate adversarial examples of classifier f . . . . . . . . . . 164
4.24 Adversarial examples generated by pairs of data. . . . . . . . . . 165
4.25 Two kinds of round decision boundary. . . . . . . . . . . . . . . 167
4.26 Local adversarial set generated by 3-nearest neighbors of a pair. 168
4.27 Decision boundaries of two models trained by the synthetic 2-D
dataset. The FCNN:(a) has only one hidden layer with one neuron;
its number of parameters is 5 (including bias). The FCNN:(b) has
three hidden layers with 10, 32 and 16 neurons; its number of
parameters is 927 (including bias). . . . . . . . . . . . . . . . . . 170
4.28 Local DBC scores from two models trained by the breast cancer
dataset. The FCNN bC1 has three hidden layers (20 neurons in
each layer) and three Dropout layers; its number of parameters is
1,481 (including bias). The bC2 has one hidden layer with 1,000
neurons; its number of parameters is 32,001 (including bias). . 172
xvi
4.29 Training and test accuracies in training process of three models.
The CNN cC1 has three convolutional layers, three max-pooling
layers, one dense layer (64 neurons) and one Dropout layer. The
cC2 has one convolutional layer and three dense layers (256, 128,
64 neurons). The cC3 has only one dense layer (1024 neurons). 173
4.30 Means and medians of local DBC scores on model cC1, cC2 and
cC3 using different numbers of nearest neighbors. . . . . . . . 174
4.31 Increasingly sorted local DBC scores from three models. The
upper figure is the whole plot, and the lower figure is zoomed the
plot in range from 2k-6k to clearly see positions of three curves. 175
4.32 Adversarial examples for the cC1 model. . . . . . . . . . . . . . 176
4.33 Linear adversarial set (green) on lumpy boundary (black). . . . 178
4.34 An example of the two-layer Fully-Connected Neural Network
(FCNN) with d − L − 1 architecture. This FCNN is used to classify
N random vectors in Rd belonging to two classes. Detailed settings
are stated before in Section 4.6.1.1. The training accuracy of this
classification can be estimated by our proposed method, without
applying any training process. The detailed Algorithm of our
method is shown in Section 4.6.3.3. . . . . . . . . . . . . . . . . 181
4.35 Maximum number of partitions in 2-D . . . . . . . . . . . . . . . 190
4.36 Fitting curve of b1 = f (N, L) in 2-D . . . . . . . . . . . . . . . . . . 193
4.37 Plots of d v.s. xydd from Table 4.18. Blue dot-line is linearly fitted
by points to show the growth. . . . . . . . . . . . . . . . . . . . 196
4.38 Estimated training accuracy results comparisons. y-axis is accu-
racy, x-axis is the dimensionality of inputs (d). . . . . . . . . . 197
4.39 Evaluation of estimated training accuracy results. y-axis is esti-
mated accuracy; x-axis is the real accuracy; each dot is for one
case; red line is y = x. R2 ≈ 0.955. . . . . . . . . . . . . . . . . . . 199
5.1 Full thermal raw images of two patients, including the neck,
shoulder, abdomen, background and chair. . . . . . . . . . . . . 208
5.2 Our breast infrared thermography system. . . . . . . . . . . . . 210
5.3 Preprocessing of the raw IR images: (a) original raw IR image, (b)
manual rectangular crop to remove shoulders and abdomen, and
(c) is the hand-trace of the breast contour to generate the manual
segmentation (ground truth). . . . . . . . . . . . . . . . . . . . . 211
5.4 Training and testing data for Experiment 1 and 2. . . . . . . . . 214
5.5 The evaluation processes. . . . . . . . . . . . . . . . . . . . . . . 215
5.6 The training curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.7 Segmentation results of one patient from Experiment 1. . . . . 217
5.8 Results of Experiment 1. The blue dots are the average IoU for
each patient and bars show the range among 3 samples. . . . . 218
5.9 Segmentation results of two patients from Experiment 2. . . . . 218
xvii
5.10 Results of Experiment 2. The blue dots are the average IoU of each
patient among its 15 testing samples, the red lines are medians
and the bars show the ranges. . . . . . . . . . . . . . . . . . . . 219
5.11 Comparison of results from the two experiments (first row: Exper-
iment 2, second row: Experiment 1). The second column (Gray
seg-image) shows output of segmentation models. The third col-
umn is the ground truth breast region of the patient’s testing
samples. (Top part: p.001, bottom part: p.009). . . . . . . . . . 221
5.12 The size and object-area ratio change of images. We change image
size by down-sampling and change object-area ratio by adding
blank margin around the object and down-sampling to keep the
same size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.13 Results of Grad-CAM applied to Xception model with input of
an elephants image. (a) is the input image. (b) is original image
masked by Grad-CAM heatmap (using ‘Parula’ colormap) of the
prediction on this input. (c) is the Grad-CAM heatmap mask
using gray-scale colormap. (d) is original image filtered by the
heatmap mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
5.14 Flowchart of the Experiment #1. The true boundaries of tumor
regions in abnormal ROIs are provided by the DDSM database. 231
5.15 Flowchart of the Experiment #2. The normal and abnormal ROIs
are used twice to train the CNN classifiers and then to generate
CAMs by trained classifiers using Grad-CAM algorithm. The CNN
classifiers to be trained by CAM-filtered ROIs are the same CNN
models (same structures) as trained by the original ROIs before
but trained from scratch again. . . . . . . . . . . . . . . . . . . 232
5.16 The ROI (left) is cropped from an original image (right) from DDSM
dataset. The red boundary shows the tumor area. The ROI is
larger than the size of tumor area because of padding. . . . . . 234
5.17 The padding is added to four sides of ROIs by some randomness
and depended on the size of tumor area. . . . . . . . . . . . . . 235
5.18 Examples of ROIs. The tumor mask is binary image created from
the tumor ROI and truth boundary of the tumor area. . . . . . 235
5.19 Result of Experiment #1. The first row shows one of the abnormal
(tumor) ROIs and its truth mask. Other rows show the CAMs of
this ROI generated by using trained CNN classifiers and Grad-
CAM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
5.20 Plots of Dice and CAM_val_acc for the six CNN classifiers in
Table 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.21 Examples of truth-mask-filtered (a) and inverse-mask-filtered (b)
ROIs from the case shown in Figure 5.19. . . . . . . . . . . . . . 241
5.22 Some tumor ROIs and their CAMs from Xception. . . . . . . . 241
xviii
List of Tables
2.1 Complexity measures reported by Lorena et al. [165] . . . . . . 19

2.2 Complexity measures results for the two-class datasets (Fig-
ure 2.5). The measures noted by “*” failed to measure separability. 35
2.3 Values of complexity measures for CIFAR-10 . . . . . . . . . . . 42
2.4 Values of complexity measures for CIFAR-100 . . . . . . . . . . 42
3.1 Accuracies of 31-feature clustering results. . . . . . . . . . . . . 66

3.2 Combinations of 4 groups. . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Accuracies of 4-feature clustering results by grouping (SN=3018):
[420-520, 530-590, 600-640, 650-720 nm]. . . . . . . . . . . . . 72
3.4 Compared CVIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5 The description of used real datasets. . . . . . . . . . . . . . . . 83
3.6 CVI scores of clustering results on the wine recognition dataset. 84
3.7 Hit-the-best results for the wine dataset. . . . . . . . . . . . . . 85
3.8 Rank sequences of CVIs converted from the score sequences in
Table 3.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.9 Rank-difference results for the wine dataset. . . . . . . . . . . . 88
3.10 Hit-the-best results for real datasets. . . . . . . . . . . . . . . . 90
3.11 Rank-difference results for real datasets. . . . . . . . . . . . . . 90
3.12 Hit-the-best results for 97 synthetic datasets. . . . . . . . . . . 92
3.13 Rank-difference results for 97 synthetic datasets. . . . . . . . . 92
3.14 Number of clusters prediction results on the wine dataset (178
samples in 3 classes). . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.15 Number of clusters prediction results on the tae dataset (151
3.16 Number of clusters prediction results on the thy dataset (215
3.17 Number of clusters prediction results on the vehicle dataset (948
3.18 Rank-difference results for selected synthetic datasets. . . . . . 96
4.1 CNN architecture for training from scratch. . . . . . . . . . . . . 108

4.2 CNN architecture for transfer learning. . . . . . . . . . . . . . . 109
4.3 The architecture of generator and discriminator neural networks. 121
4.4 The architecture of CNN classifier. . . . . . . . . . . . . . . . . . 123
4.5 Notations for data. (abnorm = abnormal / norm = normal) . . . 124
4.6 Training plans. Training by using CNN classifier in Table 4.4.
Notations are described in Table 4.5. . . . . . . . . . . . . . . . . 126
4.7 Analysis of validation accuracy for CNN classifiers. . . . . . . . 129
4.8 Wasserstein distance between two histograms. . . . . . . . . . . 130
4.9 Measure values for different numbers of generated images . . . 143
4.10 Measure results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
xix
4.11 Measure results averaged by generators . . . . . . . . . . . . . . 147
4.12 Measure results on CIFAR-10 . . . . . . . . . . . . . . . . . . . . 149
4.13 Measure results from virtual GAN models . . . . . . . . . . . . . 154
4.14 Statistical Results of local DBC scores on bC1 and bC2. . . . . 172
4.15 Statistical Results of local DBC scores on cC1, cC2 and cC3. . 176
4.16 Accuracy results comparison. The columns from left to right
are dimension, dataset size, number of neurons in hidden layer,
the real training accuracy and estimated training accuracy by
Equation (4.15) and Theorem 4.1. . . . . . . . . . . . . . . . . . 192
4.17 Estimated training accuracy results comparison in 2-D. The
columns from left to right are dataset size, number of neurons
in hidden layer, the real training accuracy, estimated/predicted
training accuracy by Equation (4.17) and Theorem 4.1, and (abso-
lute) differences based on estimations between real and estimated
accuracies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.18 Parameters {xd , yd , cd } in Equation (4.16) (Observation 4.1) for
various dimensionalities of inputs are determined by fitting. . 195
5.1 C-DCNN segmentation architecture for thermal breast images. 212

5.2 Result of Experiment #1. Descending sort by val_acc. . . . . . . 237
5.3 Result of Experiment #2. Descending sort by Dice. . . . . . . . 239
6.1 My contributions (citations in brackets) in the four summarized

projects regarding the complexity and learnability, which are the
two important components of explainable machine learning (or
XAI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.1 The CNN architecture used in Section 2.4.2 . . . . . . . . . . . . 283
B.1 Names of the 97 used synthetic datasets from the Tomas Barton
repositorya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
xx
List of Abbreviations
ARI Adjusted Rand Index
aHSI auto-fluorescence-based Hyper-Spectral Imaging
BCD Between-Class Distance
CAD Computer-Aided Diagnosis
C-DCNN Convolutional and Deconvolutional Neural Network
CT Computed Tomography
CNN Convolutional Neural Network
CVI Cluster Validity Index
DBC Decision Boundary Complexity
DL Deep Learning
DNN Deep Neural Network
DSI Distance-based Separability Index
FC Fully-Connected
FCNN Fully-Connected Neural Network
GAN Generative Adversarial Network
ICD Intra-Class Distance
xxi
LS Likeness Score
MRI Magnetic Resonance Imaging
ML Machine Learning
NN Neural Network
PET Positron Emission Tomography
ROIs regions of interest
XAI eXplainable Artificial Intelligence
xxii
List of Symbols
IRef Binary-Unmixing-Reference
I31Rlt Binary-31(features)-Result
I4Rlt Binary-4(features)-Result
SN Serial Number
xxiii
Chapter 1: Introduction
1.1 Background
Medical image analysis plays a crucial role in clinical diagnosis [157];

it is like a third eye for doctors [77]. Medical images can be obtained
from many imaging techniques including ultrasound, X-ray, Computed
Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission
Tomography (PET), multi-spectral fluorescence, and others.
In the early 1970s, as soon as processing images by computer became
possible [224, 8], Computer-Aided Diagnosis (CAD) [61] was applied to
medical image analysis. From that time to the present, CAD techniques
have kept developing along with image processing, computer vision, and
Machine Learning (ML). Medical image analysis has thus evolved from
sequential methods of pixel-level image processing [88] (e.g., filters and edge
detectors) to semantics-level Deep Learning (DL) (e.g., convolutional neural
networks [148] and generative adversarial networks [89]).
Nowadays, ML has a profound impact on almost every aspect of our
world. It has become an extremely popular topic for both academe and
industry [2, 1]. As reliance on this technology has grown, there is a greater
requirement for understanding how and why the decisions are made by
a ML model. That is, we need not only precise answers, but also want to
know their reasons and to have clear explanations. A remarkable trade-off
shows that as models move from linear to non-linear, the readability of the
model is reduced [175]. This fact means the price of model performance is
its explainability [95]. Usually, the classification models have better perfor-
mances; they are more difficult, however, to be understood and explained;
1
Figure 1.1: Accuracy and explainability trade-off [97]
as illustrated in Figure 1.1. It is known as the trade-off between accuracy

and explainability [97].
Along with the development of ML, we now have advanced ML models with
great complexity, e.g., the Deep Neural Network (DNN). These complicated
ML models can attain higher accuracy on many tasks such as classification,
recognition, and segmentation than other traditional models. For image
classification and recognition, the accuracy of detection has been greatly
improved by developing new techniques. For example, for the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) [227], traditional methods (SIFT
features and Fisher vectors) obtained 54.3% (top-1) accuracy [232] in 2011,
then in 2021, a DL-based method (EfficientNet-L2) achieved a state-of-the-
art (top-1) accuracy of 90.2% [206]. Finding explanations for their outcomes,
however, is a more difficult problem. From 2016, the interest in methods
to explain ML and DL models has increased, and more and more research
has been applied to this problem [6], which is commonly called eXplainable
Artificial Intelligence (XAI).
2
There is no one unified definition for XAI so far [64, 158]. Various
definitions and terminologies are proposed by different studies, while many
definitions overlap in concept [72]. A simple definition is to explain the
internal decisions within a ML system that lead to outcomes proposed by D.
Doran et al. [63]. Such explanations require insight into the rationale that
the ML system uses to obtain conclusions from the input data. But this
definition is not comprehensive because how the “internal decisions” could
be understood by users is still a problem. For example, a trained Fully-
Connected Neural Network (FCNN) can be described as a very complicated
composition function; but the composition function does not make sense to
us. D. Gunning proposed a more specific definition in the XAI program of
DARPA [95]. D. Gunning considers that an XAI system must answer:
• Why the decisions were made by a ML system and why not something
else.
• When the system will succeed or fail.
• When we can trust the system.
• Why the system erred.
In addition, an XAI system should maintain a high accuracy for tasks. Based
on D. Gunning’s definition, F. Hohman et al. [109] summarize the definition
of XAI system by focusing on five W’s and How (Why, Who, What, How,
When, and Where) questions. Authors consider that we should firstly clarify
the purpose of explanation (why to explain and for whom), then decide
which parts to explain (what) and find methods (how), and finally know
methods’ limitations and effective domains (when can use and where be
used). Extending from D. Doran et al., A. Barredo Arrieta et al. [21] consider
3
that the XAI system can produce detailed reasons to make internal functions
clear and understandable. These definitions emphasize the importance of
human’s participation and understanding. Human user’s experiences do
matter, hence, the goals of XAI also include confidence, trustworthiness,
fairness, accessibility, privacy awareness, among others [158, 21]. To make
the definition clear and neat, R. Roscher et al. [222] summarize the XAI in
three core elements:
1. Transparency of design
2. Interpretability of processing
3. Explainability to users
Transparency requires that all techniques used in the XAI system – models,
methods and algorithms – should have clear reasons or purposes according
to the needs and goal. For example, an XAI system must provide the
reason why it uses a Convolutional Neural Network (CNN) model with 16
hidden layers. Interpretability is to interpret the processing from input data
to results (output). For example, for a result, the XAI could answer the
question: the ML algorithm bases its decision on what? [36]. Explainability
is important to users. It further explains the answers of transparency and
inter-pretability integrated with domain knowledge. The final goal of an XAI
system is to provide explanations and answers to specific users. Therefore,
explainability requires the XAI system should translate explanations to
be understandable to its target users. In fact, the concept of XAI is still
evolving. While the definition of XAI has been discussed for years, a concise
or universal definition is unavailable [222]. Since explainability is related
to the human mind, a deep and comprehensive discussion of XAI must
involve the disciplines of arts and social sciences [180] such as philosophy,
4
linguistics, psychology, and cognitive science. In this study, I do not study
the definition of XAI further because such discussion is beyond the technical
field and unrelated to the main topic. Instead, I focus on the approaches for
XAI related to the techniques and methods used in medical image analysis.
There, ML can be applied to a wide range of tasks and provides promis-
ing approaches to make the diagnostic process more precise and efficient.
Although the ML and DL have achieved noticeable results in the laboratory,
they have not been deployed significantly in the clinics because of the lack of
explainability [252]. For the reasons of responsibility and reliability, explain-
ability is required of the CAD system for trust by physicians, regulators, and
patients. Tosun et al. [274] indicates the main requirements for explainable
CAD systems:
• Show the targets (ROIs) used to make the decision.
• Provide the confidence level (probability) of the decision.
• Display the multi-level triaging for a case (like the decision tree).
• Support a decision with similar/different examples (images).
• Support a decision with word-based descriptions (including natural

language processing techniques).
• Illustrate the differences between classes.
• Provide a user-friendly interface.
Recent explainability methods for medical imaging are mostly focused on the
saliency/attribution maps [252, 272, 250, 275], which show the ROIs used
to make decisions, such as the CAM [308]/Grad-CAM [240], Gradient [71],
and SHAP [296].
5
In the same way as the development of medical image analysis tech-
niques, new methods from XAI are being attempted to apply in the field
of CAD continuously. The goals are to make the CAD system transparent,
understandable, and explainable and to win the trust of end-users; finally,
it can be widely accepted by clinics to improve medical diagnosis and treat-
ment outcomes. Therefore, my studies include not only several applications
of ML and DL technologies in medical imaging but also aim to contribute to
methods focusing on the explainability of ML and DL.
1.2 Research Summary
The studies reported here include the applications of ML and DL in medi-

cal imaging and the developments of XAI. The following sections summarize
the relevant four projects.
1.2.1 Hyper-spectral Image-based Cardiac Ablation Lesion Detection
As the most common sustained arrhythmia, the Atrial Fibrillation (AF)

is commonly treated by the radiofrequency ablation (RFA) procedure, which
destroys culprit tissue or creates scars that prevent the spread of abnor-
mal electrical activity. Our work is to develop imaging tools for real-time
visualization of ablated tissue. The long-term goal of our studies is to
help develop an intracardiac auto-fluorescence-based Hyper-Spectral Imag-
ing (aHSI) catheter that can improve the success rate of RFA treatment,
reduce the incidence of AF recurrence, and help to avoid re-ablating the
previously ablated tissue during later treatments.
We have recently demonstrated the ability of autofluorescence hyper-
spectral imaging to reveal ablated tissue using linear unmixing protocols.
Here we have shown that k-means, an approach that does not require a
6
priori knowledge of tissue spectra, can be also an effective means to de-
tect lesions from aHSI hypercubes. The average accuracy for detection by
k-means (k = 10) using 31 features was about 74% when compared to refer-
ence images. Secondly, we have demonstrated that the number of spectral
bands (which are referred to as features) can be reduced (by grouping them)
without significantly affecting lesion detection accuracy. Specifically, we
show that by using the best four grouped features, the accuracy of lesion
identification was about 94% of that achieved by using 31 features. The
time cost of 4-feature clustering was about 40% of the 31-feature clustering,
demonstrating that 4-feature grouping can speed up acquisition and pro-
cessing. From an instrumentation point of view, by using a limited number
of features one is able to combine multiple spectral bands into one spectrally
wide band. This is extremely beneficial for low-light applications such as
implementation of aHSI via catheter access.
This project is important because the recurrence rate of AF after an
ablation procedure can be as high as 50% and more than 90% of these
recurrent cases have been linked to gaps between ablation lesions [284].
Incomplete placement of lesions that later result in AF recurrence can be
curtailed if clinicians could directly monitor lesion formation along with
the degree of tissue damage. Unfortunately, the endocardial surface of
the left atrium, where most of RF ablation procedures are performed, is
covered by thick layers of collagen and elastin preventing direct visualization
of ablated muscle beneath. While imaging technologies like MRI, CT, and
ultrasound have been successfully applied for lesion testing, they have
significant limitations. CT and MRI are expensive, involve radiation and/or
contrast agents, and ultrasound imaging has poor image resolution; thus,
they are not always good for live monitoring [198]. Therefore, our work
7
has been exploring another visualization approach called aHSI to solve the
problem.
1.2.2 Applications of Transfer Learning and the Generative Adversar-

ial Network (GAN) in Breast Cancer Detection
In the development of breast cancer detection techniques, the CNN can

be used to extract features from images automatically and then perform
classification. To train the CNN from scratch, however, requires a large
number of labeled images, which is infeasible for some kinds of medical
image data such as mammographic tumor images. Thus, we proposed two
approaches to address the lack of training images:
1. To apply transfer learning in CNN. We used the pre-trained VGG-16

model to extract features from input mammograms and used these
features to train a Neural Network (NN)-classifier. The stable average
validation accuracy converged at about 91.48% for classifying abnormal
vs. normal cases in the DDSM database.
2. Using a GAN to generate synthetic mammographic images for training.

Adding GAN-generated images made it possible to train a CNN from
scratch successfully. Adding more GAN images improved the CNN’s
best validation accuracy to 98.85%.
This project is important because training the CNN from scratch requires
a large number of labeled images [73]. For some kinds of medical image
data such as mammographic tumor images, however, to obtain a sufficient
number of images to train a CNN classifier is difficult because the true
positives are scarce in the datasets and expert labeling is expensive [111].
The shortcomings of an insufficient number of images to train a classifier are
8
well-known [146]; thus, it is very important to study methods to solve the
problem and thus to improve the performance of a CNN classifier, especially
for medical image analysis.
1.2.3 Transparent Deep Learning/Machine Learning
The lack of explainability limits the acceptance of DL applications by

clinicians. Explainability has multiple facets, and there is to date no unified
definition. For explainable DL, we have created measurement tools and
analyzed several questions for explainable DL and machine learning. We
have primarily addressed these aspects: the generalizability of the DNN,
the separability of data, transparent DL, and the learnability of DL models.
Specifically, we have:
• Created the Distance-based Separability Index (DSI) that can indicate

whether the distributions of datasets are identical for any dimension-
ality. DSI considers the situation in which different classes of data are
mixed in the same distribution to be the most difficult for classifiers to
separate. Based on the data separability measure, DSI can be used as
an internal Cluster Validity Index (CVI) to evaluate clustering results,
which is a significant part of cluster analysis. There are no true class
labels for clustering in typical unsupervised learning. Thus, a number
of internal evaluations (CVIs), which use input data and predicted
labels, are required.
• Created measures to evaluate performance of the GAN. Recently, a

number of studies have addressed the theory and applications of the
GAN in various fields of image processing. Fewer studies, however,
have directly evaluated GAN outputs. Those that have been conducted
9
focused on using classification performance and statistical metrics.
In this study, we consider a fundamental way to evaluate GANs by
directly analyzing the images they generate, instead of using them as
inputs to other classifiers.
• Created measures to analyze of the generalizability of DNNs. For

supervised learning models, the analysis of generalization ability (gen-
eralizability) is vital because it expresses how well a model will perform
on unseen data. In this study, we hypothesize that the DNN with a
simpler decision boundary has better generalizability by the law of
parsimony (Occam’s Razor). We create the decision boundary com-
plexity (DBC) score to define and measure the complexity of decision
boundary of DNNs.
• Create a novel theory from scratch to estimate the training accuracy

for two-layer neural networks applied to random datasets. This study
may provide starting points for some new ways for researchers to make
progress on the difficult problem of understanding the mechanisms of
DNN models.
This project is important because it contributes to the explainability

and so may increase the acceptance of DL applications by clinicians and
patients. This is especially important because the applications of AI/ML for
medical imaging have become more and more popular.
1.2.4 Deep Learning-based Medical Image Segmentation
One of the important techniques in medical image processing is image

segmentation, which is to identify a region of interest (ROI) through some au-
tomatic and semi-automatic methods. DL has become much more popular
10
in computer vision and has brought a breakthrough in image segmentation
applications, especially for medical images.
Autoencoder-like convolutional and deconvolutional neural networks (C-
DCNN) are promising computational approaches to automatically segment
breast areas in thermal images. We apply the C-DCNN to segment breast
areas from our thermal breast images database, which we are collecting in
our pilot study by imaging breast cancer patients with our infrared camera
(N2 Imager). We then examine how to segment targets using a classifier
trained on the targets, instead of training a new segmentation model. We
then evaluate the segmentation results. Specifically, we test this method
for medical object segmentation.
This project is important because applying image segmentation to medical
images can remove unnecessary parts and extract the key regions of interest
(ROIs) in the images; that is a crucial preprocessing step for CAD systems
of medical imaging. And, automatic segmentation of the ROIs will limit the
area for tumor search and reduce processing time. It will further reduce the
time and effort required in manual segmentation, and potentially minimize
human errors. Traditional segmentation methods, however, are not suited
to these challenging tasks. Recent DL-based segmentation methods have
been shown to outperform previous techniques for many types of medical
images.
1.3 Original Contributions
This dissertation combines my studies of machine learning, deep learn-

ing, and their applications in medical imaging. The overarching goal of
my research is to develop the ML/DL-based Computer-Aided Diagnosis
(CAD) techniques and provide explanations of their processing and results to
11
achieve users’ trust. My research has practical applications in lesion/cancer
detection, medical image segmentation, and explainable Artificial Intelli-
gence (XAI). My most significant research contributions are:
1. Designed processes to detect cardiac lesions in auto-fluorescence-

based hyper-spectral images by using k-means clustering and to select
the essential spectral bands to optimize the process of multi-spectral
image acquisition.
2. Improved the performance of breast cancer detection via applying

transfer learning and the Generative Adversarial Network (GAN) with
Convolutional Neural Networks (CNNs) to provide additional training
data.
3. Segmented breast areas from thermal breast images by using neural

network-based segmentation models. Studied how to segment targets
using a classifier trained on the targets instead of training a new
segmentation model; and the method was evaluated on medical images.
4. Created the Distance-based Separability Index (DSI), which is indepen-

dent of the classifier model, to measure the separability of datasets.
5. Shown DSI can be effectively used as an internal Cluster Validity

Index (CVI) to evaluate the results of cluster analysis.
6. Characterized the performance of a good Generative Adversarial Net-

work (GAN) according to its Creativity, Inheritance, and Diversity; then,
created the Likeness Score (LS) (a variety of DSI) based on the three
aspects to evaluate the performances of GANs.
7. Created a score – Decision Boundary Complexity (DBC) – to define

and measure the complexity of decision boundaries of deep neural
12
networks; the measure can be used to analyze the generalizability of
deep learning models.
8. Created a novel theory from scratch to estimate the training accu-

racy for two-layer neural networks applied to random datasets, to
understand the mechanisms of DNN models.
1.4 Dissertation Organization
The dissertation describes the two complementary directions of my re-

search: the applications of ML and DL in medical imaging, and the developed
methods related to XAI. Since studies of explainable ML are motivated by
some applications in medical imaging, they are combined in the relevant
chapters. The Distance-based Separability Index (DSI) is firstly and individ-
ually introduced in Chapter 2 because it is an important contribution and
has been applied in later chapters. Specifically, the dissertation is organized
as follows:
• Chapter 1: Introduction contains an expanded literature review for

explainable ML in medical image analysis and summaries of research,
contributions, and dissertation.
• Chapter 2: Distance-based Intrinsic Measure of Data Separability

presents the study of Distance-based Separability Index (DSI). We
formally show that the DSI can indicate whether the distributions of
datasets are identical for any dimensionality and verify its effectiveness
by comparing with state-of-the-art separability/complexity measures
using synthetic datasets and real datasets.
• Chapter 3: Hyperspectral Images-based Cardiac Ablation Lesion

Detection Using Unsupervised Learning contains our studies of
13
radio-frequency ablation (RFA) lesion detection using k-means cluster-
ing and the application of DSI as a CVI to evaluate clustering results.
• Chapter 4: Breast Cancer Detection Using Explainable Deep Learn-

ing first presents the work of breast cancer detection using the DL mod-
els. The Generative Adversarial Network (GAN) and transfer learning
are applied to address the shortage of images for training CNN models.
To contribute to tools for explainable/transparent deep learning, we
then present our studies of GAN performance evaluation, generaliz-
ability of DNN model analysis, and training accuracy estimation for
two-layer neural networks.
• Chapter 5: Deep Learning-based Medical Images Segmentation

presents research of applying DL-based methods to medical image
segmentation. It includes the studies of using an autoencoder-like
Convolutional and Deconvolutional Neural Network (C-DCNN) model to
automatically segment breast areas in thermal images and to approach
the segmentation problem using trained CNN classifiers.
• Chapter 6: Conclusions and Future Work provides an overall view of

the main topic and our studies, highlights novelties and contributions
of the research, and emphasizes some unfinished studies and inspired
new problems.
14
Chapter 2: Distance-based Intrinsic Measure of Data Separability1
In machine learning, the performance of a classifier depends on both the

classifier model and the separability/complexity of datasets. To quantita-
tively measure the separability of datasets, we propose an intrinsic measure
– the Distance-based Separability Index (DSI), which is independent of the
classifier model. We then formally show that the DSI can indicate whether
the distributions of datasets are identical for any dimensionality. DSI can
measure separability of datasets because we consider the situation in which
different classes of data are mixed in the same distribution to be the most
difficult for classifiers to separate. And, DSI is verified to be an effective sep-
arability measure by comparing it to state-of-the-art separability/complexity
measures using synthetic datasets and real datasets (CIFAR-10/100).
Having demonstrated the DSI’s ability to compare distributions of sam-
ples, our other studies in the following chapters show that it can be used
in other separability-based applications, such as evaluating the results
of clustering methods in Section 3.5 and measuring the performance of
generative adversarial networks (GANs) in Section 4.4.3.
2.1 Introduction
Data and models are the two main foundations of machine learning
and deep learning. Models learn knowledge (patterns) from datasets. An
example is that the convolutional neural network (CNN) classifier learns how
to recognize images from different classes. There are two aspects in which
we examine the learning process: complexity of the learning model [116]
1 This work has been published in the [J1].
15
and the separability of the dataset [47]. The learning outcomes are highly
dependent on the two aspects. For a specific model, the learning capability is
fixed, so that the training process depends on the training data. Separability
is an intrinsic characteristic of a dataset [76] to describe how data points
belonging to different classes mix with each other.
(a) (b)
Figure 2.1: Different separability of two datasets
As reported by Chiyuan et al. [300], the time to convergence for the same
training loss on random labels on the CIFAR-10 dataset was greater than
when training on true labels. It is not surprising that the performance
of a given model varies between different training datasets depending on
their separability. For example, in a two-class problem, if the scattering
area for each class has no overlap, one straight line (or hyperplane) can
completely separate the data points (Figure 2.1a). For the distribution
shown in Figure 2.1b, however, a single straight line cannot separate the
data points successfully, but a combination of many lines can. In other
words, for a given classifier, it is more difficult to train on some datasets
than on others. The difficulty of training on a less-separable dataset is made
evident by the requirement for greater learning times (e.g., number of epochs
for deep learning) to reach the same accuracy (or loss) value and/or to obtain
a lower accuracy (or higher loss), compared with the more-separable dataset.
The training difficulty, however, also depends on the model employed. In
16
summary, the separability of a dataset can be characterized in three ways:
1. Data complexity: to describe how data points belonging to different

classes mix with each other.
2. Decision boundary: to determine the number of [hyper-] planes/linear-

dividers needed to separate different-class data points.
3. Training performance: to gain insights into the training of a specific

classifier with regard to time-cost and final accuracy.
Hence, it is significant to create model-independent methods to quan-

titatively measure the separability of datasets. In other words, it is very
useful to be able to measure the separability of a dataset without using a
classifier model. Our proposed method – the Distance-based Separability
Index (DSI), is an intrinsic separability measure, which is independent of
the classifier model. DSI is based on the first way (data complexity): it
analyzes the distances between data points in the dataset. The second way
(decision boundary) and the third way (training performance) depend on
classifiers, and we used the third way to verify our method.
We then verify DSI to be an effective separability measure by comparing it
to state-of-the-art separability/complexity measures using several synthetic
datasets and real datasets (CIFAR-10/100). And, we formally prove that DSI
can indicate whether the distributions of two sample sets are identical for any
dimensionality. Having the ability to compare distributions of samples, our
other studies have shown that DSI can be used to evaluate the performance
of generative adversarial networks (GANs) (Section 4.4.3) and can be applied
as an internal cluster validity index (CVI) to evaluate clustering results
(Section 3.5).
17
Besides the applications discussed in our studies, DSI has broad po-
tentialities to be applied to other applications in deep learning, machine
learning, and data science. By providing understanding of data separabil-
ity, DSI could help in choosing a proper machine learning model for data
classification [83]. By examining the similarity of the two distributions,
DSI can detect (or certify) the distribution of a sample set, i.e., distribution
estimation. DSI can also be used as a feature selection method [236, 62]
for dimensionality reduction and as an anomaly detection method in data
analysis.
2.2 Related Work
Our review of the literature indicates that there have been substantially
fewer studies on data separability per se than on classifier models. A more
general issue than that of data separability is data complexity [84, 37],
which measures not only the relationship between classes but also the
data distribution in feature space. Ho and Basu [108] conducted a ground-
breaking review of data complexity measures. They reported measures for
classification difficulty, including those associated with the geometrical
complexity of class boundaries. Recently, Lorena et al. [165] summarized
existing methods for the measurement of classification complexity. In the
survey, most complexity measures have been grouped in six categories:
feature-based, linearity, neighborhood, network, dimensionality, and class
imbalance measures (Table 2.1). Other ungrouped measures discussed in
Lorena’s paper have similar characteristics to the grouped measures or may
have large time cost. Each of these methods has possible drawbacks. In
particular, the features extracted from data for the five categories of feature-
based measures may not accurately describe some key characteristics of
18
the data; some linearity measures depend on the classifier used, such as
support-vector machines (SVMs); neighborhood measures [130] may show
only local information; some network measures may also be affected by local
relationships between classes depending on the computational methods
employed; dimensionality measures are not strongly related to classification
complexity; and, class imbalance measures do not take the distribution of
data into account.
Table 2.1: Complexity measures reported by Lorena et al. [165]
Category Name Code

Maximum Fisher’s discriminant ratio F1
Directional vector maximum Fisher’s F1v
discriminant ratio
Volume of overlapping region F2
Feature-based
Maximum individual feature effi- F3
ciency
Collective feature efficiency F4
Sum of the error distance for linear L1
programming
Linearity
Error rate of the linear classifier L2
Non-linearity of the linear classifier L3
Fraction of borderline points N1
Ratio of intra/extra class NN distance N2
Error rate of the NN classifier N3
Non-linearity of the NN classifier N4
Neighborhood
Fraction of hyperspheres covering the T1
data
Local set average cardinality LSC
Density Density
Network Clustering coefficient ClsCoef
Hubs Hubs
Average number of features per dimen- T2
sion
Average number of PCA dimensions T3
Dimensionality
per point
Ratio of the PCA dimension to the orig- T4
inal dimension
Entropy of class proportions C1
Class imbalance
Imbalance ratio C2
19
The Fisher discriminant ratio (FDR) [153], also known as Linear discrim-
inant analysis (LDA), measures the separability of data using the mean and
standard deviation (SD) of each class. FDR is a feature-based measure (F1
and F1v in Table 2.1), and it has been used in many studies. But FDR fails
in some cases (e.g., as Figure 2.5(e) shows, Class 1 data points are scattered
around Class 2 data points in a circle; their FDR ≈ 0.) The initial definition
of FDR considers the separability between two classes to be calculated from
between-class and within-class scatter matrices.
Histogram of distances
Frequency
Class X ICD set of Class Y
Class Y
𝒅𝒅𝒚𝒚 Kolmogorov–Smirnov (KS)
Similarity:
Distance
𝑠𝑠𝑦𝑦 = 𝐾𝐾𝐾𝐾 𝑑𝑑𝑦𝑦 , 𝑑𝑑𝑥𝑥,𝑦𝑦
BCD set between

Class X and Y
𝒅𝒅𝒙𝒙,𝒚𝒚 𝒔𝒔𝒙𝒙 + 𝒔𝒔𝒚𝒚
𝑫𝑫𝑫𝑫𝑫𝑫 𝑿𝑿, 𝒀𝒀 =
𝟐𝟐
Intra-Class distances (ICDs) of Class X

𝑠𝑠𝑥𝑥 = 𝐾𝐾𝐾𝐾 𝑑𝑑𝑥𝑥 , 𝑑𝑑𝑥𝑥,𝑦𝑦
Intra-Class distances (ICDs) of Class Y
Between-class distances (BCDs) of ICD set of Class X
Class X and Y
𝒅𝒅𝒙𝒙
Figure 2.2: An example of two-class dataset in 2-D shows the definition

and computation of the DSI. Details about the ICD and BCD sets are in
Section 2.3.1, and Section 2.3.2 contains more details about computation
of the DSI. The proof that DSI can measure the separability of dataset is
shown in Section 2.3.3.
Inspired by the FDR’s key idea of using between-class and intra-class

measures to define the separation, we create a novel separability measure
for multi-class datasets. Since other previous studies [269, 183, 4, 204,
179] have used the term separability index (SI), we refer to our measure
20
as the Distance-based Separability Index (DSI). DSI uses the distances
between data points both between-class and intra-class, and it is similar in
some respects to the network measures because it represents the universal
relations between the data points. Especially, we have formally shown that
the DSI can indicate whether the distributions of datasets are identical
for any dimensionality. Figure 2.2 shows the definition and computation
of the DSI by an example of a two-class dataset in 2-D. Similarly inspired
by the idea of FDR, the recent works of Generalized Discrimination Value
(GDV) [238] and Sequentially Forward Selection based on the Separability
(SFSS) algorithm [115] also proposed their separability-based evaluations
for data classes. Since these evaluations use only the averaged measures
like the FDR, however, they also fail in some cases. Our DSI overcomes
these drawbacks because it considers all between-class and intra-class
distance values instead of their mean values. In this paper, we verify DSI by
comparing it with other state-of-the-art separability/complexity measures
based on synthetic and real (CIFAR-10/100) datasets.
In general, the DSI has a wide applicability and is not limited to simply
understanding the data; for example, it can also be applied to measure
generative adversarial network (GAN) performance (Section 4.4.3), evaluate
clustering results (Section 3.5), anomaly detection [200], the selection of
classifiers [193, 30, 31, 83], and features for classification [256, 48].
The novelty of this study is to examine the distributions of datasets via the
distributions of distances between data points in datasets, and the proved
Theorem connects the two kinds of distributions. That is the gist of DSI.
To the best of our knowledge, none of the existing studies uses the same
methods.
21
2.3 Methodological Development for Distance-based Separability In-
dex (DSI)
In a two-class dataset, we consider that the most difficult situation to

separate the data is when the two classes of data are scattered and mixed
together with the same distribution. In this situation, the proportion of each
class in every small region is equal, and the system has maximum entropy.
In extreme cases, to obtain 100% classification accuracy, the classifier must
separate each data point into an individual region (Figure 2.3).
Figure 2.3: Two-class dataset with maximum entropy
Therefore, one possible definition of data separability is the inverse of

a system’s entropy. To calculate entropy, the space could be randomly
divided into many small regions. Then, the proportions of each class in
every small region can be considered as their occurrence probabilities. The
system’s entropy can be derived from those probabilities [242]. In high-
dimensional space (e.g., image data), however, the number of small regions
grows exponentially. For example, the space for 32 × 32 pixels 8-bit RGB
images has 3,072 dimensions. If each dimension (ranging from 0 to 255,
integer) is divided into 32 intervals, the total number of small regions is
323072 ≈ 6.62 × 104623 . It is thus impossible to compute the system’s entropy
and analyze data separability in this way.
Alternatively, we can define the data separability as the similarity of
data distributions. If a dataset contains two classes X and Y with the same
22
distribution (distributions have the same shape, position and support, i.e.,
the same probability density function) and have sufficient data points to fill
the region, this dataset reaches the maximum entropy because within any
small regions, the occurrence probabilities of the two classes data are equal
(50%). It is also the most difficult situation for separation of the dataset.
Here, we proposed a new method – Distance-based Separability Index
(DSI) to measure the similarity of data distributions. DSI is used to analyze
how two classes of data are mixed together, as a substitute for entropy.
2.3.1 Intra-class and between-class distance sets
Before introducing the DSI, we introduce the intra-class distance (ICD)

and the between-class distance (BCD) set that are used for computation of
the DSI. The “set” in this paper means the multiset that allows duplicate
elements (distance values, in our cases), and the |A| of a “set” A is the
number of its elements.
Suppose X and Y have Nx and Ny data points, respectively, we can define:
Definition 2.1. The Intra-Class Distance (ICD) set {dx } is a set of distances
between any two points in the same class (X), as: {dx } = {kxi − x j k2 |xi , x j ∈
X; xi 6= x j }.
Corollary 2.1. Given |X| = Nx , then |{dx }| = 21 Nx (Nx − 1).
Definition 2.2. The Between-Class Distance (BCD) set {dx,y } is the set
of distances between any two points from different classes (X and Y ), as
{dx,y } = {kxi − y j k2 | xi ∈ X; y j ∈ Y }.
Corollary 2.2. Given |X| = Nx , |Y | = Ny , then |{dx,y }| = Nx Ny .
Remark. The metric for all distances is Euclidean (l 2 norm) in this paper.
In Section 2.5.3, we compare the Euclidean distance with some other dis-
23
tance metrics including City-block, Chebyshev, Correlation, Cosine, and
Mahalanobis, and we showed that the DSI based on Euclidean distance has
the best sensitivity to complexity, and thus we selected it.
2.3.2 Definition and computation of the DSI
We firstly introduce the computation of the DSI for a dataset contains

only two classes X and Y :
1. First, the ICD sets of X and Y : {dx }, {dy } and the BCD set: {dx,y } are
computed by their definitions (Defs. 2.1 and 2.2).
2. Second, the similarities between the ICD and BCD sets are then com-
puted using the the Kolmogorov–Smirnov (KS) [78] distance2 :
sx = KS({dx }, {dx,y }), and sy = KS({dy }, {dx,y }).
We explain the reasons to choose the KS distance in Section 2.5.2.
3. Finally, the DSI is the average of the two KS distances:
(sx + sy )
DSI({X,Y }) = .
2
Remark. We do not use the weighted average because once the distributions
of the ICD and BCD sets can be well characterized, the sizes of X and Y
will not affect the KS distances sx and sy . And, DSI is invariant to location
and scale transformations of the data points because the location and
scale transformations will be applied equally to all distances between data
2 Inexperiments, we used the scipy.stats.ks_2samp from the SciPy package in Python to
compute the KS distance. https://docs.scipy.org/doc/scipy/reference/generated/scipy.
stats.ks_2samp.html
24
points. Thus, the (normalized) histograms of ICD and BCD sets (as shown
in Figure 2.2) will not be changed, and the DSI keeps the same value.
For a multi-class dataset, the DSI can be computed by one-versus-others;

specifically, for a n-class dataset, the process to obtain its DSI is:
1. Compute n ICD sets for each class: {dCi }; i = 1, 2, · · · , n, and compute n

BCD sets for each class. For the i-th class of data Ci , the BCD set is
the set of distances between any two points in Ci and Ci (other classes,
not Ci ): {dCi ,Ci }.
2. Compute the n KS distances between ICD and BCD sets for each class:
si = KS({dCi }, {dCi ,Ci }).
3. Calculate the average of the n KS distances; the DSI of this dataset is:
∑ si
DSI({Ci }) = .
n
Therefore, DSI is (defined as) the mean value of KS distances between

the ICD and BCD sets for each class of data in a dataset.
Remark. DSI ∈ (0, 1). A small DSI (low separability) means that the ICD and
BCD sets are very similar. In this case, by Theorem 2.1, the distributions
of datasets are similar too. Hence, these datasets are difficult to separate.
2.3.3 Theorem: DSI and similarity of data distributions
Then, the Theorem 2.1 shows how the ICD and BCD sets are related to
the distributions of the two-class data; it demonstrates the core value of
this study.
25
Theorem 2.1. When |X| and |Y | → ∞, if and only if the two classes X and Y
have the same distribution, the distributions of the ICD and BCD sets are
identical.
The full proof of Theorem 2.1 is in Section 2.3.4. Here we provide an

informal explanation:
Data points in X and Y having the same distribution can be

considered to have been sampled from one distribution Z. Hence,
both ICDs of X and Y , and BCDs between X and Y are actually
ICDs of Z. Consequently, the distributions of ICDs and BCDs are
identical. In other words, that the distributions of the ICD and
BCD sets are identical indicates all labels are assigned randomly
and thus, the dataset has the least separability.
According to this theorem, that the distributions of the ICD and BCD
sets are identical indicates that the dataset has maximum entropy because
X and Y have the same distribution. Thus, as we discussed before, the
dataset has the lowest separability. And in this situation, the dataset’s DSI
≈ 0 by its definition.
The time costs for computing the ICD and BCD sets increase linearly with
the number of dimensions and quadratically with the number of data points.
It is much better than computing the dataset’s entropy by dividing the space
into many small regions. Our experiments (in Section 2.4.2.1) show that
the time costs could be greatly reduced using a small random subset of
the entire dataset without significantly affecting the results (Figure 2.8).
And in practice, the computation of DSI can be sped-up considerably by
using tensor-based matrix multiplications on a GPU (e.g., it takes about 2.4
seconds for 4000 images from CIFAR-10 running on a GTX 1080 Ti graphics
card) because the main time-cost is the computation of distances.
26
2.3.4 Proof of the Theorem
Consider two classes X and Y that have the same distribution (distribu-
tions have the same shape, position, and support, i.e., the same probability
density function) and have sufficient data points (|X| and |Y | → ∞) to fill
their support domains. Suppose X and Y have Nx and Ny data points, and
Ny
assume the sampling density ratio is Nx = α. Before providing the proof of
Theorem 2.1, we firstly prove Lemma 2.1, which will be used later.
Remark. The condition of most relevant equations in the proof is that the
Nx and Ny are approaching infinity in the limit.
Lemma 2.1. If and only if two classes X and Y have the same distribution
Ny
covering region Ω and Nx = α, for any sub-region ∆ ⊆ Ω, with X and Y having
nyi
nxi , nyi points, nxi = α holds.
Proof. Assume the distributions of X and Y are f (x) and g(y). In the
union region of X and Y , arbitrarily take one tiny cell (region) ∆i with
nxi = ∆i f (xi )Nx , nyi = ∆i g(y j )Ny ; xi = y j . Then,
nyi ∆i g(xi )Ny g(xi )

= =α
nxi ∆i f (xi )Nx f (xi )
Therefore:
g(xi ) g(xi )
α =α ⇔ = 1 ⇔ ∀xi : g(xi ) = f (xi )
f (xi ) f (xi )
2.3.4.1 Sufficient condition
Sufficient condition of Theorem 2.1. If the two classes X and Y with

the same distribution and have sufficient data points (|X| and |Y | → ∞), then
the distributions of the ICD and BCD sets are nearly identical.
27
𝛅 y𝑖
𝐃𝒊𝒋
x𝑖
∆𝒋
∆𝒊 y𝑗
x𝑗
Figure 2.4: Two non-overlapping small cells
Proof. Within the area, select two tiny non-overlapping cells (regions) ∆i and
∆ j (Figure 2.4). Since X and Y have the same distribution but in general
different densities, the number of points in the two cells nxi , nyi ; nx j , ny j fulfills:
nyi ny j
= =α
nxi nx j
The scale of cells is δ , the ICDs and BCDs of X and Y data points in cell ∆i are
approximately δ because the cell is sufficiently small. By the Definitions 2.1
and 2.2:
dxi ≈ dxi ,yi ≈ δ ; xi , yi ∈ ∆i
Similarly, the ICDs and BCDs of X and Y data points between cells ∆i and
∆ j are approximately the distance between the two cells Di j :
dxi j ≈ dxi ,y j ≈ dyi ,x j ≈ Di j ; xi , yi ∈ ∆i ; x j , y j ∈ ∆ j
First, divide the whole distribution region into many non-overlapping cells.
Arbitrarily select two cells ∆i and ∆ j to examine the ICD set for X and the
BCD set for X and Y . By Corollaries 2.1 and 2.2:
28
i) The ICD set for X has two distances: δ and Di j , and their numbers are:
1
dxi ≈ δ ; xi ∈ ∆i : |{dxi }| = nxi (nxi − 1)
2
dxi j ≈ Di j ; xi ∈ ∆i ; x j ∈ ∆ j : |{dxi j }| = nxi nx j
ii) The BCD set for X and Y also has two distances: δ and Di j , and their
numbers are:
dxi ,yi ≈ δ ; xi , yi ∈ ∆i : |{dxi ,yi }| = nxi nyi
dxi ,y j ≈ dyi ,x j ≈ Di j ; xi , yi ∈ ∆i ; x j , y j ∈ ∆ j :
|{dxi ,y j }| = nxi ny j ; |{dyi ,x j }| = nyi nx j
Therefore, the proportions of the number of distances with a value of Di j in

the ICD and BCD sets are:
For ICDs:
|{dxi j }| 2nxi nx j
=
|{dx }| Nx (Nx − 1)
For BCDs, considering the density ratio:
|{dxi ,y j }| + |{dyi ,x j }| αnxi nx j + αnxi nx j 2nxi nx j

= =
|{dx,y }| αNx2 Nx2
The ratio of proportions of the number of distances with a value of Di j in

the two sets is:
Nx (Nx − 1) 1
2
= 1− → 1 (Nx → ∞)
Nx Nx
This means that the number of proportions of the number of distances with
a value of Di j in the two sets is equal. We then examine the proportions of
the number of distances with a value of δ in the ICD and BCD sets.
29
For ICDs:
|{dxi }| ∑i [nxi (nxi − 1)] ∑i (n2xi − nxi ) ∑i (nx i2 ) − Nx

∑ |{dx }| = Nx (Nx − 1) = Nx2 − Nx = Nx2 − Nx
i
For BCDs, considering the density ratio:
|{dxi ,yi }| ∑i (n2xi )

∑ |{dx,y)}| = Nx2
i
The ratio of proportions of the number of distances with a value of δ in the

two sets is:
∑i (n2xi ) 1 − N1x
2
Nx2 − Nx nxi
· =∑ · 2 → 1 (Nx → ∞)
Nx2 ∑i (n2xi ) − Nx i Nx2 nxi
∑i N 2 − 1
x Nx
This means that the number of proportions of the number of distances with
a value of δ in the two sets is equal.
In summary, the fact that the proportion of any distance value (δ or Di j )
in the ICD set for X and in the BCD set for X and Y is equal indicates that the
distributions of the ICD and BCD sets are identical, and a corresponding
proof applies to the ICD set for Y .
2.3.4.2 Necessary condition
Necessary condition of Theorem 2.1. If the distributions of the ICD

and BCD sets with sufficient data points (|X| and |Y | → ∞) are nearly identical,
then the two classes X and Y must have the same distribution.
Remark. We prove its contrapositive: if X and Y do not have the same

distribution, the distributions of the ICD and BCD sets are not identical.
We then apply proof by contradiction: suppose that X and Y do not have
30
the same distribution, but the distributions of the ICD and BCD sets are
identical.
Ny
Proof. Suppose classes X and Y have the data points Nx , Ny , which Nx = α.
Divide their distribution area into many non-overlapping tiny cells (regions).
In the i-th cell ∆i , since distributions of X and Y are different, according to
Lemma 2.1, the number of points in the cell nxi , nyi fulfills:
nyi
= αi ; ∃αi 6= α
nxi
The scale of cells is δ and the ICDs and BCDs of the X and Y points in cell
∆i are approximately δ because the cell is sufficiently small.
dxi ≈ dyi ≈ dxi ,yi ≈ δ ; xi , yi ∈ ∆i
In the i-th cell ∆i :

i) The ICD of X is δ , with a proportion of:
|{dxi }| ∑i [nxi (nxi − 1)] ∑i (n2xi − nxi ) ∑i (n2xi ) − Nx

∑ |{dx }| = Nx (Nx − 1) = Nx2 − Nx = Nx2 − Nx (2.1)
i
ii) The ICD of Y is δ , with a proportion of:
|{dyi }| ∑i [nyi (nyi − 1)] ∑i (n2yi − nyi )

∑ |{dy}| = Ny(Ny − 1) = Ny2 − Ny
i
∑i (n2yi ) − Ny ∑i (αi2 n2xi ) − αNx
= = (2.2)
Ny2 − Ny Ny =αNx α 2 Nx2 − αNx
nyi =αi nxi
iii) The BCD of X and Y is δ , with a proportion of:
|{dxi ,yi }| ∑i (nxi nyi ) ∑i (αi n2xi )

∑ |{dx,y}| = Nx Ny = αNx2 (2.3)
i
31
For the distributions of the two sets to be identical, the ratio of proportions
of the number of distances with a value of δ in the two sets must be 1, that
(2.3) (2.3)
is (2.1) = (2.2) = 1. Therefore:
(2.3) ∑i (αi n2xi ) Nx2 − Nx

= ·
(2.1) αNx2 ∑i (n2xi ) − Nx
1 1 − N1x
= (αi n2xi ) ·
αNx2 ∑
i
1
∑i (n2xi ) − N1x
Nx2 Nx →∞
1 ∑i (αi n2xi )
= · = 1 (2.4)
α ∑i (n2xi )
Similarly,
(2.3) ∑i (αi n2xi ) α 2 Nx2 − αNx

= ·
(2.2) αNx2 ∑i (αi2 n2xi ) − αNx
∑ (αi n2 ) α − N1x
= i 2 xi · 1
Nx
Nx2 ∑i (αi2 n2xi ) − Nαx Nx →∞
∑i (αi n2xi )
=α· = 1 (2.5)
∑i (αi2 n2xi )
To eliminate the ∑i (αi n2xi ) by considering the Equations (2.4) and (2.5), we
have:
∑i (αi2 n2xi )
∑(n2xi) = α2
i
αi 2
Let ρi = α , then,
∑(n2xi) = ∑(ρin2xi)
i i
Since nxi could be any value, to hold the equation requires ρi = 1. Hence:
α 2
i
∀ρi = = 1 ⇒ ∀αi = α
α
32
This contradicts ∃αi 6= α. Therefore, the contrapositive proposition has been
proved.
2.4 Experiments
We test our proposed DSI measure on two-class synthetic and multi-

class real datasets and compare it with other complexity measures from
the Extended Complexity Library (ECoL) package [165] in R programming
language. Since the DSI is computed using KS distances between the ICD
and BCD sets, it ranges from 0 to 1. For separability, a higher DSI value
means the dataset is easier to separate, i.e., it has lower data complexity.
Hence, to compare it with other complexity measures, we use (1 − DSI). In
this paper, higher complexity means lower separability (i.e., Separability =
1 −Complexity).
2.4.1 Two-class Synthetic Data
2.4.1.1 Typical Datasets
In this section, we present the results of the DSI and the other com-
plexity measures (listed in Table 2.1) for several typical two-class datasets3 .
Figure 2.5 displays their plots and histograms of the ICD sets (for Class 1
and Class 2) and the BCD set (between Class 1 and Class 2). Each class
consists of 1,000 data points.
Table 2.2 presents the results for these measures shown in Table 2.1 and
our proposed DSI. The measures noted by “*” are considered to have failed
in measuring separability and are not used for subsequent experiments.
In particular, the dimensionality and class-imbalance measures do not
3 These datasets are created by the Samples Generator in sklearn.datasets: https:
//scikit-learn.org/stable/modules/classes.html#samples-generator
33
ICD set for Class 1
Data Plot Histograms for the Sets Class-1
ICD set for Class 2
Class-2
BCD set between classes
(a) Random (b) Spirals
(c) XOR (d) Moons
(e) Circles (f) Blobs
Figure 2.5: Typical two-class datasets and their ICD and BCD set distribu-
tions
work with separability in this situation. The feature-based and linearity

measures measured the XOR dataset as having more complexity than the
Random dataset; since the XOR has much clearer boundaries than Random
between the two classes, these measures are inappropriate for measuring
separability. N1 and N3 produce the same values for the Spiral, Moon,
Circle, and Blob datasets, even though the Spiral dataset is obviously more
difficult to separate than the Blob dataset, which is the most separable
because a single line can be used to separate the two classes. However, the
ClsCoef and Hubs measures assign the Blob dataset greater complexity than
some other cases. In this experiment, N2, N4, T1, LSC, Density, and the
34
proposed measure (1 − DSI) are shown to accurately reflect the separability
of these datasets.
Table 2.2: Complexity measures results for the two-class datasets (Fig-
ure 2.5). The measures noted by “*” failed to measure separability.
Category Code Random Spirals XOR Moons Circles Blobs

*F1 0.998 0.947 1.000 0.396 1.000 0.109
*F1v 0.991 0.779 0.999 0.110 1.000 0.019
Feature-based *F2 0.996 0.719 0.996 0.151 0.329 0.006
*F3 0.997 0.843 0.998 0.397 0.708 0.007
*F4 0.995 0.827 0.997 0.199 0.500 0.000
*L1 0.201 0.170 0.328 0.074 0.233 0.000
Linearity *L2 0.485 0.407 0.487 0.114 0.458 0.000
*L3 0.469 0.399 0.486 0.055 0.454 0.000
*N1 0.719 0.001 0.040 0.001 0.001 0.001
N2 0.502 0.052 0.071 0.025 0.043 0.017
*N3 0.500 0.000 0.019 0.000 0.000 0.000
Neighborhood
N4 0.450 0.359 0.152 0.099 0.162 0.000
T1 0.727 0.045 0.043 0.008 0.012 0.001
LSC 0.999 0.976 0.934 0.840 0.914 0.526
Density 0.916 0.919 0.864 0.847 0.880 0.812
Network *ClsCoef 0.352 0.343 0.267 0.225 0.253 0.332
*Hubs 0.775 0.822 0.857 0.767 0.650 0.842
*T2 0.001 0.001 0.001 0.001 0.001 0.001
Dimensionality *T3 0.001 0.001 0.001 0.001 0.001 0.001
*T4 1.000 1.000 1.000 1.000 1.000 1.000
*C1 1.000 1.000 1.000 1.000 1.000 1.000
Class imbalance
*C2 0.000 0.000 0.001 0.000 0.000 0.000
Proposed 1−DSI 0.994 0.953 0.775 0.643 0.545 0.027
2.4.1.2 Training Distinctness and the Two-cluster Dataset
Definition 2.3 (Training Distinctness). The Training Distinctness (TD) is

the average training accuracy during the training process of a classifier
model.
Remark. To quantify the difficulty of training the classifier, we define the

Training Distinctness (TD). A lower TD value means that a dataset is more
difficult to train, and this difficulty can reflect the separability of the dataset.
Hence, TD is the baseline of data separability.
35
In this section, we synthesize a two-class dataset that has different
separability levels. The dataset has two clusters, one for each class. The
parameter controlling the standard deviation (SD) of clusters influences
separability (Figure 2.6), and the baseline is the TD we defined.
Class 1 Decision Area 1

Class 2 Decision Area 2
(a) Cluster_SD = 1 (b) Cluster_SD = 2
(c) Cluster_SD = 3 (d) Cluster_SD = 4
Figure 2.6: Two-class datasets with different cluster standard deviation

(SD) and trained decision boundaries.
We created nine two-class datasets4 , and each dataset has 2,000 data
points (1,000 per class) and two cluster centers for the two classes, and the
SD parameters of clusters are set from 1 to 9. Along with SD of clusters
increasing, distributions of two classes are more overlapped and mixed
together, thus reducing the separability of the datasets.
We use a simple fully-connected neural network (FCNN) model to classify
4 By using the sklearn.datasets.make_blobs function in Python
36
these two-class datasets. This FCNN model has three hidden layers; there
are 16, 32, and 16 neurons, respectively, with ReLU activation functions in
each layer. The classifier was trained on one of the nine datasets, repeatedly
from scratch. We set 1,000 epochs for each training session to compute the
TD of each dataset.
In this case, separability could be clearly visualized by the complexity of
the decision boundary. Figure 2.6 shows that datasets with a larger cluster
SD need more complex decision boundaries. In fact, if a classifier model
can produce decision boundaries for any complexity, it can achieve 100%
training accuracy for any datasets (no two data points from different classes
have all the same features) but the training steps (i.e., epochs) required
to reach 100% training accuracy may vary. For a specific model, a more
complex decision boundary may need more steps to train. Therefore, the
average training accuracy throughout the training process – i.e., TD – can
indicate the complexity of the decision boundary and the separability of the
dataset.
Since the training accuracy ranges from 0.5 to 1.0 for two-class classifi-
cation, to enable a comparison with other measures that range from 0 to 1,
we normalize the accuracy by the function:
r (x) = (x − 0.5)/0.5
rT D = r(T D). The range of rTD is from 0 to 1, and the lowest complexity
(highest separability) is 1. We also compute N2, N4, T1, LSC, Density, and
the proposed measure (1 − DSI) for the nine datasets and present them
together with rTD as a baseline for separability in Figure 2.7.
As shown in Figure 2.7, the rTD for datasets with larger cluster SDs
37
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
Cluster_SD
N2 N4 T1 LSC
Density 1-DSI rTD
Figure 2.7: Complexity measures for two-class datasets with different cluster
SDs.
is lower. Lower rTD indicates lower separability and higher complexity.

The measures N2, N4, T1, and the proposed measure (1 − DSI) reflect the
complexity of these datasets well, but the LSC and Density measures do not
well reflect the complexity because they have relatively high values for the
linearly separable dataset (Cluster_SD = 1, see Figure 2.6a) and increase very
slightly for the Cluster_SD = 5 to Cluster_SD = 9 datasets. The measures
N2, N4, and T1 perform similarly to each other. By comparison with them,
(1 − DSI) is the most sensitive measure to the change in separability and has
the widest range.
38
2.4.2 CIFAR-10/100 Datasets
We next use real images from the CIFAR-10/100 database5 [145] to

examine the separability measures. A simple CNN with four convolutional
layers, two max-pooling layers, and one dense layer is trained to classify
images; Appendix A presents its detailed architecture.
The CNN classifier is trained on 50,000 images from the CIFAR-10/100
database. To change the classification performance (i.e., the TD), we apply
several image pre-processing methods to the input images before training
the CNN classifier. These pre-processing methods are supposed to change
the distribution of the training images, and thus alter the separability of the
dataset. And, the change of data separability will affect the classification
results for the given CNN in terms of the TD.
2.4.2.1 Time Complexity and Using Subsets
The main time cost of DSI is to form ICD and BCD sets by calculating the
Euclidean (l 2 -norm) distances between any two data points. If N corresponds
to the number of points in a dataset and d stands for its dimensionality
(number of features), the time cost of DSI is O d · N 2 , which is the same

as the comparable complexity measures: N2, N4, T1, LSC, and Density
(referring to the Table 1 in Lorena et al.’s paper [165]).
Images in the CIFAR-10 dataset are grouped into 10 classes and the
CIFAR-100 dataset consists of 20 super-classes. Both CIFAR-10 and CIFAR-
100 consist of N = 50,000 images (32x32, 8-bit RGB), and each image has d =
3,072 pixels (features). Thus, to apply the measures using all 50,000 images
would be very time-consuming (including the DSI, most of the measures
have a time cost of O d · N 2 ).

5 The URL for downloading the dataset: https://www.cs.toronto.edu/~kriz/cifar.html
39
We randomly select subsets of 1/5, 1/10, 1/50, 1/100, and 1/500 of
the original training images (i.e., without pre-processing) from CIFAR-10
and compute their DSIs. For each subset, we repeat the random selection
and DSI computation eight times to calculate the mean and SD of DSIs.
Figure 2.8 shows that the subset containing 1/50 training images or more
does not significantly affect the measures. For example, the DSI for the
whole (50,000) training images is 0.0945, while the DSI for a subset of 1,000
randomly selected images is 0.1043 ± 0.0049 – the absolute difference is up to
0.015 (16%) but with an execution speed that is about 2,500 times greater:
computing the DSI for 1,000 images requires about 30 seconds; for the
whole training dataset, the DSI calculation requires about 20 hours. In
addition, because the same subset is used for all measures, the comparison
results are not affected. Therefore, we have randomly selected 1,000 training
images to compute the measures, and this subset still accurately reflects
the separability/complexity of the entire dataset.
Figure 2.8: DSIs of CIFAR-10 subsets
40
2.4.2.2 Results
We use the functions in PIL.ImageEnhance (PIL is the Python Imaging

Library) with five pre-processing methods applied to the original training
images from CIFAR-10/100: Color (factor = 2) and Sharpness (2), Color (2),
Contrast (2), Color (0.1), and Contrast (0.5). Including the original images,
we use six image datasets to compute the DSI, TD, and other measures.
For the 10-class classification, the training accuracy ranges from 0.1 to 1.0.
The TD is not regularized in this section because it has a range close to
[0, 1].
CIFAR-10 CIFAR-100
1.00 1.00
0.95 0.95
0.90 0.90
0.85 0.85
0.80 0.80
0.75 0.75
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.50 0.50
0.45 0.45
0.40 0.40
0.35 0.35
0.30 0.30
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
Pre-processing Pre-processing
Figure 2.9: Manipulation (e.g., pre-processing) of images in datasets can

change their complexities. We then simultaneously compare different meth-
ods of pre-processing and of complexity measures (the y-axes) including the
Training Distinctness (TD, Definition 2.3) as ground truth, on the CIFAR-
10/100 datasets. The x-axes show pre-processing methods, from left to
right: Color (factor = 2) and Sharpness (2), Color (2), Contrast (2), Color
(0.1), and Contrast (0.5).
Figure 2.9 shows the results for CIFAR-10 and CIFAR-100. The x-axis
41
shows the pre-processing methods applied to the datasets, decreasingly
ordered from left to right by TD, which is the baseline of data separability.
Since a lower TD indicates lower separability and higher complexity, the
values of complexity measures should strictly increase from left to right.
We put specific values of measures in the Tables 2.3 and 2.4 because some
differences of complexity measures’ results are small and not obviously
shown by curves. By examining these values, we clearly find that the
measures LSC, T1 (which almost overlaps with LSC) and Density have high
values and remain nearly flat from left to right (insensitive), while N2 and
N4 decrease for the Contrast (2) pre-processing stage. Unlike the other
measures, (1−DSI) monotonically increases from left to right and correctly
reflects (and is more sensitive to) the complexity of these datasets. These
results show the advantage of DSI and indicate that image pre-processing
is useful for improving CNN performance in image classification.
Table 2.3: Values of complexity measures for CIFAR-10
Method Code Col2Sharp2 Col2 Contr2 Orginal Col0.1 Contr0.5

N2 0.5169 0.5174 0.5158 0.5185 0.5226 0.5219
N4 0.0640 0.0950 0.0610 0.0930 0.1340 0.1960
T1 0.9940 0.9970 0.9990 0.9990 0.9990 1.0000
LSC 0.9985 0.9985 0.9984 0.9985 0.9986 0.9986
Density 0.9682 0.9684 0.9697 0.9701 0.9722 0.9737
1−DSI 0.8739 0.8762 0.8829 0.8973 0.9147 0.9224
TD 0.8207 0.8122 0.8071 0.8051 0.7925 0.7818
Table 2.4: Values of complexity measures for CIFAR-100
Method Code Col2Sharp2 Col2 Contr2 Orginal Col0.1 Contr0.5

N2 0.5304 0.5324 0.5292 0.5315 0.5383 0.5386
N4 0.0960 0.1070 0.0520 0.1420 0.1530 0.2320
T1 0.9960 0.9970 0.9980 0.9960 0.9970 0.9980
LSC 0.9988 0.9988 0.9987 0.9988 0.9989 0.9988
Density 0.9849 0.9849 0.9851 0.9855 0.9867 0.9873
1−DSI 0.8916 0.8933 0.8963 0.8964 0.9005 0.9160
TD 0.6696 0.6563 0.6466 0.6403 0.6380 0.6359
42
2.5 Discussion
2.5.1 Comparison of Distributions
This work is motivated by the need for a new metric to measure the
difficulty of a dataset to be classified by machine learning models. This
measure of a dataset’s separability is an intrinsic characteristic of a dataset,
independent of classifier models, that describes how data points belonging
to different classes are mixed. To measure the separability of mixed data
points of two classes is essentially to evaluate whether the two datasets are
from the same distribution. According to Theorem 2.1, the DSI provides an
effective way to verify whether the distributions of two sample sets
are identical for any dimensionality.
As discussed in Section 2.3, if the DSI of sample sets is close to zero,
the very low separability means that the two classes of data are scattered
and mixed together with nearly the same distribution. The DSI transforms
the comparison of distributions problem in Rn (for two sample sets) to
the comparison of distributions problem in R1 (i.e., ICD and BCD sets) by
computing the distances between samples. For example, in Figure 2.5(a),
samples from Class 1 and 2 come from the same uniform distribution in R2
over [0, 1)2 . Consequently, the distributions of their ICD and BCD sets are
almost identical and the DSI is about 0.0058. In this case, each class has
1,000 data points. For twice the number of data points, the DSI decreases
to about 0.0030. When there are more data points of two classes from the
same distribution, the DSI will approach zero, which is the limit of the DSI
if the distributions of two sample sets are identical.
For another example, we equally divide 5,000 airplane-labeled images
from the CIFAR-10 dataset into two subsets: AIR1 and AIR2. We then take
43
a subset of 2,500 automobile-labeled images from the same dataset, named
AUTO. The DSI of the mixed set: AIR1 and AIR2 is about 0.0045. The DSI
of the mixed set: AIR1 and AUTO is about 0.1083. Since the images in AIR1
and AIR2 are from the same airplane class and could be considered having
the same distribution, the DSI of the AIR1 and AIR2 mixed set is closer to
zero.
In summary, to test whether two distributions are identical, we firstly
take labeled data as many as possible from the two distributions. We then
compute the DSI of these data and see how close the value is to zero. The
closer the DSI is to zero, the more likely the two distributions are similar.
2.5.2 Kolmogorov–Smirnov Test and Other Measures
It is noteworthy that our DSI is compatible with various measures of dis-

tances and distributions. The Euclidean distance and Kolmogorov–Smirnov
(KS) distance are selected because, based on our experiments, we found
that DSI has better sensitivity to separability by using those measures than
by the other mentioned measures. The best sensitivity means the change of
separability leads to the greatest difference of DSI.
One key step in DSI computation is to examine the similarity of the
distributions of the ICD and BCD sets. We applied the KS distance in our
study. The result of a two-sample KS distance is the maximum distance
between two cumulative distribution functions (CDFs):
KS(P, Q) = sup |P(x) − Q(x)|

x
Where P and Q are the respective CDFs of the two distributions p and q.
Although many statistical measures, such as the Bhattacharyya distance,
44
Kullback–Leibler divergence, and Jensen–Shannon divergence, could be
used to compare the similarity between two distributions, most of them
require the two sets to have the same number of data points. It is easy to
show that the ICD and BCD sets (|{dx }|, |{dy }|, and |{dx,y }|) cannot be the
same size. For example, The f -divergence [196]:

p (x)
Z
D f (P, Q) = q (x) f dx
q (x)
cannot be used to compute the DSI because the ICD and BCD have different
numbers of values, thus the distributions p and q are in different domains.
Measures based on CDFs can solve this problem because CDFs exist in
the union domain of p and q. Therefore, the Wasserstein-distance [212] (W-
distance) can be applied as an alternative similarity measure. For two 1-D
distributions (e.g., ICD and BCD sets), the result of W-distance represents
the difference in the area of the two CDFs:
Z
W1 (P, Q) = |P (x) − Q (x)| dx
The DSI uses the KS distance rather than the W-distance because we find
that normalized W-distance is not as sensitive as the KS distance for mea-
suring separability. To illustrate this, we compute the DSI by using the two
distribution measures for the nine two-cluster datasets in Section 2.4.1.2.
The two DSIs are then compared by the baseline rTD, which is also used
in Section 2.4.1.2. Figure 2.10 shows that along with the separability of
the datasets decreasing, KS distance has a wider range of decrease than
the W-distance. Hence, the KS distance is considered a better distribution
measure for the DSI in terms of revealing differences in the separability of
datasets.
45
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
Cluster_SD
DSI by KS distance DSI by W-distance rTD
Figure 2.10: DSI calculation using different distribution measures.
2.5.3 Distance Metrics
Since the DSI examines the distributions of distances between data

points, the used distance metric is another important factor. In this study,
the DSI uses the Euclidean distance because it has better sensitivity to
separability. We also tested several other commonly used distance metrics:
City-block, Chebyshev, Correlation, Cosine, and Mahalanobis distances.
We computed the DSIs based on these distance metrics using the nine
two-cluster datasets in Section 2.4.1.2 and the results are compared by
the baseline rTD. Figure 2.11 shows that the Euclidean distance performs
similarly as the City-block and Chebyshev distances. Such results indicate
that the Minkowski distance metric (p-norm) could be suitable for the
computation of DSI.
46
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
Cluster_SD
rTD Euclidean City-block
Chebyshev Correlation Cosine
Mahalanobis
Figure 2.11: DSI calculation using different distance metrics.
2.5.4 Future Work and Limitations
The principal use of the proposed DSI is understanding data separability,

which could help in choosing a proper machine learning model for data
classification [83]. This will be useful to the model designer who can begin
with either a small- or large-scale classifier. For example, a simpler classifier
could be used for an easily-separable dataset and thus reduce both compu-
tational cost and overfitting. DSI could serve also as a way to benchmark
the efficiency of a classifier, given a suitable measure of classifier complexity
and computational cost.
Since DSI can evaluate whether two datasets are from the same distribu-
tion, we have applied it in Section 4.4.3 to evaluate generative adversarial
networks (GANs) [90], competing with the existing metrics, such as Incep-
47
tion Score (IS) and Fréchet Inception Distance (FID) [105]. As with the FID,
measuring how close the distributions of real and GAN-generated images are
to each other is an effective approach to assess GAN performance because
the goal of GAN training is to generate new images that have the same
distribution as real images. We have also applied the DSI in Section 3.5
as an internal cluster validity index (CVI) [9] to evaluate clustering results
because the goal of clustering is to separate a dataset into clusters, in
the macro-perspective, how well a dataset has been separated could be
indicated via the separability of clusters.
By examining the similarity of the two distributions, the DSI can detect
(or certify) the distribution of a sample set, i.e., distribution estimation.
Several distributions could be assumed (e.g., uniform or Gaussian) and a
test set is created with an assumed distribution. The DSI could then be
calculated using the test and sample sets. The correct assumed distribution
will have a very small DSI (i.e., close to 0) value. In addition to the men-
tioned applications, DSI can also be used as a feature selection method for
dimensionality reduction and an anomaly detection method in data analysis.
DSI has broad applications in deep learning, machine learning, and data
science beyond direct quantification of separability.
The DSI could also help to understand how data separability changes
after passing through each layer of a neural network. As an example, we
reuse the three-layer FCNN model and nine datasets from Section 2.4.1.2.
An FCNN model is trained using a single dataset. We then input the data
into the trained model and record the output from each layer. Finally, we
compute the DSI of every output and input data. As shown in Figure 2.12,
for every dataset, that the DSI of final output is always higher than the
input indicates the classifier improves the separability of data. Some DSIs
48
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Input L1 L2 L3 Output
1 2 3 4 5
6 7 8 9
Figure 2.12: DSIs of input and output data from each layer in the FCNN
model for nine datasets from Section 2.4.1.2. The x-axis represents the
outputs from layers of the FCNN: input layer, three hidden layers, and
output layer. The y-axis represents the DSI values of output. Plots are for
the nine datasets.
of output from hidden layers, however, are even smaller than that of the
input data. This phenomenon is non-intuitive because it is assumed that
hidden layers improve separability and increase the DSI continuously. A
possible reason for this is that the dimensions of data increase in the
hidden layers. The dimension of input data is two, and it changes to
16, 32, 16, and 1 for the output because of the number of neurons in
hidden layers. In higher dimensional space, data may be coded by fewer
features or mapped closer to each other, thus, the separability decreases.
Although DSI works for any dimensionality, dimensionality can affect data
distributions and the measurement of distance, which is known as the
curse of dimensionality [228], thus affecting the DSI. More studies should
49
address the impact of dimensions on DSI and how to compare separability
across different numbers of dimensions.
(a) DSI ≈ 0.3472; TD = 0.96 (b) DSI ≈ 0.3472; TD = 0.79
Figure 2.13: Two-class datasets with different decision boundaries. They

have the same DSI but different Training Distinctness (TD). The dataset (b)
having more complex decision boundary is more difficult to be classified.
The separability in DSI is defined by the global distributions of data.

In some cases, such separability cannot accurately reflect the complex-
ity of the decision boundary because of local conditions. For example, in
Figure 2.13, two datasets have approximately the same DSI but the deci-
sion boundary complexity of the right dataset (b) is higher and its TD (by
reusing the previous three-layer FCNN model in Section 2.4.1.2) is smaller.
Therefore, the distribution-based separability cannot represent the gen-
eral training/classification difficulty in terms of the complexity of decision
boundaries. One of our studies addresses this problem by examining the
complexity of data points on or near the decision boundary (Section 4.5).
In addition, DSI may have other problems that require further research.
DSI can be considered a distance-based embedding method. It extracts
one index from data of any n-dimension to indicate the separability of that
data. By reducing the dimensionality, a significant volume of the original
information is lost. The limitations that DSI cannot properly measure the
50
separability of data in some situations must be discovered. For example,
when the separability changes, the change of DSI is nonlinear, and the DSI
of linearly separable data is usually not 1 (e.g., Figure 2.5f).
2.6 Conclusion
We proposed a novel and effective measure (DSI) to verify whether the

distributions of two sample sets are identical for any dimensionality. This
measure has a solid theoretical basis. The core Theorem we proved con-
nects the distributions of two-class datasets with the distributions of their
intra-class distance (ICD) sets and between-class distance (BCD) set. Usu-
ally, the datasets are in high-dimensional space and thus to compare their
distributions is very difficult. By our theorem, to show that the distribu-
tions of two-class datasets are identical is equivalent to showing that the
distributions of their ICD and BCD sets are identical. The distributions of
ICD and BCD sets are easy to compare because the distances are in R1 . The
DSI is based on the KS distance between these distances’ sets.
Since DSI provides an effective way to verify whether the distributions
of two sample sets are identical for any dimensionality, it has many ap-
plications. This chapter shows its core application, which is an intrinsic
separability/complexity measure of a dataset. DSI considers that different
classes of data mixing with the same distribution is the most difficult case
to separate using classifiers. The DSI indicates whether data belonging to
different classes have the same distribution, and thus provides a measure
of the separability of datasets. The quantification of the data separability
helps users choose the proper machine learning model for data classification
without excessive iterations. Comparisons using synthetic and real datasets
show that DSI outperforms many state-of-the-art separability/complexity
51
measures and demonstrate its competitiveness as an effective measure.
In addition to its uses as a separability measure and as an evaluator of
GANs and of clustering results (all as shown by our studies), DSI has the
clear potential to be applied in other important areas, such as distribution
estimation, feature selection, and anomaly detection.
52
Chapter 3: Hyperspectral Images-based Cardiac Ablation Lesion
Detection Using Unsupervised Learning1
Atrial fibrillation is the most common cardiac arrhythmia. It is being

effectively treated using the radiofrequency ablation (RFA) procedure which
destroys culprit tissue or creates scars that prevent the spread of abnormal
electrical activity. Long term success of RFA could be improved further
if ablation lesions can be directly visualized during the surgery. We have
shown that auto-fluorescence-based Hyper-Spectral Imaging (aHSI) can help
to identify lesions based on spectral unmixing. In this chapter, we show that
use of k-means clustering, an unsupervised learning method, is capable
of detecting RFA lesions without a priori knowledge of the lesions’ spectral
characteristics. We also show that the number of spectral bands required
for successful lesion identification can be significantly reduced, enabling the
use of increased spectral bandwidth. Together, these findings can help with
clinical implementation of an aHSI catheter (Figure 3.1), since by reducing
the number of spectral bands one can reduce hypercube acquisition and
processing times, and by increasing the spectral width of individual bands
one can collect more photons. The latter is of critical importance in low-light
applications such as intracardiac aHSI. The ultimate goal of our studies is
to help improve clinical outcomes for atrial fibrillation patients.
The application of k-means clustering for lesion detection aroused our
interesting in studying the evaluation of clustering results, which is a
significant area of cluster analysis. Since there are no true class labels for
clustering in typical unsupervised learning, many internal internal cluster
1 This work has been published in the [J4], [J5], [C4], and [C11].
53
355nm steerable
laser catheter inflatable
handle balloon
light guide
image guide
Camera
bandpass ablated tissue
filter
Figure 3.1: Proposed concept of acquiring hyperspectral imaging data from

the heart.
validity indices, which use predicted labels and data, have been created.
Without true labels, to design an effective Cluster Validity Index (CVI) is as
difficult as to create a clustering method. And it is crucial to have more CVIs
because there are no universal CVIs that can be used to measure all datasets
and no specific methods of selecting a proper CVI for clusters without true
labels. Therefore, to apply a variety of CVIs to evaluate clustering results
is necessary. In this chapter, we apply the Distance-based Separability
Index (DSI), which is previously provided, as a novel internal CVI. We
compared the DSI with eight internal CVIs including studies from early Dunn
(1974) to most recent CVDD (2019) and an external CVI as ground truth,
by using clustering results of five clustering algorithms on 12 real and 97
synthetic datasets. Results show DSI is an effective, unique, and competitive
CVI to other compared CVIs. We also summarized the general process to
evaluate CVIs and created the rank-difference metric for comparison of CVIs’
results.
54
3.1 Introduction of the Autofluorescence-based Hyperspectral Imag-
ing
Atrial fibrillation (AF) is the most common cardiac arrhythmia affecting

as many as 10 million people in the United States alone [56]. Most of the
abnormal sources of bioelectrical activity causing AF have been found in
the left atrium near the entrance of the pulmonary veins [41]. AF can be
treated by destroying the culprit tissue or by creating scar tissue, which
prevents abnormal activity from spreading [34]. Radiofrequency ablation
(RFA) is a common surgical procedure being used widely to ablate living
tissues including in the atria. Testing of electrical conduction is then used to
determine if abnormal sources of electrical activity have been isolated. Even
after passing such testing, however, patients can later experience repeat AF
arising because reversible tissue injury and temporary edema can also stop
electrical activity [13]. When tissue recovers, electrical reconnections can
lead to AF recurrence. The recurrence rate of AF after an ablation procedure
can be as high as 50% and more than 90% of these recurrent cases have
been linked to gaps between ablation lesions [198, 197, 284]. Incomplete
placement of lesions that later result in AF recurrence can be curtailed
if clinicians could directly monitor lesion formation along with the degree
of tissue damage. Unfortunately, the endocardial surface of the left atria,
where most of RF ablation procedures are performed, is covered by thick
layers of collagen and elastin preventing direct visualization of ablated mus-
cle beneath. While imaging technologies like MRI, CT, and ultrasound have
been successfully applied for lesion testing, they have significant limitations.
CT and MRI are expensive, involve radiation and/or contrast agents, and ul-
trasound imaging has poor image resolution; thus, they are not always good
55
for live monitoring [198, 197]. Therefore, another visualization approach
called autofluorescence hyperspectral imaging (aHSI) [284, 213, 140, 86]
has been explored. The previous studies have shown that hyperspectral
imaging can circumvent this limitation [86, 185, 184].
3.1.1 Hyperspectral Imaging Hardware
Hardware
CCD
λ Scanning
LCTF
lens
UV LED
Figure 3.2: Schematic showing hypercube acquisition. CCD – charge cou-

pled device, LCTF – liquid tunable filter, UV LED – ultraviolet light emitting
diode.
To implement the aHSI approach during the RFA procedure, one has to
deliver ultraviolet (UV) light (λ = 365 nm) to the heart by an optical fiber
threaded into a percutaneous catheter [178, 144]. This allows illumination
of the endocardial atrial surface, which is highly autofluorescent. The
autofluorescence signal is then detected through the image guide and the
attached HSI camera system, which forms a stack of images acquired at
individual wavelengths. Figure 3.2 shows a diagram of such a system,
while Figure 3.3 illustrates the hypercube construction. The hypercubes
56
contain rich spectral information about the tissue. Our previous studies
have shown that subtle changes in the tissue autofluorescence profiles
can help to identify the ablated regions in both animal and human atrial
tissue [86, 185]. In those studies, we had to pre-acquire target spectra for
lesion and non-lesion sites before applying linear unmixing [284], since it
is a supervised learning method. The first objective of this work was to
apply an unsupervised learning method, k-means clustering, to detect RFA
atrial lesions without a priori knowledge about tissue spectra. Our second
objective was to use k-means clustering to select the minimal number of
spectral bands (feature groups) without significantly reducing the accuracy
of lesion detection. This is important for future implementation of an
intracardiac aHSI catheter, since it is beneficial to decrease the number
of spectral images within the hypercube while preserving the method’s
ability to reveal the lesions. First, having fewer images will speed-up both
acquisition and processing, enabling us to visualize the ablated areas in
real time. Secondly, by widening spectral bands around the most useful
wavelengths, one can collect more photons and make the output images
more robust to noise.
Atria from freshly excised porcine hearts were ablated by a non-irrigated
RF ablation catheter (Boston Scientific). Several lesions were created on
one tissue sample. Atria were illuminated with a 365nm UVA LED (Mightex,
Pleasanton, CA) placed 10 cm from the tissue surface. A CCD camera
outfitted with a Nikon AF Micro-Nikkor 60mm f/2.8D objective and a liq-
uid crystal tunable filter (LCTF, Nuance FX, PerkinElmer/CRi) was used
to acquire hypercubes of the samples. The LCTF was tuned to pass the
wavelengths from 420 to 720 nm at continuous wavelengths separated by
the filter’s band interval, 10 nm; this yielded 31 channels. As shown in
57
Spectral trace from λi
pixel with xi, yi
coordinates
xi
yi
Y: spatial dimension
X: spatial dimension
Figure 3.3: Hypercube of aHSI images: images in the hypercube were

ordered by their wavelength increasingly on the Z-axis. Each pixel on the
X-Y plane thus has an associated spectrum.
Figure 3.2, through the LCTF, a lens projects the collected light onto a
CCD containing 1392×1040 pixels. Finally, the hypercube for each sam-
ple was constructed from the 31 auto-fluorescence images, each of size
1392×1040. 10 samples were used in this study; therefore, we collected
310 auto-fluorescence images in total.
3.1.2 Data Preprocessing
For each sample, we combined the 31 images into a 3D hypercube

and extracted spectral profiles from each x, y pixel. Each spectrum was
then divided by the spectral sensitivity curves of the CCD camera and the
58
LCTF [284] (correction), followed by normalization which converted values
of each spectrum to the range from 0 to 1 (Figure 3.4a). Normalization is
critical because for classification algorithm it is the overall shape of the
spectrum that matters, but not absolute light intensity at each wavelength.
For normalization maximum value was set at 1 and minimum value at 0.
More details about discussing the importance of normalization step are
included in the reference to the earlier study [86].
Then, we reshaped the 3D hypercube to a 2D matrix according to the
rule shown in Figure 3.4b: for every point on the X-Y plane (a pixel), the
data along the spectral dimension were considered as a vector in the new
2D matrix; the pixels were ordered from left to right in the first row (upper
left), then in the second row and so forth. The spectrum of each pixel in the
X-Y plane was represented as a vector in the matrix; the matrix therefore
had 31 columns corresponding to 31 spectral bands (420-430 nm, 430-440
nm, . . . , 710-720 nm). Hereafter, we refer to each pixel as a sample; each
sample is a vector of 31 features.
3.2 Ablation Lesion Detection Using Unsupervised Learning
Unsupervised learning methods can infer the hidden structures or ex-

tract information from unlabeled data [98, 124]. Its advantage is that a priori
knowledge (e.g., labels) about the targets is not required before performing
detection. Unsupervised learning algorithms have been applied to target de-
tection in various fields, such as anomaly detection [290], road detection [96],
object recognition [214], and salience detection [306, 203, 113, 138].
Here we used k-means clustering as an unsupervised learning algorithm
to cluster samples (vectors of spectra) into k clusters, numbered by integers
from “1” to “k”. Each location (pixel) of the spectrum was labeled with
59
31 W × L raw Tiff slices (a) (b)
1 2 3
4 5 6
7 8 9
Correct and Normalize reshape i-reshape

V11
reshape V12
L 1
V13 2
W 31 V14 3
4
5
Vij Vij 6
7
i=1~W 8
j=1~L VWL 9
Data Cube
Figure 3.4: (a) Pre-processing operations and reshaping hypercube into a
2D matrix; (b) the rule of reshaping and inverse-reshaping.
its cluster number. Then, we assigned colors to those numbers to allow

visualization of the clusters. Since the spectra of lesions and non-lesions
are different, they were assigned different colors to be distinguished visually
from the other tissues.
The clustering algorithm was performed in a computer with Intel i7-6700
3.40 GHz CPU and 16.0 GB RAM. Its operating system is Windows 10 and
application is Matlab R2016b. The built-in k-means function2 in Matlab
applies the squared Euclidean distance measure and the “k-means++” algo-
rithm for initializing cluster centers; and its maximum number of iterations
is 100.
2 https://www.mathworks.com/help/stats/kmeans.html
60
3.2.1 K-means Clustering
In our case, we have 1040 × 1392 = 1447680 samples for each dataset. We
performed k-means clustering, in which the value of k is unknown initially
and determined by experiments. Each pixel was labeled by its cluster. Then,
we assigned colors to these numbers to allow visualization of the clusters.
The procedure is shown in Figure 3.5.
Feature
Labels
vectors
V1 l1 L
V2 l2
V3 l3 l1 l2 l3 l4 lp
V4 k-means l4 i-reshape
W
li
Vi li
lq lN
VN lN
𝑙𝑙𝑖𝑖 ∈ 1,2,3, ⋯ , 𝑘𝑘
Figure 3.5: K-means clustering.
The k-means clustering method is a commonly used unsupervised ma-

chine learning algorithm. In general, given an input set {xi } having m
d-dimensional real vectors, we use k-means clustering to partition the m
vectors into k (≤ m) sets S = {s1 , s2 , . . . , sk } and minimize the within-cluster
sum of squares:
k
argmin ∑
S
∑ kx − µ j k2 ,
j=1 x∈s j
where µ j is the mean of vectors in s j . It is implemented with Lloyd’s algo-

rithm [144].
61
3.2.2 Evaluation and Results
To be able to evaluate any lesion detection method, one must have sets
of images in which the lesions are labeled. This section describes the
construction of such sets.
3.2.2.1 Creating Reference Images
A traditional way to outline the lesions is to stain tissue with 2,3,5-

Triphenyl-2H-tetrazolium chloride (TTC). The content of dehydrogenase
enzymes and NADH declines within ablated tissue. Since those compounds
turn tetrazolium salts into a formazan pigment, viable tissue turns red, while
lesion areas appear white (Figure 3.6a). TTC staining thus provides reliable
identification of lesions and their boundaries; this is TTC-Reference.
(a) (b)
Figure 3.6: Appearance of ablated tissue after: (a) linear unmixing from
aHSI system, (b) TTC staining.
Previous studies have shown that lesions detected by the linear unmixing
algorithm based on pre-acquired spectral libraries closely match RF lesions
in the corresponding TTC image (TTC-Reference) [284, 213]. Figure 3.6b
62
shows one such example. Therefore, we considered the lesion component
image obtained using linear unmixing of a 31-band hypercube as a reference
image; this is called Gray-Unmixing-Reference.
The Gray-Unmixing-Reference has a continuous gray-scale, and lesions
are brighter (have larger gray values) than non-lesion areas. To create a new
image that identifies unambiguously the lesion and non-lesion pixels, we
used a gray-level threshold. The threshold was found by Otsu’s method [98],
which uses the image’s histogram to find the threshold that maximizes the
between-class variance. The pixels with intensities greater than the thresh-
old then were then labeled lesion; all others were considered non-lesion.
Having this binary (two-class) image (the Bi-Unmixing-Reference), enabled
us to then quantitively evaluate the k-means approach.
The k-means clustering yielded an image in which each pixel was labeled
with an integer from ‘1’ to ‘k’. For finding the label of lesions, we recorded the
locations of all lesion pixels (whose value is ‘1’) in the Bi-Unmixing-Reference.
Then, we examined all the corresponding pixels (those having the same
locations) in the k-means image. Since every pixel has a label (cluster
number) after k-means clustering, we can calculate the modal (most-often
occurring) label of these sample pixels as the label of lesions; all other labels
represent non-lesions. Finally, in the clustering image, all pixels having the
label of lesion were set to value ‘1’; and other pixels (non-lesion) were set
to value ‘0’. So, we obtained the binary image (the Bi-31-Result) for lesion
detection by k-means clustering using 31 features.
3.2.2.2 Evaluation by Accuracy Index
To verify whether k-means clustering using 31 features is an effective

method to detect lesion areas, we measured pixel-to-pixel matching by
63
comparing the Bi-31-Result (e.g., lesions colored red in Figure 3.7d) with
the outcome of linear unmixing (lesion areas in Bi-Unmixing-Reference; e.g.,
white regions in Figure 3.7f).
The Binary-Unmixing-Reference (IRef ) and Binary-31(features)-Result
(I31Rlt ) are binary images having the same size; if the value of a given pixel
was different in the two images, it was declared to be a ‘miss’. Accuracy
index (Acc) was defined as 1 minus the ratio of the number of ‘miss’ (Diff) to
the total number (N) of pixels of lesion areas in the two (detected and truth)
images:
Diff
Acc (IRef , I31Rlt ) = 1 −
N
If the accuracy was acceptable, we could use the lesions that were detected
by k-means using 31 features as a reference (Bi-31-Result) to evaluate the
outcomes after the next step: feature grouping (Section 3.3.1).
3.2.2.3 Lesion detection Results by k-means clustering
For porcine samples (we have 10 datasets of samples in total) that encom-
passed an area of 1392 by 1040 pixels, Figure 3.7a shows that k=5 is not
sufficient to distinguish ablated regions for this sample (Set-1). To find the
optimal k, we computed k ranging from 2 to 41 for all our porcine datasets.
For each k, we plotted the maximum, average, and minimum accuracies
over the 10 datasets in Figure 3.8.
Because a smaller k will make k-means run faster, we seek the smallest
k that is effective. Figure 3.8 indicates that k=10 is overall optimal: it is the
smallest k that almost reaches all the highest values for maximum, average,
and minimum accuracies. As illustrated in Figure 3.7b, k=10 is effective
for the Set-1 sample.
A set of 31 aHSI planes was required to obtain the lesion detection
64
100 100
200 200
300 300
400 400
500 500
600 600
700 700
800 800
900 900
1000 1000
200 400 600 800 1000 1200 200 400 600 800 1000 1200
(a) k=5 (b) k=10

4000
100
3500
200
3000
300
400 2500
500
2000
600
1500
700
1000
800
900 500
1000
0
200 400 600 800 1000 1200
(c) original image at 500nm (d) detected lesion areas
(e) lesion component image (f) outcome of Otsu’s threshold
Figure 3.7: Results for porcine atria (Set-1) clustered by k-means into: (a) 5
clusters and (b) 10 clusters. Panel (c) shows an auto-fluorescence image at
500 nm; (d) shows the lesion areas (red) detected when k=10, superimposed
on the image in (c). The corresponding lesion component image, which is
from the unmixed image that contains lesion component and non-lesion
component, is shown in (e); followed by binary image obtained from (e) by
applying Otsu’s thresholding (f).
65
Figure 3.8: Maximum, average, and minimum accuracies over 10 datasets
for each k.
results shown above. The accuracy of lesion detection by k-means clustering

using 31 aHSI planes has been measured by comparison with the images
revealed by linear unmixing (Bi-Unmixing-Reference). Table 3.1 shows their
accuracies. The Set-3 has exceptional lower accuracy than others because
Otsu’s algorithm cannot automatically extract a good binary reference image
on this dataset. But it does not affect to find the best SN of feature grouping
next because all results will be compared with the same binary reference
image.
Table 3.1: Accuracies of 31-feature clustering results.
Dataset# 1 2 3 4 5 6 7 8 9 10
Acc (IRef , I31Rlt ) 0.87 0.91 0.38 0.80 0.69 0.82 0.66 0.87 0.66 0.75
Through the 10 datasets, the average accuracy for detection by k-means

(k=10) using 31 features was about 74% when evaluated using the Bi-
Unmixing-Reference. We then grouped the features. The goal was to de-
66
crease the number of spectral bands without reducing the accuracy of lesion
detection appreciably.
3.3 Optimization of Wavelength Selection
3.3.1 Feature Grouping
We implemented a grouping procedure, which divides the 31 features

into four contiguous disjoint groups. For each group, we calculated the sum
of values as a new feature value, yielding four new features (see Figure 3.9).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 31
SUM SUM SUM SUM
1 2 3 4
Figure 3.9: One kind of 4-feature grouping.
There are 4,060 ways to divide 31 features into four separate and contigu-
ous groups. The 31 features are the intensities at each of the wavelengths
from 420 to 720 nm. The goal was to find the best 4-feature groupings
from the 4,060 possible combinations to adequately detect the lesion areas.
That number is sufficiently small that we could construct every possible
grouping and get its detection result (the Bi-4-Result).
We assigned a Serial Number (SN) to each combination. The boundaries
between groups (the dividers) were described by the last feature’s number
67
Table 3.2: Combinations of 4 groups.
Serial Number (SN) Three dividers for 4-feature grouping (nm)

1 420, 430, 440
2 420, 430, 450
3 420, 430, 460
··· ···
28 420, 430, 720
29 420, 440, 450
30 420, 440, 460
··· ···
4060 690, 700, 710
in the 1st, 2nd, and 3rd group (Table 3.2). “720” is not shown because it is
always the last feature’s number in the 4th group. We assess the Binary-
4(features)-Result (I4Rlt ) by comparing them to the Binary-31(features)-Result
(I31Rlt ) to yield the accuracy: Acc (I31Rlt , I4Rlt ) whose calculation method was
the same as Acc (IRef , I31Rlt ).
3.3.2 Wavelength Bands Selection
Using the same sample, by comparing its Bi-4-Result with Bi-31-Result

pixel by pixel, we calculated the accuracy of the feature grouping. We
computed the accuracies for 4,060 combinations of four groups (Table 3.2)
for one sample (Set-1); the accuracy for each SN is shown3 in Figure 3.10.
By analyzing Figure 3.10, we found the feature grouping having the
highest accuracy of lesion detection for this specific sample. As shown in
Figure 3.11, the 4-feature result with higher accuracy (b) identifies more
complete lesion areas than the 4-feature result with lower accuracy (c).
We then applied the same approach to the remaining datasets. The
goal was to find 4-feature grouping combinations that have high accuracies
3 The
periodicity apparent in Figures 3.10 and 3.12 to 3.15 is an artifact of the serial-
numbering system, and has no significance.
68
Figure 3.10: Accuracies of SNs for one dataset (Set-1).
lesions
(a) 31-feature result (b) example of a 4- (c) example of a 4-

feature result with 97% feature result with 35%
accuracy accuracy
Figure 3.11: Feature grouping results for porcine atria (Set-1): (a) k-means
clustering (k=10) by using all 31 features; (b) k-means clustering (k=10)
by using four features from 4-feature grouping (SN=2857): [wavelength
groups: 420-510, 520-600, 610-630, 640-720 nm]; (c) k-means clustering
(k=10) by using four features from different 4-feature grouping (SN=3716):
[wavelength groups: 420-580, 590-600, 610-680, 690-720 nm].
69
1
Dataset
6
10
SN
Figure 3.12: Feature grouping accuracies for 10 datasets; each row repre-
sents a dataset.
of lesion detection across multiple studies. To do this we examined the

accuracies for all 10 datasets (Figure 3.12). As Figure 3.12 shows, the
accuracy of a given feature grouping varies across datasets. One group-
selection method is to record the worst performance of a feature grouping
across all datasets:
WSN = min AiSN

i
where AiSN is the accuracy of the i-th dataset and SN is the serial number.
The result (Figure 3.13) shows that there are many 4-feature groupings
that perform well (peaks) across all samples we tested. But taking the worst
performance has a flaw: if there exists one bad dataset, reducing accuracies
of all feature groupings appreciably, it will influence the result greatly. If
we plot the smoothed (by average) minimum accuracies, obviously periodic
phenomenon to series numbers is shown. Figure 3.14 shows the locations
(wavelength) of dividers and the smoothed minimum accuracy value (scaled)
to each SN. And, by looking at this figure, one can notice that such periods
70
Figure 3.13: Accuracies over 10 datasets.
are defined by the first divider. We also noticed that the SN range in the
green area (2730-3245), in which the divider 1 ranges from 510 to 530 nm,
includes most high-accuracy combinations.
Another method is to design an evaluation function to reflect the average
performance of a feature grouping combination through all samples:
1
ESN = ∏
i 1 − AiSN
where AiSN is the accuracy of i-th dataset and SN is the serial number of
combination.
By this formula, a feature grouping combination will get a large score if
its accuracy is close to 1. Thus, this nonlinear function emphasizes high
accuracy values. Since the maximum value of accuracy in this work was
less than 0.99, ESN is bounded. By comparing the two results (Figure 3.13
and Figure 3.15), we observe that the good-performance combinations are
similar. The evaluation function could find a better grouping for all tested
71
Figure 3.14: Smoothed Minimum accuracies (scaled) with the 3 dividers.
The green area (SN: 2730-3245) includes most high-accuracy combinations.
datasets than the minimum accuracy method, but the grouping that we
obtained from the max-min method could be more stable for new datasets
because it provides a reliable lower bound on accuracy.
From this evaluation, the highest value (red point in Figure 3.15) is
obtained with the 4-feature grouping [420-520, 530-590, 600-640, 650-720
nm] (SN=3018). Using this grouping, we found the accuracies of 4-feature
clustering results for all datasets (Table 3.3).
Table 3.3: Accuracies of 4-feature clustering results by grouping (SN=3018):

[420-520, 530-590, 600-640, 650-720 nm].
Dataset# 1 2 3 4 5 6 7 8 9 10
Acc (I31Rlt , I4Rlt ) 0.97 0.99 0.96 0.96 0.92 0.93 0.96 0.93 0.89 0.95
Figure 3.16 shows the flowchart to summarize this study.
72
The highest value
ESN
Figure 3.15: Evaluated accuracies over 10 datasets.
3.3.3 Time Cost Analysis
An important practical issue is the computing time cost for k-means

clustering based on all 31 features and that based on the four grouped
features. We used 10 datasets to test the time cost of k-means clustering.
The main factors that affect the computation time are the number of sam-
ple vectors, the dimension of vectors (the number of features), maximum
iterations of k-means, and the value of k. For one dataset, the number of
sample vectors is fixed (1040×1392=1447680), and the default maximum
iterations of k-means in Matlab is 100. Therefore, the running time depends
primarily on the number of features and the value of k. The running times
of k-means may vary for each run. For a given k, the time cost of 4-feature
clustering is about 41.3% of that of the 31-feature clustering, while the
average accuracy of 4-feature clustering by grouping ([420-520, 530-590,
600-640, 650-720 nm]) is about 95% of that of the 31-feature clustering
(Figure 3.17). We conclude that the 4-feature grouping can greatly speed
73
Ablated tissue
TTC analysis
Truth
data
Linear unmixing
aHSI
system Compare
Feature Decide
Decide k
groups groups
k-means
Lesion areas
Figure 3.16: The simplified flowchart summarizes the methods in this

study. The ground truth areas of lesion were obtained from aHSI system
by linear unmixing and verified by TTC analysis. By comparing with truth
data, we found the optimal k-value for k-means algorithm (green) as well as
the optimal groups (blue). The procedure on the left (red) is our proposed
methods for lesion detection from ablated tissue to lesion areas.
74
up the processing while maintaining good accuracy of lesion detection.
Figure 3.17: Time costs for k-means clustering.
3.3.4 Discussion
We used the k-means clustering method to find the lesion sites and
compared the outcomes to those using linear unmixing. Since k-means is
an unsupervised learning algorithm, we did not require a priori knowledge of
lesion spectra. In contrast, the supervised learning methods do require such
knowledge about the lesion to construct the training set containing labeled
spectra. In practice, k-means assigns lesion and non-lesion areas to different
groups and assigns different colors. The outcome of k-means verified our
hypothesis that the spectra of ablated tissue are different from those of
non-ablated tissue. Also, it confirmed that the auto-fluorescence images
contain information about the components and structure of tissues [284].
Whereas k-means clustering is in general repeatable, one disadvantage is
that the detected lesion regions may vary slightly for each clustering result
from a given dataset. That is a characteristic of k-means because the initial
75
points of groups are selected randomly, and the clustering result may be
affected by the choice of initial points.
Alternatively, we could apply supervised learning methods. A classifier
would be trained through labeled lesion and non-lesion spectral data. One
advantage of supervised classification is that the regions of lesion detected
by a classifier model are invariant for a given dataset. Though the time for
training a classifier might be greater, the lesion detection process by using
the classifier would be faster unless the classifier model is very complicated
(non-linear and in high dimension). But its disadvantage is that one will
require a large amount of labeled lesion and non-lesion spectral data for
such training.
To evaluate the results presented in this study, we compared the out-
comes after feature grouping with the results before feature grouping (I4Rlt
vs. I31Rlt ). Additionally, the results before feature grouping were verified
by comparing them with the outcomes of linear unmixing (I31Rlt vs. IRef ).
A direct comparison between the outcomes of k-means and TTC staining
would have been ideal, but this presents practical problems. First, the
chemical reaction that occurs during TTC staining makes ablated tissues
shrink to a certain degree. Secondly, images taken after TTC staining are
not taken at exactly the same orientation, so an exact comparison is not
possible, even when image registration methods are used. But since the
main goal of this study was to find the best feature grouping, direct com-
parison with TTC was not necessary; and we have previously reported a
direct comparison between lesion surface areas of the lesions in TTC images
and those obtained in Gray Unmixing Images [284, 124]. By computing the
difference between detected lesions before and after feature grouping, we
still were able to achieve our goal.
76
3.4 Introduction of Cluster Validation in Unsupervised Learning
Like the k-means we used for previous lesion detection, cluster analysis
is an important unsupervised learning method in machine learning. The
clustering algorithms divide a dataset into clusters [123] based on the dis-
tribution structure of the data, without any prior knowledge. Clustering
is widely studied and used in many fields, such as data mining, pattern
recognition, object detection, image segmentation, bioinformatics, and data
compression [220, 282, 92, 57, 132, 173]. The shortage of labels for training
is a big problem in some machine learning applications, such as medical
image analysis, and applications of big data [188] because labeling is ex-
pensive [112]. Since unsupervised machine learning does not use labels for
training, to apply cluster analysis can avoid the problem.
3.4.1 Related Work
In general, the main methods of cluster analysis can be categorized into

centroid-based (e.g., k-means), distribution-based (e.g., EM algorithm [32]),
density-based (e.g., DBSCAN [74]), hierarchical (e.g., Ward linkage [281]),
and spectral clustering [277]. None of the clustering methods, however, is
able to perform well with all kinds of datasets [141, 278]. That is, a clustering
method that performs well with some types of datasets would perform
poorly with some others. For this reason, various clustering methods have
been applied to datasets. Consequently, effective clustering validations
(measures of clustering quality) are required to evaluate which clustering
method performs well for a dataset [23, 7]. And, clustering validations are
also used to tune the parameters of clustering algorithms.
There are two categories of clustering validations: internal and external
77
validations. External validations use the truth-labels of classes and pre-
dicted labels, and internal validations use predicted labels and data. Since
external validations require true labels and there are no true class labels in
unsupervised learning tasks, we can employ only the internal validations in
cluster analysis [162]. In fact, to evaluate clustering results by internal vali-
dations has the same difficulty as to do clustering itself because measures
have no more information than the clustering methods [205]. Therefore, the
difficulty of designing an internal Cluster Validity Index (CVI) is like creating
a clustering algorithm. The different part is that a clustering algorithm can
update clustering results by a value (loss) from the optimizing function but
the CVI provides only a value for clusters evaluation.
Various CVIs have been created for the clustering of many types of
datasets [55]. By methods of calculation [114], the internal CVIs are based
on two categories of representatives: center and non-center. Center-based
internal CVIs use descriptors of clusters. For example, the Davies–Bouldin
(DB) index [50] uses cluster diameters and the distance between cluster
centroids. Non-center internal CVIs use descriptors of data points. For
example, the Dunn index [68] considers the minimum and maximum dis-
tances between two data points.
Besides the DB and Dunn indexes, in this paper, some other typical
internal CVIs are selected for comparison. The Calinski-Harabasz (CH)
index [33] and Silhouette coefficient (Sil) [225] are two traditional internal
CVIs. In recently developed internal CVIs, the I index [176], WB index [305],
Clustering Validation index based on Nearest Neighbors (CVNN) [162], and
Cluster Validity index based on Density-involved Distance (CVDD) [114] are
selected. Eight typical internal CVIs, which range from early studies (Dunn,
1974) to the most recent studies (CVDD, 2019), are selected to compare
78
with our proposed CVI.
In addition, an external CVI – the Adjusted Rand Index (ARI) [234] is se-
lected as the ground truth for comparison because external validations use
the true class labels and predicted labels. Unless otherwise indicated, the
“CVIs” that appear hereafter mean internal CVIs and the only external
CVI is named “ARI”.
3.5 Experiments of Cluster validation using DSI
Since the goal of clustering is to separate a dataset into clusters, in the

macro-perspective, how well a dataset has been separated could be indicated
via the separability of clusters. In a dataset, data points are assigned class
labels by the clustering algorithm. The most difficult situation for separation
of the dataset occurs when all labels are randomly assigned and the data
points of different classes will have the same distribution (distributions
have the same shape, position, and support, i.e., the same probability
density function). We introduce our Distance-based Separability Index (DSI)
in Chapter 2 to analyze the distributions of different-class data. In this
section, we will apply DSI as an internal CVI to evaluate clustering results.
3.5.1 Materials and Methods
A small DSI (low separability) of classes X and Y means that their ICD
and BCD sets are very similar. In this case, the distributions of classes X
and Y are similar too. Hence, data of the two classes are difficult to separate.
An example of two-class dataset is shown in Figure 3.18. Figure 3.18a
shows that, if the labels are assigned correctly by clustering, the distri-
butions of ICD sets will be different from the BCD set and the DSI will
reach the maximum value for this dataset because the two clusters are
79
well separated. For an incorrect clustering, in Figure 3.18b, the difference
between distributions of ICD and BCD sets becomes smaller so that the
DSI value decreases. Figure 3.18c shows an extreme situation, that is, if
all labels are randomly assigned, the distributions of the ICD and BCD sets
will be nearly identical. It is the worst case of separation for the two-class
dataset and its separability (DSI) is close to zero. Therefore, the separability
of clusters can be reflected well by the proposed DSI. The DSI ranges from
0 to 1, DSI ∈ (0, 1), and we suppose that the greater DSI value means the
dataset is clustered better.
3.5.1.1 Compared CVIs
CVIs are used to evaluate the clustering results. In this study, several
internal CVIs including the proposed DSI have been employed to examine
the clustering results from different clustering methods (algorithms). To use
different clustering methods on a given dataset may obtain different cluster
results and thus, CVIs are used to select the best clusters. We choose eight
commonly used (classical and recent) internal CVIs and an external CVI -
the Adjusted Rand Index (ARI) to compare with our proposed DSI (Table 3.4).
The role of ARI is the ground truth for comparison because ARI involves
true labels (clusters) of the dataset.
3.5.1.2 Synthetic and real datasets
In this study, the synthetic datasets for clustering are from the Tomas
Barton repository 4 , which contains 122 artificial datasets. Each dataset
has hundreds to thousands of objects with several to tens of classes in two
or three dimensions (features). We have selected 97 datasets for experiment
4 https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/
datasets/artificial
80
(a) Correct labeling
DSI ≈ 0.645
(b) Incorrect labeling

DSI ≈ 0.191
(c) Random labeling

DSI ≈ 0.008
BCD ≈ ICD
Data Plot Histograms of sets

ICD set for X
X Y ICD set for Y
Figure 3.18: Two clusters (classes) datasets with different label assignments.
Each histogram indicates the relative frequency of the value of each of the
three distance measures (indicated by color).
81
Table 3.4: Compared CVIs.
Name Optimala Reference

ARIb MAX (Santos & Embrechts, 2009) [234]
Dunn index MAX (Dunn, J.,1973) [68]
Calinski-Harabasz Index MAX (Calinski & Harabasz, 1974) [33]
Davies–Bouldin index min (Davies & Bouldin, 1979) [50]
Silhouette Coefficient MAX (Rousseeuw, 1987) [225]
I MAX (U. Maulik, 2002) [176]
CVNN min (Yanchi L., 2013) [162]
WB min (Zhao Q., 2014) [305]
CVDD MAX (Lianyu H., 2019) [114]
DSI MAX Proposed
a. Optimal column means the CVI for best case has the minimum or
maximum value. b. The ground truth for comparison.
because the 25 unused datasets have too many objects to run the clustering
processing in reasonable time. The names of the 97 used synthetic datasets
are shown in Appendix B. Illustrations of these datasets can be found in
Tomas Barton’s homepage 5 .
The 12 real datasets used for clustering are from three sources: the
sklearn.datasets package 6 , UC Irvine Machine Learning Repository [58]
and Tomas Barton’s repository (real world datasets) 7 . Unlike the synthetic
datasets, the dimensions (feature numbers) of most selected real datasets
are greater than three. Hence, CVIs must be used to evaluate their clustering
results rather than plotting clusters as for 2D or 3D synthetic datasets.
Details about the 12 real datasets appear in Table 3.5.
5 https://github.com/deric/clustering-benchmark
6 https://scikit-learn.org/stable/datasets
7 https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/
datasets/real-world
82
Table 3.5: The description of used real datasets.
Name Title Object# Feature# Class#

Iris Iris plants dataset 150 4 3
digits Optical recognition of hand- 5620 64 10
written digits dataset
wine Wine recognition dataset 178 13 3
cancer Breast cancer Wisconsin (diag- 569 30 2
nostic) dataset
faces Olivetti faces dataset 400 4096 40
vertebral Vertebral column data 310 6 3
haberman Haberman’s survival data 306 3 2
sonar Sonar, Mines vs. Rocks 208 60 2
tae Teaching Assistant evaluation 151 5 3
thy Thyroid disease data 215 5 3
vehicle Vehicle silhouettes 946 18 4
zoo Zoo data 101 16 7
3.5.2 Evaluation Metrics
In general, there are two strategies to evaluate CVIs using a dataset: 1)

to compare with ground truth (real clusters with labels); 2) to predict the
number of clusters (classes) by finding the optimal number of clusters as
identified by CVIs [42].
3.5.2.1 Using real clusters
By using datasets’ information of real clusters with labels, the steps to

evaluate CVIs are:
1. To obtain clustering results by running different clustering methods

(algorithms) on a dataset.
2. To compute CVIs of these clustering results and their ARI (ground

truth) using real labels.
3. To compare the values of CVI with ARI.
83
Table 3.6: CVI scores of clustering results on the wine recognition dataset.
Clustering
Ward Spectral
method KMeans BIRCH EM
Linkage Clustering
Validity a
ARIb + 0.913c 0.757 0.880 0.790 0.897
Dunn + 0.232 0.220 0.177 0.229 0.232
CH + 70.885 68.346 70.041 67.647 70.940
DB - 1.388 1.390 1.391 1.419 1.389
Silhouette + 0.284 0.275 0.283 0.278 0.285
WB - 3.700 3.841 3.748 3.880 3.700
I+ 5.421 4.933 5.326 4.962 5.421
CVNN - 21.859 22.134 21.932 22.186 21.859
CVDD + 31.114 31.141 29.994 30.492 31.114
DSI + 0.635 0.606 0.629 0.609 0.634
a. CVI for best case has the minimum (-) or maximum (+) value. b. The first
row shows results of ARI as ground truth; other rows are CVIs. c. Bold
value: the best case by the measure of this row.
4. To repeat the former three steps for a new dataset.
In this study, five clustering algorithms from various categories are

used, they are: k-means, Ward linkage, spectral clustering, BIRCH [302]
and EM algorithm (Gaussian Mixture). The CVIs used for evaluation and
comparison are shown in Table 3.4 and the used datasets are introduced
in Section 3.5.1.2. And we provide two evaluation methods to compare the
values of CVIs with the ground truth ARI; they are called Hit-the-best and
Rank-difference, which are described as follows.
Evaluation metric: Hit-the-best For a dataset, clustering results ob-

tained by different clustering algorithms would have different CVIs and ARI.
If a CVI gives the best score to a clustering result that also has the best
ARI score, this CVI is considered to be a correct prediction (hit-the-best).
Table 3.6 shows CVIs of clustering results by different clustering methods
on a dataset. For the wine dataset, k-means receives the best ARI score and
Dunn, DB, WB, I, CVNN and DSI give k-means the best score; and thus, the
84
Table 3.7: Hit-the-best results for the wine dataset.
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
wine 1 0 1 0 1 1 1 0 1
a. Sil = Silhouette.
six CVIs are hit-the-best. If we mark hit-the-best CVIs as 1 and others as 0,

CVI scores in Table 3.6 can be converted to hit-the-best results (Table 3.7)
for the wine dataset.
For the hit-the-best, however, the best score can be unstable and random
in some cases. For example, in Table 3.6, the ARI score of EM is very close
to that of k-meams and the Silhouette score of EM is also very close to that
of k-meams. If these values fluctuated a little and changed the best cases,
the comparison outcome for this dataset will be changed. Another drawback
of hit-the-best is that it concerns only one best case and ignores others;
it does not evaluate the whole picture for one dataset. The hit-the-best
might be a stricter criterion but lacks robustness, and it is vulnerable to
extreme cases such as when scores of different clustering results are very
close to each other. Hence, we create another method to compare the score
sequences of CVIs and ARI through their ranks.
Evaluation metric: Rank-difference This comparison method fixes the

two problems of the hit-the-best: one is instability for similar scores and
the other is the bias on only one case.
We apply quantization to solve the problem of similar scores. Every score
in the score sequence of a CVI (i.e., a row in Table 3.6) will be assigned a
rank number and similar scores have high probability to be allocated the
same rank number. The procedure is:
1. Find the minimum and maximum values of N scores from one se-
85
quence.
2. Uniformly divide [min,max] into N − 1 intervals.
3. Label intervals from max to min by 1, 2, . . . , N − 1.
4. If a score is in the k-th interval, its rank number is k.
5. Define rank number of max is 1, and intervals are left open and right
closed: (upper value, lower value].
input: score sequence 1, 9, 2, 8, 6
interval numbers 1 2 3 4
intervals
9 7 5 3 1
8 6 2
output: rank sequence 4, 1, 4, 1, 2
Figure 3.19: An example of rank numbers assignment.
Figure 3.19 shows an example of converting a score sequence to a rank

sequence (rank numbers). The rank number of scores 9 and 8 is 1 because
they are in the 1st interval. For the same reason, the rank number of scores
1 and 2 is 4. Such quantization is better than assigning rank numbers by
ordering because it avoids the assignment of different rank numbers to very
close scores in most cases (it is still possible to use different rank numbers
for very close scores; for example, in the Figure 3.19 case, if scores 8 and 6
86
Table 3.8: Rank sequences of CVIs converted from the score sequences in
Table 3.6.
Clustering
Ward Spectral
method KMeans BIRCH EM
Linkage Clustering
Validity
ARIa 1 4 1 4 1
Dunn 1 1 4 1 1
CH 1 4 2 4 1
DB 1 1 1 4 1
Silhouette 1 4 1 3 1
WB 1 4 2 4 1
I 1 4 1 4 1
CVNN 1 4 1 4 1
CVDD 1 1 4 3 1
DSI 1 4 1 4 1
a. The first row shows results of ARI as ground truth; other rows are CVIs.
changed to 7.1 and 6.9, their rank numbers will still be 1 and 2 even they
are very close).
Remark. If the score whose rank number is 1 (1-rank score) represents the
optimal performance, that the rank number of the maximum CVI score is 1
only works for the CVI whose optimum is maximum but does not work for
the CVI whose optimum is minimum, like DB and WB, because its 1-rank
score should be minimum. A simple solution to make the rank number
work for both types of CVIs is to negate all values in score sequences of
the CVIs whose optimum is minimum before converting to rank sequence
(Figure 3.19). Thus, the 1-rank score always represents the optimal perfor-
mance for all CVIs.
Table 3.8 shows rank sequences of CVIs converted from the score se-
quences in Table 3.6. For each CVI, four ranks are assigned to five scores.
Since the ARI row shows the truth rank sequence, for rank sequences in
other CVI rows, the more similar to the ARI row, the better the CVI performs.
87
Table 3.9: Rank-difference results for the wine dataset.
CVI
Dataset
wine 9 1 3 1 1 0 0 7 0
For two score sequences (e.g., CVI and ARI), after quantizing them to
two rank sequences, we will compute the difference of two rank sequences
(called rank-difference), which is simply defined as the summation of ab-
solute difference between two rank sequences. For example, the two rank
sequences from Table 3.8 are:
ARI : {1, 4, 1, 4, 1}
CV DD : {1, 1, 4, 3, 1}
Their rank-difference, which is the summation of absolute difference, is:
|1 − 1| + |4 − 1| + |1 − 4| + |4 − 3| + |1 − 1| = 7
Smaller rank-difference means the distance of two sequences is closer. That

two sequences of CVI and ARI are closer indicates a better prediction. It is not
difficult to show that rank-difference for two N-length score sequences lies
in the ranges [0, N(N − 2)]. Table 3.9 shows rank-differences calculated by the
ARI and nine CVIs from Table 3.8. The CVI having the lower rank-difference
value is better and 0 is the best because it has the same performance as
the ground truth (ARI).
3.5.2.2 To predict the number of clusters
Some clustering methods require setting the number of clusters (classes)

in advance, such as the k-means, spectral clustering, and Gaussian mixture
88
(EM). Suppose we have a dataset and know its real number of clusters, c;
then the steps to evaluate CVIs through predicting the number of clusters
in this dataset are:
1. To run clustering algorithms by setting the number of clusters k =

2, 3, 4, . . . (the real number of clusters c is included) to get clusters.
2. To compute CVIs of these clusters.
3. The predicted number of clusters by the i-th CVI: k̂i , is the number of
clusters that perform best on the i-th CVI. (i.e., the optimal number of
clusters recognized by this CVI)
4. The successful prediction of the i-th CVI is that its predicted number
of clusters equals the real number of clusters: k̂i = c.
For several CVIs, the number of successful predictions could be zero,

one, two, or more. Besides CVIs, the success also depends on the datasets
and clustering methods. In this study, we selected the wine, tae, thy, and
vehicle datasets (see Table 3.5), and clustering methods: k-means, spectral
clustering, and EM algorithm.
3.5.3 Results
3.5.3.1 Clusters of real and synthetic datasets
As discussed before, for one dataset and a CVI, an evaluation result

can be computed by using the hit-the-best or rank-difference metric. In
other words, one result is obtained by comparing one CVI row in Table 3.6
with the ground truth (ARI). The outcome of a hit-the-best comparison
is either 0 or 1; 1 means that the best clusters predicted by CVI are the
same as ARI; otherwise, the outcome is 0. Table 3.7 shows the hit-the-best
89
Table 3.10: Hit-the-best results for real datasets.
CVI
Dataset
Iris 0 0 0 0 0 0 0 1 0
digits 0 0 0 1 0 0 1 0 1
wine 1 0 1 0 1 1 1 0 1
cancer 0 0 0 0 0 0 1 0 0
faces 1 1 1 1 1 1 0 1 1
vertebral 0 0 0 0 0 0 0 0 0
haberman 0 1 0 0 1 0 0 0 0
sonar 0 1 0 0 1 0 0 0 0
tae 0 0 0 0 0 0 1 1 0
thy 0 0 0 0 0 0 0 0 0
vehicle 0 0 0 0 0 0 1 0 1
zoo 1 0 1 0 0 1 0 0 1
Totalb 3 3 3 2 4 3 5 3 5
(rank) (4) (4) (4) (9) (3) (4) (1) (4) (1)
a. Sil = Silhouette. b. Larger value is better (rank number is smaller).
Table 3.11: Rank-difference results for real datasets.
CVI
Dataset
Iris 8 13 15 15 13 11 15 6 15
digits 2 2 1 1 4 6 8 7 6
wine 9 1 3 1 1 0 0 7 0
cancer 8 7 6 9 7 8 2 7 9
faces 4 3 4 4 2 3 9 2 5
vertebral 6 13 14 12 15 13 15 6 13
haberman 9 7 7 7 7 9 7 7 8
sonar 7 3 3 4 3 4 11 10 3
tae 9 14 9 9 14 15 0 9 9
thy 5 2 2 2 2 6 2 3 10
vehicle 12 11 9 13 13 12 3 3 7
zoo 1 6 1 6 6 1 9 8 1
Totalb 80 82 74 83 87 88 81 75 86
(rank) (3) (5) (1) (6) (8) (9) (4) (2) (7)
a. Sil = Silhouette. b. Smaller value is better (rank number is smaller).
90
results of nine CVIs on the wine dataset. The outcome of the rank-difference
comparison is a value in the range [0, N(N − 2)], where N is the sequence
length. As Table 3.8 shows, the length of sequences is 5; hence, the range
of rank-difference is [0, 15]. Table 3.9 shows the rank-difference results of
nine CVIs on the wine dataset. The smaller rank-difference value means the
CVI predicts better.
We applied the evaluation method to the selected CVIs (Table 3.4) by
using real and synthetic datasets (Section 3.5.1.2) and the five clustering
methods (Table 3.6). Table 3.10 and Table 3.12 are hit-the-best comparison
results for real and synthetic datasets. Table 3.11 and Table 3.13 are rank-
difference comparison results for real and synthetic datasets. To compare
across data sets, we summed all results at the bottom of each table. For the
hit-the-best comparison, the larger total value is better because more hits
appear. For the rank-difference comparison, the smaller total value is better
because results of the CVI are closer to that of ARI. Finally, ranks in the
last row uniformly indicate CVIs’ performances. The smaller rank number
means better performance. Since there are 97 synthetic datasets, to keep
the tables to manageable lengths, Tables 3.12 and 3.13 present illustrative
values for the datasets and most importantly, the totals and ranks for each
measure.
3.5.3.2 Prediction of number of clusters
Another strategy of CVI evaluation is to predict the number of clusters

(classes). Its detailed processes are described in Section 3.5.2.2. The cluster-
ing methods we selected require setting the number of clusters (classes) in
advance; they are: k-means, spectral clustering, and EM algorithm. The a
priori number of clusters we set for the three algorithms are: k = 2, 3, 4, 5, 6
91
Table 3.12: Hit-the-best results for 97 synthetic datasets.
CVI
Dataset
3-spiral 1 0 0 0 0 0 0 1 0
aggregation 0 0 0 0 0 0 1 1 1
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
zelnik5 1 0 0 0 0 0 0 1 0
zelnik6 1 1 0 0 1 0 0 0 0
Totalb 46 30 35 35 29 31 35 50 40
(rank) (2) (8) (4) (4) (9) (7) (4) (1) (3)
a. Sil = Silhouette. b. Larger value is better (rank number is smaller).
Table 3.13: Rank-difference results for 97 synthetic datasets.
CVI
Dataset
3-spiral 2 12 14 13 14 12 13 1 13
aggregation 3 3 2 2 4 5 2 5 3
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
zelnik5 4 10 12 10 11 11 10 4 11
zelnik6 4 3 2 2 3 3 5 2 2
Totalb 406 541 547 489 583 554 504 337 415
(rank) (2) (6) (7) (4) (9) (8) (5) (1) (3)
a. Sil = Silhouette. b. Smaller value is better (rank number is smaller).
92
(the real number of clusters is included). Clustering algorithms have been
applied on four datasets: the wine, tae, thy, and vehicle datasets (see
Table 3.5 for details).
Tables 3.14 to 3.17 show prediction of the number of clusters based
on CVIs, clustering algorithms and datasets. The predicted number of
clusters by a CVI is the number of clusters that perform best on this CVI.
Captions of sub-tables contain the real number of clusters (classes) for each
dataset. A successful prediction of the CVI is that its predicted number of
clusters equals the real number of clusters. In the results, it is worth noting
that only DSI successfully predicted the number of clusters from spectral
clustering for all datasets. This implies that DSI may work well with the
spectral clustering method.
Table 3.14: Number of clusters prediction results on the wine dataset (178
samples in 3 classes).
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 3b 4 6
CH 3 3 3
DB 3 3 3
Sila 3 3 3
WB 3 3 3
I 3 2 2
CVNN 2 2 2
CVDD 2 2 2
DSI 3 3 3
a.Sil = Silhouette. Same to the following Tables. b. Bold value: the
successful prediction of the CVI whose predicted number of clusters equals
the real number of clusters. Same to the following Tables.
93
Table 3.15: Number of clusters prediction results on the tae dataset (151
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 2 3 2
CH 6 6 4
DB 6 6 5
Sil 6 3 2
WB 6 6 6
I 5 3 3
CVNN 2 2 2
CVDD 2 2 2
DSI 6 3 5
Table 3.16: Number of clusters prediction results on the thy dataset (215
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 5 2 6
CH 3 3 3
DB 5 3 4
Sil 4 3 2
WB 6 3 6
I 3 3 4
CVNN 2 2 2
CVDD 2 2 5
DSI 5 3 6
94
Table 3.17: Number of clusters prediction results on the vehicle dataset
(948 samples in 4 classes).
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 6 2 5
CH 2 2 2
DB 2 2 2
Sil 2 2 2
WB 3 3 3
I 5 2 5
CVNN 2 2 2
CVDD 2 2 2
DSI 5 4 5
3.5.4 Discussion
Although DSI obtains only one first-rank (Table 3.10) compared with
other CVIs in experiments, having no last rank means that it still performs
better than some other CVIs. It is worth emphasizing that all compared
CVIs are excellent and widely used. Therefore, experiments show that DSI
can join them as a new promising CVI. Actually, by examining those CVI
evaluation results, we confirm that none of the CVIs performs well for
all datasets. And thus, it would be better to measure clustering results by
using several effective CVIs. The DSI provides another CVI option. Also, DSI
is unique: none of the other CVIs performs the same as DSI. For example, in
Table 3.10, for the vehicle dataset, only CVNN and DSI predicted correctly.
But for zoo dataset, CVNN was wrong and DSI was correct. For another
example, in Table 3.11, for the sonar dataset, DSI performed better than
Dunn, CVNN, and CVDD; but for the cancer dataset, Dunn, CVNN, and
CVDD performed better than DSI. More examples of the diversity of CVI are
shown in Table 3.18 and their plots with true labels are shown in Figure 3.20
(the atom dataset has three features, and the others have two features).
95
Table 3.18: Rank-difference results for selected synthetic datasets.
CVI
Dataset
atom 0 15 15 15 15 14 4 0 0
disk-4000n 10 0 7 0 0 0 11 12 1
disk-1000n 6 12 15 12 13 14 15 8 14
D31 5 1 2 1 0 2 10 2 0
flame 10 6 11 7 7 8 12 11 7
square3 11 0 2 0 0 7 0 11 0
The former examples show the need for employing more CVIs because
each is different and every CVI may have its special capability. That ca-
pability, however, is difficult to describe clearly. Some CVIs’ definitions
show them to be categorized into center/non-center representative [114] or
density-representative. Similarly, the DSI is a separability-representative
CVI. That is, DSI performs better for clusters having high separability with
true labels (like the atom dataset in Figure 3.20); otherwise, if real clusters
have low separability, the incorrectly predicted clusters may have a higher
DSI score (Figure 3.21).
Clusters in datasets have great diversity so that the diversity of clustering
methods and CVIs is necessary. Since the preferences of CVIs are difficult to
analyze precisely and quantitatively, more studies for selecting a proper CVI
to measure clusters without true labels should be performed in the future.
Having more CVIs expands the options. And before the breakthrough that
we discover approaches to select an optimal CVI to measure clusters, it is
meaningful to provide more effective CVIs and apply more than one CVI to
evaluate clustering results.
In addition, to evaluate CVIs is also an important task. Its general
process is:
96
(a) atom (b) disk-4000n (c) disk-1000n
(d) D31 (e) flame (f) square3
Figure 3.20: Examples for rank-differences of synthetic datasets.
1. To create different clusters from datasets.
2. To compute external CVI with true labels as ground truth and internal
CVIs.
3. To compare results of internal CVIs with the ground truth. Results

from an effective internal CVI should be close to the results of an
external CVI.
In this study, we generated different clusters using a variety of cluster-

ing methods. To generate different clusters can also be achieved through
changing parameters of clustering algorithms (e.g., the number of clusters
k in k-means clustering) or taking subsets of datasets. The comparison
step could also apply other methods besides the two evaluation metrics:
hit-the-best and rank-difference that we have used.
97
(a) Real clusters: (b) Predicted clusters:
DSI ≈ 0.456 DSI ≈ 0.664
Figure 3.21: Wrongly-predicted clusters have a higher DSI score than real
clusters.
3.6 Conclusion
As the most common sustained arrhythmia, atrial fibrillation (AF) is

expected to affect more than 10 million people by 2050 [56]. This study
aims to develop imaging tools for real-time visualization of ablated tissue.
The long-term goal of our studies is to help develop an intracardiac auto-
fluorescence-based Hyper-Spectral Imaging (aHSI) catheter that can improve
the success rate of the radiofrequency ablation (RFA) treatment, reduce the
incidence of AF recurrence, and help to avoid re-treatment of the previously
ablated tissue.
Here we have shown that k-means, an approach that does not require
a priori knowledge of tissue spectra, can be also an effective means to
detect lesions from aHSI hypercubes. The average accuracy for detection
by k-means (k=10) using 31 features was about 74% when compared to
reference images. Secondly we have also demonstrated that the number
of spectral bands (which are referred as “features”) can be reduced (by
grouping them) without significantly affecting lesion detection accuracy.
Specifically, we show that by using the best four grouped features, the
98
accuracy of lesion identification was about 94% of that using 31 features.
The time cost of 4-feature clustering was about 40% of the 31-feature
clustering, demonstrating that 4-feature grouping can speed up acquisition
and processing. From an instrumentation point of view, by using a limited
number of features one is able to combine multiple spectral bands into one
spectrally wide band. This is extremely beneficial for low-light applications
such as implementation of aHSI via catheter access.
Furthermore, to evaluate clustering results, like k-means; it is essential
to apply various CVIs because there is no universal CVI for all datasets and
no specific method for selecting a proper CVI to measure clusters without
true labels. In this study, we propose the DSI as a novel CVI based on a
data separability measure. Since the goal of clustering is to separate a
dataset into clusters, we hypothesize that better clustering could cause
these clusters to have a higher separability.
Including the proposed DSI, we applied nine internal CVI and one exter-
nal CVI – Adjusted Rand Index (ARI) as ground truth to clustering results
of five clustering algorithms on various real and synthetic datasets. The re-
sults show DSI to be an effective, unique, and competitive CVI to other CVIs
compared here. And we summarized the general process to evaluate CVIs
and used two methods to compare the results of CVIs with ground truth.
We created the rank-difference as an evaluation metric to compare two
score sequences. This metric avoids two disadvantages of the hit-the-best
measure, which is commonly used in CVI evaluation. We believe that both
the DSI and rank-difference metric can be helpful in clustering analysis
and CVI studies in the future.
99
Chapter 4: Breast Cancer Detection Using Explainable Deep
Learning
In this chapter, we firstly present our works of breast cancer detection

using the deep learning model – the Convolutional Neural Network (CNN).
To solve the shortcoming of having an insufficient number of images to
train CNN models, we applied the Generative Adversarial Network (GAN) to
create synthetic mammographic images for training and used the transfer
learning. For the purpose of explainable/transparent deep learning, we
secondly present our studies on deep learning models. We evaluate the
performance of GAN models by the Distance-based Separability Index (DSI)
we presented before, analyze the generalizability of deep neural networks,
and create a method to estimate the training accuracy for two-layer neural
networks.
4.1 Introduction
Breast cancer is the second leading cause of death among U.S women,
and will be diagnosed in about 12% of them [248, 53]. The commonly
used mammographic detection based on Computer-Aided Diagnosis (CAD)
methods can improve treatment outcomes for breast cancer and increase
survival times for the patients [215]. These traditional CAD tools, however,
have a variety of drawbacks because they rely on manually designed features.
The process of hand-crafted features design can be tedious, difficult, and
non-generalizable [293]. In recent years, developments in machine learning
have provided alternative methods to CAD for feature extraction; one is
to learn features from whole images directly through a CNN [163, 125].
100
Usually, training the CNN from scratch requires a large number of labeled
images [73]; for example, the AlexNet (a classical CNN model) was trained
by using about 1.2 million labeled images [146]. For some kinds of medical
image data such as mammographic tumor images, to obtain a sufficient
number of images to train a CNN classifier is difficult because the true
positives are scarce in the datasets and expert labeling is expensive [111].
The shortcomings of an insufficient number of images to train a classifier are
well-known [146, 207]; thus, it is worthwhile to research into its solutions.
One promising solution is to reuse as the feature extractor a pre-trained
CNN model that has been trained with very large image datasets from
other fields, or re-train (fine-tune) such a model using a limited number of
labeled medical images [267]. This approach is also called transfer learn-
ing, which has been successfully applied to various computer vision ques-
tions [244, 15, 202]. In fact, some results of transfer learning are counter-
intuitive: previous studies for the pulmonary embolism and melanocytic
lesion detection [267, 75] show that the features (connection weights in the
CNN) learned from natural images could be transferred to medical images,
even if the target images greatly differ from the pre-trained source images.
Another solution is applying image augmentation to create new training
images and thus to improve the performance of a CNN classifier. Previous
approaches to image augmentation used original images modified by rota-
tion, shifting, scaling, shearing and/or flipping. The potential problem with
such processing is that slightly changed images are similar to original ones;
they may not be used as new training images to improve the performance
of a CNN classifier. Large changes, on the other hand, may change the
structure or pattern of objects in training images and degrade the perfor-
mance of the classifier. An alternative image augmentation method is to
101
generate synthetic images using the features extracted from original im-
ages. These generated images are not exactly like the original ones but
could keep the essential features, structures or patterns of the objects in
original images. In this point, the Generative Adversarial Network (GAN)
is an ideal candidate as such image augmentation method for augment-
ing the training dataset. As with CNN, GAN is a neural network-based
learning method introduced by Goodfellow et al. in 2014 [90], and it is
a state-of-the-art technique in the field of deep learning [110]. GAN has
many novel applications in the field of image processing, for example, image
translation [279, 294], object detection [154], super-resolution [150] and
image blending [286]. Recently, various GAN are also developed for the
medical imaging, such as GANCS [174] for MRI reconstruction, SegAN [287],
DI2IN [291] and SCAN [49] for medical image segmentation. In this study,
synthetic mammographic images are generated from GAN to improve the
performance of a CNN classifier.
Recently, the number of types of GANs has grown to about 500 [107] and a
substantial number of studies are about the theory and applications of GANs
in various fields of image processing. Compared to the theoretical progress
and applications of GANs, however, fewer studies have focused on evaluating
or measuring GANs’ performance [29]. Most existing GANs’ measures have
been conducted using classification performance (e.g., Inception Score) and
statistical metrics (e.g., Fréchet Inception Distance). A more fundamental
alternative approach to evaluate a GAN is to directly analyze the images it
generated, instead of using them as inputs to other classifiers (e.g., Inception
network) and then analyzing the outcomes. In this study, we propose
a fundamental way to analyze GAN-generated images quantitatively and
qualitatively.
102
In addition, we have examined two more basic questions for the CNN
and deep learning models: the generalizability of deep neural networks
and how to understand the mechanism of neural network models. For
supervised learning models, like CNN, the analysis of generalization ability
(generalizability) is vital because the generalizability expresses how well a
model will perform on new data. Traditional generalization measures, such
as the VC dimension [276], do not apply to Deep Neural Network (DNN)
models. Thus, new theories to measure the generalizability of DNNs are
required. In this study, we hypothesize that the DNN with a simpler decision
boundary has better generalizability by the law of parsimony (Occam’s
Razor) [25]. And, although the DNN technique plays an important role in
machine learning, to comprehensively understand the mechanisms of DNN
models and to explain their output results, however, still require more basic
research [223]. To understand the mechanisms of DNN models, that is, the
transparency of deep learning, there are mainly three ways: the training
process [66], generalizability [159], and loss or accuracy prediction [12].
Besides the analysis of generalizability of DNN, in this study, we also create
a novel theory from scratch to estimate the training accuracy for two-layer
neural networks applied to random datasets. Such studies may provide
starting points of some new ways for researchers to make progress on the
difficult problem of understanding deep learning.
4.2 Breast Cancer Detection Using Transfer Learning in Convolu-

tional Neural Networks1
Currently, CNN has been applied to medical image classification in three

major ways: 1) training CNN from scratch [285, 199, 247]; 2) using pre-
1 This work has been published in the [C12].
103
trained CNN model to extract features from medical images [87, 20, 46]
and 3) fine-tuning pre-trained CNN model on medical images [239, 35,
155]. In this study, we compared the three main techniques to detect
breast cancer using the Mammographic Image Analysis Society (MIAS)
mammogram database [258].
Previous studies have applied various machine learning methods for
breast cancer/tumor detection using mammograms [81]. The MIAS
database is a commonly used public mammogram databases. Some stud-
ies used the traditional automatic feature extraction (not manual extrac-
tion) techniques, such as Gabor filter, fractional Fourier transform and
Gray Level Co-Occurrence Matrix (GLCM), to obtain features and then
applied Support Vector Machine (SVM) or other classifier to do classifica-
tion [135, 210, 136, 303, 187]. Neural networks were also used as classi-
fiers [280, 194]. And some studies applied CNN to generate features from
mammographic images [312, 129, 59, 28]. Some of these studies used
pre-trained CNN as applications of transfer learning. Few previous studies,
however, presented results obtained by using only CNN for both feature
generation and classification for breast cancer detection in mammograms.
In our study, we used only one CNN; its front convolutional layers are re-
sponsible for feature generation and the back Fully-Connected (FC) layers
are the classifier. Thus, the input for our CNN is mammographic images
and its output are the (predicted) labels.
We tested three training methods on MIAS dataset: 1) trained a CNN
from scratch, 2) applied the pre-trained VGG-16 model [251] to extract
features from input images and used these features to train a Neural Network-
classifier, 3) updated the weights in several last layers of VGG-16 model by
back-propagation (fine-tuning) to detect abnormal regions. By comparison,
104
we found that the method 2) is ideal for study.
4.2.1 MIAS Mammograms and Images Pre-processing
Mammography is the process of using low-energy X-rays to examine

the human breast for diagnosis and screening. The Mammographic Image
Analysis Society (MIAS) is an organization of UK research groups interested
in the understanding of mammograms and has generated a database of
digital mammograms. The MIAS database has 322 images including 102
abnormal and 220 normal samples. The locations and boundaries of these
abnormal regions are given.
We downloaded all mammographic images in MIAS database from their
official website2 . The images in MIAS are in PGM format, which can be read
and processed by MATLAB directly. MIAS describes abnormal regions by
circular boundaries, and their center locations (X, Y) and radii values are
contained in the documentation.
We used the regions of interest (ROIs) instead of whole images to train
neural networks. These ROIs are cropped rectangle-shape images and
obtained by:
• For abnormal ROIs from images containing abnormalities, they are the
minimum rectangle-shape areas surrounding the whole given ground
truth boundaries.
• We firstly obtained abnormal ROIs. Then for normal ROIs, they are
also rectangle-shape images and their size are about the average size
of abnormal ROIs in the same database. Their locations are randomly
selected on normal breast areas. In this study, we cropped only one
ROI from a whole normal breast image.
2 http://peipa.essex.ac.uk/info/mias.html
105
The sizes of abnormal ROIs vary with abnormality boundaries. Since the
CNN requires all input images to be one specific size and usual inputs for
CNN are RGB images (images in MIAS are gray-scale images and the input
of VGG-16 model requires RGB images), we resized the ROIs by resampling
and made them to RGB (3-layer cubes) by duplication.
4.2.2 Pre-trained Model: VGG-16
For transfer learning, we applied the pre-trained VGG-16 model [251]

in this study. The VGG-16 network was proposed by the Oxford Visual
Geometry Group for the ImageNet Large-Scale Visual Recognition Challenge
(ILSVRC) competition. This model is also known as one of typical deep
convolutional networks. It was deeper and wider than the previous neural
architectures. It mainly consists of five groups of convolution operations.
Adjacent convolution groups are connected through max-pooling layers.
Each group contains a series of 3×3-pixel convolutional layers. The VGG-16
model has 16 hidden layers in total, composed of 13 convolutional layers
and 3 FC layers.
This pre-trained VGG-16 network was trained with about 1.3 million
images (1000 classes) from ImageNet database [226] (ILSVRC-2012 competi-
tion), and it surpassed human-level performance on ImageNet [101], which
achieved 7.5% top-5 error on ILSVRC-2012-Val and 7.4% top-5 error on
ILSVRC-2012-Test in the competition.
4.2.3 Experiments and Results
We tested three training methods including non-transfer learning and

transfer learning in CNN on MIAS dataset. The three CNN classification
models are:
106
• To train a CNN from scratch (New-model)
• Features extraction by pre-trained VGG-16 network (Feature-model)
• Fine-tuning: update the weights in last layers of the pre-trained VGG-

16 model (Tuning-model)
The implementation of CNN was on the Keras API backend on Tensor-

Flow [5]. The development environment for Python was Anaconda3. We
randomly selected 95 ROI (cropped) images for each abnormal and normal
case, and divided them into training and validation set by ratio 15:4. The
label is binary, which “0” stands for normal and “1” for abnormal. Our
training method (optimizer) was RMSprop [270] using default parameters
provided in Keras, loss function was Binary Cross Entropy, updating met-
rics was Accuracy, batch size was 15 and the number of total epochs was
set to be 500. For a CNN classifier, the input is the ROI image in size
421×421-pixel. Since the Sigmoid function was used in the output layer,
the predicted outcome from the CNN classifier is a value between 0 and
1. By default, the classification threshold is 0.5, meaning that if the value
is less than 0.5 it will be considered as “0” (normal), otherwise it will be
considered as “1” (abnormal).
4.2.3.1 To Train the CNN from Scratch (New-model)
We built our own CNN in this part. The details about this CNN structure
show in Table 4.1. It consists of three convolutional layers with max-pooling
layers and one FC layer. The activation function for each layer is the ReLU
function [186] except the last one for output, which is sigmoid function.
The notation Conv_3-32 means there are 32 convolutional neurons (units)
and the filter size in each unit is 3×3-pixel (height×width) in this layer.
107
MaxPool_2 means a max-pooling layer with size of filters is 2×2-pixel window,
stride 2. And FC_64 means a fully-connected layer having 64 units. Dropout
layer [257] randomly set a fraction rate of input units to 0 for the next layer
at every updating during training; it could help the CNN avoid overfitting.
The output layer uses a sigmoid function, which maps the output value to
the range of [0, 1].
Table 4.1: CNN architecture for training from scratch.
input: RGB image

Conv_3-32 + ReLU
MaxPool_2
Conv_3-32 + ReLU
MaxPool_2
Conv_3-64 + ReLU
MaxPool_2
FC_64 + ReLU (with Dropout = 0.5)
output (sigmoid): [0, 1]
4.2.3.2 Transfer Learning: Features Extraction by Pre-trained VGG-

16 Network (Feature-model)
The structure of CNN in transfer learning was combined the 13 convolu-

tional layers in pre-trained VGG-16 model [251] with a simple FC layer. As
shown in Table 4.2, all the weights in five convolutional blocks (Conv block
1-5) were imported from the pre-trained VGG-16 model and not changed
(or called weights frozen) during the training of this CNN. Only weights
in the FC layer (FC_256 + ReLU) were randomly initialized and updated
by training. Thus, such training process can be seen as that the VGG-16
extracts features from input image and then these features were used to
train a FC Neural Network-classifier.
108
Table 4.2: CNN architecture for transfer learning.
input: RGB image

Conv_3-64 + ReLU
Conv block 1 Conv_3-64 + ReLU
MaxPool_2
Conv_3-128 + ReLU
Conv block 2 Conv_3-128 + ReLU
MaxPool_2
Conv_3-256 + ReLU
Conv_3-256 + ReLU
Conv block 3
Conv_3-256 + ReLU
VGG-16
MaxPool_2
Conv_3-512 + ReLU
Conv_3-512 + ReLU
Conv block 4
Conv_3-512 + ReLU
MaxPool_2
Conv_3-512 + ReLU
Conv_3-512 + ReLU
Conv block 5
Conv_3-512 + ReLU
MaxPool_2
FC_256 + ReLU (with Dropout = 0.5)
output (sigmoid): [0, 1]
4.2.3.3 Transfer Learning: Fine-tuning (Tuning-model)
The CNN structure for fine-tuning is the same structure as shown in

Table 4.2. One difference is in the training process – not all weights in the
pre-trained model are fixed. During the fine-tuning training, the weights
in the first 4 convolutional blocks (Conv block 1-4) were imported from the
pre-trained VGG-16 model and frozen. The weights in the last convolutional
blocks (Conv block 5), however, were updated by training. Another differ-
ence is that weights in the FC layer were imported from previous feature
extraction training instead of random initialization. Weights in the last con-
volutional blocks were also imported from the pre-trained VGG-16 model.
Therefore, no weight was randomly initialized in fine-tuning. Weights in the
109
FC layer (FC_256 + ReLU) were still randomly initialized and updated by
training.
4.2.3.4 Results
New-model (Non-transfer Learning) The result in Figure 4.1 shows clas-

sification accuracy of the New-model for validation set. The blue curve is the
accuracy after each epoch of training, and it was smoothed (the smoothing
interval is about 20 epochs) to yield the red curve because we want to see
its tendency as the number of epochs increased. One epoch means the
model has been trained by all training data once. This result shows that
the average accuracy is low (Max = 0.751) and the accuracy curve (blue)
has not converged.
9.27s per epoch
Smoothed curve Max = 0.751
Figure 4.1: Result of the New-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the smoothing
interval is about 20 epochs).
110
Feature-model The result in Figure 4.2 shows the average accuracy of
the Feature-model converged at about 0.906 (also Max = 0.906) and the
accuracy curve converged. The time cost for each epoch is about 14% of
that of the New-model. Therefore, such comparison demonstrates that the
performance of CNN in transfer learning is much better than training from
scratch for breast cancer/tumor detection.
1.33s per epoch

0.906
Figure 4.2: Result of the Feature-model. Blue curve is the accuracy after
Tuning-model The result in Figure 4.3 shows the average accuracy of the
Tuning-model can reach a maximum of 0.914 and the accuracy curve also
converged. Its performance is slightly improved (about 0.88%) compared to
the Feature-model. But the training time for each epoch is about 22 times
that of training the classifier by only feature extraction.
111
29.03s per epoch
0.914
Figure 4.3: Result of the Tuning-model. Blue curve is the accuracy after
4.2.4 Discussion
Figure 4.4 shows classification accuracy of the three models – New-model

(yellow), Feature-model (red) and Tuning-model (blue) – on the MIAS valida-
tion set. The center line is smoothed accuracy (the smoothing interval is
about 20 epochs) and width shows the departure of the mean. By compari-
son, training classifier by extracted features is the ideal method for study
because its accuracy is very close to that of fine-tuning and the time cost is
only about 5% of that of fine-tuning. But for real applications, fine-tuning
is also feasible because we can have enough time (off-line) to train a very
good model for implementation.
For future studies, we could try to recognize the abnormal areas in whole
mammographic images instead of ROIs. As the object detection with region
proposal [217], by using the CNN, we could recognize the abnormalities on
mammographic images and draw boundaries (or rectangle region proposals)
112
0.914 29.0s/epoch
0.906 1.3s/epoch
0.751 9.3s/epoch
Figure 4.4: Comparing of the three CNN classification models: New-model

(yellow); Feature-model, to train a neural network-classifier (red); Tuning-
model (blue). The values are maximum smoothed accuracy and time cost
(second) of training per epoch.
on such areas automatically. These regions do not have to be 100% accuracy;

they just provide another kind of reference for doctors to make decisions.
We could use other pre-trained models, and compare to their perfor-
mances. In the research field of deep learning, VGG-16 appeared early
but its depth (total number of layers is 23) is relatively shallow compared
to new models, such as InceptionV3 (159 layers) [265], ResNet50 (168 lay-
ers) [100] and InceptionResNetV2 (572 layers) [263]. It will be interesting to
see performances of breast cancer detection by using very deep CNNs.
113
4.3 Breast Cancer Detection Using Synthetic Mammograms from
Generative Adversarial Networks3
In this section, we name the original images ORG images, the augmented
images by affine transformation AFF images, and the synthetic images
generated from GAN GAN images.
To compare the performances of GAN images with AFF images to image
augmentation, we firstly cropped the regions of interest (ROIs) from images
in the Digital Database for Screening Mammography (DDSM) [104] database
as original (ORG) ROIs. Second, by using these ORG ROIs, we applied GAN
to generate the same number of GAN ROIs. We also used ORG ROIs to
generate the same number of AFF ROIs. Then, we used six groups of ROIs:
GAN ROIs, AFF ROIs, ORG ROIs and three mixture groups of any two of the
three simple ROIs to train a CNN classifier from scratch for each group. We
used the rest ORG ROIs that were never used in augmentation and training
to validate classification outcomes.
4.3.1 Introduction of the Mammogram Data: DDSM
Mammography is the process of using low-energy X-rays to examine the

human breast for diagnosis and screening. There are two main angles to get
the X-ray images: the cranio-caudal (CC) view and the mediolateral-oblique
(MLO) view. The goal of mammography is the early detection of breast
cancer [80], typically through detection of masses or abnormal regions from
the formed X-ray images. Usually, such abnormal regions are spotted by
doctors or expert radiologists. In this study, we used mammogram from the
Digital Database for Screening Mammography (DDSM) [104]. The DDSM is
3 This work has been published in the [J3].
114
a widely used mammographic images resource by the U.S. Mammographic
Image Analysis Research Community. It is a collaborative effort between
Massachusetts General Hospital, Sandia National Laboratories and the
University of South Florida Computer Science and Engineering Depart-
ment. The DDSM database contains approximately 2,620 mammograms
in total: 695 normal mammograms, 1925 abnormal mammograms (914
malignant/cancers, 870 benign and 141 benign without callback) with
locations and boundaries of abnormalities. Each case includes four images
representing the left and right breasts in CC and MLO views.
4.3.1.1 Images Pre-processing
We downloaded all mammographic images from DDSM’s official website4 .

Images in DDSM are compressed in LJPEG format. To decompress and
convert these images, we used the DDSM Utility [245]. We converted all
images in DDSM to PNG format. DDSM describes the location and boundary
of actual abnormality by chain-codes, which are recorded in OVERLAY files
for each breast image containing abnormalities. The DDSM Utility also
provides the tool to read boundary data and display them for each image
having abnormalities. Since the DDSM Utility tools run on MATLAB, we
implemented all pre-processing tasks in MATLAB. We used the regions of
interest (ROIs) instead of entire images to train CNN classifiers. These ROIs
are cropped rectangle-shape images and obtained by:
truth boundaries.
• For normal ROIs, they were cropped on the other side of a breast having
4 http://www.eng.usf.edu/cvprg/Mammography/Database.html
115
abnormal ROI and the normal ROI was the same size and location
as the abnormal ROI on different breast side. If both left and right
breasts having abnormal ROIs and their locations overlapping, we
discarded this sample. Since in most cases, only one side of breast
has tumor and the area and shape of left and right breast are similar;
thus, normal ROIs and abnormal ROIs have similar black background
areas and scaling.
The selected ROIs to experiment have no black background areas, shapes

are close to square (width-height ratio < 1.2) and sizes are larger than
320x320 (to avoid up-sampling). The sizes of abnormal ROIs vary with
abnormality boundaries. Since the CNN requires all input images to be
one specific size and the usual inputs for CNN are RGB images (images in
DDSM are grayscale), we resized the ROIs by resampling and made them
to RGB (3-layer cubes) by duplication (Figure 4.5). These images cropped
from mammogram are ORG ROIs.
4.3.2 Image Augmentation by Affine Transformation
The image augmentation by affine transformations that we applied on

ORG ROIs are rotation, width shifting, height shifting, shearing, scaling,
horizontal flipping, and vertical flipping. All transformations are happened
randomly, and some are in certain ranges. The range of rotation is 0-30
degree and of width shifting, height shifting, shearing, scaling is 0%-20%
according to the total image size. Since the input image size and position
will be changed after affine transformations, padding (filling) points outside
the boundaries is needed to keep the size of output image. There are three
commonly used padding methods: set a constant value for all pixel outside
the boundaries, copy the values at the nearest pixel on the boundaries and
116
B A
Figure 4.5: (A) A mammographic image from DDSM rendered in grayscale;

(B) Cropped ROI by the given truth abnormality boundary; (C) Convert Grey
to RGB image by duplication.
reflect the image by the boundaries. Figure 4.6 displays the results of the
three padding methods. We will choose to use one padding method that
can obtain the best classification accuracy.
Input image Constant Nearest Reflect
Affine Transformation
Figure 4.6: The three types of affine transformation.
4.3.3 Introduction of Generative Adversarial Network (GAN) Augmen-

tation
The GAN is a neural network-based generative model that learns the

probability distribution of real data and creates simulated data samples
117
with a similar distribution (Figure 4.7). Formally, in d-dimension space, for
x ∈ Rd , y = pdata (x) is a mapping from x to real data y. We create a neural
network called the generator G to simulate this mapping. If sample y comes
from pdata , it is a real one; and sample z comes from G, it is a synthetic
one. Another neural network discriminator D is used to detect whether
a sample is real or synthetic. Ideally, D(y) = 1; D(z) = 0. The two neural
networks G and D compose the GAN. We can find G and D by solving the
two-player minimax game [90], with value function V (G, D):
min max V (G, D) = E [log D(pdata (x))] + E [log(1 − D(G(x)))]

G D
This min-max problem has a global optimum (Nash equilibrium) solution

for G (x) = pdata (x). That is the goal to find the distribution of real data.
At equilibrium, discriminator D can no longer distinguish the real from
the synthetic sample, where D(y) = D(z) = 0.5. Synthetic samples can be
generated from G by changing the input x. In this study, the input x for G we
used was a noise vector having 100 elements from a Gaussian distribution:
N(0, 1). The key point of a well-trained GAN is that it could generate seemingly
real-like data samples by giving noise vectors. To train a GAN, we used
limited number of real samples. Ideally, GAN could generate unlimited
different synthetic samples.
Since the GAN were introduced, it has been widely used in many image
processing applications [110]. In medical imaging, many applications of
GAN are segmentation [287, 49, 313, 218, 254, 142]. And some studies are
about medical image simulation/synthesis [117, 45, 192, 24, 93]. Image
synthesis is a specialty or advantage of GAN, hence, it is apt to apply GAN as
an image augmentation method [216] for training classifiers and improving
118
Real data distribution Generated distribution
real sample y
generated sample z
D G Simulated mapping
Update z = G ( x)
GAN
Real mapping
y = pdata ( x ) x : N ( 0, 1)
Standard Gaussian Distribution
Figure 4.7: The principle of GAN.
their detection performances. As far as we aware there is no study about

using GAN as data augmentation method on mammogram to train CNN
classifier for breast cancer detection (by 2017). Therefore, our study filled
this gap.
4.3.3.1 Image Augmentation by GAN
To implement GAN, we built the generator and discriminator neural net-

works. The details about their structures show in Table 4.3. The generator
consists of four up-sampling layers to double the size of image and five
convolutional layers. The activation function for each layer is the ReLU
function [186] except the last one for output, which is tanh function. The
function of generator is to transform a 100-length vector to a 320x320x3
image. The input of discriminator is a 320x320x3 image and its output is a
value between 0 and 1, which ‘0’ stands for the synthetic image and ‘1’ for
119
the real one. Like a typical CNN, the discriminator has four convolutional
layers with max-pooling layers and one FC layer. The activation function
for each convolutional layer is also the ReLU function and the last one for
output is sigmoid function, which maps the output value to the range of [0,
1].
The notation Conv_3-32 means there are 32 convolutional neurons (units)
and the filter size in each unit is 3×3-pixel (height × width) in this layer.
MaxPool_2 means a max-pooling layer with size of filters is 2×2-pixel window,
stride 2. And FC_n means a fully-connected layer having n units. Dropout
layer [257] randomly set a fraction rate of input units to 0 for the next layer at
every updating during training; it could help the networks avoid overfitting.
Our training optimizer is Nadam [65] using default parameters (except the
learning rate changed to 1e-4), the loss function is Binary Cross Entropy,
the updating metric is Accuracy, the batch size is 30 and the number of
total epochs is set to be 1e+5.
The steps of training methods for GAN are:
1. Randomly initialize all weights for both networks.
2. Input a batch of 100-length noise vectors to generator to obtain syn-

thetic images.
3. To train the discriminator by a batch of synthetic images labeled ‘0’

and real images labeled ‘1’.
4. To train the generator: input a batch of 100-length noise vectors to

generator to obtain synthetic images and label them as ‘1’. Then, input
these synthetic images to discriminator to obtain the predicted labels.
The differences between predicted labels and ‘1’ will be the loss for
120
Table 4.3: The architecture of generator and discriminator neural networks.
Generator
Layer Shape
input: 100-length vector 100
FC_(256x20x20) + ReLU 102400
Reshape to 20x20x256 20x20x256
Normalization + Up-sampling 40x40x256
Conv_3-256 + ReLU 40x40x256
Conv_3-128 + ReLU 80x80x128
Conv_3-64 + ReLU 160x160x64
Conv_3-32+ ReLU 320x320x32
Normalization + Conv_3-3+ ReLU 320x320x3
output (tanh): [−1, 1] 320x320x3
Discriminator
Layer Shape
input: RGB image 320x320x3
Conv_3-32 + ReLU 320x320x32
MaxPooling_2 + Dropout (0.25) 160x160x32
Conv_3-64 + ReLU 160x160x64
Conv_3-128 + ReLU 80x80x128
Conv_3-256 + ReLU 40x40x256
Flatten 102400
FC_1 1
output (sigmoid): [0, 1] 1
121
updating the generator. It is noteworthy that in this step, only the
weights in generator are changed; weights in discriminator are fixed.
5. Repeat Step 2 to Step 4 until all real images have been used once, that
counts one epoch. When the number of epoch reaches a certain value,
training stops.
Actually, for the Step 5, the ideal situation to stop training is when the
classification accuracy of discriminator converges to 50%. It means the
discriminator no longer can distinguish the real images and synthetic images
generated from a well-trained generator. The discriminator plays a role as an
assistant in GAN. After training, we will use the generator neural networks
to generate synthetic images for usage next.
4.3.4 Experiments
Our implementation of neural networks was on the Keras API backend on

TensorFlow [5]. The development environment for Python was Anaconda3.
We applied affine transformations and GAN to augment images, and com-
pared the two augmentation methods by training a CNN classifier and their
classification accuracy.
A CNN was designed as the discriminator in GAN. Its function is to
distinguish real and synthetic mammographic ROIs. We also built a CNN to
classify abnormal ROIs and normal ROIs, it is called CNN tumor classifier.
As shown in Table 4.4, this CNN classifier consists of three convolutional
layers with max-pooling layers and two FC layers. The activation function for
each layer is the ReLU function except the last one for output. The output
layer uses a sigmoid function, which maps the output value to the range
of [0, 1]. Its input is the image in size 320×320-pixel. Since the sigmoid
122
Table 4.4: The architecture of CNN classifier.
CNN classifier
Layer Shape
input: RGB image 320x320x3
Conv_3-32 + ReLU 320x320x32
MaxPooling_2 160x160x32
Conv_3-32 + ReLU 160x160x32
Conv_3-64 + ReLU 80x80x64
Flatten 102400
FC_64 + ReLU + Dropout (0.5) 64
FC_1 1
output (sigmoid): [0, 1] 1
function was used in the output layer, the predicted outcome from the CNN
classifier is a value between 0 and 1. By default, the classification threshold
is 0.5, meaning that if the value is less than 0.5 it will be considered as “0”
(normal), otherwise it will be considered as “1” (abnormal). The optimizer for
training is Nadam using default parameters [139] (except the learning rate
changed to 1e-4), the loss function is Binary Cross Entropy, the updating
metric is Accuracy, the batch size is 26 and the number of total epochs is
set to be 750. To train this CNN classifier from scratch, we used the labeled
ROIs of abnormal and normal mammographic images. All training data
include ORG ROIs, AFF ROIs and GAN ROIs, but validation data are only
the ORG ROIs.
To the affine transformation, we firstly decide the padding method. We
collected 1300 real abnormal ROIs (Oabnorm , ‘O’ for original) and 1300 real
normal ROIs (Onorm ) in total. After taking off 10% for validation, there are
1170 Oabnorm and 1170 Onorm . We firstly did the data augmentation to 1170
Oabnorm and 1170 Onorm by affine transformations to obtain 1170 Aabnorm (‘A’
for affine) and 1170 Anorm ; the details are shown in Section 4.3.2. For the
123
Figure 4.8: Validation accuracy of CNN classifiers trained by three types of
AFF ROIs.
Table 4.5: Notations for data. (abnorm = abnormal / norm = normal)
Set name Notation for element Meaning

ORG ROIs Oabnorm /Onorm Real abnormal/normal ROI
padding
AFF ROIs Aclass Affine transformed ROI from one class by
padding method constant/nearest/reflect
GAN ROIs Gabnorm /Gnorm Synthetic abnormal/normal ROI by GAN
three padding methods, we mark the augmented data as Aconstant , Anearest

and Areflect . Then, we trained three CNN classifiers from scratch by three
datasets: [1170 Aconstant constant nearest nearest
abnorm , 1170 Anorm ], [1170 Aabnorm , 1170 Anorm ] and [1170
Areflect reflect
abnorm , 1170 Anorm ] respectively. Figure 4.8 shows the validation accuracy of
the three CNN classifiers. Obviously, the CNN classifier trained by nearest
padding AFF ROIs has the best overall performance. Therefore, we used the
nearest padding AFF ROIs for our rest experiments.
We then used the ORG ROIs to train two generators: GANabnorm and
GANnorm for generating GAN ROIs. As shown in Figure 4.9 (GAN box), during
124
Real images ORG ROIs
420 abnormal ROIs 336 84
20% for validation
DDSM Crop ROIs 420 normal ROIs 336 84
GAN AFF
Discriminator
AFF ROIs
336
Noise vector x Generator 336
GAN ROIs
336 Training Training accuracy
CNN classifier
336 data Validation accuracy
Figure 4.9: The flowchart of our experiment plan. CNN classifiers were
trained by data including ORG, AFF and GAN ROIs. Validation data for the
classifier were ORG ROIs that were never used for training. The AFF box
means to apply affine transformations.
the training process, the generator G provided synthetic ROIs to discrimi-

nator D. D was trained to distinguish the real from the synthetic ROIs by
real and synthetic ROIs. And once synthetic ROIs were distinguished, D
gave feedback loss to G for G’s updating. Then G will generate synthetic
ROIs more like the real ones. By inputting noise vectors to GANabnorm and
GANnorm , we obtained 336 Gabnorm and 336 Gnorm . We repeated training the
CNN classifier from scratch using different datasets of labeled ROIs shown
in Table 4.6. In each set, the number of abnormal ROIs and normal ROIs is
equal (Figure 4.9). We used 84 Oabnorm and 84 Onorm that were never used in
the training process as validation data to evaluate those CNN classifiers.
4.3.5 Results
For training GAN, we used 336 real abnormal ROIs to obtain the generator
GANabnorm , and used 336 real normal ROIs to obtain the generator GANnorm .
Figure 4.10 shows some synthetic abnormal ROIs (Gabnorm ) generated from
GANabnorm . Then, we generated 336 Gabnorm and 336 Gnorm by generators.
125
Table 4.6: Training plans. Training by using CNN classifier in Table 4.4.
Notations are described in Table 4.5.
Set# Dataset for training Validation

336 Oabnorm labeled ‘1’
1
336 Onorm labeled ‘0’
336 Gabnorm labeled ‘1’
2
336 Gnorm labeled ‘0’
336 Anearest
abnorm labeled ‘1’
3 nearest
336 Anorm labeled ‘0’ 84 Oabnorm labeled ‘1’
336 Oabnorm + 336 Gabnorm labeled ‘1’ 84 Onorm labeled ‘0’
4
336 Onorm + 336 Gnorm labeled ‘0’
336 Oabnorm + 336 Anearest
5
336 Onorm + 336 Anearest
norm labeled ‘0’
336 Gabnorm + 336 Anearest
6 nearest
336 Gnorm + 336 Anorm labeled ‘0’
The results of training accuracy and validation accuracy after each

training epoch (it is defined in Section 4.3.3.1, training methods, Step 5;
the total epochs are 750) are shown in Figure 4.11. By looking the figures,
Set 1, 4 and 5 perform well and 3 is the worst. To analyze those results
quantitatively, we show the stable standard deviation (SStd, which is the
standard deviation of validation accuracy after 600 epochs), maximum
validation accuracy (Best), average validation accuracy after 600 epochs
Real
Synthetic
Figure 4.10: (Top row) Real abnormal ROIs; (Bottom row) synthetic abnormal
ROIs generated from GAN.
126
SStd: 0.0381
SStd: 0.0129
Best：0.7976
Best: 0.7857
Stable: 0.6452 SStd: 0.0245
Stable: 0.7348
Time: 4.00s/ep Best：0.7321
Time: 7.01s/ep
Stable: 0.6036
Time: 4.04s/ep
SStd: 0.0165 SStd: 0.0212 SStd: 0.0193

Best：0.8512 Best：0.8155 Best：0.8095
Stable: 0.7496 Stable: 0.7132 Stable: 0.6931
Time: 10.15s/ep Time: 9.82s/ep Time: 6.79s/ep
Figure 4.11: Training accuracy and validation accuracy for six training
datasets.
(Stable) and time cost (in second) for each training epoch. The maximum
validation accuracy can indicate the best performance of the classifier,
but it may be reached fortuitously. The average validation accuracy after
600 epochs can show the stable performance of the classifier. For a good
classifier, this value will be monotone increasing and converged. And SStd
shows how validation accuracy varies from its average after 600 epochs.
Table 4.7 shows these quantitative results.
Since the maximum validation accuracy may be fortuitous, the stable
performance is more reliable to evaluate a classifier. From Table 4.7, it
demonstrations that:
127
• ORG ROIs must add to training set because the stable performances
of sets without ORG ROIs are lower than 70%.
• By comparing Set 2 with Set 3, GAN generated images could have

features closer to real images than affine transformed images. And, by
comparing Set 4 with Set 5, GAN ROIs are better than AFF ROIs for
image augmentation. Although the synthetic ROIs in Figure 4.10 have
some artificial flavors by looking.
• Since the performance of GAN is better than affine transformation

for image augmentation, GAN could be an alternative augmentation
method for training CNN classifiers.
For training by only real ROIs, the validation accuracy is lower than training
by adding GAN ROIs. Adding AFF ROIs can also improve the validation ac-
curacy. Therefore, image augmentation is necessary to train CNN classifiers
and since GAN performs better than affine transformation, GAN could be a
good alternative option. But GAN ROIs may have different features against
ORG ROIs because the over-fitting occurred. Adding ORG ROIs in training
set can help correct this problem. The images augmented by GAN or affine
transformation cannot substitute for real images to train CNN classifiers
because the absence of real images in training set will cause over-fitting.
4.3.6 Discussion
The hypothesis of GAN is that, in d-dimension space, there exists a

mapping function pdata (x) from vector x to real data y; a GAN can learn and
simulate the mapping function G(x) by samples from the distribution of real
data. G(x) is also called a generator. The ideal outcome is G(x) = pdata (x). The
maximum validation accuracy for training by GAN ROIs is about 79.76%,
128
Table 4.7: Analysis of validation accuracy for CNN classifiers.
Best Stable
Set# SStd (%) Time/epoch (s)
perfa (%) perf (%)
1 (ORG) 78.75 73.48 1.29 7.01
2 (GAN) 79.76 64.52 3.81 4.00
3 (AFF) 73.21 60.36 2.45 4.04
4 (ORG + GAN) 85.12 74.96 1.65 10.15
5 (ORG + AFF) 81.55 71.32 2.12 9.82
6 (GAN + AFF) 80.95 69.31 1.93 6.79
a. perf = performance.
which shows that the generator acquired some important features from
ORG ROIs. But the GAN ROIs may also have different features against ORG
ROIs, thus, the stable accuracy is about 9% lower. Adding ORG ROIs in
training set can help correct this problem.
Since abnormal ROIs may contain more features than normal ROIs,
we take a statistic view for comparing the real abnormal ROIs and their
augmented ROIs: Oabnorm , Anearest
abnorm , and Gabnorm . For each category, we use
336 samples, compute their mean, standard deviation (Std), skewness, and
entropy. Then we plot normalized values in statistic histogram to see their
distributions. Since the limited space in paper, we only display their Std
and mean in Figure 4.12.
From the view of mean’s distribution, GAN is more like ORG than AFF. But
the view of Std’s distribution shows the opposite. To quantitatively analyze
difference between distributions, we calculate the Wasserstein distance [229]
between two histograms. The value of Wasserstein distance is smaller if
the difference between two distributions is smaller. Wasserstein distance
is equal to 0 when the two distributions are identical. Table 4.8 shows
Wasserstein distances of ORG ROIs vs. GAN ROIs and ORG ROIs vs. AFF
ROIs in the four statistical criterions. GAN ROIs are closer to ORG ROIs
129
Figure 4.12: Histogram of mean and standard deviation. (Normalized)
than AFF in mean and entropy but farther in Std and skewness. Such
results may explain why GAN ROIs are valid image augmentation. These
results also inspire us how to improve the GAN. We could modify GAN
to generate images having less Wasserstein distances in those statistical
criterions to real images. Actually, the Wasserstein GAN [11] is designed by
the similar idea.
Table 4.8: Wasserstein distance between two histograms.
Criterion 336 Oabnorm vs. 336 Gabnorm 336 Oabnorm vs. 336 Anearest
abnorm
Mean 0.083 0.185
Std 0.100 0.040
Skewness 0.101 0.047
Entropy 0.111 0.456
130
Theoretically, a well-trained GAN could generate images having the same
distributions as real images. The synthetic images will have zero Wasserstein
distance to real images in any statistical criterions. If so, the performance
of CNN classifier trained by GAN ROIs will be as good as by ORG ROIs. Our
results, however, shows that based on distribution and training performance,
GAN did not correspond with theoretical expectations. The problem could
be found by looking the synthetic images (Figure 4.10): they have clear
artificial flavors. One possible reason is that GAN adds some features or
information not belonging to real images; that is why the distributions of
four statistical criterions to GAN ROIs are different from ORG ROIs. Those
new features disturb classifiers to detect abnormal features in real images
and make the validation accuracy lower. The possible solution is to change
the architecture of generator or/and discriminator in GAN. In this study,
the architecture we used is DCGAN [209]. Actually, there are about 500
different architectures of GAN [107] recently. We believe that some of them
can achieve a better performance for image augmentation.
4.4 Evaluation of Generative Adversarial Network Performance5
We applied Generative Adversarial Network (GAN) to generate synthetic

mammograms. In this study, we propose a fundamental way to analyze
GAN-generated images quantitatively and qualitatively.
We briefly introduce the two commonly used GAN evaluation methods:
Inception Score (IS) [231] and Fréchet Inception Distance (FID) [106], and four
additional measures: 1-Nearest Neighbor classifier (1NNC) [164], Mode Score
(MS) [39], Activation Maximization (AM) score [310], and Sliced Wasserstein
distance (SWD) [26]. We then compare those results with our proposed
5 This work has been published in the [J2] and [C7].
131
measure. In addition, we discuss how these evaluations could help us to
deepen our understanding of GANs and to improve their performance.
4.4.1 Introduction of GAN Evaluation Metrics
The optimal GAN for images can generate images that have the same
distribution as real samples (used for training), are different from real ones
(not duplication), and have variety. Expectations of generated images could
be described by three aspects: 1) non-duplication of the real images, 2)
generated images should have the same style, which we take to mean that
their distribution is close to that of the real images, and 3) generated images
are different from each other. Therefore, we evaluate the performance of a
GAN as an image generator according to the three aspects:
• Creativity: non-duplication of the real images. It checks for overfitting

by GANs.
• Inheritance (or visual fidelity): generated images should have the

same style, which retains key features of the real (input) images. And
this is traded off with the creativity property because generated images
should not be too similar nor too dissimilar to the real ones.
• Diversity: generated images are different from each other. A GAN

should not generate a few dissimilar images repeatedly.
Figure 4.13 displays four counterexamples of ideal generated images.

Here, we use the Distance-based Separability Index (DSI) to define the
measure: Likeness Score (LS) to evaluate GAN performance according to
the three expectations of ideal generated images. LS offers a direct way to
measure difference or similarity between images based on the Euclidean
132
Unlike real data
(a) (b) (c) (d)
Figure 4.13: Problems of generated images from the perspective of distribu-

tion. The area of dotted line is the distribution of real images. The dark-blue
dots are real samples and red dots are generated images. (a) is overfitting,
lack of Creativity. (b) is lack of Inheritance. (c) is called mode collapse for
GAN and (d) is mode dropping. Both (c) and (d) are examples of lack of
Diversity.
distance and has a simple and uniform framework for the three aspects of
ideal GANs and depends less on visual evaluation.
The proposed LS measure is applied to analyze the generated images
directly, without using pre-trained classifiers. We applied the measure to out-
comes of several typical GANs: DCGAN [209], WGAN-GP [94], SNGAN [181]
LSGAN [172] and SAGAN [301] on various image datasets. Results show
that the LS can reflect the performance of GAN well and are very competitive
with other compared measures. In addition, the LS is stable with respect to
the number of images and could provide an explanation of results in terms
of the three respects of ideal GANs.
4.4.2 Related Work
Recently, the two most widely applied indexes to evaluate GANs per-
formance are the Inception Score (IS) [231] and Fréchet Inception Distance
(FID) [106]. They both depend on the pre-trained Inception network [264]
that was trained on the ImageNet [51] dataset.
133
4.4.2.1 KL Divergence Based Evaluations
From the perspective of the three aspects for ideal GANs, the IS focuses
on measuring the inheritance and diversity. Specifically, we let x ∈ G be
a generated image; y = InceptionNet(x) is the label obtained from the pre-
trained Inception network by inputting image x. For all generated images,
we have the label set Y . H(Y ) defines the diversity (H(·) is entropy) because
the variability of labels reflects the variability of images. H(Y |G) could show
the inheritance because a good generated image can be well recognized and
classified, and thus the entropy of p(y|x) should be small. Therefore, an
ideal GAN will maximize H(Y ) and minimize H(Y |G). Equivalently, the goal
is to maximize:
H(Y ) − H(Y |G) = EG [DKL (p(y|x)kp(y))]
DKL is the Kullback–Leibler (KL) divergence of two distributions [147]. The

IS index is defined:
IS(G) = exp (EG [DKL (p(y|x)kp(y))])
The IS mainly shows diversity and reflects inheritance to some extent;

a larger value of IS indicates that a GAN’s performance is better. The
substantial limitations of IS are:
1. It depends on classification of images by the Inception network,

which is by trained ImageNet, and employs generated data without
exploiting real data. Thus, IS may not be proper to use on other
images or non-classification tasks because it cannot properly show the
inheritance if the data are different from those used in ImageNet.
2. Creativity is not considered by the IS because it ignores the real
134
data. And it has no ability to detect overfitting. For example, if the set
of generated images was a copy of the real images and very similar to
images of ImageNet, IS will give a high score.
The main drawback of the IS is disregard of real data. Thus, to improve

the performance of IS, the Mode Score (MS) [39] and Activation Maximization
(AM) score [310] include real data in their computations. Specifically, we
let z ∈ R be a real image; yr = InceptionNet(z) is the label obtained from the
pre-trained Inception network by inputting the real image z. The MS is then
defined as:
MS(R, G) = exp (EG [DKL (p(y|x)kp(yr ))] − DKL (p(y)kp(yr )))
And the AM is defined as:
AM(R, G) = EG [H (y|x)] + DKL (p(yr )kp(y))
Like the IS, larger value of MS is better; but smaller value of AM is better.
4.4.2.2 Distance-based Evaluations
The FID also exploits real data and uses the pre-trained Inception net-
work. Instead of output labels it uses feature vectors from the final pooling
layers of the InceptionNet. All real and generated images are input to the
network to extract their feature vectors.
Let ϕ(·) = InceptionNet_lastPooling(·) be the feature extractor and let
Fr = ϕ(R), Fg = ϕ(G) be two groups of feature vectors extracted from real
and generated image sets. Consider that the distributions of Fr , Fg are
135
multivariate Gaussian:
Fr ∼ N (µr , Σr ) ; Fg ∼ N (µg , Σg )
The difference of two Gaussians is measured by the Fréchet distance:
1

FID (R, G) = kµr − µg k22 + Tr Σr + Σg − 2 (Σr Σg ) 2
In fact, FID measures the difference between distributions of real and

generated images; that agrees with the goal of GAN training – to minimize
the difference between the two distributions. The FID measure, however,
depends on the multivariate Gaussian distribution assumption of Fr and
Fg : Fr ∼ N (µr , Σr ) ; Fg ∼ N (µg , Σg ). The assumption of multivariate Gaussian
distributions of feature vectors cannot be always guaranteed because some
features may not be Gaussian distributed. And in a high-dimensional space,
because of the curse of dimensionality, the amount of data may be not large
enough to form a multivariate Gaussian distribution (because that requires
a large amount of data according to the Central Limit Theorem). In addition,
as with IS, FID depends on the pre-trained Inception network.
To avoid the Gaussian assumption, we can directly compute the Wasser-
stein distance [230] between the real data distribution Pr ∼ R and the gener-
ated data distribution Pg ∼ G. In fact, the well-known Wasserstein GAN [10]
uses this distance to optimize the GAN models. It is very difficult to compute
the Wasserstein distance between two distributions in high dimensions by its
original definition. In practice, the Sliced Wasserstein distance (SWD) [26] is
applied to approximate the Wasserstein distance between real and generated
images. The key idea of SWD is to obtain several random radial projections
of data from high dimensions to one-dimensional spaces and compute their
136
1-D Wasserstein distances, which have simple solutions [212, 3].
Compared to IS and FID, SWD directly uses the real and generated
images without auxiliary networks but it requires that the two data sets
have the same number of images: |R| = |G|. Usually, the amount of real data
is smaller than that of generated data (generated data can be an arbitrarily
large amount). And the result of SWD is in general different with each
application of the algorithm because of its dimensionality reduction by
random projections. Thus, we have to take its average values by computing
repeatedly.
As with the FID, the Wasserstein distance measures the difference be-
tween distributions of real and generated images and a good GAN can
minimize the difference between the two distributions. Hence, for FID and
SWD, the smaller value is better.
4.4.2.3 Other Evaluations
As illustrated by the FID and SWD, to compare distributions of real and

generated data is an important idea for the GAN evaluation. The Classifier
Two-sample Tests (C2ST) [152] is to examine if two samples belong to the
same distribution through a selected classification method. Specifically,
any two-class classifier can be employed in the C2ST. To create a C2ST
without an additional classifier, Lopez-Paz and Oquab [164] introduced the
1-Nearest Neighbor Classifier (1NNC) measure that uses a two-sample test
with the 1-Nearest Neighbor (1-NN) method on real and generated image
sets. Similar to SWD, 1NNC examines whether two distributions of real and
generated image are identical and it also requires the numbers of real and
generated images to be equal.
Suppose |R| = |G|, we apply the Leave one out cross-validation (LOOCV)
137
to a 1-NN classifier trained on dataset: {R ∪ G} with labels “1” for R and
“0” for G. For each validation result, the accuracy is either 1 or 0; and the
Leave-one-out (LOO) accuracy is the final average of all validation results.
• LOO accuracy ≈ 0.5 is the optimal situation because the two distribu-
tions are very similar.
• LOO accuracy < 0.5, the GAN is overfitting to R because the generated
data are very close to the real samples. In an extreme case, if the GAN
memorizes every sample in R and then generates them identically, i.e.,
G = R, the accuracy would be = 0 because every sample from R would
have its nearest neighbor from G with zero distance.
• LOO accuracy > 0.5 means the two distributions are different (separa-
ble). If they are completely separable, the accuracy would be = 1.
Compared to IS and FID, the 1NNC is an independent measure without

auxiliary pre-trained classifiers. However, the |R| = |G| requirement limits its
applications and the local conditions of distributions will greatly affect
the 1-NN classifier. For 1NNC, 0.5 is the best score. To compare with other
scores, we regularize 1NNC by this function:
r(x) = −|2x − 1| + 1 (4.1)
Let r1NNC = r(1NNC). Therefore, for r1NNC, the best score is 1 and the
larger value is better.
As reported by Borji [29], many other GAN evaluation measures have
been proposed recently. Measures like the Average Log-likelihood [268],
Coverage Metric [273], and Maximum Mean Discrepancy (MMD) [91] depend
on selected kernels. And measures like the Classification Performance (e.g.,
138
FCN-score) [121], Boundary Distortion [235], Generative Adversarial Metric
(GAM) [119], Normalized Relative Discriminative Score (NRDS) [304], and
Adversarial Accuracy and Divergence [292] use various types of auxiliary
models. Some measures compare real and generated images based on image-
level techniques [253, 298], such as SSIM, PSNR, and filter responses. The
idea of the Geometry Score (GS) [137] is similar to our proposed LS in some
aspects but its results are unstable and rely on required parameters6 .
We will further discuss the GS later.
By considering the complexity of algorithm, efficiency in high dimensions,
dependency on models or parameters, the extent of use in GAN study field,
and (codes) availability for implementation, we finally chose the IS, FID,
r1NNC(C2ST), MS, AM, and SWD from the currently-used quantitative
measures to compare with our proposed LS.
4.4.3 Likeness Score: A Modified DSI for GANs Evaluation
Like FID, 1NNC, and SWD, to examine how the distributions of real and
generated images are close to each other is an effective way to measure
GANs because the goal of GAN training is to make generated images have
the same distribution as real ones.
Considering a dataset that contains real and generated data, the most
difficult situation to separate the two classes (or two types: real and gen-
erated data) of data arises when the two classes are scattered and mixed
together in the same distribution. In this sense, the separability of real and
generated data could be a promising measure of the similarity of the two
distributions. As the separability increases, the two distributions have more
differences. Therefore, we propose to use the Distance-based Separability
6 Inpractice, we used the codes provided by its author: https://github.com/KhrulkovV/
geometry-score.
139
Index (DSI) to analyze how two classes of data are mixed together.
Since for GANs’ evaluation, there are only two classes: the real image
set R and generated image set G, we have two ICD sets and one BCD set
(see their definitions in Section 2.3.1). The DSI can be applied in a multi-
class scenario by one-versus-others but here, we focus on the computation
of DSI for GANs’ evaluation (two-class scenario).
Similar to the procedure shown in Section 2.3.2 (but the last step is
different), to compute the LS for two classes R and G:
1. First, to compute the ICD sets of R and G: {dr }, {dg } and the BCD set:
{dr,g }.
2. Second, to examine the similarity of the distributions of the ICD and

BCD sets, we apply7 the Kolmogorov–Smirnov (KS) distance [79]:
sr = KS({dr }, {dr,g }), and sg = KS({dg }, {dr,g }).
3. Finally, the LS for GANs evaluation is calculated from the maximum

of two KS distances:
LS({R, G}) = 1 − max{sr , sg },
because the maximum value can highlight the difference between ICD
and BCD sets.
Remark. The similarity of the distributions of the ICD sets: KS({dr }, {dg })
is not used because it shows only the difference of distribution shapes, not
their location information. For example, two distributions that have the
7 Inexperiments, we used the scipy.stats.ks_2samp from the SciPy package in Python to
compute the KS distance. https://docs.scipy.org/doc/scipy/reference/generated/scipy.
stats.ks_2samp.html
140
same shape but no overlap will have zero KS distance between their ICD
sets: KS({dr }, {dg }) = 0.
Figure 4.14 displays artificial 2D examples of generated data (orange

points; blue points are real data) that respectively lack creativity, diversity,
and inheritance. With respect to the ICD and BCD sets, if the generated data
overfit the real data (lack of creativity), peaks will appear in the distribution
of BCD near zero (see Figure 4.14a) because there are many generated
points that are close to real data points in their distribution space; hence,
many BCD are close to zero. Similarly, lack of diversity implies that many
generated data points are close to each other; thus, many ICD values are
close to zero and peaks will appear in the distribution of ICD near zero
(see Figure 4.14b). Lack of inheritance is shown by the difference between
the distributions of ICD and BCD (see Figure 4.14c) because if and only if
the two classes (real data and generated data) have the same distribution,
the distributions of ICD and BCD sets are identical. In that case, there is
neither lack of creativity nor lack of diversity. This is because there will be
no single peaks of ICD or BCD near zero. Therefore, the LS well evaluates
the GAN’s performance by measuring creativity, diversity, and inheritance.
LS ranges from 0 to 1; that LS is close to 1 (low separability) means that
the ICD and BCD sets are very similar, and by Theorem 2.1, the distributions
of real and generated data are similar too. Hence, the GAN performs well.
LS is closer to 1, the GAN performs better.
The first experiment has two purposes: one is to test the stability of
the proposed measure, i.e., how little the results change when different
amounts of data are used. Another purpose is to find the minimum amount
141
(a) lack of Creativity
BCD peaks ahead
(b) lack of Diversity

Gen ICD peaks ahead
(c) lack of Inheritance

ICD set for Real
ICD set for Generated
BCD differs from ICD
Real Gen
Data Plot Histograms of sets
Figure 4.14: Lack of Creativity, Diversity, and Inheritance in 2D. Histograms
of (a) and (b) are zoomed to ranges near zero; (c) has the entire histogram.
142
of data required for the following experiments because a GAN could generate
unlimited data and we wish to bound it to make computation practicable.
The following experiments compare our measure LS with the commonly
used measures: IS and FID, and other selected measures. The purpose is
not to show which GAN is better but to show how the results (values) of our
measure compare to those of existing measures.
4.4.4.1 One Image Type by DCGAN
Table 4.9: Measure values for different numbers of generated images
# LS IS FID r1NNC† MS AM SWD GS

120 0.613 1.435 148.527 0.850 0.791 456.660 717.471 0.311
240 0.644 1.424 134.484 0.858 0.809 456.119 673.341 0.757
480 0.636 1.409 135.317 0.821 0.834 451.786 668.462 1.074
960 0.622 1.447 145.142 0.833 0.852 451.338 667.519 0.908
1200 0.630 1.426 141.818 0.862 0.827 454.656 675.751 1.000
2400 0.628 1.431 146.109 0.850 0.844 452.077 685.621 0.454
4800 0.622 1.440 145.109 0.851 0.842 451.255 678.986 0.526
Dashed line: to the left are our proposed measures; to the right are
compared measures.
† r1NNC is the regularized 1NNC, defined by Eq. 4.1.
To test the proposed measures, in the first experiment, we used one

type of image (Plastics; 12 images) from the USPtex database [16] to train
a DCGAN. Then, the trained GAN generated several groups containing
different amounts of synthetic images. Finally, we compute results of our
proposed measure (LS), IS, FID, r1NNC, MS, AM, SWD and GS by using these
generated images and 12 real images; the results are shown in Table 4.9.
Computations of FID, r1NNC and SWD require that the two image sets
have the same number of images. We divided the generated images into
many 12-image subsets to compute the scores with 12 real images and then
obtained their average values. Figure 4.15 shows the plots of these scores.
143
DCGAN-Plastics
1.5
1.3
1.1
Index Value
0.9
0.7
0.5
0.3
0 1000 2000 3000 4000 5000
# of generated images
LS IS
FID/100 r1NNC
MS AM/1000
SWD/1000 GS
Figure 4.15: Plots of values in Table 4.9.
To fit the axes, the values of FID, AM, and SWD are scaled by 0.01, 0.001,
and 0.001, respectively. The result indicates that the scores except the
GS, are stable to different numbers of testing images, especially when the
amount is greater than 1000. We remove the GS from further comparisons
because its results are highly unstable with the amount of data.
144
4.4.4.2 Four Image Types and Three GANs
In the second experiment, four types of image (Holes, Small leaves,

Big leaves, and Plastics; 12 images for each type) are used to train three
GANs (DCGAN, WGAN-GP, and SNGAN). Then, the trained GANs generated
1,200 synthetic images for each type. Twelve sets of synthetic images
were generated; Figure 4.16 shows samples from 4 real image sets and
12 generated image sets. Visual examination of these synthetic images
indicates that the DCGAN seems to give the most images similar to the real
ones, but many of its generated images are duplications of real ones. Thus,
the DCGAN overfitted the training data. The SNGAN’s generated images are
most dissimilar from real images; they lack the inheritance feature. The
WGAN-GP well balanced the creativity and inheritance features.
We applied these measures on the 12 generated image sets; results are
shown in Table 4.10. Figure 4.17 shows plots of results. To emphasize
the rank of each score for different generators and image types, values are
normalized and ranked from 0 to 1 by columns for plotting; 0 is for the worst
(model) performance and 1 is for the best (model) performance. Table 4.11
averaged scores by GAN models. To compare the three GANs, Table 4.11
shows summarized results and Figure 4.17 gives more details. In general,
the absolute values of measures are not significant but their ranks matter
because for infinite-range measures, such as IS, FID, and SWD, their values
highly depend on the input data. Therefore, little importance should be
attached to their differences.
For the best generator, the proposed LS agrees with IS, 1NNC, SWD, and
the visual appearance of generated images. Since the DCGAN overfitted to
training data, it lacks creativity, but FID, MS, and AM rank it as the best
model. All measures including the LS rank SNGAN as the worst because it
145
Real DCGAN WGAN-GP SNGAN
Hole
Small leaf
Big leaf
Plastic
Figure 4.16: Column 1: samples from four types of real images; column
2-4: samples from synthetic images of three GANs trained by the four types
of images.
146
Table 4.10: Measure results
* LS IS FID↓ r1NNC MS AM↓ SWD↓

DC-h 0.747 1.222 102.805 0.892 0.866 407.841 862.241
DC-sl 0.611 1.171 155.973 0.858 0.934 511.218 944.228
DC-bl 0.262 1.321 172.296 1.000 0.573 509.649 1053.687
DC-pla 0.630 1.426 141.818 0.908 0.827 454.656 678.210
W-h 0.771 1.163 233.277 0.958 0.671 607.249 604.263
W-sl 0.465 1.369 400.036 0.983 0.155 726.232 702.976
W-bl 0.626 1.536 375.987 0.975 0.117 779.834 650.157
W-pla 0.441 1.555 513.268 0.792 0.026 1108.549 732.241
SN-h 0.594 1.317 252.857 1.000 0.467 570.819 778.487
SN-sl 0.025 1.105 469.795 0.133 0.158 879.136 1110.309
SN-bl 0.000 1.083 456.813 0.195 0.077 1086.094 1221.202
SN-pla 0.000 1.037 485.716 0.000 0.032 1399.649 1229.506
*Generator models: DC: DCGAN, W: WGAN-GP, SN: SNGAN. Generated
image types: h: hole, sl: small leaf, bl: big leaf, pla: plastic.
Dashed line: to the left are our proposed measures; to the right are the
compared measures.
↓ Measures with this symbol mean smaller score is better; otherwise, larger
score is better.
Table 4.11: Measure results averaged by generators
Model LS IS FID↓ r1NNC MS AM↓ SWD↓

DCGAN 0.562 1.285 143.223 0.915 0.800 470.841 884.592
WGAN-GP 0.576 1.406 380.642 0.927 0.242 805.466 672.409
SNGAN 0.155 1.135 416.295 0.332 0.184 983.924 1084.876
Bold value: the best model by the measure of this column.
Underline: the worst model by the measure of this column.
compared measures.
score is better.
147
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
LS IS FID r1NNC MS AM SWD
DCGAN-hole DCGAN-small_leaf
DCGAN-big_leaf DCGAN-plastic
WGAN-GP-hole WGAN-GP-small_leaf
WGAN-GP-big_leaf WGAN-GP-plastic
SNGAN-hole SNGAN-small_leaf
SNGAN-big_leaf SNGAN-plastic
Figure 4.17: Normalized and ranked scores. X-axis shows scores and y-axis
shows their normalized values; 0 is for the worst (model) performance and
1 is for the best (model) performance. Colors are for generators and shapes
are for image types; see details in legend.
148
Table 4.12: Measure results on CIFAR-10

DCGAN 0.833 4.311 147.110 0.772 1.878 335.879 710.993
WGAN-GP 0.957 3.408 136.121 0.932 1.483 507.374 276.189
SNGAN 0.593 2.049 219.762 0.534 0.860 631.807 743.679
LSGAN 0.745 3.405 136.132 0.716 1.337 450.250 710.747
SAGAN 0.688 2.075 206.046 0.545 0.814 611.706 595.761
compared measures.
score is better.
lacks diversity. Especially, for the SNGAN-big leaf and SNGAN-plastic

whose LS values are zero (in Table 4.10), almost all images are the same
(but different from real ones).
4.4.4.3 Five GANs on CIFAR-10
In the third experiment, we used the CIFAR-10 dataset that is widely

used in machine learning to train more types of GANs (DCGAN, WGAN-
GP, SNGAN, LSGAN, and SAGAN). A 2,000-image subset had been chosen
randomly from the training set of CIFAR-10 to train the five GANs. Five sets
of synthetic images were generated; Figure 4.18 shows samples from the
original 2,000-image subset and five generated image sets.
Then, each trained GAN generated 2,000 synthetic images and we applied
the LS, and other six measures to the five generated image sets and the
original 2,000-image subset. Results are shown in Table 4.12. LS agrees
with FID, 1NNC, and SWD that WGAN-GP is the best GAN model but IS, MS,
and AM rank DCGAN as the best model. For the worst model, LS agrees
with all the other measures except the MS. MS shows the SAGAN performs
worst but the MS scores of SAGAN and SNGAN are small and close.
149
Real DC W SN LS SA
Figure 4.18: Column 1: samples from real images of CIFAR-10; column 2-6:
samples from synthetic images of five GANs: DCGAN, WGAN-GP, SNGAN,
LSGAN, and SAGAN trained by the original 2,000-image subset.
150
2000 images (name: Optimal
MNIST Opt.
Select generated set)
Label= “8”
2000 images (real set)
20 images
Repeat copying 100 times
Small modification
by median filter
2000 images (name: generated
LD
Lack of Diversity set)
2000 images (name: generated Lack

LC of Creativity set)
Pick 20 images then repeat copying 100 times
2000 images (name: generated Lack of

LC&D
Creativity & Diversity set)
MNIST Select 2000 images (name: generated Lack of

Label= “7” LIn
Inheritance set)
Figure 4.19: Processes to build real set and generated sets including optimal
generated images and generated images lack creativity, lack diversity, lack
creativity & diversity, and lack inheritance.
4.4.4.4 Virtual GANs on MNIST
To emphasize the measurements of creativity, diversity, and inheritance,

in the fourth experiment, we created five artificial image sets to simulate
the optimal generated images and generated images that lack creativity, lack
diversity, lack both creativity and diversity, and lack inheritance. Images are
taken or modified from the MNIST database [149], which contains 28 × 28-
pixel handwritten-digit images with labels {0, 1, 2, · · · , 9}. Figure 4.19
describes how the five artificial sets were built.
Three subsets containing 2,000, 2,000, and 20 images were randomly
selected from handwritten digit “8” images in the MNIST database. There is
151
no common image in the three sets. One set having 2,000 images was con-
sidered as the optimal generated set (Opt.) because these images come from
the same source of real data. The lack-of-diversity set (LD) was generated
by repeatedly copying the 20 images 100 times. Another 2,000-image set
was considered as the real set and used to generate the lack-of-creativity set
(LC) by the small modification of all images with the median filter. Since
filtering could slightly change images and keep their main information,
each image after filtering is similar to its original version i.e., the modified
images lack creativity. Choosing 20 images from the lack-of-creativity set
and repeatedly copying them 100 times generates the lack-of-creativity &
diversity set (LC&D). The lack-of-inheritance set (LIn) contains 2,000 images
selected randomly from handwritten digit “7” images in MNIST because the
handwritten digit “7” is greatly different from digit “8”.
The five datasets: Opt., LC, LD, LC&D, and LIn mimic the datasets that
are generated from five virtual GAN models trained on the 2,000-image
real set. The optimal generated set (Opt.) as if it was generated from an
optimal GAN and the other four sets as if they were generated from four
different GANs having respective drawbacks. Figure 4.20 shows samples
from these datasets. Then, we applied the LS, and other six measures to
the five “generated” image sets and the 2,000-image real set. Results are
shown in Table 4.13.
In this experiment, we know the Opt. GAN is the best one. Hence, we
could state the concrete conclusion that LS, FID, 1NNC, MS, and SWD
successfully discover the best GAN model. As we discussed in Section 4.4.2,
results of IS confirm that it is not good at evaluating the creativity and
inheritance of GANs because it gives them higher scores (2.112 and 1.941)
than the best case (1.591) and the IS emphasizes the diversity. Other
152
Real Opt. LC LD LC&D LIn
Figure 4.20: Column 1: samples from the real set; column 2-6: sample
images from the five virtual GAN models: Opt., LC, LD, LC&D, and LIn
trained by the real set.
153
Table 4.13: Measure results from virtual GAN models

Opt. 0.994 1.591 4.006 0.978 1.968 343.842 23.427
LC 0.820 2.112 67.310 0.039 1.007 371.322 657.527
LD 0.892 1.299 59.112 0.002 1.597 337.553 211.140
LC&D 0.775 1.418 116.656 0.775 0.789 389.437 740.512
LIn 0.526 1.941 130.827 0.462 0.605 441.292 1166.082
compared measures.
score is better.
measures also show their characteristics and preferences: LS agrees with

FID, MS, MA, and SWD that the worst model is lack of inheritance; IS and
1NNC indicate that the model lacking diversity is the worst. By contrast,
AM does not care about the diversity very much because its scores of the
best model and the model lacking diversity are similar; and LS, FID, MS,
AM, and SWD value creativity more among diversity and creativity.
4.4.5 Discussion
Since Geirhos et al. [85] recently reported that CNNs trained by ImageNet
have a strong bias to recognize textures rather than shapes, we chose texture
images to train GANs. From results in Table 4.11, the proposed LS agrees
with IS, 1NNC, and SWD that the WGAN-DP performs the best and SNGAN
performs the worst on selected texture images. As shown in Table 4.12, LS
makes the same evaluation on CIFAR-10 dataset. As shown in Figure 4.16,
SNGAN and WGAN-GP generate synthetic images that look different from
real samples but SNGAN tends to generate many very similar images (its
diversity is low). Hence, all measures rate SNGAN as performing worst on
texture datasets. Results on CIFAR-10 dataset (Table 4.12) show a similar
154
(a) Optimal (b) Lack of Creativity (c) Lack of Diversity (d) Lack of C & D* (e) Lack of Inheritance
Generated ICD peak ahead ICD BCD

BCD peaks ahead
ICD set for Real * C & D is creativity & diversity.

ICD set for Generated
BCD set between classes Real Generated
Figure 4.21: Real and generated datasets from virtual GANs on MNIST. First
row: the 2D tSNE plots of real (blue) and generated (orange) data points
from each virtual GAN. Second row: histograms of ICDs (blue for real data;
orange for generated data) and BCD for real and generated datasets. The
histograms in (b)-(d) are zoomed to the beginning of plots; (a) and (e) have
the entire histograms.
conclusion.
4.4.5.1 Evaluation of GAN Measures
Our results indicate that LS is a promising measure for GAN. Without a

gold standard, however, it is difficult to compare GAN evaluation methods
and to state which method is better when they performed similarly. To
show measures’ characteristics/preferences and evaluate them in terms
of the three respects of an ideal GAN, we artificially created five datasets
(Figure 4.19) as if they were generated from five virtual GANs trained on
MNIST. In this controlled circumstance, the LS, FID, 1NNC, MS, and SWD
discerned the best GAN model (Table 4.13). In addition, by analyzing the
distributions of ICD and BCD sets, LS could provide evidences for the
lack of creativity, diversity, and inheritance to explain its results. As with
Figure 4.14, we plot data and histograms of their ICD and BCD sets in
Figure 4.21 to show their relationships with the LS. Each image in MNIST
155
has 28 × 28 pixels so that these data are in a 784-dimensional space. To
visually represent the data in two dimensions, we applied the t-distributed
Stochastic Neighbor Embedding (tSNE) [171] method. In contrast, the ICD
and BCD sets were computed in the 784-dimensional space directly, without
using any dimensionality reduction or embedding methods.
As shown in Figure 4.21, the ICD and BCD sets for computing the LS
offer an interpretation of how LS works and verify that LS is able to detect
the lack of creativity, diversity, and inheritance for GAN generated data, as
we discussed in Section 4.4.3. Figure 4.21(a) shows the real (training) data
and data generated by the ideal GAN. Since distributions of the three sets
are nearly the same, LS gets the highest score (close to 1, in Table 4.13).
Figure 4.21(b) shows the GAN lacks creativity. Almost every generated data
point is overlapped with (or very close to) a real data point. Hence, the BCD
set has some peaks at the beginning of plot. Lack of diversity is shown by
Figure 4.21(c). Most generated data points are not close to real data points,
but some points are very close to each other. That results in a peak at the
beginning of generated ICD plot. Any differences of the histograms of
ICD and BCD sets will decrease the LS. Therefore, LS is affected by the
isolated peaks of one distance set. Figure 4.21(d) shows the combined effect.
Generated data points are close to real data points and cluster in a few
places. Both BCD and generated ICD peaks can be found at the beginning
of plot. For the last Figure 4.21(e), lack of inheritance means generated
data are dissimilar from real data. The two kinds of data are distributed
separately so that distributions of the three sets are all different, contrary
to Figure 4.21(a); that leads to the lowest LS.
156
211
210
29 685.89
28
27
110.86
6
Time (s)
25
26.14
24
23
6.66
2
2
21
20 1.13
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
# of real or synthetic images
LS IS† FID 1NNC
MS AM SWD
Figure 4.22: Time cost of measures running on a single core of CPU (i7-
6900K). To test time costs, we used same amount of real and generated
images (200, 500, 1000, 2000, and 5000) from CIFAR-10 dataset and
DCGAN trained on CIFAR-10. † IS only used the generated images.
157
4.4.5.2 Time Complexity
Both LS and 1NNC use the direct image comparison which is the Eu-
clidean (l 2 -norm) distance between two images. The main time cost of LS is
to calculate ICD and BCD sets. LS’s time complexity for N (Class 1) and M
(Class 2) data is about O N 2 /2 + M 2 /2 + MN (two ICD sets and one BCD set).

Although 1NNC also uses Euclidean distance between two images, its time
complexity is about O (M + N)2 , which is double the cost of LS, because it

uses the Leave-One-Out Cross-Validation for 1-Nearest Neighbor classifier.

For each sample from the (M + N) images, (M + N − 1) distances should be
calculated to find its nearest neighbor.
The IS, FID, MS, and AM use the Inception neural network to process
images so that their time costs are greater than LS if running on CPU
(i7-6900K). Although running on GPU could accelerate the processing of
neural networks, for fair comparisons of time costs, all measures were run
on a single core of CPU because 1NNC and LS do not run on GPU currently;
but in the future, they also could be accelerated by moving to run on GPU.
The Figure 4.22 shows for as many as 5,000 samples, LS has uniformly
superior performance in terms of time complexity. Although the growth
trend shows other measures (except 1NNC) will be running faster than LS
at some larger number of samples, we do not need such a large data set to
evaluate GANs. Since GAN measures are stable to the growth of amount of
data (as shown by Figure 4.15), our experiments demonstrate that 2,000
samples are adequate for GAN measures.
158
4.4.5.3 Comparison Summary
The compared measures have various drawbacks. The IS, FID, MS, and
AM depends on the Inception network pre-trained by ImageNet. In addition,
IS lacks the ability to detect overfitting (creativity) and inheritance and FID
depends on the Gaussian distribution assumption of feature vectors from
the network. The SWD and 1NNC require that the amount of real data be
equal to the amount of generated data. The local conditions of distribu-
tions will greatly influence results of 1NNC (e.g., it obtains extreme values
like 0 or 1 in Table 4.10) because it only considers the 1-nearest neigh-
bor. That there are several required parameters8 such as slice_size and
n_descriptors is another disadvantage of SWD; both changes of parameters
and the randomness of radial projections will influence its results.
The proposed LS is designed to avoid those disadvantages. We have
created three criteria (creativity, diversity, and inheritance) to describe ideal
GANs. And we have shown that LS evaluates a GAN by examining the three
aspects in a uniform framework. In addition, LS does not need a pre-trained
classifier, image analysis methods, nor a priori knowledge of distributions.
Ranging between 0 and 1 is another merit of LS because we could know
how close the performance of a GAN model is to the ideal situation.
We found that the idea of GS [137] has some similar points to our LS. The
GS compares the complexities of the manifold structures, which are built
by pairwise distances of samples, between real and generated data. And we
think the complexity of data manifold may have some connections to data
separability. However, we found the results of GS is too unstable to use. For
example, we have computed GS measure twice on 2,000 generated and 2,000
real images from DCGAN and CIFAR-10 (the same test in Section 4.4.4.3);
8 More details are in its source codes: https://github.com/koshian2/swd-pytorch.
159
one result is 0.0078 and another is 0.0142 – it is almost doubled. As
Figure 4.15 shown, GS results not only differ on each computation time
but also on the amount of samples.
4.4.5.4 Contributions and Future Work
LS uses a very simple process – it calculates only Euclidean distances of

data and the KS distances between distributions of data distances; those
methods are independent of image types, amounts, and sizes. LS offers
a distinctly new way to measure the separability of real and generated
data. By experiments, it has been verified to be an effective GAN evaluation
method by examining the three aspects (creativity, diversity, and inheritance)
of ideal GANs. In particular, LS can provide evidences of the three
aspects in the histograms of ICD and BCD sets to explain its results
(e.g., Figure 4.21). In the future, individual measures (scores) for each
aspect could be developed by further analysis of the ICD and BCD sets.
Besides evaluation of GANs, LS could measure data complex-
ity/separability as well. According to Theorem 2.1, the LS provides an
effective way to verify whether the distributions of two sample sets are iden-
tical for any dimensionality. Thus, our proposed novel model-independent
measure for GAN evaluation has clear advantages in theory and has been
demonstrated to be worthwhile for future GAN studies.
Results also show that a GAN that performs well with one type of image
may not do so with other types. For example, in Table 4.10 and Figure 4.17,
we see that the SNGAN performs much better on Hole images than on other
image types. Hence, in future work, we will examine the proposed measure
on more types of images and GAN models.
160
4.5 Generalizability of Deep Neural Networks9
In this study, we create the Decision Boundary Complexity (DBC) score

to define and measure the complexity of decision boundary of the Deep
Neural Network (DNN). The idea of the DBC score is to generate data
points (called adversarial examples) on or near the decision boundary. Our
new approach then measures the complexity of the boundary using the
entropy of the eigenvalues of those data. The method works equally well
for high-dimensional data. We use training data and the trained model to
compute the DBC score. And the ground truth for a model’s generalizability
is its test accuracy. Experiments based on the DBC score have verified our
hypothesis. The DBC is shown to provide an effective method to measure
the complexity of a decision boundary and gives a quantitative measure of
the generalizability of DNNs.
4.5.1 Introduction of Generalizability of Neural Networks
The generalization ability (generalizability) is an essential characteristic

of classifiers in both machine learning and deep learning. A classifier with
good generalizability performs well on new data. Classically, a small portion
of data taken from the training set as test/validation data is used to describe
the generalizability. It would be valuable to analyze the generalizability of a
classifier model directly, without test data, because it could help the model
selection, and save time and data (data are quite limited in some cases) for
training models.
A Deep Neural Network (DNN) usually contains many more parameters
than training data. Based on traditional generalization analysis such as the
161
VC dimension [276] or Rademacher complexity [22], DNNs tend to overfit
the training data and demonstrate poor generalization. Much empirical
evidence, however, has indicated that neural networks can exhibit a re-
markable generalizability [299]. This fact requires new theories to explain
the generalizability of neural networks. Two main approaches characterize
studies of generalizability for deep learning [128]: a generalization bound
on the test/validation error calculated from the training process [69, 160],
and a complexity measure of models [134, 189, 127], motivated by the
VC-dimension.
Classifiers that overfit the training data lead to poor generalizability. To
limit the overfitting, several regularization techniques such as dropout and
weight decay have been widely applied in training DNNs. As L1 and L2
regularization could generate sparsity for sparse coding [151], regularization
techniques simplify the model’s structure and then prevent the model from
overfitting [257, 295]. This is because the simplified model cannot fit all
training data precisely but must learn the approximate outline or distribu-
tion of the training data, which is the key information required to perform
well on test data (generalizability). On the other hand, the law of parsimony
(Occam’s Razor) [25] implies that any given simple model is a priori more
probable than any given complex model [99]. Therefore, we hypothesize
that, on a specific dataset, if two models have similar high training accuracy
(close to 1), the simpler model will have a higher test accuracy (better
generalizability).
There are two ways to measure model complexity: 1) to examine trainable
parameters and the structure of the model [189, 38]; 2) to evaluate the com-
plexity of the decision boundary [311, 14, 156], which is the consequential
representation of model complexity. Recently, several analyses of complexity
162
of the decision boundary investigated adversarial examples that are near
the decision boundary [103, 297, 133]. In this paper, for DNN models, we
analyze generalizability based on the complexity of the decision boundary.
Unlike other recent studies, this one proposes a novel method to charac-
terize these adversarial examples to reveal the complexity of the decision
boundary, and this method is applicable to datasets of any dimensionality.
4.5.2 Methods
4.5.2.1 Adversarial Examples
It is difficult to describe the decision boundary of a trained DNN model

directly. Using adversarial examples is the key to this problem because they
are near the boundary and could be considered as points sampled from
the boundary. The boundary is described by these examples. Specifically,
for a two-class {0, 1} classifier f , an adversarial example x is one for which:
f (x) ≈ 0.5
There are several approaches to generate the adversarial examples [103,

297, 133]; we apply a simple one [297] to linearly generate them. For
example, for the two-class classifier f , as Figure 4.23 shows, we select one
training data point a in Class 1 and another one b in Class 2. The example
x on the line segment between a and b can be defined by:
x = λ a + (1 − λ ) b, 0 ≤ λ ≤ 1
The line must cross the decision boundary because its two ends are in
different classes. Hence, the adversarial example c exists on the line.
163
Decision boundary of classifier f
1−𝜆
𝑓 𝑏 ≈1
𝜆
𝑓 𝑐 ≈ 0.5
𝑓 𝑎 ≈0
Class 1 Class 2
Figure 4.23: To generate adversarial examples of classifier f .
To find the adversarial example on such a line segment, we search over

[0,1] for a value of λ that yields a point close to the boundary. The pseudo-
code shows the algorithm of this process.
Algorithm 1: To find an adversarial example

1 ∀ a ∈ Class1 , b ∈ Class2 s.t. f (a) < 0.5; f (b) > 0.5
∗
2 c ← a or b // initial the adversarial example
3 for λ ← 0 to 1 step ε do
4 c ← λ a + (1 − λ ) b // closer data point to decision boundary
5 if | f (c) − 0.5| < | f (c∗ ) − 0.5| then
6 c∗ ← c
7 end
8 end
Output: c∗ .
The precision of the distance of the adversarial example from the bound-
ary depends on the step (ε) value; the time cost depends on it also: it is
about O (1/ε). This process can be speeded up to O (log (1/ε)) by the divide-
and-conquer algorithm, which uses binary search. In experiments, we set
ε = 1/256 because the inputs are 8-bit images.
164
Class 1 Class 2
Figure 4.24: Adversarial examples generated by pairs of data.
4.5.2.2 Boundary Complexity Measure
For two-class datasets, one adversarial example is generated by a pair

of data points from the two classes. Suppose Class 1 has N data and
Class 2 has N data; then randomly-selected N pairs of data can generate N
adversarial examples. As Figure 4.24 shows, these adversarial examples
(green points) are likely sampled from the decision boundary and could
describe it. These generated adversarial examples form the adversarial set.
We measure the complexity of the decision boundary by investigating the
complexity of the adversarial set.
We apply Principal Components Analysis (PCA) to analyze the complexity
of the adversarial set. In n-dimensions, an adversarial set with m examples
forms a n × m matrix X. Suppose n < m; by PCA, we have:
XXT W = λ W
165
Where W is eigenvector matrix: s.t. WT W = I and λ contains the n eigenval-
ues: {λ1 , λ2 , · · · , λn }.
These eigenvalues could show the complexity of the adversarial set. If
λi
∑ λk
= 1, it means all m examples lie on the line of the i-th eigenvector. It is the
λi +λ j
simplest condition for the adversarial set. If ∑ λk
= 1, it means all m examples
are on a plane; that indicates the decision boundary most likely is a plane.
In general, we could measure the Decision Boundary Complexity (DBC) of f
by computing the Shannon entropy of the eigenvalues:
n o
λ1 λ2
H ,
∑ λi ∑ λi
, · · · , ∑λλn i
DBC { f } =
log n
Dividing by log n normalizes the DBC to the range [0, 1]. 0 is the simplest
condition: the decision boundary is just a line.
A problem arises if we think about the most difficult condition of the
boundary (DBC=1). For example, in 2-D, DBC=1 when the adversarial set
forms a circle, but we cannot say the round boundary is the most complex
one. For round-shape decision boundaries, some boundaries are smooth,
and some may be lumpy. As Figure 4.25 shows, the boundary (a) is more
smooth (simpler) than (b). Under our hypothesis, we consider that the
generalizability of model (a) is better than (b). DBC scores computed by
adversarial sets of the two models will, however, be similar (and close to 1).
In Figure 4.25, the boundary (a) is obviously simpler than boundary (b)
because (b) has many zigzags in every segment. But if we compute the DBC
score using the entire adversarial set, the effect (on eigenvalues’ entropy)
of zigzags is confused with the round-shape. Thus, it is not appropriate
to use the entire adversarial set in such cases. If the adversarial set is
generated by all data (in Figure 4.24), we name it the global adversarial
166
Class 2 Class 2
Class 1 Class 1
boundary (a) boundary (b)
Segment of (a) Segment of (b)
Figure 4.25: Two kinds of round decision boundary.
set. And the DBC score computed from it is called the global DBC. To solve
the round-shape problem, we turn to consider adversarial examples on a
section of the boundary, the segmental boundary. We define the adversarial
data set formed by a segmental boundary as the local adversarial set.
Adversarial examples in a local adversarial set should be close to each
other to outline the shape of the segmental boundary. As Figure 4.26 shows,
a pair of data points from two classes is randomly selected, then to find
n-nearest neighbors of one of those two data points. Finally, adversarial
examples (green points) are generated by lines between these n + 1 data
points in one class to another data point in a different class. To decide
the number of examples k for one local adversarial set is an interesting
question. It probably depends on the dimension and distances between
example points. We will further discuss this question in Sec. III.B.
The computation process for complexity of a local adversarial set is the
167
Class 1 Class 2
Figure 4.26: Local adversarial set generated by 3-nearest neighbors of a

pair.
same as that for the global adversarial set. The steps show the process. The
difference is that N pairs of data generate one global adversarial set but N
local adversarial sets. Thus, one decision boundary has many local DBC
scores.
Algorithm 2: To compute one local DBC score

1 Take ∀ a ∈ Class1 , b ∈ Class2 s.t. f (a) < 0.5; f (b) > 0.5
2 Set of adversarial examples {c1 , c2 , · · · , ck+1 } generated by: a to
k-nearest neighbors of b including b
3 Compute the eigenvalues of the local adversarial set by PCA
4 Compute the normalized Shannon’s entropy of eigenvalues
We design three experiments to verify our boundary complexity mea-

sure. The dataset for the first experiment contains synthetic 2-D data with
two classes. The second experiment uses the breast cancer Wisconsin
168
dataset from sklearn.datasets.load_breast_cancer10 . Its dimensionality is
30. The third experiment uses real images of cats and dogs downloaded
from GitHub11 . The image size is 150x150x3; thus, these data are in very
high dimension.
The key ideas of experiments are to train DNNs with different general-
izabilities and compute DBC scores of these trained models. The ground
truth for generalizability is the test accuracy because better performance
on test data indicates greater generalizability: performance on new data.
In the experiments the generalizability of a DNN is adjusted by intentional
overfitting, such as by adding excessive trainable weights and removing
regularization layers.
4.5.3.1 Synthetic 2-D Dataset
The dataset is generated by sklearn.datasets.make_blobs. The dataset

has two well-separated clusters; each cluster has 200 data points and
belongs to one class (see data points in Figure 4.27), and thus this dataset
is linearly separable. Two fully-connected neural networks (FCNNs) have
been trained to classify this dataset. There are no test data, and both
training accuracies are 100%. Their real decision boundaries are shown in
Figure 4.27. Obviously, the decision boundary of model (a) is simpler than
that of model (b) because, by comparison to the linearly separable dataset,
any non-linear boundary is superfluous.
For the two models (a) and (b), we generate adversarial examples (green
points) by pairs of data from two classes to form the global adversarial
sets, which clearly illustrate the boundary shape. And, the global DBC
10 https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_
cancer.html
11 https://github.com/vyomshm/Cats-Dogs-with-keras
169
boundary (a) boundary (b)
adversarial set of (a) adversarial set of (b)

𝑫𝑩𝑪 ≈ 𝟎. 𝟎𝟎𝟎𝟕 𝑫𝑩𝑪 ≈ 𝟎. 𝟐𝟗𝟖𝟓
Figure 4.27: Decision boundaries of two models trained by the synthetic

2-D dataset. The FCNN:(a) has only one hidden layer with one neuron;
its number of parameters is 5 (including bias). The FCNN:(b) has three
hidden layers with 10, 32 and 16 neurons; its number of parameters is 927
(including bias).
scores successfully show their different complexity situations (smaller DBC

means a simpler boundary). In this case, we could assert that the local
DBC of model (a) must be smaller than the local DBC of model (b) without
quantitative comparison because any segment of boundary (a) is not more
complex than any segment of boundary (b). Hence, the overall (average)
local DBC of boundary (a) must be smaller than that of boundary (b). For
convenience, in other experiments, we compute only the local DBC.
The DBC effectively detects the model with better generalizability (simpler
170
decision boundary). It is not very impressive for the 2-D dataset because
the boundary is visible. We could visually identify the simpler boundary
case. But for a high-dimensional dataset, we must rely on the DBC score to
describe the complexity of decision boundary.
4.5.3.2 Breast Cancer Dataset
The dataset is imported from the breast cancer (Wisconsin) dataset and
has two classes (212 Malignant and 375 Benign cases). Each case contains
30 numerical features.
Two FCNN models (bC1 and bC2) have been trained to classify this
dataset. The training-test data ratio is 3:2 and both training accuracies are
nearly 100% (> 0.99) at the end. Then, we obtain models’ test accuracies
as the ground truth for model complexity. The greater test accuracy value
means better generalizability (simpler decision boundary).
To compute the local DBC scores requires only the training data. We
randomly select a pair of data points, of which one is a Malignant sample
and another one is a Benign sample, to compute one local DBC score on
trained models. This process is repeated 2,500 times (about 5 times of
total number of data) to obtain 2,500 local DBC scores for each model.
These local DBC scores are based on 30-nearest neighbors because the
space dimension is 30. Thus, each local DBC score is computed by 31
adversarial examples. The reason is that, in 30-D, the simplest element
(30-simplex) contains 31 vertices (e.g., as triangle in 2-D and tetrahedron in
3-D). We consider that n-nearest neighbors could best reflect the complexity
of segmental boundary in n-D. The next experiment shows that the number
of nearest neighbors could be much smaller than the dimensionality and
not unique.
171
Histograms of local DBC scores form models
Sorted local DBC scores from models
Number of repetitions
Figure 4.28: Local DBC scores from two models trained by the breast cancer
dataset. The FCNN bC1 has three hidden layers (20 neurons in each layer)
and three Dropout layers; its number of parameters is 1,481 (including
bias). The bC2 has one hidden layer with 1,000 neurons; its number of
parameters is 32,001 (including bias).
Table 4.14: Statistical Results of local DBC scores on bC1 and bC2.
2,500 local DBC scores

Model Test Acca
Mean Median h0 (bC1 ≥ bC2)b
bC1 0.970 0.087 0.057 Rejected
bC2 0.921 0.113 0.079 (p ≈ 0)
a Test accuracy is the ground truth.
b By Two-sample Wilcoxon signed rank test.
172
Figure 4.28 and Table 4.14 clearly indicate that the model bC2 generally
has larger local DBC scores than bC1. The result means bC1 has better
generalizability than bC2, which is verified by their test accuracies. We do
not calculate the standard deviation of scores because their distributions
are not Gaussian but more like the long-tailed distribution. Instead, we
apply the two-sample rank test12 to estimate which scores are smaller.
4.5.3.3 Cat and Dog Dataset
This dataset contains 1,440 cat and 1,440 dog RGB photos. The image
size is 150x150x3 (67,500 8-bit integers). Three convolutional neural net-
work (CNN) models (cC1, cC2 and cC3) are trained to classify this dataset.
The training-test ratio is 32:13 and both training accuracies are > 0.95 at
the end. Then, we obtain models’ test accuracies as the ground truth for
model complexity. Figure 4.29 shows the training process.
0.73
0.63 0.58
Figure 4.29: Training and test accuracies in training process of three models.
The CNN cC1 has three convolutional layers, three max-pooling layers,
one dense layer (64 neurons) and one Dropout layer. The cC2 has one
convolutional layer and three dense layers (256, 128, 64 neurons). The cC3
has only one dense layer (1024 neurons).
12 https://www.mathworks.com/help/stats/signrank.html
173
To compute the local DBC scores uses only the training set. We randomly
select a cat and a dog image from the training set to compute local DBC
scores on trained models. This process is repeated 6,000 times (about 5
times of the size of training set) to obtain 6,000 local DBC scores for each
model.
Since the space dimension (67,500) is far beyond the size of dataset
(2,880), we cannot choose based on the idea of a simplex and use 67,500-
nearest neighbors to compute local DBC scores. Even if we have enough
images to use, the number of nearest neighbors is too large to run the
process. Hence, to find a properly small number, we test the 3, 5, 10, 15,
20, 30-nearest neighbors.
0.9
Mean Median
0.85
0.8
DBC scores
0.75
0.7
0.65
0.6
0.55
0 5 10 15 20 25 30 0 5 10 15 20 25 30
# of nearest neighbors cC1 cC2
cC3
Figure 4.30: Means and medians of local DBC scores on model cC1, cC2
and cC3 using different numbers of nearest neighbors.
Figure 4.30 shows the means and medians of 6,000 local DBC scores
based on various numbers of nearest neighbors. Since the distributions of
these scores are not Gaussian but more like the long-tailed, we use their
174
Sorted local DBC scores (15-NN) from models
enlarged range from 2k-6k
Number of repetitions
Figure 4.31: Increasingly sorted local DBC scores from three models. The
upper figure is the whole plot, and the lower figure is zoomed the plot in
range from 2k-6k to clearly see positions of three curves.
medians instead of the standard deviations. By comparing the means and

medians for the three models, we find that, regardless of the number of
nearest neighbors, for the DBC scores: cC1 < cC2 < cC3 always holds.
Such a conclusion is verified by their test accuracies (see test accuracy
in Table 4.15). The higher test accuracy suggests the model has better
generalizability and should have a simpler decision boundary and smaller
DBC scores. More strict estimates require the two-sample rank test. We
provide an example of the 15-nearest neighbors case in Table 4.15. We
reject two null hypotheses: cC1 ≥ cC2 and cC2 ≥ cC3 with p ≈ 0; it proves that
cC1 < cC2 < cC3. Also, Figure 4.31 indicates the same conclusion.
175
Table 4.15: Statistical Results of local DBC scores on cC1, cC2 and cC3.
6,000 local DBC scores (15-NNb )

Model Test Acca
Mean Median h0 (cC1 ≥ cC2)c h0 (cC2 ≥ cC3)
cC1 0.730 0.850 0.873 Rejected
cC2 0.626 0.877 0.889 (p ≈ 0) Rejected
cC3 0.583 0.887 0.897 (p ≈ 0)
a Test
accuracy is the ground truth.
b Computation bases on 15-nearest neighbors.
c By Two-sample Wilcoxon signed rank test.
Figure 4.32: Adversarial examples for the cC1 model.
4.5.4 Discussion
The main idea of this study is simple and clear: using the adversarial
examples on or near the decision boundary to measure the complexity of
the boundary. It is difficult to define and measure the complexity of a
boundary surface in high dimensions, but easier to measure the complexity
of adversarial example sets. We measure the complexity via the entropy of
eigenvalues of adversarial sets. Other complexity measures for grouped data
are also worth considering [165]. Figure 4.32 shows several adversarial
examples for the cC1 model generated by training images. They look like
mixed cat and dog photos.
To generate the adversarial examples, as Figure 4.23 shows, we use a
pair of real data from different classes. At least one adversarial example is
176
on the line segment between two data points because the line must cross the
decision boundary at least once. If we use only real data from the training
set, we could evaluate models’ generalizability without using test sets. That
is an advantage when data are limited because we could have more data
for training. However, the disadvantage of this method is the dependence
on real data. The number of adversarial examples that could be generated
depends on the size of the real dataset. Can we generate an adversarial
example x for classifier f by randomly searching for f (x) ≈ 0.5? Maybe, but
it is very difficult in a high-dimensional space. Even to find two data points
a, b whose f (a) ≈ 1, f (b) ≈ 0 is difficult because one of the areas (say f (a) ≈ 1)
would be very small and sparse in the space. Definitely, there are some
other methods to generate adversarial examples, such as the DeepDIG [133]
and applications of the Generative Adversarial Network (GAN).
Smaller local DBC scores are necessary but insufficient conditions for
a simpler decision boundary because a lower complexity adversarial set
may be generated from a higher complexity boundary (Figure 4.33). Hence,
the density of adversarial examples is important. Denser examples have
higher probability to reflect the real condition of boundary. In practice,
more adversarial examples are required to be on the effective segment of
decision boundary, which is not the whole boundary but the part close to
the data. From this aspect, to generate adversarial examples on the line
segments between two data points is an appropriate way to create a dense
adversarial set on the effective segment of decision boundary.
A smaller DBC score indicates that the model has a simpler decision
boundary and better generalizability on a certain dataset. It is worth
noting that the DBC score is meaningless for a single model and cannot
be compared across different datasets. The gist of DBC score is used to
177
Figure 4.33: Linear adversarial set (green) on lumpy boundary (black).
compare various models trained on the same dataset. In this study, all
three experiments use two-class datasets. In future work, we will use
multi-class datasets. The multi-class problem could be treated as multiple
two-class problems by one class vs. others.
4.6 Estimation of Training Accuracy for Two-layer Neural Networks
In addition, for understanding the mechanism of Neural Network (NN)

models and the transparency of deep learning, we propose a novel theory
based on space partitioning to estimate the approximate training accuracy
for two-layer neural networks on random datasets without training. There
appear to be no other studies that have proposed a method to estimate
training accuracy without using input data and/or trained models. Our
method estimates the training accuracy for two-layer fully-connected neural
networks on two-class random datasets using only three arguments: the
dimensionality of inputs (d), the number of inputs (N), and the number
of neurons in the hidden layer (L). We have verified our method using
real training accuracies in our experiments. The results indicate that the
method will work for any dimension, and the proposed theory could extend
also to estimate deeper NN models, like the DNN. This study may provide a
starting point for a new way for researchers to make progress on the difficult
178
problem of understanding deep learning.
4.6.1 Introduction and Related Work
In recent years, the neural network (deep learning) technique has played
a more and more important role in applications of machine learning. To
comprehensively understand the mechanisms of Neural Network (NN) mod-
els and to explain their output results, however, still require more basic
research [223]. To understand the mechanisms of NN models, that is, the
transparency of deep learning, there are mainly three ways: the training
process [66], generalizability [159], and loss or accuracy prediction [12].
In this study, we create a novel theory from scratch to estimate the
training accuracy for two-layer neural networks applied to random datasets.
Figure 4.34 demonstrates the mentioned two-layer neural network and sum-
marizes the processes to estimate its training accuracy using the proposed
method. Its main idea is based on the regions of linearity represented by
NN models [201], which derives from common insights of the Perceptron.
This study may raise other questions and offer the starting point of a new
way for future researchers to make progress in the understanding of deep
learning. Thus, we begin from a simple condition of two-layer NN models,
and we discuss the use for multi-layer networks in Section 4.6.4.3 as future
works. This study has two main contributions:
• We propose a novel theory to understand the mechanisms of two-layer

FCNN models.
• By applying that theory, we estimate the training accuracy for two-layer

FCNN on random datasets.
More discussion about our contributions are in Section 4.6.4.1.
179
4.6.1.1 Preliminaries
Specifically, the studied subjects are:
• Classifier model: the two-layer Fully-Connected Neural Network

(FCNN) with d − L − 1 architecture, in which the length of input vectors
(∈ Rd ) is d, the hidden layer has L neurons (with ReLU activation), and
the output layer has one neuron, using the Sigmoid activation function.
This FCNN is for two-class classifications and outputs of the FCNN are
values in [0,1]
• Dataset: N random (uniformly distributed) vectors in Rd belonging to

two classes with labels ‘0’ and ‘1’, and the number of samples for each
class is the same. We consider that the uniformly-distributed dataset
is an extreme situation for classification; its predicted accuracy is thus
the lower-bound for all other situations.
• Metrics: training accuracy.
We focus on the training accuracy instead of the test accuracy because

the main idea is to understand the mechanism of NN models by the approach
of estimating training accuracy, but not to analyze their performances. The
paradigm we use to study the FCNN is the following:
1. We find a simplified system to examine.
2. We create a theory based on the Hypotheses 4.1 and 4.2 to predict or

estimate outputs of the system.
3. For the most important step, we perform experiments to test the pro-
posed theory by comparing actual outputs of the system with predicted
outputs. If the predictions are close to the real results, we could accept
180
𝑑
𝑥𝑖 ∈ 0, 1 → [0, 1]
𝑦𝑖 ∈ {0, 1}
𝑖 = 1, 2, ⋯ , 𝑵
𝒅
𝑳
𝑵
𝒅 ൡ → Our Method → Estimated Training Accuracy
𝑳
Figure 4.34: An example of the two-layer FCNN with d − L − 1 architecture.
This FCNN is used to classify N random vectors in Rd belonging to two
classes. Detailed settings are stated before in Section 4.6.1.1. The training
accuracy of this classification can be estimated by our proposed method,
without applying any training process. The detailed Algorithm of our method
is shown in Section 4.6.3.3.
the theory, or update it to make predictions/estimates more accurate

by these heuristic observations (empirical corrections). Otherwise, we
abandon this theory and seek another one.
4.6.1.2 Related Work
To the best of our knowledge, only a few studies have discussed the
prediction/estimation of the training accuracy of NN models. None of them,
however, estimates training accuracy without using input data and/or
trained models as does our method.
The overall setting and backgrounds of the studies of over-parameterized
two-layer NNs [66, 12] are similar to ours. But the main difference is that
we do not estimate the value of training accuracy. One study [66] mainly
181
shows that the zero training loss on deep over-parametrized networks can be
obtained by using gradient descent. Another study analyzes the generaliza-
tion bound [12] between training and test performance (i.e., generalization
gap). There are other studies [127, 288] to investigate the prediction of the
generalization gap of neural networks. We do not further discuss the gener-
alization gap because we focus on only the estimation of training accuracy,
ignoring the test accuracy.
Unlike our proposed method that does not need to use input data nor
to apply any training process, in recent works related to the accuracy
estimation for neural networks [289, 82, 40, 243], the accuracy prediction
methods require pre-trained NN models or weights from the pre-trained NN
models. Through our method, to estimate the training accuracy for two-layer
FCNN on random datasets (two classes) requires only three arguments: the
dimensionality of inputs (d), the number of inputs (N), and the number of
neurons in the hidden layer (L). The Peephole [260] and TAP [122] techniques
apply Long Short Term Memory (LSTM)-based frameworks to predict a NN
model’s performance before training the original NN model. However, the
frameworks themselves still must be trained by the input data before making
predictions.
4.6.2 The Hidden Layer: Space Partitioning
In general, the output of the k-th neuron in the first hidden layer is:
sk (x) = σ (wk · x + bk ),
182
where input x ∈ Rd ; parameter wk is the input weight of the k-th neuron and
its bias is bk . We define σ (·) as the ReLU activation function, defined as:
σ (x) = max{0, x}.
The neuron can be considered as a hyperplane: wk · x + bk = 0 that divides

the input space Rd into two partitions [201]. If the input x is in one (lower)
partition or on the hyperplane, then wk · x + bk ≤ 0 and then their output
sk (x) = 0. If x is in the other (upper) partition, its output sk (x) > 0. Specifically,
the distance from x to the hyperplane is:
|wk · x + bk |
dk (x) =
kwk k
If wk · x + bk > 0,
sk (x) = σ (wk · x + bk ) = |wk · x + bk | = dk (x)kwk k.
For a given input data point, L neurons assign it a unique code: {s1 , s2 ,
· · · , sL }; some values in the code could be zero. L neurons divide the input
space into many partitions, input data in the same partition will have codes
that are more similar because of having the same zero positions. Conversely,
it is obvious that the codes of data in different partitions have different zero
positions, and the differences (the Hamming distances) of these codes are
greater. It is apparent, therefore, that the case of input data separated into
different partitions is favorable for classification.
183
4.6.2.1 Complete separation
Given L neurons that divide the input space into S partitions, we hy-
pothesize that:
Hypothesis 4.1. In anticipation of classification, but without yet assigning

labels, we identify the best case as the separation of all N input data into
different partitions (complete separation).
Remark. For most real classification problems (i.e., in which labels have
been assigned), complete separation of all data points is a very strong
assumption because adjacent same-class samples assigned to the same par-
tition is a looser condition and will not affect the classification performance.
Because our partitions by complete separation are unlabeled, the prin-
ciple of discriminant analysis (which aims to minimize the within-class
separations and maximize the between-class separations) is not applicable.
Finally, adjacent samples assigned to different partitions could have the
same label and thus define a within-class separation. The partitions we men-
tioned are thus not necessarily the final decision regions for classification;
those will be determined when labels are assigned.
Under this hypothesis, for complete separation, each partition contains

at most one data point after space partitioning. Since the position of data
points and hyper-planes can be considered (uniformly distributed) random
(because the methods to find hyper-planes themselves contain randomness,
e.g., the Stochastic Gradient Descent (SGD) [261], during the training), the
probability of complete separation (Pc ) is:
S
N S!
Pc = SN
= (4.2)
(S − N)!SN
N!
184
In other words, Pc is the probability that each partition contains at most
one data point after randomly assigning N data points to S partitions. By
Stirling’s approximation,
√ S S

S! 2πS e
Pc = ≈p
(S − N)!S N S−N S−N N

2π (S − N) e S
N S−N+0.5
1 S
Pc = (4.3)
e S−N
Let S = bN a (a and b are two coefficients to be determined) and for a large

N → ∞, by Equation (4.3), the limitation of complete separation probability
is:
N bN a −N+0.5
1 bN a
lim Pc = lim (4.4)
N→∞ N→∞ e bN a − N
S > 0 requires b > 0; and for complete separation, ∀N : S ≥ N (Pigeonhole

principle) requires a ≥ 1. By simplifying the limit in Equation (4.4), we
have13 :
(a−1)N 2−a
lim Pc = lim e− ab when a > 1
N→∞ N→∞
(4.5)
lim Pc = 0 when a = 1
N→∞
Equation (4.5) shows that for large N, the probability of complete separation
is nearly zero when 1 ≤ a < 2, and close to one when a > 2. Only for a = 2
(i.e., S = bN 2 ) is the probability controlled by the coefficient b:
1
lim Pc = e−( 2b ) (4.6)
N→∞

Although complete separation holds lim Pc = 1 for a > 2, there is no need
N→∞
to incur the exponential growth of S with a when compared to the linear
13 A derivation of this simplification is in the Appendix C.
185
growth with b. And a high probability of complete separation does not
require even a large b. For example, when a = 2 and b = 10 the lim Pc ≈ 0.95.
N→∞
Therefore, we let S = bN 2 throughout this study.
4.6.2.2 Incomplete separation
To increase the b in Equation (4.6) can improve the probability of complete

separation. Alternatively, decreased training accuracy could improve the
probability of an incomplete separation; that is, some partitions have
more than one data point after space partitioning. We define the separation
ratio γ (0 ≤ γ ≤ 1) for N input data, which means at least γN data points
have been completely separated (at least γN partitions contain only one
data point). According to Equation (4.2), the probability of such incomplete
separation (Pinc ) is:
(γNS ) (S−γN)(1−γ)N
N
(γN ) (1−γ)N! S! (S − γN)(1−γ)N
Pinc = SN
= (4.7)
(S − γN)!SN
N!
In other words, Pinc is the probability that at least γN partitions contain only
one data point after randomly assigning N data points to S partitions. When
γ = 1, Pinc = Pc , i.e., it becomes the complete separation, and when γ = 0,
Pinc = 1. We apply Stirling’s approximation and let S = bN 2 , N → ∞, similar to
Equation (4.6), we have:
γ(γ−2)
lim Pinc = e 2b (4.8)
N→∞
186
4.6.2.3 Expectation of separation ratio
In fact, Equation (4.8) shows the probability that at least γN data points
(when N is large enough) have been completely separated, which implies:
γ(γ−2)
Pinc (x ≥ γ) = e 2b ⇒
dPinc (x < γ) d (1 − Pinc (x ≥ γ))
Pinc (x = γ) = =
dγ dγ
γ(γ−2)

d 1−e 2b
1 − γ γ(γ−2)
= = e 2b = Pinc (γ)
dγ b
We notice that the equation Pinc (γ) does not include the probability of com-
plete separation Pc because Pinc (1) = 0. Hence, Pinc (1) is replaced by Pc and
the comprehensive probability for the separation ratio γ is:

 Pc = e−( 2b1 )

γ =1
P(γ) = γ(γ−2)
(4.9)
 1−γ e 2b
 0≤γ <1
b
Since Equation (4.9) is a function of probability, we could verify it by observ-

ing:
Z 1 Z 1
1−γ γ(γ−2) 1
−( 2b )
1
−( 2b )

P (γ) dγ = Pc + e 2b dγ = e + 1−e =1
0 0 b
We compute the expectation of the separation ratio γ:
Z 1 Z 1
1 − γ γ(γ−2)
E [γ] = γ · P (γ) dγ = 1 · Pc + γ· e 2b dγ ⇒
0 0 b
√
2πb 1 1
E [γ] = erfi √ e−( 2b ) (4.10)
2 2b
187
where erfi(x) is the imaginary error function:
2 ∞ x2n+1
erfi(x) = √ ∑
π n=0 n! (2n + 1)
4.6.2.4 Expectation of training accuracy
Based on the hypothesis, a high separation ratio helps to obtain a high

training accuracy, but it is not sufficient because the training accuracy also
depends on the separating capacity of the second (output) layer. Neverthe-
less, we initially ignore this fact and reinforce our Hypothesis 4.1.
Hypothesis 4.2. The separation ratio directly determines the training ac-
curacy.
Then, we will add empirical corrections to our theory to allow it to

match the real situations. In the case of incomplete separation, 1) all
completely-separated data points can be predicted correctly, and 2) the
other data points have a 50% chance to be predicted correctly (equivalent
to a random guess, since the number of samples for each class is the same)
because each partition will be ultimately assigned one label. Specifically,
if γN data points have been completely separated, the training accuracy α
(based on our hypothesis) is:
γN + 0.5 (1 − γ) N 1 + γ
α= =
N 2
To take the expectation on both sides, we have:
1 + E [γ]
E [α] = (4.11)
2
Equation (4.11) shows the expectation relationship between the separation
188
ratio and training accuracy. After replacing E [γ] in Equation (4.11) with
Equation (4.10), we obtain the formula to compute the expectation of
training accuracy:
√
1 2πb 1 1
E [α] = + erfi √ e−( 2b ) (4.12)
2 4 2b
To compute the expectation of training accuracy by Equation (4.12),

we must calculate the value of b. The expectation of training accuracy is
a monotonically increasing function of b on its domain (0, ∞) and its
range is (0.5, 1). Since the coefficient b is very important in estimation of the
training accuracy, it is also called the ensemble index for training accuracy.
This leads to the following theorem:
Theorem 4.1. The expectation of training accuracy for a d − L − 1 archi-

tecture FCNN is determined by Equation (4.12) with the ensemble index
b.
In the input space Rd , L hyperplanes (neurons) divide the space into S

partitions. By the Space Partitioning Theory [283], the maximum number
of partitions is:
d
L
S=∑ (4.13)
i=0 i
Since:
d d
L L
∑ i = O d!
i=0
We let:
Ld
S= (4.14)
d!
Figure 4.35 shows that the partition numbers calculated from Equa-
tions (4.13) and (4.14) are very close in 2-D. In high dimensions, both
theory and experiments show that Equation (4.14) is still an asymptotic
189
Figure 4.35: Maximum number of partitions in 2-D
upper-bound of Equation (4.13). By our agreement in Equation (4.6), which

S = bN 2 ; we have:
Ld
b= (4.15)
d!N 2
Now, we have introduced our main theory that could estimate the train-
ing accuracy for a d − L − 1 structure FCNN and two classes of N random
(uniformly distributed) data points by using Equations (4.12) and (4.15).
For example, let a dataset have 200 two-class random data samples in R3
(100 samples for each class) and let it be used to train a 3 − 200 − 1 FCNN.
In this case,
2003
b= ≈ 33.33.
3! · 2002
Substituting b = 33.33 into Equation (4.12) yields E [α] ≈ 0.995, i.e., the ex-
pectation of training accuracy for this case is about 99.5%.
190
4.6.3 Empirical Corrections
The empirical correction uses results from real experiments to up-

date/correct the theoretical model we have proposed above. The correction
is necessary because our hypothesis ignores the separating capacity of the
second (output) layer and because the maximum number of partitions is not
guaranteed for all situations; e.g., for a large L, the real partition number
may be much smaller than S in Equation (4.13).
In experiments, we train a d −L−1 structure FCNN by two-class N random
(uniformly distributed) data points in [0, 1)d with labels ‘0’ and ‘1’ (and the
number of samples for each class is equal). The training process ends when
the training accuracy converges (loss change is smaller than 10−4 in 1000
epochs). For each {d, N, L}, the training repeats several times from scratch,
and the recorded training accuracy is the average.
4.6.3.1 Empirical Correction for 2-D
In 2-D, by Equation (4.15), we have:
2
1 L
b=
2 N
L
If N = c, b is not changed by N. To test this counter-intuitive inference, we let
L
L = N = {100, 200, 500, 800, 1000, 2000, 5000}. Since N = 1, b and E[α] are
unchanged. But Table 4.16 shows the real training accuracies vary with N.
The predicted training accuracy is close to the real training accuracy only
at N = 200 and the real training accuracy decreases with the growth of N.
Hence, our theory must be refined using empirical corrections.
The correction could be applied on either Equation (4.15) or Equa-
tion (4.12). We modify Equation (4.15) because the range of functionin
191
Table 4.16: Accuracy results comparison. The columns from left to right
are dimension, dataset size, number of neurons in hidden layer, the real
training accuracy and estimated training accuracy by Equation (4.15) and
Theorem 4.1.
d N L Real Acc Est. Acc

2 100 100 0.844 0.769
2 200 200 0.741 0.769
2 500 500 0.686 0.769
2 800 800 0.664 0.769
2 1000 1000 0.645 0.769
2 2000 2000 0.592 0.769
2 5000 5000 0.556 0.769
Equation (4.12) is (0.5, 1), which is an advantage for training accuracy

estimation. In Table 4.16, the real training accuracy decreases when N
increases; this suggests that the exponent of N in Equation (4.15) should be
larger than that of L. Therefore, according to (Equation (4.15)), we consider
a more general equation to connect the ensemble index b with parameters
d, N, and L:
Lxd
b = cd (4.16)
N yd
Observation 4.1. The ensemble index b is computed by Equation (4.16)

with special parameters {xd , yd , cd }, where xd , yd are exponents of N and L,
and cd is a constant. All the parameters vary with the dimensionality of
inputs d.
In 2-D, to determine the x2 , y2 , c2 in Equation (4.16), we test 81 {N, L},

which are the combinations of: L, N ∈ {100, 200, 500, 800, 1000, 2000,
5000, 10000, 20000}. For each {N[i], L [i]}, we could obtain a real training
accuracy by experiment. Their corresponding ensemble indexes b[i] are
found using Equation (4.12). Finally, we determine the x2 , y2 , c2 by fitting
the 1b , N, L to Equation (4.16); this yields the expression for the ensemble
192
1
Figure 4.36: Fitting curve of b = f (N, L) in 2-D
index for 2-D:

L0.0744
b = 8.4531 (4.17)
N 0.6017
The fitting process uses the Curve Fitting Tool (cftool) in MATLAB. Fig-
ure 4.36 shows the 81 points of {N, L} and the fitted curve. The R2 value of
1
fitting is about 0.998. The reason to fit b instead of b is to avoid b = +∞ when
1
the real accuracy is 1 (which can occur). In this case, b = 0. Conversely,
b = 0 when the real accuracy is 0.5, which never appears in our experiments.
Using an effective classifier rather than random guess makes the accuracy
1
> 0.5 (b > 0), thus, b 6= +∞. To cover the parameter space as completely as
possible, we manually choose the 81 points of N and L. We then verify the
fitted model: Equation (4.17) by other random values of N and L.
By using Equations (4.12) and (4.17), we estimate training accuracy on
193
Table 4.17: Estimated training accuracy results comparison in 2-D. The
columns from left to right are dataset size, number of neurons in hidden
layer, the real training accuracy, estimated/predicted training accuracy
by Equation (4.17) and Theorem 4.1, and (absolute) differences based on
estimations between real and estimated accuracies.
d=2 N L Real Acc Est. Acc Diff

115 2483 0.910 0.846 0.064
154 1595 0.805 0.819 0.014
243 519 0.782 0.767 0.015
508 4992 0.699 0.724 0.025
689 2206 0.665 0.685 0.020
1366 4133 0.614 0.631 0.016
2139 2384 0.578 0.593 0.015
2661 890 0.566 0.573 0.007
1462 94 0.577 0.592 0.014
3681 1300 0.555 0.560 0.004
4416 4984 0.556 0.559 0.003
4498 1359 0.550 0.552 0.002
random values of N and L in 2-D. The results are shown in Table 4.17. The
differences between real and estimated training accuracies are small, except
the first (row) one. For higher real-accuracy cases (> 0.86), the difference
1
is larger because b < 1 (b > 1 when the accuracy > 0.86), while the effect is
1
smaller in cases with b > 1 during the fitting to find Equation (4.17).
4.6.3.2 Empirical Correction for 3-D and More Dimensions
We repeat the same processes as for 2-D to determine special parameters

{xd , yd , cd } in Observation 4.1 for data dimensionality from 3 to 10. Results
are shown in Table 4.18. The R2 values of fitting are high.

Such results reaffirm the necessity of correction in our theory because,
1

when compared to Equation (4.15), parameters {xd , yd , cd } are not d, 2, d! .
194
Table 4.18: Parameters {xd , yd , cd } in Equation (4.16) (Observation 4.1) for
various dimensionalities of inputs are determined by fitting.
d xd yd cd R2
2 0.0744 0.6017 8.4531 0.998
3 0.1269 0.6352 15.5690 0.965
4 0.2802 0.7811 47.3261 0.961
5 0.5326 0.8515 28.4495 0.996
6 0.4130 0.8686 61.0874 0.996
7 0.4348 0.8239 33.4448 0.977
8 0.5278 0.9228 61.3121 0.996
9 0.7250 1.0310 82.5083 0.995
10 0.6633 1.0160 91.4913 0.995
xd
But the growth of yd is preserved. From Equation (4.15),
xd d
=
yd 2
xd xd
The yd linearly increases with d. The real d v.s. yd (Figure 4.37) shows the
same tendency.
Table 4.18 indicates that xd , yd , and cd increase almost linearly with d.
Thus, we apply linear fitting on d-xd , d-yd , and d-cd to obtain these fits:
xd = 0.0758 · d − 0.0349 (R2 = 0.858)

yd = 0.0517 · d + 0.5268 (R2 = 0.902) (4.18)
cd = 9.4323 · d − 8.8558 (R2 = 0.804)
Equation (4.18) is a supplement to Observation 4.1, which employs

empirical corrections to Equation (4.15) for determining the ensemble index
b of Theorem 4.1.
195
1
0.8
xd ⁄ yd 0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11
d
Figure 4.37: Plots of d v.s. xydd from Table 4.18. Blue dot-line is linearly fitted
by points to show the growth.
4.6.3.3 Summary of the algorithm
To associate the two statements of Observation 4.1 and Theorem 4.1,

we can estimate the training accuracy for a two-layer FCNN on two-class
random datasets using only three arguments: the dimensionality of inputs
(d), the number of data points (N), and the number of neurons in the hidden
layer (L), without actual training.
4.6.3.4 Testing
To verify our theorem and observation, we first estimate training accuracy

on a larger range of dimensions (d = 2 to 24) and for three situations:
• N L : N = 10000, L = 1000
• N∼
= L : N = L = 10000
• N L : N = 1000, L = 10000
The results are shown in Figure 4.38. The maximum differences between
196
N >> L
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
Real Acc Est. Acc Diﬀ
N=L
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
N << L
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
Figure 4.38: Estimated training accuracy results comparisons. y-axis is

accuracy, x-axis is the dimensionality of inputs (d).
197
Algorithm 3: To estimate the training accuracy α for two-layer neu-
ral networks on a random dataset without training
Input: the dimensionality of inputs d, the number of data points N,
the number of neurons in the hidden layer L.
Result: the expectation of training accuracy E [α].
// Use Equation (4.18) to calculate the three parameters:
{xd , yd , cd }.
1 xd ← 0.0758 · d − 0.0349;
2 yd ← 0.0517 · d + 0.5268;
3 cd ← 9.4323 · d − 8.8558;
// Use Equation (4.16) to compute the ensemble index b.
L xd
4 b ← cd yd ;
N
// Use Equation (4.12) to compute the expectation of training
accuracy E [α].
√ 1
5 E [α] ← +
2
1
4
2πb
erfi √1 e−( 2b ) ;
2b
Output: E [α].
real and estimated training accuracies are about 0.130 (N L), 0.076 (N ∼
= L),
and 0.104 (N L). There may be two reasons that the differences are not
small in some cases of N L: 1) in the fitting process, we do not have
enough samples for which N L, so that the corrections are not perfect;
2) the reason why the differences are greater for higher real accuracies is
similar to the 2-D situation discussed above.
In addition, we estimate training accuracy on 40 random cases. For each
case, the N, L ∈ [100, 20000] and d ∈ [2, 24], but principally we use d ∈ [2, 10]
because in high dimensions, almost all cases’ accuracies are close to 100%
(see Figure 4.38). Figure 4.39 shows the results. Each case is plotted with
its real and estimated training accuracy. The overall R2 value is about 0.955,
indicating good estimation.
198
1
0.9
0.8
y: Est. Acc
0.7
0.6
0.5
0.5 0.6 0.7 0.8 0.9 1
x: Real Acc
Figure 4.39: Evaluation of estimated training accuracy results. y-axis is
estimated accuracy; x-axis is the real accuracy; each dot is for one case;
red line is y = x. R2 ≈ 0.955.
4.6.4 Discussion
4.6.4.1 Significance and Contributions
Our main contribution is to build a novel theory to estimate the training

accuracy for a two-layer FCNN used on two-class random datasets without
using input data or trained models (training); the theory could help to
understand the mechanisms of neural network models. The estimation
uses three arguments:
1. Dimensionality of inputs (d)
2. Number of data points (N)
3. Number of neurons in the hidden layer (L)
199
It also is based on the conditions of classifier models and datasets stated in
the Section 4.6.1.1 and the two hypotheses. There appear to be no other
studies that have proposed a method to estimate training accuracy in this
way.
Our theory is based on the notion that hidden layers in neural networks
perform space partitioning and the hypothesis that the data separation
ratio determines the training accuracy. Theorem 4.1 introduces a mapping
function between training accuracy and the ensemble index. The ensemble
index, by virtue of its domain (0, ∞); is better suited than accuracy (domain
[0.5, 1]) for the computation required by the fitting process. This is to
maintain parity between the input variables’ domain and the range of the
fitted quantity (the ensemble index). The extended domain consists with
the domains of N, L, and d; it is good for designing prediction models or
fitting experimental data. Observation 4.1 provides a calculation of the
ensemble index based on empirical corrections. And these corrections;
which successfully improve our model to make estimations.
4.6.4.2 Improvements of the Theory
An ideal theory is to precisely estimate the accuracy with no, or very

limited, empirical corrections. Our purely theoretical results (to estimate
training accuracy by Equation (4.15) and Theorem 4.1) cannot match the
real accuracies in 2-D (Table 4.16). Empirical corrections are therefore
required.
Those corrections are not perfect so far, because of the limited number of
fitting and testing samples. Although the empirically-corrected estimation
is not very accurate for some cases, the method does reflect some character-
istics of the real training accuracies. For example, the training accuracies
200
of the N L cases are smaller than those of N L, and for specific N and
L, the training accuracies of higher dimensionality of inputs are greater
than those of lower dimensionality. These characteristics are shown by
the estimation curves in Figure 4.38. Although there are large errors for
some cases, Figure 4.38 shows the similar tendencies of real and estimated
accuracy.
And the theorem, Observation 4.1 and empirical corrections could be
improved in the future. The improvements would be along these directions:
1. To use more data for fitting.
2. To rethink the space partitioning problem to change Observation 4.1.

We could use a different approximation formula from Equation (4.14),
or involve the probability of reaching the maximum number of parti-
tions.
3. To modify Theorem 4.1 by reconsidering the necessity of complete sep-

aration. In fact, in real classification problems, complete separation
of all data points is too strong a requirement. Instead, to require
only that no different-class samples are assigned to the same partition
would be more appropriate.
4. To involve the capacity of a separating plane [67]:


 1

 N ≤ d +1
f (N, d) = 2 d N −1

 2N ∑

 N > d +1
i=0 i
Where we would estimate f (N, d), the probability of the existence of a

hyper-plane separating two-class N points in d dimensions.
201
4.6.4.3 For Deeper Neural Networks
Our proposed estimation theory could extend to multi-layer neural net-

works. As discussed at the beginning of the second section, L neurons in
the first hidden layer assign every input a unique code. The first hidden
layer transforms N inputs from d-D to L-D. Usually, L > d, and the higher
dimensionality makes data separation easier for successive layers. Also,
the effective N decreases when data pass through layers if we consider
partition merging. Specifically, if a partition and its neighboring partitions
contain same-class data, they could be merged into one because these
data are locally classified well. The decrease of actual inputs makes data
separation easier for successive layers also.
Alternatively, the study of Pascanu et al. [201] provides a calculation of
the number of space partitions (S regions) created by k hidden layers:
k−1 ! n0
ni n
S= ∏ n0 ∑ ik
i=1 i=0
Where, n0 is the size of input layer and ni is the i-th hidden layer. This
reduces to Equation (4.13) in the present case (k = 1, n0 = d and n1 = L). The
theory for multi-layer neural networks could begin by using the approaches
above.
4.6.4.4 More Conditions
In addition, this study still has several places that are worth attempting to
enhance or extend and raises some questions for future work. For example,
the proposed theory could also extend to distributions of data other than
uniform, unequal number of samples in different classes, and/or other
types of neural networks by modifying the ways to calculate separation
202
probabilities, such as in Equations (4.2) and (4.7).
4.7 Conclusion
In this chapter, we applied Deep Learning (DL)-based methods to detect

breast cancer from mammograms. Training CNN from scratch is not feasible
for limited number of labeled mammographic images. Our results show that
using transfer learning in CNN is a promising solution for breast cancer
detection. The pre-trained CNN model (VGG-16) can automatically extract
features from mammographic images, and a good Neural Network (NN)-
classifier can be trained by these features without providing hand-crafted
features. Combining pre-trained CNN (VGG-16) with a one-layer Fully-
Connected Neural Network (FCNN) classifier can achieve average accuracy
about 0.905 for classifying abnormal vs. normal cases in the DDSM dataset.
We also show that Generative Adversarial Network (GAN) can be used as
an image augmentation method for training and to improve the performance
of CNN classifiers. Our results indicate that, to classify the normal ROIs and
abnormal (tumor) ROIs from DDSM dataset, adding GAN generated ROIs
in training data can help the classifier prevent from over-fitting. Another
traditional image augmentation method – affine transformation, however,
performs lower than GAN; therefore, GAN could be an ideal augmentation
option. By comparing GAN ROIs with affine transformed ROIs in their
distributions of mean, standard deviation, skewness and entropy, we found
that GAN ROIs are more similar to real ROIs than affine transformed ROIs
in terms of mean and entropy. We also find that images augmented by
GAN or affine transformation cannot substitute for real images to train CNN
classifiers because the absence of real images in training set will cause
over-fitting with more training.
203
To further evaluate the images generated from GAN models, we propose
a novel GAN measure – Likeness Score (LS), which can directly analyze the
generated images without using a pre-trained classifier and it is stable with
respect to the amount of images. Comparing with other methods, such as
IS and FID, LS has fewer constraints and wider applications. Particularly,
LS could provide explanation of results in the three main respects of opti-
mal GANs according to our expectations of ideal generated images. Such
explanations help us to deepen our understanding of GANs and of other
GAN measures that will help to improve GAN performance.
In addition, we have examined two more basic questions for the CNN and
deep learning models: the generalizability of the Deep Neural Network (DNN)
and how to understand the mechanism of DNN models.
We propose the Decision Boundary Complexity (DBC) score to define
and measure the decision boundary complexity of the DNN. DBC score is
computed from the entropy of eigenvalues of adversarial examples, which
are generated on or near the decision boundary, and in a feature space of
any dimension. Training data and the trained models are used to compute
the DBC scores, and test data are used to obtain test accuracies as the
ground truth for models’ generalizability. Results verifies our hypothesis
that a DNN with a simpler decision boundary has better generalizability.
Thus, DBC provides an effective way to measure the complexity of decision
boundaries and its relationship to the generalizability of DNNs.
To understand the mechanism of DNN models, we create a novel theory
based on space partitioning to estimate the approximate training accuracy
for two-layer neural networks on random datasets without training. It
does this using only three arguments: the dimensionality of inputs (d), the
number of input data points (N), and the number of neurons in the hidden
204
layer (L). The theory has been verified by the computation of real training
accuracies in our experiments. Although the method requires empirical
correction factors, they are determined using a principled and repeatable
approach. The results indicate that the method will work for any dimension,
and it has the potential to estimate deeper neural network models. This
study may raise other questions and suggest a starting point for a new
way for researchers to make progress on studying the transparency of deep
learning and explainable deep learning.
205
Chapter 5: Deep Learning-based Medical Images Segmentation
This chapter presents the works of applying Deep Learning (DL)-based

methods to medical images segmentation. First, we propose an autoencoder-
like Convolutional and Deconvolutional Neural Network (C-DCNN) model to
automatically segment breast areas in thermal images. To understand the
relationship between segmentation and classification using Deep Neural
Network (DNN) models, we then approach the segmentation problem using
trained Convolutional Neural Network (CNN) classifiers, which automatically
extract important features from images for classification, instead of using
current DL segmentation models (e.g., C-DCNN and UNet).
5.1 Introduction
Breast cancer is the second leading cause of death for women in the U.S.
Early detection of breast cancer has been shown to be the key to higher
survival rates for breast cancer patients. We are investigating infrared ther-
mography as a noninvasive adjunct to mammography for breast screening.
Thermal imaging is safe, radiation-free, pain-free, and non-contact. Auto-
mated segmentation of the breast area from the acquired thermal images will
help limit the area for tumor search and reduce the time and effort needed
for manual hand segmentation. Autoencoder-like C-DCNN are promising
computational approaches to automatically segment breast areas in thermal
images. In this study, we apply the C-DCNN to segment breast areas from
our thermal breast image database, which we are collecting in our clini-
cal trials by imaging breast cancer patients with our infrared camera (N2
Imager). For training the C-DCNN, the inputs are 132 gray-value thermal
206
images and the corresponding manually-cropped breast area images (binary
masks to designate the breast areas). For testing, we input thermal images
to the trained C-DCNN and the output after post-processing are the binary
breast-area images.
Instead of using current DL-based segmentation models (like the UNet
and variants), we then employ a “sneak attack” on segmentation of mam-
mographic images using CNN classifiers. The CNN classifiers can automat-
ically extract important features from images for classification. Those ex-
tracted features can be visualized and formed into heatmaps using Gradient-
weighted Class Activation Mapping (Grad-CAM). This study tested whether
the heatmaps could be used to segment the classified targets. We also pro-
posed an evaluation method for the heatmaps; that is, to re-train the CNN
classifier using images filtered by heatmaps and examine its performance.
We used the mean-Dice coefficient to evaluate segmentation results.
5.2 Segmentation of Thermal Breast Images1
5.2.1 Background and Related Work
Breast cancer will be diagnosed among about 12% of U.S. women dur-
ing their lifetime, making it the second leading cause of death for U.S.
women [248, 54]. Early detection of breast cancer via Computer-Aided
Diagnosis (CAD) systems has been shown to improve outcomes of breast
cancer treatment and increase patients’ survival times [215]. If the tumor
is detected and localized early, the 5-year relative survival rate is more
than 94% [52]. Although X-ray mammography is the gold standard for
breast cancer detection, it nevertheless has a substantial false-positive rate,
requires exposure to radiation, often is uncomfortable, and is less effective
207
Figure 5.1: Full thermal raw images of two patients, including the neck,
shoulder, abdomen, background and chair.
for dense-breast cases. Alternatively, thermography has been shown to

be a promising noninvasive adjunct to mammography for breast cancer
screening [190]. Studies show thermography being a reliable indicator of
increased risk of breast cancer at early stages [126]. We are conducting a
pilot study to image patients diagnosed with breast cancer with our infrared
camera.
As seen in Figure 5.1, thermal images may contain various parts in
addition to the region of interest required for automatic detection of abnor-
malities in the breast. These parts include the background, chair, neck,
shoulders and lower abdomen. Removing those regions and extracting the
breast region is a crucial preprocessing step for CAD systems for breast
cancer. Automatic segmentation of the breast region will limit the area
for tumor search and reduce processing time. Furthermore, automatic
segmentation reduces the time and effort required in manual segmentation,
and potentially minimize human error.
The accurate segmentation of the thermal breast images is still a difficult
task [27] because the breast thermogram has inherent limitations such
as low contrast, low signal to noise ratio [309], lack of clear edges and
no definite shape. Previous studies of breast thermogram segmentation
208
mostly used histogram analysis, threshold-based techniques, edge-based
techniques, and region-based techniques, for example, edge detection by
Hough transform feature curve (parabola) extraction [208], edge detection
by interpolation of curves [237], snake algorithm [70], detection of edge and
boundary curves [131], anisotropic diffusion filter-based edge detection [259]
and automated segmentation algorithm based on ellipse detection (our lab).
Recently, Deep Learning (DL) has become a state-of-the-art method to
segment images. The SegNet [17], for example, was trained to segment
urban street images to parts of sky, building, road marking, pavement,
etc. For medical-image applications, deep neural networks have applied
to segmentation of retinal vessels, brain tissues in MRI, and liver lesions
in CT [177, 182, 44]. However, we are aware of no study that has used
DL-based segmentation to segment breast thermograms. This study fills
this gap by providing a DL model to automatically segment the breast area
from the whole thermal breast image.
5.2.2 Breast Thermography Image Collection and Image Pre-

processing
Tumor growth is accompanied by angiogenesis, or the formation of new

blood vessels. This growing vascular network supplies the developing tumor
with nutrients and oxygen and remove waste products. The increased
blood flow in the tumor region results in an increased local temperature
of that region compared to the temperature of the surrounding tissues.
Previous work indicates that differences of as little as 0.1K can be clinically
important [126]. Therefore, thermography has the potential to detect those
elevated skin temperatures that arise from the increased blood flow.
We are using the N2 Imager (N2 Imaging Systems, Irvine, Calif.). The
209
Figure 5.2: Our breast infrared thermography system.
camera detects wavelengths in the long infrared region of the electromagnetic

spectrum (8-12 microns), and humans at normal body temperature radiate
thermal infrared at wavelengths around 10 microns. It has a 640x480 array
of 17-micrometer pixels, and a stated thermal resolution of 18.6 mK.
Preliminary pilot studies are currently being conducted in collaboration
with the Breast Clinic at the George Washington University (GW) Medical
Faculty Associates. Patients diagnosed with breast cancer are imaged with
the infrared camera for a total time of 15 minutes to observe cool-down
of the breast tissue [126]. The patient sits still with both arms raised
on arm supports, with the camera positioned approximately 25 inches
away from the patient (frontal view). Imaging starts immediately after the
patient undresses, capturing images every minute as the patient’s skin
cools. Figure 5.2 shows the environment for our breast image acquisition.
We collected data from 11 breast cancer patients, with 15 images for
each case, at a rate of one image per minute. We chose 15 minutes for our
screening time, similar to proposed IR image acquisition protocols found
in literature, which typically takes 4 to 15 minutes [249]. The 15 minutes
allow enough time to observe cooling of the breasts since we are allowing
210
(a) (b) (c)
Figure 5.3: Preprocessing of the raw IR images: (a) original raw IR image,
(b) manual rectangular crop to remove shoulders and abdomen, and (c) is
the hand-trace of the breast contour to generate the manual segmentation
(ground truth).
natural cool down of the breast tissue, and without causing discomfort for
the patient by sitting still for a long period of time. The rationale is that
the surrounding tissue cools faster than the tumor, which increases the
thermal contrast.
Initially, images were cropped manually by removing the upper and lower
regions (neck and abdomen). All breast IR images were converted to 8-bit
gray-scale. Then, a trained student manually traced the breast curvature
and cropped the breast region from the rest of the body to form the ground
truth breast region images for training and testing the segmentation model
(Figure 5.3). In practice, these truth breast region images were set to binary
values, where (in gray-scale) 0 (black) is for background and 255 (white) for
breast areas.
5.2.3 Segmentation Model Architecture
The CNN is a neural network incorporating many convolutional layers. It

is the commonly used deep-learning method for image processing because
using convolutional layers instead of full-connected layers reduces the total
number of weights needed for input images. Also, convolutional layers
211
Table 5.1: C-DCNN segmentation architecture for thermal breast images.
Layer Shape
input: gray-scale image 400x200x1
Conv_3-1 + ReLU Normalization 400x200x1
Conv_3-64 + ReLU 400x200x64
MaxPool_2 Normalization 200x100x64
Conv_3-128 + ReLU 200x100x128
MaxPool_2 Normalization 100x50x128
Flatten 640000
FC 200 + ReLU 200
FC 200 + ReLU 200
FC 640000 + ReLU 640000
Reshape to 100x50x128 100x50x128
Normalization Up-sampling 200x100x128
Conv_3-128 + ReLU 200x100x128
Normalization Up-sampling 400x200x128
Conv_3-64 + ReLU Normalization 400x200x64
output: Conv_3-1 + tanh 400x200x1
could retain the local structures of images. CNN usually transforms an

image to a vector, while deconvolutional neural networks (DCNN) apply the
opposite transformation of CNN by transforming a vector to an image. In this
study, we used the deep-learning based segmentation model: C-DCNN [195],
which connects a CNN and DCNN together. Firstly, convolutional layers
transform a 2-D image to a feature vector; then deconvolutional layers
convert a vector to image. Overall, the model transforms images to images.
In our experiments, the segmentation model converts infrared breast images
to segmented breast region images. Such an architecture is also called an
autoencoder [19]: CNN is the encoder and the feature vectors are “codes” of
input images. For deep-learning segmentation model, the “codes” could be
a smaller image or other shapes of data besides vectors.
The details of the C-DCNN structure are shown in Table 5.1. It consists
of six convolutional layers, two max-pooling, two up-sampling layers, three
212
fully-connected (FC) layers and one flattening layer. The activation function
for each layer is the ReLU function [186] except the last one for the output,
which is the tanh function.
The notation Conv_3-64 means there are 64 convolutional neurons
(units), with each unit having a filter size of 3×3-pixel (height × width)
in the layer. MaxPool_2 is a max-pooling layer with the filter size 2×2-pixel
window, stride 2; up-sampling layers have the same size. FC_200 is a
fully-connected layer containing 200 units. Normalization is the batch
normalization layer, which normalizes the activations of the previous layer
at each batch and helps accelerate deep network training [120]. The output
layer uses the tanh function, which maps the output value to the range of
[-1, 1].
Data shapes from input to output are symmetric. The CNN (encoder)
transforms an image to a 200-length vector (code) and the D-CNN (decoder)
transforms the vector back to an image. The 8-bit gray-scale input images
were scaled from [0, 255] to [-1,1] to match the value range required for
the neural network input. Similarly, the neural network segmented output
image is then rescaled back to uint-8 [0, 255].
5.2.4 Experiments and Evaluation
Experiment 1 Since all image samples were from 11 breast cancer pa-
tients, with 15 samples for each patient, the first experiment randomly
selected 12 samples from each patient for the training set and the remaining
3 samples for the testing set. In total, there are 132 breast infrared images
along with 132 manually segmented regions for training the segmentation
model, and 33 breast infrared images and their segmentations for testing.
213
Experiment 1
Patient Training set Testing set
001 12 images 3 images
002
003
011
Experiment 2
Patient Training set
001 15 images
002
Testing set
006 15 images
010
011
Figure 5.4: Training and testing data for Experiment 1 and 2.
Experiment 2 In this leave-one-case-out experiment, one patient is left

out of the training set by taking all 15 samples of that patient as the testing
set, while 150 images from 10 patients are used for training. The goal of
this experiment is to evaluate the performance of the segmentation model
on brand new breast shapes (cases) that were not seen during training.
Figure 5.4 shows the two proposed experiments.
Testing and Evaluation Criterion After training the segmentation C-

DCNN model, we put the testing breast infrared (IR) images in the model;
as we discussed in the Methods section, the outputs are predicted breast
area images in gray-scale. Since the truth breast region images are binary,
214
Trained seg-model
IR breast image Gray seg-image
Otsu's
Binary seg-image
IoU
Truth regions
Figure 5.5: The evaluation processes.
to compare the predicted images with truth data, we applied the Otsu’s [98]
algorithm to automatically convert gray-scale segmentation images to binary
segmentation images.
We compared the binary segmentation images with truth region images
by computing their Intersection-over-union (IoU), also called the Jaccard
Similarity (See Figure 5.5). The IoU of two binary images is the ratio of
overlapped area divided by area of union. Therefore, for two binary images
I1 and I2 of the same size, the IoU is:
|I1 ∩ I2 |
IoU (I1 , I2 ) =
|I1 ∪ I2 |
If the two images have the same breast region, IoU will be 1. For all the
testing results, we computed their IoU with ground truth manual segmented
regions to evaluate the segmentation C-DCNN model.
5.2.5 Results
Our implementation of neural networks was on the Keras API backend on

TensorFlow [5]. The development environment for Python was Anaconda3.
We set 1000 epochs for training and the batch-size is 6. The loss-function for
215
Figure 5.6: The training curve.
training is mean square error (MSE) between predicted segmentation images

and truth images, and optimizer is Nadam [139] using default parameters
(except the learning rate changed to 1e-4).
One training curve of MSE is shown in Figure 5.6; it shows that MSE
decreased fast at the beginning and converged close to 0 after 600 epochs.
In Experiment 1, the model was trained by 132 samples and tested on 33
samples once. In Experiment 2, we trained the C-DCNN model from scratch
for 11 times by leave-one-case-out training and testing. On average, each
epoch took 7 seconds by using one NVIDIA GTX-1080Ti GPU.
Figure 5.7 shows a sample segmentation result of one patient from
Experiment 1, where 33 test samples come from 11 patients, with 3 testing
breast images for each patient. In Figure 5.8, bars show the ranges among 3
samples for the same patient because breast images may be slightly different
due to patient movement during image acquisition and breast temperature
216
IR breast image Trained seg-model Gray seg-image
011 Otsu's
Truth regions Binary seg-image
IoU = 0.960
Figure 5.7: Segmentation results of one patient from Experiment 1.
change over time. The overall average IoU is about 0.9424 with 0.0248
standard deviation.
In Experiment 1, training and testing sets contain images of the same
patient. That is, IR breast images having the same region are in both train-
ing and testing sets. A possible explanation for the good segmentation
performance of the C-DCNN might be that it memorized the breast region
for each patient but have not learned how to segment breast regions. There-
fore, Experiment 2 evaluated the trained segmentation model by different
patients’ IR breast images from training to avoid the memorization.
Figure 5.9 shows two results from Experiment 2. In the first row, the
segmentation model was trained by 10 patients’ IR breast images without
patient 002 (leave-one-case-out). All 15 test images come from the 002
patient. The predicted segmentation breast region seems to be synthesized
by several trained breast areas from other patients and the segmentation
result of patient 002 is not as accurate as the result from Experiment 1;
however, the predicted breast area still covers most of the ground truth
breast area. The second row shows a better example for patient 007.
217
Ave: 0.9424 Std: 0.0248
Subject ID
Figure 5.8: Results of Experiment 1. The blue dots are the average IoU for
each patient and bars show the range among 3 samples.
IR breast image Gray seg-image Truth regions
IoU = 0.774
002
IoU = 0.864
007
Figure 5.9: Segmentation results of two patients from Experiment 2.
218
Ave: 0.8340 Std: 0.0809
Figure 5.10: Results of Experiment 2. The blue dots are the average IoU of
each patient among its 15 testing samples, the red lines are medians and
the bars show the ranges.
Figure 5.10 shows that the average IoUs in most cases are better than 0.8,
and the overall average IoU is about 0.8340 with 0.0809 standard deviation.
This is relatively high because of the wide variety of breast shapes and
contours among different patients. Low average IoU for each case means
the breast region (shape and contour) is quite different from other cases in
the training set. Higher IoU cases mean there are similar breast areas in
the training set (See Discussion section).
5.2.6 Discussion
5.2.6.1 Comparison of the Two Experiments
The average IoU of Experiment 1 is 0.9424 and Experiment 2 is 0.8340;

the results illustrate that if the segmentation model was trained by some IR
breast images from one patient, it can have a better performance to segment
219
the breast region in other samples from the same patient (Figure 5.11).
In the top part of Figure 5.11, one IR breast image of patient 001 (p.001)
was input to two trained segmentation models: one model had been trained
without p.001’s samples (Experiment 2) and another one had been trained
with some of p.001’s samples (different from the input one) (Experiment 1).
Both the outputs and IoUs demonstrate that the outcomes by training with
or without the same patient’s samples can have big differences, with results
from the Experiment 1 (with the same patient’s samples) being better. For
p.001, the output from Experiment 1 looks very similar to the ground truth
region, however, the predicted segmentation area from Experiment 2 looks
like breast regions from other patients used in the training set.
On the contrary, in the bottom part of Figure 5.11, the segmentation
outputs for p.009 from the two experiments are very similar. This is because
another patient (p.010) has a similar IR breast region (breast shape and
contour) as p.009. The segmentation model was trained by similar-looking
breast regions. It is not surprising that if training samples include breast
images very similar to the test image, the segmentation outcomes become
better. Such results indicate that including more IR breast images of various
breast shapes and sizes in the training process of the C-DCNN segmentation
model will greatly improve overall performance.
5.2.6.2 Limitations and Future Work
Both Otsu’s thresholding and IoU computation used in this study have
limitations. Although Otsu’s thresholding converts gray-scale images to
binary automatically, it cannot guarantee optimal segmentation. IoU is
used to compare two binary regions but results are subject to image size.
There could exist multiple ways to segment the breast region from IR im-
220
IR breast image Gray seg-image Truth regions
IoU = 0.778
without 001
001 with 001

IoU = 0.924
IoU = 0.895
without 009
009 with 009

IoU = 0.939
010
Figure 5.11: Comparison of results from the two experiments (first row:
Experiment 2, second row: Experiment 1). The second column (Gray seg-
image) shows output of segmentation models. The third column is the
ground truth breast region of the patient’s testing samples. (Top part:
p.001, bottom part: p.009).
221
ages, suggesting some limitations to the manual hand segmentation. For
instance, a predicted segmentation area with low IoU value by the C-DCNN
segmentation model might still be a reasonable way to segment the breast
region even if it does not match the manual segmentation. Hence, a better
evaluation metric needs to be developed to assess the quality of the breast
segmentation by our developed model. In future studies, we will consider
applying other thresholding and region or contour comparison methods.
For future works, one approach to improve outcomes is to combine deep-
learning based segmentation with other methods for pre/post-processing,
such as the contrast limited adaptive histogram equalization (CLAHE). Since
the histogram equalization globally changes the images, it may not be
achieved by CNN because the convolutional operations are localized and
could play the roles of various image filters. From Experiment 2 we know that
more varieties and number of training images could benefit the C-DCNN seg-
mentation model, thus we will collect more patients and volunteers’ samples
for future training. Also, we could train other deep-learning segmentation
models, such as SegNet [17] or U-Net [221].
5.2.7 Extended Studies2
Based on this study, we proposed other DL-based models with more

complicated architectures for segmentation of thermal breast images. They
are: MultiResUnet [166], DC-UNet [169], CFPNet-M [168], and CaraNet [167].
Comparing with the C-DCNN and U-Net [221], these segmentation models
achieve better performances, contain fewer parameters (light-weight), or/and
run faster. Especially, the CFPNet-M model can perform medical image
segmentation in real-time. These models are comprehensively discussed
2 This work has been published in the [C1], [C3], and [C9].
222
in their related published/pre-printed papers that can be found in the
references. Here, we will not show more details about their architectures
because they are not highly relevant to our main topic.
In our segmentation studies, besides the IoU, we apply another measure-
ment metric – Tanimoto Similarity [219]. To evaluate the performance of
segmentation, we need a method to compare the segmented region with
ground truth. Since we applied the sigmoid function to activate the final
convolutional operator in DL-based segmentation models, the output is
a gray-level image which maps into the range [0,1]. Therefore, we must
threshold before calculating accuracy. Usually, thresholding grayscale im-
age to binary (binarization) [241], like Otsu’s method [98] used in this study,
introduces additional errors.
In previous studies, we choose IoU as the measurement metric; it com-
pares two binary images as two sets A and B, their IoU value is:
|A ∩ B|
IoU (A, B) =
|A ∪ B|
For binary images, IoU compares images by union and intersection opera-
tions. The intersection operation could be considered as sum of products.
For two sets A and B:
|A ∩ B| = ∑ ai bi (5.1)
where ai ∈ A, bi ∈ B. This equation holds if ai , bi ∈ {0, 1}, which are binary

values. But if ai , bi are not binary, we use sum of products (right part of
Equation (5.1)) instead of the union operation. Since:
|A ∩ A| = ∑ ai 2
223
And,
|A ∪ B| = |A| + |B| − |A ∩ B| = ∑ ai 2 + bi 2 − ai bi

For gray-to-gray comparison, according to IoU (A, B), the value of the Tani-
moto similarity [219] is:
∑ ai bi
T (A, B) =
∑ ai 2 + bi 2 − ai bi

In addition, we have shown [168] that Tanimoto similarity is stable

with both changes of image size and object-area ratio (the ratio of object
area vs. background area, as shown in Figure 5.12) but IoU is not. And,
the values of Tanimoto similarity are close to IoU in most cases. Thus,
Tanimoto similarity is a good alternative measure instead of IoU for gray-
scale image comparisons. Using Tanimoto similarity avoids binarization
and preserves more information of segmented images, and time cost is
less. Although ground truth images are binary, it is simple to convert them
to 8-bit gray-scale by multiplying by 255. Therefore, we apply Tanimoto
similarity instead of IoU as the measurement metric in some of our other
segmentation studies [166, 169, 168].
5.3 Targets Segmentation by Trained Classifier3
To train a DL-based image segmentation model, the true segmented

images are required. The CNN models for classification and the C-DCNN
models for segmentation have very similar architectures. In this study,
we examine how to segment targets from a trained classifier of the targets
instead of training a new segmentation model and evaluate the segmentation
results. Specifically, we test this method for medical object segmentation.
224
Change of image size
Change of object area ratio
Figure 5.12: The size and object-area ratio change of images. We change
image size by down-sampling and change object-area ratio by adding blank
margin around the object and down-sampling to keep the same size.
5.3.1 Introduction
For image classification, the Convolutional Neural Network (CNN) has

performed well in many tasks [262]. Traditional classification methods
have relied on manually extracted features; alternatively, the CNN automat-
ically extracts the features from images for classification [163]. To visualize
extracted features from a CNN model, recent techniques such as the Grad-
CAM [240] can weight and combine the features to display heatmaps of
targets on input images, and the targets are the basis of classification. Thus,
such techniques provide a way to find the targets of classification from a
trained classifier.
Since the CNN has been applied to many medical image classification
problems [18], it will be meaningful if we could gain more knowledge or
information about the objects of classification from extracted features of
theirs. The development of deep learning, moreover, has made significant
contributions to medical image segmentation and become a research focus
225
(a) (b)
(c) (d)
Figure 5.13: Results of Grad-CAM applied to Xception model with input of

an elephants image. (a) is the input image. (b) is original image masked
by Grad-CAM heatmap (using ‘Parula’ colormap) of the prediction on this
input. (c) is the Grad-CAM heatmap mask using gray-scale colormap. (d) is
original image filtered by the heatmap mask.
in the field of medical image segmentation [161]. Thus, our question is

whether it is possible to segment targets from a trained classifier of the targets.
If the features extracted from a trained classifier (CNN) model of targets
could be used to segment the targets well, it will benefit medical image
segmentation. For example, we could segment breast cancer areas by
re-using a classifier for breast cancer detection without training a new
segmentation model. A potential advantage of such segmentation is that
the objects in classification tasks are often more difficult to segment from
the background because their boundaries are not as apparent as in many
segmentation tasks.
226
Figure 5.13 shows an example of how the heatmap from the Grad-CAM
method segments the targets. We input an image of African elephants
(Figure 5.13a) into the Xception [43] neural network model pre-trained by
the ImageNet database [227]. The pre-trained Xception model can classify
images including 1000-class targets4 . Its top-1 prediction result of the
input image (Figure 5.13a) is ‘African_elephant’. Then, we apply Grad-CAM
to show the heatmap of this prediction; results are shown in Figure 5.13.
Finally, the input image filtered by the heatmap mask can be considered as
a segmentation result of the targets segmented from the background (grass
and sky).
This example shows that we can achieve a segmentation result without
training a segmentation model but by using a trained classifier. In this
study, we applied this method to mammographic images for breast tumor
segmentation. We used breast tumor images from the DDSM database and
various CNN-based classifier models (e.g., Xception). Since DDSM describes
the location and boundary of each abnormality by a chain-code, we were
able to extract the true segmentations of tumor regions. We used the regions
of interest instead of entire images to train CNN classifiers. After training
the two-class (with- or without-tumor) classifier, we applied Grad-CAM to
the classifier for tumor region segmentation. We expect that this will be a
beneficial method for general medical image segmentation; e.g., we could
segment breast cancer areas by re-using a classifier developed for breast
cancer detection without training a new segmentation model from scratch.
This study – that to segment breast tumors by re-using trained clas-
sifiers – is inspired by applications of explainability in medical imaging.
A main category of explainability methods is attribution-based methods,
4 https://image-net.org/challenges/LSVRC/2014/browse-synsets
227
which are widely used for interpretability of deep learning [252]. The com-
monly used algorithms of attribution-based methods for medical images are
saliency maps [170], activation maps [275], CAM [308]/Grad-CAM [240],
Gradient [71], SHAP [296], et cetera.
In this study, we examined how the attribution maps (the heatmaps)
generated from the Grad-CAM algorithm can contribute to the segmentation
of breast tumors. The visualization of the class-specific units [307, 308] for
CNN classifiers is used to locate the most discriminative components for
classification in the image. The authors of Grad-CAM also evaluated the
localization capability of Grad-CAM [240] by bounding boxes containing the
objects. But those methods provide coarse boundaries around the targets,
and these studies have not provided further quantitative analysis about
the differences between predicted and real boundaries of the targets. Thus,
they are considered to be methods for localization rather than segmentation.
For weakly-supervised image segmentation [143, 240], CAM/Grad-CAM’s
heatmaps can be computed and combined with other segmentation models,
such as UNet CNNs [191, 211], to improve segmentation performance. We
are aware of no similar study that has merely applied the Grad-CAM algo-
rithm by trained CNN classifiers to a specific application of medical image
segmentation, without using any other segmentation models or approaches.
We quantitatively analyzed the differences between predicted boundaries
from Grad-CAM and real boundaries of the targets, and we discussed the
relationships between the performance of segmentation and classification
based on the CNN classifier.
228
5.3.2 The Grad-CAM Method
To compute the class activation maps (CAM), Zhou et al. [308] proposed
to insert a global average pooling (GAP) layer between the last convolutional
layer (feature maps) and the output layer in CNNs. The size of each 2-D
feature map is [x, y], and a value in the k-th feature map is fi,k j . Suppose the
last convolutional layer contains n feature maps, then the GAP layer will
have n nodes for each extracted feature map. From the definition, the value
of the k-th node is the average value of the k-th feature map:
1
Fk = ∑ fi,k j (5.2)
xy i∈{1,2,··· ,x}
j∈{1,2,··· ,y}
After inserting the GAP layer, to obtain the weights of connections from
the GAP layer to the output layer requires to re-train the whole network
with training data. wck is the weight of connection from the k-th node in the
GAP layer to the c-th node in the output layer (for the c-th class). Thus, the
CAM of c-th class is:
n
CAMc [i, j] = ∑ wck · fi,k j (5.3)
k=1
The size of CAM is the same as that of the feature maps.

The authors (Selvaraju et al. [240]) of Grad-CAM noted that re-training
is not necessary. They applied the back-propagation gradient and chain
rule to calculate the connection weights wck instead of re-training.
The value of the c-th node in output layer Y c is the score of the target
classification for the c-th class:
n
Yc = ∑ wck · F k (5.4)
k=1
229
By taking the partial derivative of F k :
∂Y c
= wck (5.5)
∂ Fk
By the chain rule:

k
∂Y c ∂Y c ∂ fi, j
wck = = · (5.6)
∂ F k ∂ fi,k j ∂ F k
From the Equation (5.2), by taking the partial derivative of fi,k j :
∂ Fk 1
k
= (5.7)
∂ fi, j xy
Therefore, by taking Equations (5.6) and (5.7) to eliminate the F k :
∂Y c
wck = · xy (5.8)
∂ fi,k j
Finally, putting Equations (5.3) and (5.8) together, they find the way to
compute the CAM for c-th class without really inserting and training a GAP
layer. Thus, it is called Grad-CAM:
n
∂Y c k
Grad-CAMc [i, j] = xy · ∑ k
· fi, j (5.9)
k=1 ∂ f i, j
The size of CAM (Grad-CAM) equals the size of feature maps: [x, y], which
is usually smaller than the size of input images. For comparison, resizing
is commonly applied to CAMs to enlarge their size to be the same as input
images.
5.3.3 Proposed Experiments
We have two experiments. In the first experiment, we used the regions

of interest (ROIs) of breast cancer images from the DDSM [104] to train two-
230
Tumor mask
Left breast True boundary
Test abnormal ROI

Compare Dice
abnormal ROI
Grad-CAM
CNN Classifier
CAM
Right breast normal ROI
Figure 5.14: Flowchart of the Experiment #1. The true boundaries of tumor
regions in abnormal ROIs are provided by the DDSM database.
class (with/without tumors) CNN classifiers. ROIs with tumors are called
abnormal ROIs and ROIs without tumors are called normal ROIs. After
training the two-class classifier using these normal and abnormal ROIs, we
will apply Grad-CAM and the classifier to test abnormal ROIs to segment
tumor regions. Then, we will use the true boundary of the test whether
abnormal ROIs can be used to evaluate segmentation results. Figure 5.14
shows the flowchart of this experiment. The goal of the Experiment #1 is
to verify how well the medical targets are segmented by a trained classifier
using Grad-CAM algorithm.
By using the Grad-CAM algorithm, the trained CNN classifiers can gen-
erate CAMs from both normal and abnormal ROIs. These CAMs can be
considered as masks that indicate the areas that are important to classifi-
cation. In the second experiment, we trained CNN classifiers from scratch
by only using information of those areas. The training data are ROIs filtered
by CAMs. It is an evaluation method for the CAMs: to re-train the CNN
classifiers using images filtered by CAMs and examine their performance.
By combining with the Experiment #1, the steps of Experiment #2 are
(Figure 5.15):
231
• To train two-class classifiers with normal and abnormal ROIs.
• To generate CAMs by inputting these ROIs in trained classifiers using

Grad-CAM algorithm.
• To create CAM-filtered ROIs: resize CAMs (heatmaps) to the same size

of ROIs and convert their range of values to [0, 1]; then, multiplied by
the original ROIs. The important areas to classification are close to 1
in CAMs thus, they will be kept in CAM-filtered ROIs.
• To train the same two-class classifiers (same models) from scratch

again by CAM-filtered ROIs.
The goals of the Experiment #2 are to examine 1) whether Grad-CAM can

really recognize the areas in the images that are important to classification;
2) whether the predictions of CNN classifiers really depend on tumor areas.
CNN Classifiers
Train, Input Grad-CAM Train
×
ROIs CAMs ROIs filtered
by CAMs
Figure 5.15: Flowchart of the Experiment #2. The normal and abnormal
ROIs are used twice to train the CNN classifiers and then to generate CAMs
by trained classifiers using Grad-CAM algorithm. The CNN classifiers to be
trained by CAM-filtered ROIs are the same CNN models (same structures)
as trained by the original ROIs before but trained from scratch again.
For comparison, we additionally trained these CNN classifiers by

the truth-mask-filtered ROIs. To create truth-mask-filtered ROIs is similar
232
to generating the CAM-filtered ROIs. For abnormal ROIs, we multiply ROIs
by their corresponding tumor masks so that only the tumor areas are kept
and the background (non-tumor area) is removed (pixel values = 0). For
normal ROIs, since there is no tumor area, we multiply ROIs by randomly
selected tumor masks from abnormal cases, for the purpose of making
normal/abnormal ROIs have similar shapes (outlines).
5.3.4 Image Data Pre-processing
In this study, we also use the mammographic images from the Digi-
tal Database for Screening Mammography (DDSM) [104] as introduced in
Section 4.3.1.
We firstly downloaded mammographic images from DDSM database and
cropped the Region of Interest (ROI) by given abnormal areas as ground
truth information. Images in DDSM are compressed in LJPEG format. To
decompress and convert these images, we used the DDSM Utility [246]. We
converted all images in DDSM to PNG format. DDSM describes the location
and boundary of actual abnormality by chain-codes, which are recorded in
OVERLAY files for each breast image containing abnormalities. The DDSM
Utility also provides the tool to read boundary information and display them
for each image having abnormalities. Since the DDSM Utility tools run on
MATLAB, we implemented all pre-processing tasks using MATLAB.
We used the ROIs instead of entire images to train CNN classifiers. These
ROIs are cropped rectangle-shape images (Figure 5.16) and obtained by:
truth boundaries with padding.
• For normal ROIs, they were cropped on the other side of a breast
233
4
10
Figure 5.16: The ROI (left) is cropped from an original image (right) from
DDSM dataset. The red boundary shows the tumor area. The ROI is larger
than the size of tumor area because of padding.
having abnormal ROI and the normal ROI was the same size (with
padding) and location as the abnormal ROI on different breast side. If
both left and right breasts having abnormal ROIs and their locations
overlapping, we discarded this sample. Since in most cases, only one
side of breast has tumor and the area and shape of left and right breast
are similar; thus, normal ROIs and abnormal ROIs have similar black
background areas and scaling.
• All ROIs are converted to 8-bit gray-value.
• All ROIs are only from the CC views.
The padding is added to all ROIs in order to vary the locations of tumors
in abnormal ROIs and to avoid excessive proportion of the tumor area in
a ROI. ROIs are larger than the sizes of tumor areas because of padding.
As shown in Figure 5.17, the padding is added by some randomness and
depended on the size of tumors:
234
• Width: randomly adding 10%-30% of tumor width on left and right
sides.
• Height: similar as width.
0.1w~0.3w
w
0.1h~0.3h
Figure 5.17: The padding is added to four sides of ROIs by some randomness
and depended on the size of tumor area.
After collecting ROIs, as shown in Figure 5.18, we have normal ROIs and
abnormal (tumor) ROIs to apply classification (using binary labels), and
could have real tumor masks to apply segmentation.
True boundary
Abnormal
Normal ROI Tumor mask
(tumor) ROI
Figure 5.18: Examples of ROIs. The tumor mask is binary image created
from the tumor ROI and truth boundary of the tumor area.
To train CNN classifiers by normal and abnormal ROIs and to generate

CAMs using Grad-CAM algorithm, we used six CNN models: NASNetMo-
235
bile [314], MobileNetV2 [233], DenseNet121 [118], ResNet50V2 [102], Xcep-
tion [43], and InceptionV3 [266]. Except for cropping ROIs, experiments are
implemented by codes written in the Python language.
Our dataset has 325 abnormal (tumor) ROIs and 297 normal ROIs in
total. To train the CNN classifiers, we divide the dataset into 80% for training
and 20% for validation. The framework of deep learning models is Keras5 .
Every CNN model is trained about 200 epochs with EarlyStopping6 setting.
The classifier models having the best validation (accuracy) performance
during each training were saved.
To input an abnormal ROI into the trained CNN classifier by using Grad-
CAM algorithm, we can obtain a CAM for that ROI. Then, we resized the
CAM to the size same as the input ROI. The CAMs are gray-value image,
and the truth tumor masks we have are binary image. Thus, we applied
the mean-Dice metric to compare CAMs and tumor masks.
The Dice coefficient [60, 255] of two binary images A and B is:
|A ∩ B|
Dice(A, B) = 2 ×
|A| + |B|
To calculate Dice coefficient requires both images are binary; thus, we need
to transform the CAMs from gray-value ([0, 255]) to binary ({0, 1}). Suppose
B is the CAM, it can be binarized by setting a threshold (t): Bt (B > t) = 1 and
Bt (B ≤ t) = 0, where Bt is the binarized CAM. Then, the mean-Dice metric is
defined:
1 255
mean-Dice(A, B) = · ∑ Dice(A, Bt ) (5.10)
256 t=0
We report the best validation accuracy (val_acc) and averaged mean-Dice

(Dice) of Experiment #1 (described in Section 5.3.3 and Figure 5.14) for each
5 https://keras.io/api/applications/
6 https://keras.io/api/callbacks/early_stopping/
236
Table 5.2: Result of Experiment #1. Descending sort by val_acc.
Classifier val_acc Dice

InceptionV3 0.872 0.256
DenseNet121 a 0.872 0.030
Xception 0.856 0.435
NASNetMobile 0.848 0.353
MobileNetV2 a 0.840 0.034
ResNet50V2 0.840 0.365
a These classifiers have very small Dice values.
CNN classifier in Table 5.2. The averaged mean-Dice is calculated using all
325 abnormal (tumor) ROIs. Figure 5.19 shows CAMs of one tumor ROI
generated by using trained CNN classifiers and Grad-CAM algorithm.
As shown in the result, CAMs from Xception overlap the most regions
of true tumor masks. But CAMs from DenseNet121 and MobileNetV2
almost do not cover the true tumor regions. We could see from Figure 5.19,
heatmaps (CAMs) of the two classifiers highlight the corners and outer areas
of images instead of the tumor regions. Although CAMs from DenseNet121
and MobileNetV2 have very small Dice values with true tumor areas, they
still have good classification performance. Thus, the result leads to two
questions:
1. Can Grad-CAM really recognize the important areas in the images to

classification?
2. Do the predictions of CNN classifiers really depend on tumor areas?
Experiment #2 is proposed to examine the two questions.

In addition, we add the best validation accuracy of CNN classifiers trained
by the CAM-filtered ROIs (CAM_val_acc) and by truth-mask-filtered ROIs
(mask_val_acc), which are described in Section 5.3.3 and Figure 5.15, to
the result in Table 5.3. It is extended from Table 5.2 and descending sort
237
Tumor ROI Tumor mask
Xception InceptionV3 DenseNet121
ResNet50V2 NASNetMobile MobileNetV2

CAMs
Figure 5.19: Result of Experiment #1. The first row shows one of the
abnormal (tumor) ROIs and its truth mask. Other rows show the CAMs
of this ROI generated by using trained CNN classifiers and Grad-CAM
algorithm.
by Dice. And, we plot the Dice and CAM_val_acc for the six CNN classifiers
in Figure 5.20.
As shown in Figure 5.20, in general, training on ROIs filtered by CAMs
covering more tumor areas (higher Dice values) leads to better classification
performance (CAM_val_acc). InceptionV3 model is an exception: its Dice is
smaller than NASNetMobile’s but it has a higher CAM_val_acc than NAS-
NetMobile. The reason may be that InceptionV3 has a better classification
capability than NASNetMobile because 1) the parameters in InceptionV3 are
238
Table 5.3: Result of Experiment #2. Descending sort by Dice.
Classifier val_acc Dice CAM_val_acc mask_val_acc

Xception 0.856 0.435 0.816 0.880
ResNet50V2 0.840 0.365 0.792 0.880
NASNetMobile 0.848 0.353 0.768 0.872
InceptionV3 0.872 0.256 0.776 0.904
MobileNetV2 0.840 0.034 0.736 0.872
DenseNet121 0.872 0.030 0.704 0.896
about five times of the parameters in NASNetMobile7 ; 2) Table 5.3 shows

InceptionV3 has higher val_acc and mask_val_acc than NASNetMobile.
For all CNN classifiers, training on ROIs filtered by truth tumor masks
(Figure 5.21a) leads to the best classification performance (Table 5.3,
mask_val_acc), even better than training on original ROIs. Using truth-
mask-filtered ROIs is better than CAM-filtered ROIs simply because truth-
mask-filtered ROIs contain the whole tumor areas (Dice = 1). And, the
reason that using truth-mask-filtered ROIs is better than original ROIs
might be that original ROIs contain some irrelative information interrupt-
ing classification. Such result indicates that the classification depends on
tumor areas more than other areas. To confirm this conclusion, we trained
InceptionV3 on ROIs filtered by inverse truth tumor masks (Figure 5.21b).
The inverse-mask-filtered ROIs exactly exclude the tumor areas (Dice = 0).
The best validation accuracy of InceptionV3 trained by inverse-mask-filtered
ROIs is 0.616. Comparing with the CAM_val_acc (0.776), val_acc (0.872),
and mask_val_acc (0.904) of InceptionV3, ROIs containing smaller tumor
areas lead to worse classification performance.
7 https://keras.io/api/applications/
239
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Dice CAM_val_acc
Figure 5.20: Plots of Dice and CAM_val_acc for the six CNN classifiers in
Table 5.3.
5.3.6 Discussion
The results of Experiment #1 show that the best segmentation from

Grad-CAM method is 0.435 (averaged mean-Dice) by the Xception model.
Such results indicate CAMs may not perform segmentation well for binary-
classification problems because the highlighted area (hot region) on the
CAM is almost the same for all CAMs (Figure 5.22). The CNN classifiers
may select some fixed areas that are best for classification. Thus, these
classifiers achieve a good classification performance but their CAMs are not
optimal for segmentation. In addition, the requirements of segmentation
and classification are different: segmentation is based on local (pixel-wise)
decisions but classification is based on a global decision. The decision of
240
(a) (b)
Figure 5.21: Examples of truth-mask-filtered (a) and inverse-mask-filtered

(b) ROIs from the case shown in Figure 5.19.
classification may depend on part of the target-object and/or other parts

outside of the target-object. Therefore, only use of CAMs from CNN classifiers
may not be an optimal approach for segmentation; instead, to combine CAMs
with other segmentation methods can be a promising direction [191, 211].
Figure 5.22: Some tumor ROIs and their CAMs from Xception.
Experiment #2 verified that the predictions of CNN classifiers mainly

depend on tumor areas. The Grad-CAM algorithm can recognize some of the
important (tumor) areas to classification but its performance depends on
241
classifier models. As shown by the DenseNet121 in Table 5.3 and Figure 5.19,
its CAMs have very small Dice values with true tumor areas but the model
still has a good classification performance. This implies that the dark
regions in CAMs also contribute to classification.
5.3.6.1 Future Work
This study may raise other questions and discover the starting points
for future studies to make progress in the understanding of deep learning.
The Grad-CAM is not the only method to generate the heatmap that reflects
the basis of classification. In future works, we would test other techniques,
such as Saliency map [170], SHAP [296], and Activation map [275], to create
segmentation and make comparison. We found the dark-regions in CAMs
from Grad-CAM also contribute to classification; thus, we wonder if some
other techniques could solve this drawback.
Since the performance of Grad-CAM depends on classifier models, we
would ask:
• What CNN architectures/types are good for segmentation by Grad-

CAM? And why?
• How do the bottom layers (fully-connected layers, layers after the last
convolutional layer) in CNN models affect the CAMs?
And for weakly-supervised image segmentation [143, 240], CAMs can

be referred and combined with other segmentation models, such as UNet
CNNs [191, 211]. More works could be done to fuse CAMs and segmentation
models/methods to improve their performance.
242
5.4 Conclusion
In this chapter, we applied DL-based methods to medical images segmen-

tation. Through cross-validation and comparison with the ground-truth
images, the results demonstrate that the capability of Convolutional and
Deconvolutional Neural Network (C-DCNN) to learn essential features of
breast regions and delineate them in thermal images, showing that the
C-DCNN is a promising method to segment breast regions. Adding training
samples from the same patient can improve segmentation performance
from 0.83 to 0.94 based on our evaluation criterion (IoU). The outcomes
by training without the same patient is still acceptable. This fact reveals a
variety of breast shapes in training samples help improve performance of
the segmentation model.
To approach the segmentation problem for mammogram using trained
Convolutional Neural Network (CNN) classifiers, we found the best segmen-
tation from the Grad-CAM algorithm is 0.435 (averaged mean-Dice) by the
Xception model. The Grad-CAM algorithm can recognize some of the impor-
tant (tumor) areas to classification but its performance depends on classifier
models. It indicates that the use of only Grad-CAM to train two-class CNN
classifiers may not be an optimal approach for segmentation; instead, to
combine Grad-CAM with other segmentation methods could be a promising
direction. In addition, we have verified that the predictions of CNN classifiers
mainly depend on tumor areas, and dark regions in Grad-CAM’s heatmaps
also contribute to classification. We also proposed an evaluation method for
the heatmaps; that is, to re-train and examine the performance of a CNN
classifier that uses images filtered by heatmaps.
243
Chapter 6: Conclusions and Future Work
During my PhD studies and research, I have applied multiple ML and

DL methods to medical image analysis. With the development of these
applications, I was paying more and more attention to the explainability
of these methods. The explainability of ML or DL not only brings many
interesting and challenging problems for research, but also becomes a real
requirement of users in terms of reliability and trustworthiness. Data and
models are the two main subjects in machine learning and deep learning.
Models learn knowledge (patterns) from data. Thus, there are two aspects
in which we examine the explainability: the complexity of the dataset and
learnability of the learning model. The outcomes are highly dependent on
the two aspects. My contributions in summarized projects regarding the
complexity and learnability are shown in Table 6.1.
In the studies of hyper-spectral image-based cardiac ablation lesion
detection, I show that k-means, an approach that does not require a pri-
ori knowledge of tissue spectra, is an effective approach to detect lesions
from aHSI data. I have also demonstrated that the number of spectral
bands (which are referred to as “features”) can be reduced (by grouping
them) without significantly affecting lesion detection accuracy. To evaluate
clustering results including k-means, I create and use a novel CVI called
the Distance-based Separability Index (DSI) based on the data separability
measure. Results show DSI to be an effective, unique, and competitive CVI
to other compared CVIs.
244
Table 6.1: My contributions (citations in brackets) in the four summarized projects regarding the complexity and
learnability, which are the two important components of explainable machine learning (or XAI).
XAI
Complexity → Data Learnability → Models
Projects
Transparent Deep Create the Distance-based Separabil- Create the Decision Boundary Complexity
Learning/Machine ity Index (DSI) to measure the separa- (DBC) measure to analyze the generalizability
Learning bility of datasets. [J1] of deep learning models [C5] and develop a
theory to estimate the training accuracy for
two-layer neural networks applied to random
datasets, to understand the mechanisms of
deep neural networks. [Section 4.6]
Hyper-Spectral Image- Apply DSI as an effective internal Apply k-means clustering to detect lesions
Based Cardiac Abla- Cluster Validity Index (CVI) to evaluate from hyperspectral images and reduce the
tion Lesion Detection clusters. [C4] number of spectral bands (by grouping them)
245
without significantly affecting detection accu-
racy. [J5]
Applications of Trans- Create the Likeness Score (LS) (a va- Show that adding GAN-generated images
fer Learning and the riety of DSI) to evaluate the perfor- makes the training of CNNs from scratch suc-
Generative Adversarial mances of GANs by directly analyzing cessful and improves CNNs’ performances.
Network in Breast Can- their generated images without using [J3]
cer Detection a pre-trained classifier. [J2]
Deep Learning-Based Test the reverse process to approach Demonstrate the capability of Convolutional
Medical Image Seg- the segmentation problem for mam- and Deconvolutional Neural Network (C-
mentation mograms using pre-trained Convolu- DCNN) to learn essential features of breast
tional Neural Network (CNN) classi- regions and delineate them in thermal images.
fiers because the complexity of medi- [C10]
cal images demands new approaches
to segmentation. [C2]
I have applied Deep Learning (DL)-based methods to detect breast cancer
from mammograms. Since training a Convolutional Neural Network (CNN)
from scratch is not feasible for a limited number of labeled mammographic
images, I show that using transfer learning in CNN is a promising solution for
breast cancer detection and the Generative Adversarial Network (GAN) can
be used as an image augmentation method for training and to improve the
performance of CNN classifiers. In terms of explainable DL, to further study
DL-models, I propose a novel GAN measure – Likeness Score (LS) – based
on the DSI to evaluate the images generated from GAN models, propose
the Decision Boundary Complexity (DBC) score to define and measure
the generalizability of the Deep Neural Network (DNN), and create a novel
theory based on space partitioning to estimate the approximate training
accuracy for two-layer neural networks. All were developed to reveal the
mechanism of DNN models. These studies may raise other questions and
suggest starting points for new ways for researchers to make progress on
studying the transparency of deep learning and explainable deep learning.
I have applied Deep Learning (DL)-based methods to medical image
segmentation. My studies demonstrate the capability of the Convolutional
and Deconvolutional Neural Network (C-DCNN) to learn essential features of
breast regions and delineate them in thermal images; further, the C-DCNN
can segment breast regions. Then, I test whether the heatmaps extracted
from trained classifiers (e.g., using Grad-CAM) could be applied to segment
the objects. Results indicate that the use of only Grad-CAM to train two-
class CNN classifiers may not be an optimal approach for segmentation;
instead, to combine Grad-CAM with other segmentation methods could be
a promising direction.
Based on the research presented in this dissertation, some future works
246
in medical image analysis could be:
• Instead of cancer and non-cancer classification, medical diagnosis

images could contain information about cancer stages; the information
could be also shown by the mapped vector. Thus, we consider that the
smoothly changing cancer stages could be shown by continuous vector
values. This is because the latent (manifold) space of a GAN implies
that we could build a mapping function from an image to a vector and
the vector could reflect some features of the image.
• More studies to evaluate the segmentation results are required. Specif-

ically, future work will review and develop methods for medical segmen-
tation evaluation and consider the problem of inter-observer variability
and fusion if there are multiple expert-segmented labels as ground
truth.
In addition, more questions related to the learnability of models must be

addressed. Some future work in eXplainable Artificial Intelligence (XAI)
could be:
• To further understand the mechanism of DNN models, the estimation

theory for two-layer neural networks could extend to multi-layer neural
networks.
• We could analyze the minimum bound of the Fully-Connected Neural

Network (FCNN) structure (the number of neurons and the number of
hidden layers) to solve a specified classification problem (e.g., the XOR
puzzles). It is a promising approach to measure the capability of one
type of neural network for a special classification requirement.
• The proposed data separability measure (DSI) can be applied to ana-

lyze and describe how the distribution of data changes after passing
247
through each hidden layer in neural networks. Data separability may
provide another perspective to understand how neural networks work.
In general, we want to understand what DL-models learn. For example,
for a CNN classifier, does it learn or extract the patterns for classifi-
cation from training data or just memorize the training data? For a
specific DL-model, we could try to define the learned patterns, find
evidence or ways to determine either that the DL-model learns patterns
from training data or records the data, and study how the DL-model
keeps and reuses such information for recognition/classification.
As new methods from Machine Learning (ML) and Deep Learning (DL)
have been applied to medical image analysis for detection, classification, and
segmentation, there has been since 2016 a parallel interest in explainable
ML and DL (i.e., XAI): more and more research is being focused on this
issue. Although ML and DL have achieved notable results in the laboratory,
they have not been deployed significantly in the clinic because of the lack
of explainability. In addition to the technical issues in XAI being studied
by researchers and engineers, it is important, for the reasons of respon-
sibility and reliability, to involve physicians, regulators, and patients in
the principled approach to defining and realizing explainability for medical
applications.
248
List of Publications
Journal Articles
[J1] S. Guan and M. Loew, “A novel intrinsic measure of data separa-
bility”, Applied Intelligence, 2022, in press. doi:10.1007/s10489-
022-03395-6.
[J2] S. Guan and M. Loew, “A novel measure to evaluate generative

adversarial networks based on direct analysis of generated im-
ages”, Neural Computing and Applications, vol. 33, no. 20, pp.
13921–13936, 2021. doi:10.1007/s00521-021-06031-5.
[J3] S. Guan and M. Loew, “Breast cancer detection using synthetic

mammograms from generative adversarial networks in convolu-
tional neural networks”, Journal of Medical Imaging, vol. 6, no. 3,
pp. 031 411–031 411, Jul. 2019. doi: 10.1117/1.JMI.6.3.031411.
[J4] H. Asfour, S. Guan, N. Muselimyan, L. Swift, M. Loew, and N. Sar-

vazyan, “Optimization of wavelength selection for multispectral
image acquisition: A case study of atrial ablation lesions”, Biomed-
ical Optics Express, vol. 9, no. 5, pp. 2189–2204, May 2018. doi:
10.1364/BOE.9.002189.
[J5] S. Guan, H. Asfour, N. Sarvazyan, and M. Loew, “Application of

unsupervised learning to hyperspectral imaging of cardiac ablation
lesions”, Journal of Medical Imaging, vol. 5, no. 4, pp. 046 003–046
003, Oct. 2018. doi: 10.1117/1.JMI.5.4.046003.
Conference Papers
[C1] A. Lou, S. Guan, H. Ko, and M. Loew, “Caranet: Context axial re-
verse attention network for segmentation of small medical objects”,
in Medical Imaging 2022: Image Processing, International Society
for Optics and Photonics, vol. 12032, SPIE, 2022, pp. 81–92. doi:
10.1117/12.2611802.
[C2] S. Guan and M. Loew, “A sneak attack on segmentation of medical

images using deep neural network classifiers”, in 2021 IEEE Applied
Imagery Pattern Recognition Workshop (AIPR), 2021, in press.
249
[C3] A. Lou, S. Guan, and M. H. Loew, “DC-UNet: rethinking the U-Net
architecture with dual channel efficient CNN for medical image
segmentation”, in Medical Imaging 2021: Image Processing, Inter-
national Society for Optics and Photonics, vol. 11596, SPIE, 2021,
pp. 749–759. doi: 10.1117/12.2582338.
[C4] S. Guan and M. Loew, “An internal cluster validity index using a
distance-based separability measure”, in 2020 IEEE 32nd Interna-
tional Conference on Tools with Artificial Intelligence (ICTAI), 2020,
pp. 827–834. doi: 10.1109/ICTAI50040.2020.00131.
[C5] S. Guan and M. Loew, “Analysis of generalizability of deep

neural networks based on the complexity of decision bound-
ary”, in 2020 19th IEEE International Conference on Machine
Learning and Applications (ICMLA), 2020, pp. 101–106. doi:
10.1109/ICMLA51294.2020.00025.
[C6] S. Guan and M. Loew, “Understanding the ability of deep neural

networks to count connected components in images”, in 2020 IEEE
Applied Imagery Pattern Recognition Workshop (AIPR), 2020, pp.
1–7. doi: 10.1109/AIPR50011.2020.9425331.
[C7] S. Guan and M. Loew, “Evaluation of generative adversarial network

performance based on direct analysis of generated images”, in 2019
IEEE Applied Imagery Pattern Recognition Workshop (AIPR), 2019,
pp. 1–5. doi: 10.1109/AIPR47015.2019.9174595.
[C8] S. Guan and M. Loew, “Using generative adversarial networks

and transfer learning for breast cancer detection by convolutional
neural networks”, in Medical Imaging 2019: Imaging Informatics
for Healthcare, Research, and Applications, vol. 10954, SPIE, 2019,
pp. 306–318. doi: 10.1117/12.2512671.
[C9] A. Lou, S. Guan, N. Kamona, and M. Loew, “Segmentation of

infrared breast images using MultiResUnet neural networks”, in
2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR),
2019, pp. 1–6. doi: 10.1109/AIPR47015.2019.9316541.
[C10] S. Guan, N. Kamona, and M. Loew, “Segmentation of thermal

breast images using convolutional and deconvolutional neural net-
works”, in 2018 IEEE Applied Imagery Pattern Recognition Workshop
(AIPR), 2018, pp. 1–7. doi: 10.1109/AIPR.2018.8707379.
250
[C11] S. Guan, M. Loew, H. Asfour, N. Sarvazyan, and N. Muselimyan,
“Lesion detection for cardiac ablation from auto-fluorescence hyper-
spectral images”, in Medical Imaging 2018: Biomedical Applications
in Molecular, Structural, and Functional Imaging, vol. 10578, SPIE,
2018, pp. 389–403. doi: 10.1117/12.2293652.
[C12] S. Guan and M. Loew, “Breast cancer detection using transfer

learning in convolutional neural networks”, in 2017 IEEE Applied
Imagery Pattern Recognition Workshop (AIPR), 2017, pp. 1–8. doi:
10.1109/AIPR.2017.8457948.
251
Bibliography
[1] Gartner top 10 strategic technology trends for 2020. source:

www.gartner.com.
[2] Global ai software market size 2018-2025. source: www.statista.com.
[3] scipy.stats.wasserstein_distance — SciPy v1.6.1 Reference Guide,

2021.
[4] Zighed Djamel A., Lallich Stéphane, and Muhlenbach Fabrice. Sep-
arability index in supervised learning. Lecture Notes in Computer
Science, pages 475–487. Springer Berlin Heidelberg, 2002.
[5] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga,
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiao-
qiang Zheng. Tensorflow: Large-scale machine learning on hetero-
geneous distributed systems. arXiv:1603.04467 [cs], 3 2016. arXiv:
1603.04467.
[6] Amina Adadi and Mohammed Berrada. Peeking Inside the Black-Box:
A Survey on Explainable Artificial Intelligence (XAI). IEEE Access,
6:52138–52160, 2018.
[7] Andreas Adolfsson, Margareta Ackerman, and Naomi C. Brownstein.

To cluster, or not to cluster: An analysis of clusterability methods.
Pattern Recognition, 88:13–26, 4 2019.
[8] Harry C. Andrews, William K. Pratt, and Kenneth Caspari. Computer

techniques in image processing, volume 2. Academic Press New York,
1970.
[9] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesús M. Pérez,

and Iñigo Perona. An extensive comparative study of cluster validity
indices. Pattern Recognition, 46(1):243–256, 2013.
[10] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein

generative adversarial networks. In Doina Precup and Yee Whye Teh,
252
editors, Proceedings of the 34th International Conference on Machine
Learning, volume 70 of Proceedings of Machine Learning Research,
pages 214–223, International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
[11] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein
gan. arXiv:1701.07875 [cs, stat], 1 2017. arXiv: 1701.07875.
[12] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang.
Fine-grained analysis of optimization and generalization for overpa-
rameterized two-layer neural networks. In The 36th International
Conference on Machine Learning (ICML), volume 97 of Proceedings of
Machine Learning Research, pages 322–332. PMLR, 2019.
[13] Aruna Arujuna, Rashed Karim, Dennis Caulfield, Benjamin Knowles,
Kawal Rhode, Tobias Schaeffter, Bernet Kato, C Aldo Rinaldi, Michael
Cooklin, Reza Razavi, et al. Acute pulmonary vein isolation is achieved
by a combination of reversible and irreversible atrial injury after
catheter ablation: evidence from magnetic resonance imaging. Circu-
lation: Arrhythmia and Electrophysiology, 5(4):691–700, 2012.
[14] Esmaeil Atashpaz-Gargari, Chao Sima, Ulisses M. Braga-Neto, and
Edward R. Dougherty. Relationship between the accuracy of classi-
fier error estimation and complexity of decision boundary. Pattern
Recognition, 46(5):1315–1322, 5 2013.
[15] Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto
Maki, and Stefan Carlsson. From generic to specific deep representa-
tions for visual recognition. page 36–45, 2015.
[16] André Ricardo Backes, Dalcimar Casanova, and Odemir Martinez
Bruno. Color texture analysis based on fractal descriptors. Pattern
Recognition, 45(5):1984–1992, 5 2012.
[17] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep
convolutional encoder-decoder architecture for image segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(12):2481–2495, 12 2017.
[18] Mihalj Bakator and Dragica Radosav. Deep Learning and Medical
Diagnosis: A Review of Literature. Multimodal Technologies and Inter-
action, 2(3):47, August 2018.
[19] Pierre Baldi. Autoencoders, unsupervised learning, and deep archi-
tectures. page 37–49, 2012.
[20] Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and
Hayit Greenspan. Chest pathology detection using deep learning with
non-medical training. page 294–297. IEEE, 2015.
253
[21] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser,
Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia,
Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila,
and Francisco Herrera. Explainable artificial intelligence (xai): Con-
cepts, taxonomies, opportunities and challenges toward responsible
ai. Information Fusion, 58:82–115, 6 2020.
[22] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian

complexities: Risk bounds and structural results. Journal of Machine
Learning Research, 3(Nov):463–482, 2002.
[23] Shai Ben-David and Margareta Ackerman. Measures of clustering

quality: A working set of axioms for clustering. In D. Koller, D. Schu-
urmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Infor-
mation Processing Systems 21, page 121–128. Curran Associates, Inc.,
2009.
[24] Lei Bi, Jinman Kim, Ashnil Kumar, Dagan Feng, and Michael Fulham.
Synthesis of positron emission tomography (pet) images via multi-
channel generative adversarial networks (gans). In Molecular Imaging,
Reconstruction and Analysis of Moving Body Organs, and Stroke Imag-
ing and Treatment, Lecture Notes in Computer Science, pages 43–51.
Springer, Cham, 9 2017. DOI: 10.1007/978-3-319-67564-0_5.
[25] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K.

Warmuth. Occam’s razor. Information Processing Letters, 24(6):377–
380, 4 1987.
[26] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister.
Sliced and Radon Wasserstein Barycenters of Measures. Journal of
Mathematical Imaging and Vision, 51(1):22–45, January 2015.
[27] Tiago B. Borchartt, Aura Conci, Rita C. F. Lima, Roger Resmini, and
Angel Sanchez. Breast thermography from an image processing view-
point: A survey. SIGNAL PROCESSING, 93(10, SI):2785–2803, 10
2013.
[28] Wener Borges Sampaio, Edgar Moraes Diniz, Aristófanes Corrêa Silva,
Anselmo Cardoso de Paiva, and Marcelo Gattass. Detection of masses
in mammogram images using cnn, geostatistic functions and svm.
Computers in Biology and Medicine, 41(8):653–664, 8 2011.
[29] Ali Borji. Pros and cons of gan evaluation measures. Computer Vision
and Image Understanding, 179:41–65, 2 2019.
[30] Andre L. Brun, Alceu S. Britto, Luiz S. Oliveira, Fabricio Enembreck,

and Robert Sabourin. Contribution of data complexity features on dy-
namic classifier selection. pages 4396–4403, Vancouver, BC, Canada,
254
7 2016. 2016 International Joint Conference on Neural Networks
(IJCNN), IEEE.
[31] André L. Brun, Alceu S. Britto, Luiz S. Oliveira, Fabricio Enembreck,

and Robert Sabourin. A framework for dynamic classifier selection
oriented by the classification problem difficulty. Pattern Recognition,
76:175–190, 4 2018.
[32] Charles L. Byrne. The em algorithm: Theory, applications and related

methods. Lecture Notes, University of Massachusetts, 2017.
[33] Tadeusz Caliński and Jerzy Harabasz. A dendrite method for cluster
analysis. Communications in Statistics-theory and Methods, 3(1):1–27,
1974.
[34] Riccardo Cappato, Hugh Calkins, Shih-Ann A Chen, Wyn Davies,

Yoshito Iesaka, Jonathan Kalman, You-Ho H Kim, George Klein, An-
drea Natale, Douglas Packer, Allan Skanes, Federico Ambrogi, and
Elia Biganzoli. Updated worldwide survey on the methods, efficacy,
and safety of catheter ablation for human atrial fibrillation. Circulation.
Arrhythmia and electrophysiology, 3(1):32–8, 2 2010.
[35] Gustavo Carneiro and Jacinto C. Nascimento. Combining multiple

dynamic models and deep learning architectures for tracking the
left ventricle endocardium in ultrasound data. IEEE transactions on
pattern analysis and machine intelligence, 35(11):2592–2607, 2013.
[36] C. Casert, T. Vieijra, J. Nys, and J. Ryckebusch. Interpretable machine

learning for inferring the phase boundaries in a nonequilibrium sys-
tem. Physical Review E, 99(2):023304, 2 2019. publisher: American
Physical Society.
[37] David Charte, Francisco Charte, and Francisco Herrera. Reducing

Data Complexity using Autoencoders with Class-informed Loss Func-
tions. IEEE Transactions on Pattern Analysis and Machine Intelligence,
pages 1–1, 2021.
[38] Niladri S. Chatterji, Behnam Neyshabur, and Hanie Sedghi. The

intriguing role of module criticality in the generalization of deep net-
works. arXiv:1912.00528 [cs, stat], 2 2020. arXiv: 1912.00528.
[39] Tong Che, Yanran Li, Athul Jacob, Yoshua Bengio, and Wenjie Li.
Mode Regularized Generative Adversarial Networks. November 2016.
[40] Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang,
and Ming Yan. Practical accuracy estimation for efficient deep neural
network testing. ACM Trans. Softw. Eng. Methodol., 29(4), October
2020.
255
[41] SA Chen, MH Hsieh, CT Tai, CF Tsai, VS Prakash, WC Yu, TL Hsu,
YA Ding, and MS Chang. Initiation of atrial fibrillation by ectopic beats
originating from the pulmonary veins : Electrophysiological charac-
teristics, pharmacological responses, and effects of radiofrequency
ablation. Circulation, 100(18):1879–1886, 11 1999.
[42] Dongdong Cheng, Qingsheng Zhu, Jinlong Huang, Quanwang Wu,

and Lijun Yang. A novel cluster validity index based on local
cores. IEEE Transactions on Neural Networks and Learning Systems,
30(4):985–999, 4 2019. doi:10.1109/TNNLS.2018.2853710.
[43] Francois Chollet. Xception: Deep Learning with Depthwise Separable

Convolutions. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1800–1807, Honolulu, HI, July 2017. IEEE.
[44] Patrick Ferdinand Christ, Mohamed Ezzeldin A. Elshaer, Florian

Ettlinger, Sunil Tatavarty, Marc Bickel, Patrick Bilic, Markus Rempfler,
Marco Armbruster, Felix Hofmann, Melvin D’Anastasi, Wieland H.
Sommer, Seyed-Ahmad Ahmadi, and Bjoern H. Menze. Automatic
liver and lesion segmentation in ct using cascaded fully convolutional
neural networks and 3d conditional random fields, 2016.
[45] Maria J. M. Chuquicusma, Sarfaraz Hussein, Jeremy Burt, and Ulas

Bagci. How to fool radiologists with generative adversarial networks?
a visual turing test for lung cancer diagnosis. arXiv:1710.09762 [cs,
q-bio], 10 2017. arXiv: 1710.09762.
[46] Francesco Ciompi, Bartjan de Hoop, Sarah J. van Riel, Kaman Chung,
Ernst Th Scholten, Matthijs Oudkerk, Pim A. de Jong, Mathias Prokop,
and Bram van Ginneken. Automatic classification of pulmonary peri-
fissural nodules in computed tomography using an ensemble of 2d
views and a convolutional neural network out-of-the-box. Medical
image analysis, 26(1):195–202, 2015.
[47] Uri Cohen, SueYeon Chung, Daniel D. Lee, and Haim Sompolinsky.
Separability and geometry of object manifolds in deep neural networks.
Nature Communications, 11(1):746, December 2020.
[48] Dabboor, Stephen Howell, Shokr, and J.J. Yackel. The jef-
fries–matusita distance for the case of complex wishart distribution
as a separability criterion for fully polarimetric sar data. International
Journal of Remote Sensing, 35, 10 2014.
[49] Wei Dai, Joseph Doyle, Xiaodan Liang, Hao Zhang, Nanqing Dong,
Yuan Li, and Eric P. Xing. Scan: Structure correcting adversarial
network for organ segmentation in chest x-rays. arXiv:1703.08770
[cs], 3 2017. arXiv: 1703.08770.
256
[50] David L. Davies and Donald W. Bouldin. A cluster separation measure.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
1(2):224–227, 4 1979. doi:10.1109/TPAMI.1979.4766909.
[51] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database. pages 248–255.
2009 IEEE Conference on Computer Vision and Pattern Recognition,
6 2009. ISSN: 1063-6919.
[52] Carol DeSantis, Rebecca Siegel, and Ahmedin Jemal. Breast cancer
facts & figures 2015-2016. page 44.
[53] Carol E. DeSantis, Stacey A. Fedewa, Ann Goding Sauer, Joan L.

Kramer, Robert A. Smith, and Ahmedin Jemal. Breast cancer statis-
tics, 2015: Convergence of incidence rates between black and white
women. CA: a cancer journal for clinicians, 66(1):31–42, February
2016.
[54] Carol E. DeSantis, Stacey A. Fedewa, Ann Goding Sauer, Joan L.

Kramer, Robert A. Smith, and Ahmedin Jemal. Breast cancer statis-
tics, 2015: Convergence of incidence rates between black and white
women. CA: a cancer journal for clinicians, 66(1):31–42, 2 2016. PMID:
26513636.
[55] Bernard Desgraupes. Clustering indices. University of Paris Ouest-Lab

Modal’X, 1:34, 2017.
[56] Sanjay Deshpande, John Catanzaro, and Samuel Wann. Atrial fibrilla-
tion: Prevalence and scope of the problem. Cardiac Electrophysiology
Clinics, 6(1):1–4, 3 2014. PMID: 27063816.
[57] Nameirakpam Dhanachandra and Yambem Jina Chanu. A survey on

image segmentation methods using clustering techniques. European
Journal of Engineering Research and Science, 2(1):15–20, 2017.
[58] Dua Dheeru and E. Karra Taniskidou. Uci machine learning repository.
2017.
[59] Neeraj Dhungel, Gustavo Carneiro, and Andrew P. Bradley. The

automated learning of deep features for breast mass classification from
mammograms. Lecture Notes in Computer Science, pages 106–114.
International Conference on Medical Image Computing and Computer-
Assisted Intervention, Springer, Cham, 10 2016.
[60] Lee R Dice. Measures of the amount of ecologic association between

species. Ecology, 26(3):297–302, 1945.
257
[61] F. T. de Dombal, D. J. Leaper, J. R. Staniland, A. P. McCann, and
Jane C. Horrocks. Computer-aided Diagnosis of Acute Abdominal
Pain. Br Med J, 2(5804):9–13, April 1972. Publisher: British Medical
Journal Publishing Group Section: Papers and Originals.
[62] Ngan Thi Dong and Megha Khosla. Revisiting Feature Selection with
Data Complexity. In 2020 IEEE 20th International Conference on
Bioinformatics and Bioengineering (BIBE), pages 211–216, 2020. ISSN:
2471-7819.
[63] Derek Doran, Sarah Schulz, and Tarek R. Besold. What does ex-
plainable ai really mean? a new conceptualization of perspectives.
arXiv:1710.00794 [cs], 10 2017. arXiv: 1710.00794.
[64] Finale Doshi-Velez and Been Kim. Towards a rigorous science of in-
terpretable machine learning. arXiv e-prints, 1702:arXiv:1702.08608,
2 2017.
[65] Timothy Dozat. Incorporating nesterov momentum into adam. 2
2016.
[66] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai.
Gradient descent finds global minima of deep neural networks. In The
36th International Conference on Machine Learning (ICML), volume 97
of Proceedings of Machine Learning Research, pages 1675–1685. PMLR,
2019.
[67] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Clas-
sification. John Wiley & Sons, November 2012. Google-Books-ID:
Br33IRC3PkQC.
[68] J. C. Dunn. Well-separated clusters and optimal fuzzy
partitions. Journal of Cybernetics, 4(1):95–104, 1 1974.
doi:10.1080/01969727408546059.
[69] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous
generalization bounds for deep (stochastic) neural networks with many
more parameters than training data. arXiv:1703.11008 [cs], 10 2017.
arXiv: 1703.11008.
[70] NG EDDIEY.-K. Segmentation of breast thermogram : Improved
boundary detection with modified snake algorithm. 2006.
[71] Fabian Eitel and Kerstin Ritter. Testing the Robustness of Attribution
Methods for Convolutional Neural Networks in MRI-Based Alzheimer’s
Disease Classification. In Interpretability of Machine Intelligence in
Medical Image Computing and Multimodal Learning for Clinical Decision
Support, Lecture Notes in Computer Science, pages 3–11, Cham, 2019.
Springer International Publishing.
258
[72] Frank Emmert-Streib, Olli Yli-Harja, and Matthias Dehmer. Explain-
able artificial intelligence and machine learning: A reality rooted
perspective. arXiv:2001.09464 [cs, stat], 1 2020. arXiv: 2001.09464.
[73] Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Ben-

gio, and Pascal Vincent. The difficulty of training deep architectures
and the effect of unsupervised pre-training. 2009.
[74] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al.
A density-based algorithm for discovering clusters in large spatial
databases with noise. KDD, 96:226–231, 1996.
[75] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M.
Swetter, Helen M. Blau, and Sebastian Thrun. Dermatologist-level
classification of skin cancer with deep neural networks. Nature,
542(7639):115–118, 2 2017.
[76] Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati,

Bartosz Krawczyk, and Francisco Herrera. Data Intrinsic Characteris-
tics. In Learning from Imbalanced Data Sets, pages 253–277. Springer
International Publishing, Cham, 2018.
[77] A. Fourcade and R.H. Khonsari. Deep learning in medical image

analysis: A third eye for doctors. Journal of Stomatology, Oral and
Maxillofacial Surgery, 120(4):279–288, September 2019.
[78] Jr. Frank J. Massey. The Kolmogorov-Smirnov Test for Goodness of

Fit. Journal of the American Statistical Association, 46(253):68–78,
March 1951.
[79] Jr. Frank J. Massey. The Kolmogorov-Smirnov Test for Goodness of

Fit. Journal of the American Statistical Association, 46(253):68–78,
March 1951.
[80] Sarah M. Friedewald, Elizabeth A. Rafferty, Stephen L. Rose, Melissa A.

Durand, Donna M. Plecha, Julianne S. Greenberg, Mary K. Hayes,
Debra S. Copit, Kara L. Carlson, Thomas M. Cink, Lora D. Barke,
Linda N. Greer, Dave P. Miller, and Emily F. Conant. Breast cancer
screening using tomosynthesis in combination with digital mammog-
raphy. JAMA, 311(24):2499–2507, 6 2014.
[81] K. Ganesan, U. R. Acharya, C. K. Chua, L. C. Min, K. T. Abraham,

and K. H. Ng. Computer-aided breast cancer detection using mammo-
grams: A review. IEEE Reviews in Biomedical Engineering, 6:77–98,
2013.
[82] Shangqian Gao, Feihu Huang, Weidong Cai, and Heng Huang. Net-
work pruning via performance maximization. In Proceedings of the
259
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 9270–9280, June 2021.
[83] Luis P. F. Garcia, Ana C. Lorena, Marcilio C. P. de Souto, and Tin Kam
Ho. Classifier recommendation using data complexity measures.
pages 874–879, Beijing, 8 2018. 2018 24th International Conference
on Pattern Recognition (ICPR), IEEE.
[84] Nathan Garcia, Frederico Tiggeman, Eduardo Borges, Giancarlo
Lucca, Helida Santos, and Graçaliz Dimuro. Exploring the Rela-
tionships between Data Complexity and Classification Diversity in
Ensembles. In Proceedings of the 23rd International Conference on En-
terprise Information Systems, pages 652–659. SCITEPRESS - Science
and Technology Publications, 2021.
[85] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge,
Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns
are biased towards texture; increasing shape bias improves accu-
racy and robustness. In 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net, 2019.
[86] Daniel A Gil, Luther M Swift, Huda Asfour, Narine Muselimyan,
Marco A Mercader, and Narine A Sarvazyan. Autofluorescence hyper-
spectral imaging of radiofrequency ablation lesions in porcine cardiac
tissue. Journal of biophotonics, 10(8):1008–1017, 8 2017.
[87] van B. Ginneken, A. A. A. Setio, C. Jacobs, and F. Ciompi. Off-the-
shelf convolutional neural network features for pulmonary nodule
detection in computed tomography scans. pages 286–289. 2015 IEEE
12th International Symposium on Biomedical Imaging (ISBI), 4 2015.
[88] Rafael C Gonzalez, Richard E Woods, et al. Digital image processing,
2002.
[89] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,
N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural In-
formation Processing Systems 27, page 2672–2680. Curran Associates,
Inc., 2014.
[90] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,
N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural In-
formation Processing Systems 27, page 2672–2680. Curran Associates,
Inc., 2014.
260
[91] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard
Schölkopf, and Alexander Smola. A kernel two-sample test. Journal
of Machine Learning Research, 13(25):723–773, 2012.
[92] Shuyue Guan, Huda Asfour, Narine Sarvazyan, and Murray Loew.
Application of unsupervised learning to hyperspectral imaging of
cardiac ablation lesions. Journal of Medical Imaging, 5(4):046003, 12
2018. doi:10.1117/1.JMI.5.4.046003.
[93] John T. Guibas, Tejpal S. Virdi, and Peter S. Li. Synthetic medical
images from dual generative adversarial networks. arXiv:1709.01872
[cs], 9 2017. arXiv: 1709.01872.
[94] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,

and Aaron C Courville. Improved training of wasserstein gans. In
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems 30, page 5767–5777. Curran Associates, Inc.,
2017.
[95] David Gunning. Explainable artificial intelligence (xai). Defense

Advanced Research Projects Agency (DARPA), nd Web, 2, 2017.
[96] C. Guo, S. Mita, and D. McAllester. Robust road detection and tracking
in challenging scenarios based on markov random fields with unsu-
pervised learning. IEEE Transactions on Intelligent Transportation
Systems, 13(3):1338–1354, 9 2012.
[97] Philipp Hacker, Ralf Krestel, Stefan Grundmann, and Felix Naumann.
Explainable AI under contract and tort law: legal incentives and
technical challenges. Artificial Intelligence and Law, 28(4):415–439,
December 2020.
[98] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised

learning. In The Elements of Statistical Learning, Springer Series
in Statistics, pages 485–585. Springer, New York, NY, 2009. DOI:
10.1007/978-0-387-84858-7_14.
[99] Douglas M. Hawkins. The problem of overfitting. Journal of Chemical

Information and Computer Sciences, 44(1):1–12, 1 2004. publisher:
American Chemical Society.
[100] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. arXiv:1512.03385 [cs], 12
2015. arXiv: 1512.03385.
[101] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delv-
ing deep into rectifiers: Surpassing human-level performance on
261
imagenet classification. pages 1026–1034. Proceedings of the IEEE
International Conference on Computer Vision, 2015.
[102] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity
Mappings in Deep Residual Networks. arXiv:1603.05027 [cs], July
2016. ECCV 2016 camera-ready.
[103] Warren He, Bo Li, and Dawn Song. Decision boundary analysis of
adversarial examples. 2018.
[104] Michael Heath, Kevin Bowyer, Daniel Kopans, Richard Moore, and
W. Philip Kegelmeyer. The digital database for screening mammog-
raphy. In Proceedings of the 5th international workshop on digital
mammography, pages 212–218. Medical Physics Publishing, 2000.
[105] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard

Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in Neural Information Processing Systems 30, page
6626–6637. Curran Associates, Inc., 2017.
[106] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard

Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg,
[107] Avinash Hindupur. the-gan-zoo: A list of all named GANs! 3 2018.

original-date: 2017-04-14T16:45:24Z.
[108] Tin Kam Ho and M. Basu. Complexity measures of supervised classi-

fication problems. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(3):289–300, 3 2002.
[109] Fred Hohman, Minsuk Kahng, Robert Pienta, and Duen Horng Chau.
Visual analytics in deep learning: An interrogative survey for the next
frontiers. IEEE Transactions on Visualization and Computer Graphics,
25(8):2674–2693, 8 2019. event: IEEE Transactions on Visualization
and Computer Graphics.
[110] Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo, and Sungroh Yoon. How
generative adversarial networks and its variants work: An overview of
gan. arXiv:1711.05914 [cs], 11 2017. arXiv: 1711.05914.
262
[111] Shin Hoo-Chang, Holger R. Roth, Mingchen Gao, Le Lu, Ziyue Xu,
Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M. Sum-
mers. Deep convolutional neural networks for computer-aided detec-
tion: Cnn architectures, dataset characteristics and transfer learn-
ing. IEEE transactions on medical imaging, 35(5):1285–1298, 5 2016.
doi:10.1109/TMI.2016.2528162.
[112] Shin Hoo-Chang, Holger R. Roth, Mingchen Gao, Le Lu, Ziyue Xu, Is-
abella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M. Summers.
Deep convolutional neural networks for computer-aided detection:
Cnn architectures, dataset characteristics and transfer learning. IEEE
transactions on medical imaging, 35(5):1285–1298, 5 2016. PMID:
26886976 PMCID: PMC4890616.
[113] X. Hou and L. Zhang. Saliency detection: A spectral residual approach.

pages 1–8. 2007 IEEE Conference on Computer Vision and Pattern
Recognition, 6 2007.
[114] Lianyu Hu and Caiming Zhong. An internal validity index based on

density-involved distance. IEEE Access, 7:40038–40051, 2019.
[115] Meng Hu, Eric C. C. Tsang, Yanting Guo, and Weihua Xu. Fast
and Robust Attribute Reduction Based on the Separability in Fuzzy
Decision Systems. IEEE Transactions on Cybernetics, pages 1–14,
2021.
[116] Xia Hu, Lingyang Chu, Jian Pei, Weiqing Liu, and Jiang Bian. Model
Complexity of Deep Learning: A Survey. August 2021. arXiv:
2103.05127.
[117] Yipeng Hu, Eli Gibson, Li-Lin Lee, Weidi Xie, Dean C. Barratt, Tom
Vercauteren, and J. Alison Noble. Freehand ultrasound image simu-
lation with spatially-conditioned generative adversarial networks. In
Molecular Imaging, Reconstruction and Analysis of Moving Body Or-
gans, and Stroke Imaging and Treatment, Lecture Notes in Computer
Science, pages 105–115. Springer, Cham, 9 2017. DOI: 10.1007/978-
3-319-67564-0_11.
[118] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Wein-
berger. Densely Connected Convolutional Networks. arXiv:1608.06993
[cs], January 2018. CVPR 2017.
[119] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memi-
sevic. Generating images with recurrent adversarial networks. arXiv
preprint arXiv:1602.05110, 2016.
263
[120] Sergey Ioffe and Christian Szegedy. Batch normalization: Accel-
erating deep network training by reducing internal covariate shift.
arXiv:1502.03167 [cs], 2 2015. arXiv: 1502.03167.
[121] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-
image translation with conditional adversarial networks. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition,
pages 1125–1134, 2017.
[122] R. Istrate, F. Scheidegger, G. Mariani, D. Nikolopoulos, C. Bekas, and

A. C. I. Malossi. TAPAS: Train-Less Accuracy Predictor for Architecture
Search. Proceedings of the AAAI Conference on Artificial Intelligence,
33:3927–3934, July 2019.
[123] A. K. Jain, M. N. Murty, and P. J. Flynn. Data cluster-

ing: a review. Association for Computing Machinery, 9 1999.
doi:10.1145/331499.331504.
[124] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An introduction to statistical learning, volume 112. Springer, 2013.
[125] Andrew R. Jamieson, Karen Drukker, and Maryellen L. Giger. Breast

image feature learning with adaptive deconvolutional networks. vol-
ume 8315, page 831506. International Society for Optics and Photon-
ics, February 2012.
[126] Li Jiang, Wang Zhan, and Murray H. Loew. Modeling static and
dynamic thermography of the human breast under elastic deforma-
tion. Physics in Medicine and Biology, 56(1):187–202, 1 2011. PMID:
21149948.
[127] Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio.
Predicting the generalization gap in deep networks with margin distri-
butions. In 7th International Conference on Learning Representations.
ICLR, 2019.
[128] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan,

and Samy Bengio. Fantastic generalization measures and where to
find them. arXiv:1912.02178 [cs, stat], 12 2019. arXiv: 1912.02178.
[129] Zhicheng Jiao, Xinbo Gao, Ying Wang, and Jie Li. A deep feature
based framework for breast masses classification. Neurocomputing,
197:221–231, 7 2016.
[130] LI Jie, Xue Yaxu, and Yu Yadong. Incremental Learning Algorithm of

Data Complexity Based on KNN Classifier. In 2020 International Sym-
posium on Community-centric Systems (CcS), pages 1–4, September
2020.
264
[131] Pragati Kapoor, S.V.A.V. Prasad, and Seema Patni. Image segmen-
tation and asymmetry analysis of breast thermograms for tumor
detection. International Journal of Computer Applications, 50(9):40–45,
7 2012.
[132] Md Rezaul Karim, Oya Beyan, Achille Zappa, Ivan G. Costa, Diet-
rich Rebholz-Schuhmann, Michael Cochez, and Stefan Decker. Deep
learning-based clustering approaches for bioinformatics. Briefings in
Bioinformatics, 2020.
[133] Hamid Karimi, Tyler Derr, and Jiliang Tang. Characterizing the deci-
sion boundary of deep neural networks. arXiv:1912.11460 [cs, stat],
6 2020. arXiv: 1912.11460.
[134] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail

Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep
learning: Generalization gap and sharp minima. arXiv:1609.04836
[cs, math], 2 2017. arXiv: 1609.04836.
[135] Salabat Khan, Muhammad Hussain, Hatim Aboalsamh, and George

Bebis. A comparison of different gabor feature extraction approaches
for mass classification in mammography. Multimedia Tools and Appli-
cations, 76(1):33–57, 1 2017.
[136] Salabat Khan, Muhammad Hussain, Hatim Aboalsamh, Hassan Math-

kour, George Bebis, and Mohammed Zakariah. Optimized gabor fea-
tures for mass classification in mammography. Applied Soft Computing,
44:267–280, 7 2016.
[137] Valentin Khrulkov and Ivan Oseledets. Geometry score: A method for
comparing generative adversarial networks. In International Confer-
ence on Machine Learning, pages 2621–2629. PMLR, 2018.
[138] J. Kim, D. Han, Y. W. Tai, and J. Kim. Salient region detection via high-
dimensional color transform. pages 883–890. 2014 IEEE Conference
on Computer Vision and Pattern Recognition, 6 2014.
[139] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv:1412.6980 [cs], 12 2014. arXiv: 1412.6980.
[140] Peter M Kistler, Kim Rajappan, Mohammed Jahngir, Mark J Earley,

Stuart Harris, Dominic Abrams, Dhiraj Gupta, Reginald Liew, Stephen
Ellis, Simon C Sporton, and Richard J Schilling. The impact of
ct image integration into an electroanatomic mapping system on
clinical outcomes of catheter ablation of atrial fibrillation. Journal of
cardiovascular electrophysiology, 17(10):1093–101, 10 2006.
265
[141] Jon M. Kleinberg. An impossibility theorem for clustering. In S. Becker,
S. Thrun, and K. Obermayer, editors, Advances in Neural Information
Processing Systems 15, page 463–470. MIT Press, 2003.
[142] Simon Kohl, David Bonekamp, Heinz-Peter Schlemmer, Kaneschka

Yaqubi, Markus Hohenfellner, Boris Hadaschik, Jan-Philipp Radtke,
and Klaus Maier-Hein. Adversarial networks for the detection of
aggressive prostate cancer. arXiv:1702.08014 [cs], 2 2017. arXiv:
1702.08014.
[143] Alexander Kolesnikov and Christoph H. Lampert. Seed, Expand and

Constrain: Three Principles for Weakly-Supervised Image Segmenta-
tion. In Computer Vision – ECCV 2016, Lecture Notes in Computer
Science, pages 695–711, Cham, 2016. Springer International Pub-
lishing.
[144] Jacob Koruth, Shigeki Kusa, Srinivas Dukkipati, Petr Neuzil, Ter-
rance Ransbury, KC Armstrong, Larson Larson, Cinnamon Bowen,
Amirana. Omar, Marco Mercader, Narine A Sarvazyan, Matthew W
Kay, and Vivek Y Reddy. Direct assessment of catheter-tissue contact
and rf lesion formation: a novel approach using endogenous nadh
fluorescence. Heart Rhythm, page S111, 2015.
[145] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of

features from tiny images. 2009.
[146] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet

classification with deep convolutional neural networks. In F. Pereira,
C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in
Neural Information Processing Systems 25, page 1097–1105. Curran
Associates, Inc., 2012.
[147] S. Kullback and R. A. Leibler. On information and sufficiency. The

Annals of Mathematical Statistics, 22(1):79–86, 1951.
[148] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images,
speech, and time series. The handbook of brain theory and neural
networks, 3361(10):1995, 1995.
[149] Yann LeCun, Corinna Cortes, and C. J. Burges. Mnist handwritten

digit database, 2010.
[150] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, An-
drew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani,
Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic sin-
gle image super-resolution using a generative adversarial network.
arXiv:1609.04802 [cs, stat], 9 2016. arXiv: 1609.04802.
266
[151] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient
sparse coding algorithms. In B. Schölkopf, J. C. Platt, and T. Hoffman,
801–808. MIT Press, 2007.
[152] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses.

Springer Science & Business Media, 2006.
[153] Cheng Li and Bingyu Wang. Fisher linear discriminant analysis. 2014.
[154] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and
Shuicheng Yan. Perceptual generative adversarial networks for small
object detection. pages 1951–1959, Honolulu, HI, 7 2017. 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
IEEE.
[155] Rongjian Li, Wenlu Zhang, Heung-Il Suk, Li Wang, Jiang Li, Dinggang
Shen, and Shuiwang Ji. Deep learning based imaging data completion
for improved brain disease diagnosis. page 305–312. Springer, 2014.
[156] Yu Li, Lizhong Ding, and Xin Gao. On the decision boundary of deep
neural networks. arXiv:1808.05385 [cs], 1 2019. arXiv: 1808.05385.
[157] Zhongyu Li, Xiaofan Zhang, Henning Müller, and Shaoting Zhang.
Large-scale retrieval for medical image analytics: A comprehensive
review. Medical Image Analysis, 43:66–84, January 2018.
[158] Zachary C. Lipton. The mythos of model interpretability. Communica-

tions of the ACM, 61(10):36–43, 9 2018.
[159] Jinlong Liu, Yunzhi Bai, Guoqing Jiang, Ting Chen, and Huayan Wang.
Understanding Why Neural Networks Generalize Well Through GSNR
of Parameters. In International Conference on Learning Representations,
ICLR, 2020.
[160] Jinlong Liu, Yunzhi Bai, Guoqing Jiang, Ting Chen, and Huayan
Wang. Understanding why neural networks generalize well through
gsnr of parameters. 2020.
[161] Xiangbin Liu, Liping Song, Shuai Liu, and Yudong Zhang. A review of
deep-learning-based medical image segmentation methods. Sustain-
ability, 13(3), 2021.
[162] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu, and
Sen Wu. Understanding and enhancement of internal clustering vali-
dation measures. IEEE Transactions on Cybernetics, 43(3):982–994,
6 2013. doi:10.1109/TSMCB.2012.2220543.
267
[163] Shih-Chung B. Lo, Heang-Ping Chan, Jyh-Shyan Lin, Huai Li,
Matthew T. Freedman, and Seong K. Mun. Artificial convolution
neural network for medical image pattern recognition. Neural Net-
works, 8(7–8):1201–1214, 1995.
[164] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample

tests. 2017.
[165] Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto,

and Tin Kam Ho. How complex is your classification problem?: A
survey on measuring classification complexity. ACM Comput. Surv.,
52(5):107:1–107:34, 9 2019.
[166] Ange Lou, Shuyue Guan, Nada Kamona, and Murray Loew. Segmen-
tation of infrared breast images using multiresunet neural networks.
In 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR),
pages 1–6, 2019.
[167] Ange Lou, Shuyue Guan, Hanseok Ko, and Murray Loew. Caranet:
Context axial reverse attention network for segmentation of small
medical objects. arXiv preprint arXiv:2108.07368, 2021.
[168] Ange Lou, Shuyue Guan, and Murray Loew. Cfpnet-m: A light-weight
encoder-decoder based network for multimodal biomedical image real-
time segmentation. arXiv preprint arXiv:2105.04075, 2021.
[169] Ange Lou, Shuyue Guan, and Murray H Loew. Dc-unet: rethinking
the u-net architecture with dual channel efficient cnn for medical
image segmentation. In Medical Imaging 2021: Image Processing,
volume 11596, page 115962T. International Society for Optics and
Photonics, 2021.
[170] Daniel Lévy and Arzav Jain. Breast Mass Classification from Mammo-
grams using Deep Convolutional Neural Networks. arXiv:1612.00542
[cs], December 2016. arXiv: 1612.00542.
[171] van der Laurens Maaten and Geoffrey Hinton. Visualizing data using
t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
[172] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and
Stephen Paul Smolley. Least squares generative adversarial networks.
pages 2813–2821. 2017 IEEE International Conference on Computer
Vision (ICCV), 10 2017. ISSN: 2380-7504.
[173] Yuliya Marchetti, Hai Nguyen, Amy Braverman, and Noel Cressie.
Spatial data compression via adaptive dispersion clustering. Compu-
tational Statistics & Data Analysis, 117:138–153, 2018.
268
[174] Morteza Mardani, Enhao Gong, Joseph Y. Cheng, Shreyas Vasanawala,
Greg Zaharchuk, Marcus Alley, Neil Thakur, Song Han, William Dally,
John M. Pauly, and Lei Xing. Deep generative adversarial networks
for compressed sensing automates mri. arXiv:1706.00051 [cs, stat], 5
2017. arXiv: 1706.00051.
[175] David Martens, Jan Vanthienen, Wouter Verbeke, and Bart Baesens.
Performance of classification models from a user perspective. Decision
Support Systems, 51(4):782–793, 11 2011.
[176] U. Maulik and S. Bandyopadhyay. Performance evaluation of some

clustering algorithms and validity indices. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(12):1650–1654, 12 2002.
doi:10.1109/TPAMI.2002.1114856.
[177] Martina Melinščak, Pavle Prentašić, and Sven Lončarić. Retinal ves-
sel segmentation using deep neural networks. VISAPP 2015 (10th
International Conference on Computer Vision Theory and Applications),
Proceedings, Vol.1, page 577, 5 2015.
[178] Marco Mercader, Armstrong Kc, Terry Ransbury, Vivek Y. Reddy, Ja-
cob Koruth, Cinnamon Larsen, James Bowen, Narine A Sarvazyan,
and Omar Amirana. Optical tissue interrogation catheter that provides
real-time monitoring of catheter-tissue contact and rf lesion progres-
sion using nadh fluorescence. EP Europace, 18(suppl_1):i27–i27, 6
2016.
[179] Amit Kumar Mishra. Separability indices and their use in radar signal
based target recognition. IEICE Electronics Express, 6(14):1000–1005,
2009.
[180] Brent Mittelstadt, Chris Russell, and Sandra Wachter. Explaining

explanations in ai. FAT* ’19, page 279–288, Atlanta, GA, USA, 1 2019.
Association for Computing Machinery.
[181] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi

Yoshida. Spectral normalization for generative adversarial networks.
2018.
[182] Pim Moeskops, Jeroen de Bresser, Hugo J. Kuijf, Adriënne M. Mendrik,

Geert Jan Biessels, Josien P. W. Pluim, and Ivana Išgum. Evaluation
of a deep learning approach for the segmentation of brain tissues
and white matter hyperintensities of presumed vascular origin in mri.
NeuroImage: Clinical, 17:251–262, 1 2018.
[183] Linda Mthembu and Tshilidzi Marwala. A note on the separability

index. 1 2009.
269
[184] Narine Muselimyan, Al Mohammed Jishi, Huda Asfour, Luther Swift,
and Narine A. Sarvazyan. Anatomical and optical properties of atrial
tissue: Search for a suitable animal model. Cardiovascular Engineer-
ing and Technology, 8(4):505–514, 2017.
[185] Narine Muselimyan, Luther M Swift, Huda Asfour, Tigran Chahbazian,

Ramesh Mazhari, Marco Mercader, and Narine A Sarvazyan. Seeing
the invisible: Revealing atrial ablation lesions using hyperspectral
imaging approach. PloS one, 11(12):e0167760, 12 2016.
[186] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve
restricted boltzmann machines. page 807–814, 2010.
[187] Fabián Narváez, Jorge Alvarez, Juan D. Garcia-Arteaga, Jonathan

Tarquino, and Eduardo Romero. Characterizing architectural distor-
tion in mammograms by linear saliency. Journal of Medical Systems,
41(2):26, 2 2017.
[188] Olfa Nasraoui and Chiheb-Eddine Ben N’Cir. Clustering Methods for
Big Data Analytics. Springer, 2019.
[189] Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and

Nati Srebro. Exploring generalization in deep learning. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, editors, Advances in Neural Information Processing
Systems 30, page 5947–5956. Curran Associates, Inc., 2017.
[190] E. Y. K. Ng. A review of thermography as promising non-invasive

detection modality for breast tumor. International Journal of Thermal
Sciences, 48(5):849–859, May 2009.
[191] Huu-Giao Nguyen, Alessia Pica, Jan Hrbacek, Damien C. Weber,

Francesco La Rosa, Ann Schalenbourg, Raphael Sznitman, and
Meritxell Bach Cuadra. A novel segmentation framework for uveal
melanoma in magnetic resonance imaging based on class activation
maps. In Proceedings of The 2nd International Conference on Medical
Imaging with Deep Learning, pages 370–379. PMLR, May 2019. ISSN:
2640-3498.
[192] Dong Nie, Roger Trullo, Jun Lian, Caroline Petitjean, Su Ruan, Qian
Wang, and Dinggang Shen. Medical image synthesis with context-
aware generative adversarial networks. Lecture Notes in Computer
Science, pages 417–425. International Conference on Medical Image
Computing and Computer-Assisted Intervention, Springer, Cham, 9
2017.
[193] S Nishikawa, Y Nojima, and H Ishibuchi. Appropriate granularity

specification for fuzzy classifier design by data complexity measures.
270
pages 691–696, Fukuoka, 12 2010. 2010 Second World Congress on
Nature and Biologically Inspired Computing (NaBIC 2010), IEEE.
[194] R. Nithya and B. Santhi. Classification of normal and abnormal

patterns in digital mammograms for diagnosis of breast cancer. Inter-
national Journal of Computer Applications, 28(6):21–25, 2011.
[195] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning de-
convolution network for semantic segmentation. pages 1520–1528.
Proceedings of the IEEE International Conference on Computer Vision,
2015.
[196] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training
generative neural samplers using variational divergence minimiza-
tion. In Proceedings of the 30th International Conference on Neural
Information Processing Systems, pages 271–279, 2016.
[197] Hakan Oral, Bradley P. Knight, Mehmet Ozaydin, Hiroshi Tada, Aman
Chugh, Sohail Hassan, Christoph Scharf, Steve W. K. Lai, Radmira
Greenstein, Frank Pelosi, S. Adam Strickberger, and Fred Morady.
Clinical significance of early recurrences of atrial fibrillation after pul-
monary vein isolation. Journal of the American College of Cardiology,
40(1):100–104, 7 2002. PMID: 12103262.
[198] Feifan Ouyang, Roland Tilz, Julian Chun, Boris Schmidt, Erik Wissner,
Thomas Zerm, Kars Neven, Bulent Köktürk, Melanie Konstantinidou,
Andreas Metzner, Alexander Fuernkranz, and Karl-Heinz Kuck. Long-
term results of catheter ablation in paroxysmal atrial fibrillation:
lessons from a 5-year follow-up. Circulation, 122(23):2368–2377, 12
2010. PMID: 21098450.
[199] Yuehao Pan, Weimin Huang, Zhiping Lin, Wanzheng Zhu, Jiayin Zhou,
Jocelyn Wong, and Zhongxiang Ding. Brain tumor grading based on
neural networks and convolutional neural networks. page 699–702.
IEEE, 2015.
[200] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den
Hengel. Deep learning for anomaly detection: A review. ACM Comput.
Surv., 54(2), March 2021.
[201] Razvan Pascanu, Guido Montúfar, and Yoshua Bengio. On the number
of inference regions of deep feed forward networks with piece-wise
linear activations. In The 2nd International Conference on Learning
Representations (ICLR), Conference Track Proceedings, 2014.
[202] Otávio AB Penatti, Keiller Nogueira, and Jefersson A. dos Santos. Do

deep features generalize from everyday objects to remote sensing and
aerial scenes domains? page 44–51, 2015.
271
[203] Philip Perconti and Murray H. Loew. Salience measure for assessing
scale-based features in mammograms. Journal of the Optical Society
of America. A, Optics, Image Science, and Vision, 24(12):B81–90, 12
2007. PMID: 18059917.
[204] Anna Dagmar Peterson. A Separability Index for Clustering and Classi-
fication Problems with Applications to Cluster Merging and Systematic
Evaluation of Clustering Algorithms. PhD thesis, Ames, IA, USA, 2011.
[205] Darius Pfitzner, Richard Leibbrandt, and David M. W. Powers. Charac-

terization and evaluation of similarity measures for pairs of clusterings.
Knowledge and Information Systems, 2008. doi:10.1007/s10115-008-
0150-6.
[206] Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V.
Le. Meta Pseudo Labels. arXiv:2003.10580 [cs, stat], March 2021.
arXiv: 2003.10580.
[207] Nicolas Pinto, David D. Cox, and James J. DiCarlo. Why is real-world
visual object recognition hard? PLOS Computational Biology, 4(1):e27,
1 2008.
[208] Hairong Qi, Wesley Snyder, Jonathan F. Head, and Robert L. Elliott.
Detecting breast cancer from infrared images by asymmetry analysis.
volume 2, pages 1227–1228 vol.2, 2 2000.
[209] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised rep-
resentation learning with deep convolutional generative adversarial
networks. 2016.
[210] U. Raghavendra, U. Rajendra Acharya, Hamido Fujita, Anjan Gudigar,

Jen Hong Tan, and Shreesha Chokkadi. Application of gabor wavelet
and locality sensitive discriminant analysis for automated identifica-
tion of breast cancer using digitized mammogram images. Applied
Soft Computing, 46:151–161, 9 2016.
[211] Sajith Rajapaksa and Farzad Khalvati. Localized Perturbations

For Weakly-Supervised Segmentation of Glioma Brain Tumours.
arXiv:2111.14953 [cs, eess], November 2021. arXiv: 2111.14953.
[212] Aaditya Ramdas, Nicolas Garcia Trillos, and Marco Cuturi. On wasser-
stein two-sample testing and related families of nonparametric tests.
Entropy, 19(2):47, 2017.
[213] Ravi Ranjan, Eugene G Kholmovski, Joshua Blauer, Sathya Vijayaku-

mar, Nelly A Volland, Mohamed E Salama, Dennis L Parker, Rob
MacLeod, and Nassir F Marrouche. Identification and acute targeting
272
of gaps in atrial ablation lesion sets using a real-time magnetic reso-
nance imaging system. Circulation. Arrhythmia and electrophysiology,
5(6):1130–5, 12 2012.
[214] M. Ranzato, F. J. Huang, Y. L. Boureau, and Y. LeCun. Unsupervised

learning of invariant feature hierarchies with applications to object
recognition. pages 1–8. 2007 IEEE Conference on Computer Vision
and Pattern Recognition, 6 2007.
[215] Vijay M. Rao, David C. Levin, Laurence Parker, Barbara Cavanaugh,

Andrea J. Frangos, and Jonathan H. Sunshine. How widely is
computer-aided detection used in screening and diagnostic mammog-
raphy? Journal of the American College of Radiology, 7(10):802–805,
10 2010.
[216] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dun-

nmon, and Christopher Ré. Learning to compose domain-specific
transformations for data augmentation. In I. Guyon, U. V. Luxburg,
[217] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-
cnn: Towards real-time object detection with region proposal networks.
arXiv:1506.01497 [cs], 6 2015. arXiv: 1506.01497.
[218] Mina Rezaei, Konstantin Harmuth, Willi Gierke, Thomas Keller-

meier, Martin Fischer, Haojin Yang, and Christoph Meinel. Condi-
tional adversarial network for semantic segmentation of brain tumor.
arXiv:1708.05227 [cs], 8 2017. arXiv: 1708.05227.
[219] David J Rogers and Taffee T Tanimoto. A computer program for

classifying plants: The computer is programmed to simulate the
taxonomic process of comparing each case with every other case.
Science, 132(3434):1115–1118, 1960.
[220] Richard J. Roiger. Data mining: a tutorial-based primer. CRC press,

2017.
[221] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
tional networks for biomedical image segmentation. Lecture Notes in
Computer Science, pages 234–241. Springer International Publishing,
2015.
[222] Ribana Roscher, Bastian Bohn, Marco F. Duarte, and Jochen Garcke.
Explainable machine learning for scientific insights and discoveries.
IEEE Access, 8:42200–42216, 2020. event: IEEE Access.
273
[223] Ribana Roscher, Bastian Bohn, Marco F. Duarte, and Jochen Garcke.
Explainable Machine Learning for Scientific Insights and Discoveries.
IEEE Access, 8:42200–42216, 2020. Conference Name: IEEE Access.
[224] Azriel Rosenfeld. Picture Processing by Computer. ACM Computing

Surveys, 1(3):147–176, September 1969.
[225] Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation

and validation of cluster analysis. Journal of computational and applied
mathematics, 20:53–65, 1987.
[226] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
and Michael Bernstein. Imagenet large scale visual recognition chal-
lenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[227] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
[228] John Rust. Using randomization to break the curse of dimensionality.

Econometrica: Journal of the Econometric Society, pages 487–516,
1997.
[229] Ludger Rüschendorf. The wasserstein distance and approximation

theorems. Probability Theory and Related Fields, 70(1):117–129, 3
1985.
[230] Ludger Rüschendorf. The Wasserstein distance and approximation

theorems. Probability Theory and Related Fields, 70(1):117–129,
March 1985.
[231] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, Xi Chen, and Xi Chen. Improved techniques for training gans.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
[232] Jorge Sanchez and Florent Perronnin. High-dimensional signature

compression for large-scale image classification. In CVPR 2011, pages
1665–1672, June 2011. ISSN: 1063-6919.
[233] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,

and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear
Bottlenecks. arXiv:1801.04381 [cs], March 2019. arXiv: 1801.04381.
274
[234] Jorge M. Santos and Mark Embrechts. On the use of the adjusted rand
index as a metric for evaluating supervised classification. Lecture
Notes in Computer Science, page 175–184, Berlin, Heidelberg, 2009.
Springer.
[235] Shibani Santurkar, Ludwig Schmidt, and Aleksander Madry. A
classification-based study of covariate shift in gan distributions. In In-
ternational Conference on Machine Learning, pages 4480–4489. PMLR,
2018.
[236] Saeed Sarbazi-Azad, Mohammad Saniee Abadeh, and Mohammad Er-
fan Mowlaei. Using data complexity measures and an evolutionary
cultural algorithm for gene selection in microarray data. Soft Comput-
ing Letters, 3:100007, 2021.
[237] N. Scales, C. Kerry, and M. Prize. Automated image segmentation for
breast analysis using infrared images. volume 1, pages 1737–1740.
The 26th Annual International Conference of the IEEE Engineering
in Medicine and Biology Society, 9 2004.
[238] Achim Schilling, Andreas Maier, Richard Gerum, Claus Metzner, and
Patrick Krauss. Quantifying the separability of data classes in neural
networks. Neural Networks, 139:278–293, 2021.
[239] Thomas Schlegl, Joachim Ofner, and Georg Langs. Unsupervised
pre-training across image domains improves lung tissue classification.
page 82–93. Springer, 2014.
[240] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-
ishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual
explanations from deep networks via gradient-based localization. In
Proceedings of the IEEE international conference on computer vision,
pages 618–626, 2017.
[241] Mehmet Sezgin and Bülent Sankur. Survey over image threshold-
ing techniques and quantitative performance evaluation. Journal of
Electronic imaging, 13(1):146–165, 2004.
[242] C. E. Shannon. A mathematical theory of communication. The Bell
System Technical Journal, 27(3):379–423, 1948.
[243] Zhihui Shao, Jianyi Yang, and Shaolei Ren. Increasing the trust-
worthiness of deep neural networks via accuracy monitoring. In
Proceedings of the Workshop on Artificial Intelligence Safety, volume
2640 of CEUR Workshop Proceedings. CEUR-WS.org, 2020.
[244] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan
Carlsson. Cnn features off-the-shelf: an astounding baseline for
recognition. page 806–813, 2014.
275
[245] Anmol Sharma. DDSM Utility. GitHub, 2015.
[246] Anmol Sharma. Ddsm utility. https://github.com/trane293/
DDSMUtility, 2015.
[247] Wei Shen, Mu Zhou, Feng Yang, Caiyun Yang, and Jie Tian. Multi-
scale convolutional neural networks for lung nodule classification.
page 588–599. Springer, 2015.
[248] Rebecca L. Siegel, Kimberly D. Miller, and Ahmedin Jemal. Cancer
statistics, 2016. CA: A Cancer Journal for Clinicians, 66(1):7–30, 1
2016.
[249] Lincoln Silva, D C. M. Saade, Giomar Sequeiros Olivera, Ari Silva,
Anselmo Paiva, Renato Bravo, and Aura Conci. A new database for
breast research with infrared image. Journal of Medical Imaging and
Health Informatics, 4:92–100, 3 2014.
[250] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside
convolutional networks: Visualising image classification models and
saliency maps. In In Workshop at International Conference on Learning
Representations. Citeseer, 2014.
[251] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv:1409.1556 [cs], 9
2014. arXiv: 1409.1556.
[252] Amitojdeep Singh, Sourya Sengupta, and Vasudevan Lakshmi-
narayanan. Explainable Deep Learning Models in Medical Image
Analysis. Journal of Imaging, 6(6):52, June 2020. Number: 6 Pub-
lisher: Multidisciplinary Digital Publishing Institute.
[253] Jake Snell, Karl Ridgeway, Renjie Liao, Brett D Roads, Michael C Mozer,
and Richard S Zemel. Learning to generate images with perceptual
similarity metrics. In 2017 IEEE International Conference on Image
Processing (ICIP), pages 4277–4281. IEEE, 2017.
[254] Jaemin Son, Sang Jun Park, and Kyu-Hwan Jung. Retinal vessel
segmentation in fundoscopic images with generative adversarial net-
works. arXiv:1706.09318 [cs], 6 2017. arXiv: 1706.09318.
[255] Th A Sorensen. A method of establishing groups of equal amplitude
in plant sociology based on similarity of species content and its appli-
cation to analyses of the vegetation on danish commons. Biol. Skar.,
5:1–34, 1948.
[256] José Sotoca, José Sánchez, and R Mollineda. A review of data complex-
ity measures and their applicability to pattern classification problems.
Actas del III Taller Nacional de Mineria de Datos y Aprendizaje, 1 2005.
276
[257] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. Dropout: a simple way to prevent neu-
ral networks from overfitting. Journal of machine learning research,
15(1):1929–1958, 2014.
[258] J. Suckling, C. R. M. Boggis, I. Hutt, S. Astley, D. Betal, N. Cerneaz,

D. R. Dance, S.-L. Kok, J. Parker, I. Ricketts, J. Savage, E. Stamatakis,
and P. Taylor. The mammographic image analysis society digital
mammogram database. Exerpta Medica, 1069:375–378, 1994.
[259] S. S. Suganthi and S. Ramakrishnan. Anisotropic diffusion filter

based edge enhancement for segmentation of breast thermogram
using level sets. Biomedical Signal Processing and Control, 10:128–
136, 3 2014.
[260] Yanan Sun, Xian Sun, Yuhan Fang, Gary G. Yen, and Yuqiao Liu. A
novel training protocol for performance predictors of evolutionary neu-
ral architecture search algorithms. IEEE Transactions on Evolutionary
Computation, 25(3):524–536, 2021.
[261] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learn-
ing. In The 30th International Conference on Machine Learning (ICML),
volume 28 of Proceedings of Machine Learning Research, pages 1139–
1147. PMLR, June 2013.
[262] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. Effi-
cient Processing of Deep Neural Networks: A Tutorial and Survey.
arXiv:1703.09039 [cs], March 2017. arXiv: 1703.09039.
[263] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
Inception-v4, inception-resnet and the impact of residual connections
on learning. arXiv:1602.07261 [cs], 2 2016. arXiv: 1602.07261.
[264] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. Rethinking the inception architecture for computer
vision. pages 2818–2826, Las Vegas, NV, USA, 6 2016. 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), IEEE.
[265] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,

and Zbigniew Wojna. Rethinking the inception architecture for com-
puter vision. arXiv:1512.00567 [cs], 12 2015. arXiv: 1512.00567.
[266] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,

and Zbigniew Wojna. Rethinking the Inception Architecture for
Computer Vision. arXiv:1512.00567 [cs], December 2015. arXiv:
1512.00567.
277
[267] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
M. B. Gotway, and J. Liang. Convolutional neural networks for medical
image analysis: Full training or fine tuning? IEEE Transactions on
Medical Imaging, 35(5):1299–1312, 5 2016.
[268] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the
evaluation of generative models. In Yoshua Bengio and Yann LeCun,
editors, 4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track
Proceedings, 2016.
[269] Chris Thornton. Separability is a learner’s best friend. In John A.

Bullinaria, David W. Glasspool, and George Houghton, editors, 4th
Neural Computation and Psychology Workshop, London, 9–11 April
1997, pages 40–46. Springer London, London, 1998.
[270] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide

the gradient by a running average of its recent magnitude. COURSERA:
Neural networks for machine learning, 4(2):26–31, 2012.
[271] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide

the gradient by a running average of its recent magnitude. COURSERA:
Neural networks for machine learning, 4(2):26–31, 2012.
[272] Erico Tjoa and Cuntai Guan. A Survey on Explainable Artificial Intel-
ligence (XAI): Towards Medical XAI. IEEE Transactions on Neural Net-
works and Learning Systems, pages 1–21, 2020. arXiv: 1907.07374.
[273] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-

Gabriel, and Bernhard Schölkopf. Adagan: Boosting generative mod-
els. In NIPS, 2017.
[274] Akif B. Tosun, Filippo Pullara, Michael J. Becich, D. Lansing Taylor,

Jeffrey L. Fine, and S. Chakra Chennubhotla. Explainable AI (xAI) for
Anatomic Pathology. Advances in Anatomic Pathology, 27(4):241–250,
July 2020.
[275] Pieter Van Molle, Miguel De Strooper, Tim Verbelen, Bert Vankeirsbilck,
Pieter Simoens, and Bart Dhoedt. Visualizing Convolutional Neural
Networks to Improve Decision Support for Skin Lesion Classification.
In Understanding and Interpreting Machine Learning in Medical Image
Computing Applications, Lecture Notes in Computer Science, pages
115–123, Cham, 2018. Springer International Publishing.
[276] Vladimir Vapnik and Alexey Chervonenkis. The necessary and suf-
ficient conditions for consistency in the empirical risk minimization
method. Pattern Recognition and Image Analysis, 1(3):283–305, 1991.
278
[277] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and
computing, 17(4):395–416, 2007.
[278] Ulrike Von Luxburg, Robert C. Williamson, and Isabelle Guyon. Clus-
tering: Science or art? Proceedings of ICML Workshop on Unsupervised
and Transfer Learning, page 65–79, 2012.
[279] Chaoyue Wang, Chang Xu, Chaohui Wang, and Dacheng Tao. Per-
ceptual adversarial networks for image-to-image transformation.
arXiv:1706.09138 [cs], 6 2017. arXiv: 1706.09138.
[280] Shuihua Wang, Ravipudi Venkata Rao, Peng Chen, Yudong Zhang,
Aijun Liu, and Ling Wei. Abnormal breast detection in mammogram
images by feed-forward neural network trained by jaya algorithm.
Fundamenta Informaticae, 151(1-4):191–211, 1 2017.
[281] Joe H. Ward Jr. Hierarchical grouping to optimize an objective function.

Journal of the American statistical association, 58(301):236–244, 1963.
[282] Lulu Wen, Kaile Zhou, and Shanlin Yang. A shape-based clustering
method for pattern recognition of residential electricity consumption.
Journal of cleaner production, 212:475–488, 2019.
[283] R. O. Winder. Partitions of N-Space by Hyperplanes. SIAM Journal on

Applied Mathematics, 14(4):811–818, July 1966.
[284] Anita Wokhlu, David O. Hodge, Kristi H. Monahan, Samuel J. Asir-

vatham, Paul A. Friedman, Thomas M. Munger, Yong-Mei Cha, Win-
Kuang Shen, Peter A. Brady, Christine M. Bluhm, Janis M. Haroldson,
Stephen C. Hammill, and Douglas L. Packer. Long-term outcome of
atrial fibrillation ablation: impact and predictors of very late recur-
rence. Journal of Cardiovascular Electrophysiology, 21(10):1071–1078,
10 2010. PMID: 20500237.
[285] Jelmer M. Wolterink, Tim Leiner, Max A. Viergever, and Ivana Išgum.
Automatic coronary calcium scoring in cardiac ct angiography using
convolutional neural networks. page 589–596. Springer, 2015.
[286] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Gp-gan:
Towards realistic high-resolution image blending. arXiv:1703.07195
[cs], 3 2017. arXiv: 1703.07195.
[287] Yuan Xue, Tao Xu, Han Zhang, Rodney Long, and Xiaolei Huang.
Segan: Adversarial network with multi-scale l1 loss for medical image
segmentation. arXiv:1706.01805 [cs], 6 2017. arXiv: 1706.01805.
279
[288] Scott Yak, Javier Gonzalvo, and Hanna Mazzawi. Towards task and
architecture-independent generalization gap predictors. In ICML “Un-
derstanding and Improving Generalization in Deep Learning” Workshop,
2019.
[289] Yasunori Yamada and Tetsuro Morimura. Weight features for predict-
ing future model performance of deep neural networks. In The 25th
International Joint Conference on Artificial Intelligence (IJCAI), pages
2231–2237. AAAI Press, July 2016.
[290] Kenji Yamanishi, Jun-ichi Takeuchi, Graham Williams, and Peter

Milne. On-line unsupervised outlier detection using finite mixtures
with discounting learning algorithms. Data Mining and Knowledge
Discovery, 8(3):275–300, 5 2004.
[291] Dong Yang, Tao Xiong, Daguang Xu, Qiangui Huang, David Liu,
S. Kevin Zhou, Zhoubing Xu, JinHyeong Park, Mingqing Chen, Trac D.
Tran, Sang Peter Chin, Dimitris Metaxas, and Dorin Comaniciu. Au-
tomatic vertebra labeling in large-scale 3d ct using deep image-to-
image network with message passing and sparsity regularization.
arXiv:1705.05998 [cs], 5 2017. arXiv: 1705.05998.
[292] Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi Parikh. LR-GAN:
layered recursive generative adversarial networks for image generation.
In 5th International Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
OpenReview.net, 2017.
[293] Darvin Yi, Rebecca Lynn Sawyer, David Cohn III, Jared Dunnmon,
Carson Lam, Xuerong Xiao, and Daniel Rubin. Optimizing and vi-
sualizing deep learning for benign/malignant classification in breast
tumors. arXiv:1705.06362 [cs], 5 2017. arXiv: 1705.06362.
[294] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsuper-
vised dual learning for image-to-image translation. arXiv:1704.02510
[cs], 4 2017. arXiv: 1704.02510.
[295] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for
improving the generalizability of deep learning. arXiv:1705.10941 [cs,
stat], 5 2017. arXiv: 1705.10941.
[296] Kyle Young, Gareth Booth, Becks Simpson, Reuben Dutton, and Sally
Shrapnel. Deep Neural Network or Dermatologist? In Interpretability
of Machine Intelligence in Medical Image Computing and Multimodal
Learning for Clinical Decision Support, Lecture Notes in Computer
Science, pages 48–55, Cham, 2019. Springer International Publishing.
280
[297] Roozbeh Yousefzadeh and Dianne P. O’Leary. Investigating decision
boundaries of trained neural networks. arXiv:1908.02802 [cs, stat], 8
2019. arXiv: 1908.02802.
[298] Yu Zeng, Huchuan Lu, and Ali Borji. Statistics of deep generated
images. arXiv preprint arXiv:1708.02688, 2017.
[299] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and
Oriol Vinyals. Understanding deep learning requires rethinking gener-
alization. arXiv:1611.03530 [cs], February 2017. arXiv: 1611.03530.
[300] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and
Oriol Vinyals. Understanding deep learning (still) requires rethinking
generalization. Commun. ACM, 64(3):107–115, February 2021.
[301] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena.
Self-attention generative adversarial networks. pages 7354–7363.
International Conference on Machine Learning, 5 2019. ISSN: 1938-
7228 section: Machine Learning.
[302] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an effi-
cient data clustering method for very large databases. ACM SIGMOD
Record, 25(2):103–114, 6 1996.
[303] Yu-Dong Zhang, Shui-Hua Wang, Ge Liu, and Jiquan Yang. Computer-
aided diagnosis of abnormal breasts in mammogram images by
weighted-type fractional fourier transform. Advances in Mechanical
Engineering, 8(2):1687814016634243, 2 2016.
[304] Zhifei Zhang, Yang Song, and Hairong Qi. Decoupled learning for
conditional adversarial networks. In 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV), pages 700–708. IEEE, 2018.
[305] Qinpei Zhao and Pasi Fränti. Wb-index: A sum-of-squares based
index for cluster validity. Data & Knowledge Engineering, 92:77–89, 7
2014.
[306] R. Zhao, W. Ouyang, and X. Wang. Unsupervised salience learning
for person re-identification. pages 3586–3593. 2013 IEEE Conference
on Computer Vision and Pattern Recognition, 6 2013.
[307] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and An-
tonio Torralba. Object Detectors Emerge in Deep Scene CNNs.
arXiv:1412.6856 [cs], April 2015. ICLR 2015 conference paper.
[308] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. Learning Deep Features for Discriminative Localization. In
2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 2921–2929, Las Vegas, NV, USA, June 2016. IEEE.
281
[309] Quming Zhou, Zhuojing Li, and J. K. Aggarwal. Boundary extraction
in thermal images by edge map. SAC ’04, page 254–258, New York,
NY, USA, 2004. ACM.
[310] Zhiming Zhou, Han Cai, Shu Rong, Yuxuan Song, Kan Ren, Weinan
Zhang, Jun Wang, and Yong Yu. Activation maximization generative
adversarial nets. In International Conference on Learning Representa-
tions, 2018.
[311] Hui Zhu, Jianhua Huang, and Xianglong Tang. Comparing decision
boundary curvature. volume 3, pages 450–453 Vol.3. Proceedings
of the 17th International Conference on Pattern Recognition, 2004.
ICPR 2004., 8 2004. ISSN: 1051-4651.
[312] Wentao Zhu, Qi Lou, Yeeleng Scott Vang, and Xiaohui Xie. Deep
multi-instance networks with sparse label assignment for whole
mammogram classification. arXiv:1612.05968 [cs], 12 2016. arXiv:
1612.05968.
[313] Wentao Zhu, Xiang Xiang, Trac D. Tran, Gregory D. Hager, and Xiao-
hui Xie. Adversarial deep structured nets for mass segmentation from
mammograms. arXiv:1710.09288 [cs], 10 2017. arXiv: 1710.09288.
[314] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le.
Learning Transferable Architectures for Scalable Image Recognition.
arXiv:1707.07012 [cs, stat], April 2018. arXiv: 1707.07012.
282
Appendix A: The CNN architecture for Cifar-10/100 used in
Section 2.4.2
Table A.1: The CNN architecture used in Section 2.4.2
Layer Shape
Input: RGB image 32 × 32 × 3
Conv_3-32 + ReLU 32 × 32 × 32
Conv_3-32 + ReLU 32 × 32 × 32
MaxPooling_2 + Dropout (0.25) 16 × 16 × 32
Conv_3-64 + ReLU 16 × 16 × 64
Conv_3-64 + ReLU 16 × 16 × 64
MaxPooling_2 + Dropout (0.25) 8 × 8 × 64
Flatten 4096
FC_512 + Dropout (0.5) 512
FC_10 (Cifar-10) / FC_20 (Cifar-100) 10 / 20
Output (softmax): [0,1] 10 (Cifar-10) / 20 (Cifar-100)
The CNN architecture used in Section 2.4.2 of the main paper consists of
four convolutional layers, two max-pooling layers, and two fully connected
(FC) layers. The activation function for each convolutional layer is the ReLU
function, and that for output is softmax function, which maps the output
value to a range of [0, 1], with a summation of 1. The notation Conv_3-32
indicates that there are 32 convolutional neurons (units), and the filter
size in each unit is 3×3 pixels (height×width) in this layer. MaxPooling_2
denotes a max-pooling layer with a filter of 2 × 2 pixels window and stride
2. In addition, FC_n represents a FC layer with n units. The dropout layer
randomly sets the fraction rate of the input units to 0 for the next layer
with every update during training; this layer helps the network to avoid
overfitting. Table A.1 shows the detailed architecture. Our training optimizer
is RMSprop [271] with a learning rate of 1e-4 and a decay of 1e-6, the loss
283
function is categorical cross-entropy, the updating metric is accuracy, the
batch size is 32, and the number of total epochs is set at 200.
284
Appendix B: Synthetic Datasets
Table B.1: Names of the 97 used synthetic datasets from the Tomas Barton
repositorya .
3-spiral 2d-10c ds2c2sc13 rings square5 complex8

aggregation 2d-20c-no0 ds3c3sc6 shapes st900 complex9
2d-3c-no123 threenorm ds4c2sc8 simplex target compound
dense-disk-3000 triangle1 2d-4c sizes1 tetra donutcurves
dense-disk-5000 triangle2 2dnormals sizes2 curves1 donut1
elliptical_10_2 dartboard1 engytime sizes3 curves2 donut2
elly-2d10c13s dartboard2 flame sizes4 D31 donut3
2sp2glob 2d-4c-no4 fourty sizes5 twenty zelnik1
cure-t0-2000n-2D 2d-4c-no9 xor smile1 aml28 zelnik2
cure-t1-2000n-2D pmf hepta smile2 wingnut zelnik3
twodiamonds diamond9 hypercube smile3 xclara zelnik5
spherical_4_3 disk-1000n jain atom R15 zelnik6
spherical_5_2 disk-3000n lsun blobs pathbased
spherical_6_2 disk-4000n long1 cassini square1
chainlink disk-4500n long2 spiral square2
spiralsquare disk-4600n long3 circle square3
gaussians1 disk-5000n longsquare cuboids square4
a. Available at https://github.com/deric/clustering-benchmark/tree/
master/src/main/resources/datasets/artificial.
285
Appendix C: Simplification from Equation 4.4 to Equation 4.5
In the paper, Equation (4.4):
N bN a −N+0.5
1 bN a
lim Pc = lim
N→+∞ N→+∞ e bN a − N
Ignoring the constant 0.5 (small to N),
N bN a −N
1 bN a
lim Pc = lim
N→+∞ N→+∞ e bN a − N
Using equation x = eln x ,
bN a

a
−N + (bN − N) ln
a
bN a −N bN a − N
N
ln ( 1e ) bNbN
a −N
| {z }
(A )
lim Pc = lim e = lim e
N→+∞ N→+∞ N→+∞
1
Let t = N → +0,
b a−1
b − t a−1 ln b−t a−1 − t
!
b
1 b 1 ta
(A ) = − + a − ln =
t t t b
ta − 1t ta
[i] If a = 1,
(B)
z }| {
b
(b − 1) ln −1
b−1
(A ) =
t
(b−1)
b
(B) = ln
b−1
286
In R, it is easy to show that, for b > 0,
(b−1)
b
1< <e
b−1
Then,
0 < (B) < 1
(B) − 1
lim (A ) = lim = −∞
t→+0 t→+0 t
Therefore,
lim Pc = e−∞ = 0 (Equation (4.5) in paper, when a = 1)

N→+∞
[ii] For a > 1, by applying L’Hôpital’s rule several times:
b a−1
b

b − t a−1 ln b−t a−1 − t L’Hôpital’s
(1 − a)t a−2 ln b−t a−1
lim (A ) = lim = lim
t→+0 t→+0 ta t→+0 at a−1

(1 − a) ln b (a−1)2
b−t a−1 L’Hôpital’s (a − 1)2 t
= lim = lim = lim
t→+0 at t→+0 a (t − bt 2−a ) t→+0 a (1 − bt 1−a )
(a−1)2
L’Hôpital’s −t 2 (a − 1)
= lim = lim −
t→+0 (a − 1) abt −a t→+0 abt 2−a
Substitute N = 1t ,
(a−1)N 2−a
lim Pc = lim e(A ) = lim e− ab (Equation (4.5) in paper, when a > 1)
N→+∞ t→+0 N→+∞
When 1 < a < 2 (and a = 1, shown before in [i]),
+∞
lim Pc = e− ab = 0
N→+∞
287
When a > 2,
0
lim Pc = e− ab = 1
N→+∞
For a = 2,
1
lim Pc = e−( 2b ) (Equation (4.6) in paper)
N→+∞
288
ProQuest Number: 29064905
INFORMATION TO ALL USERS

The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.
Distributed by ProQuest LLC ( 2022 ).

Copyright of the Dissertation is held by the Author unless otherwise noted.
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
This work is protected against unauthorized copying under Title 17,

United States Code and other applicable copyright laws.
Microform Edition where available © ProQuest LLC. No reproduction or digitization

of the Microform Edition is authorized without permission of ProQuest LLC.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability

Uploaded by

Copyright:

Available Formats

Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability

Uploaded by

Copyright:

Available Formats

Toward Explainability of Machine Learning in Medical Imaging:

Generalizability, Separability, and Learnability

B.S. in Physics, June 2010, Northeast Forestry University, China

May 15, 2022

Toward Explainability of Machine Learning in Medical Imaging:

Dissertation Research Committee:

Murray H. Loew, Professor of Biomedical Engineering,

Jason M. Zara, Professor of Biomedical Engineering,

Matthew W. Kay, Professor of Biomedical Engineering,

Miloš Doroslovački, Associate Professor of Electrical and

Robert Pless, Professor of Computer Science,

Ronald M. Summers, Senior Investigator, National Institutes of

To my parents: Guan, Ning and Fu, Dihua

First and foremost, I want to express my sincerest thanks to Prof. Murray

Toward Explainability of Machine Learning in Medical Imaging:

Abstract of Dissertation viii

List of Figures xviii

List of Abbreviations xxi

List of Symbols xxiii

Chapter 2: Distance-based Intrinsic Measure of Data Separabil-

Chapter 3: Hyperspectral Images-based Cardiac Ablation Lesion

Chapter 4: Breast Cancer Detection Using Explainable Deep

Chapter 5: Deep Learning-based Medical Images Segmentation 206

Chapter 6: Conclusions and Future Work 244

List of Publications 249

Appendix A: The CNN architecture for Cifar-10/100 used in Sec-

Appendix C: Simplification from Equation 4.4 to Equation 4.5 286

1.1 Accuracy and explainability trade-off [97] . . . . . . . . . . . . . 2

2.1 Different separability of two datasets . . . . . . . . . . . . . . . . 16

3.1 Proposed concept of acquiring hyperspectral imaging data from

2.1 Complexity measures reported by Lorena et al. [165] . . . . . . 19

3.1 Accuracies of 31-feature clustering results. . . . . . . . . . . . . 66

4.1 CNN architecture for training from scratch. . . . . . . . . . . . . 108

5.1 C-DCNN segmentation architecture for thermal breast images. 212

6.1 My contributions (citations in brackets) in the four summarized

A.1 The CNN architecture used in Section 2.4.2 . . . . . . . . . . . . 283

ARI Adjusted Rand Index

aHSI auto-fluorescence-based Hyper-Spectral Imaging

BCD Between-Class Distance

CAD Computer-Aided Diagnosis

C-DCNN Convolutional and Deconvolutional Neural Network

CNN Convolutional Neural Network

CVI Cluster Validity Index

DBC Decision Boundary Complexity

DNN Deep Neural Network

DSI Distance-based Separability Index

FCNN Fully-Connected Neural Network

GAN Generative Adversarial Network

ICD Intra-Class Distance

MRI Magnetic Resonance Imaging

PET Positron Emission Tomography

ROIs regions of interest

XAI eXplainable Artificial Intelligence

Medical image analysis plays a crucial role in clinical diagnosis [157];

as illustrated in Figure 1.1. It is known as the trade-off between accuracy

• When the system will succeed or fail.

• When we can trust the system.

• Why the system erred.

• Show the targets (ROIs) used to make the decision.