Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability
Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability
Toward Explainability of Machine Learning in Medical Imaging: Generalizability, Separability, and Learnability
by Shuyue Guan
A Dissertation submitted to
The Faculty of
The School of Engineering and Applied Science
of The George Washington University
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
Dissertation directed by
Murray H. Loew
Professor of Biomedical Engineering
The School of Engineering and Applied Science of The George Washington
University certifies that Shuyue Guan has passed the Final Examination
for the degree of Doctor of Philosophy as of April 07, 2022. This is the final
and approved form of the dissertation.
Shuyue Guan
ii
© Copyright 2022 by Shuyue Guan
All rights reserved
iii
Dedication
“You the wise, tell me, why should our days leave us, never to return?”
– Zhu, Ziqing
iv
Acknowledgments
v
and inviting me to join the FDA. I feel exceptionally fortunate to have the
opportunity to work on amazing projects with their talented teams. I also
greatly thank my previous academic advisors. Prof. Claire Monteleoni
was my first advisor at GWU and supervised me on a summer research
project; Prof. Dawei Qi was my research supervisor at the Northeast Forestry
University (NEFU). They shared their rich expertise in image processing and
machine learning with me and have directly influenced my research and
career choices.
I sincerely thank Prof. Claire Monteleoni, Prof. Murray Loew, Prof. Jason
M. Zara, Dr. Amrinder Arora, Dr. Amy Lingley-Papadopoulos, and Dr.
HyungSok Choe for their courses providing me with the teaching assistant
position. This position offered me financial support, and I have gained
knowledge and skills through working for their courses.
I am also very grateful to all previous and recent members I met in the
Medical Imaging and Image Analysis Laboratory (the Loew’s Lab) at George
Washington University. I learned a lot from them during our lab meeting,
and they also provided me with many valuable questions and comments on
my research.
I would like to send my heartfelt gratitude to all my friends, fellows, and
classmates. They make my life a lot more fun and beautiful.
I sincerely acknowledge everyone who has helped me during my doctoral
study in the past years, including faculty, staff, editors, and reviewers in
various academic activities, workshops, conferences, and journals.
Finally, I want to thank my beloved girlfriend from the bottom of my
heart for her accompanying, her love, and our affection. My immediate
family gives me infinite support and help. It is a constant source of love,
entertainment, encouragement, assistance, and understanding. In addition,
vi
this dissertation is dedicated to my dear parents, who are the “first cause”
of my all!
vii
Abstract of Dissertation
The applications of Deep Learning (DL) for medical imaging have become
increasingly popular in recent years. During my studies of applications of
Machine Learning (ML) and DL methods in medical imaging, I realized that
there is a trade-off between accuracy and explainability for these methods.
Although some DL methods have better performances, they are more difficult
to understand and to explain. The lack of explainability limits the acceptance
of DL applications by clinicians. The requirement of explainability and the
DL applications for medical imaging that I have investigated thus have
stimulated my research interest in eXplainable Artificial Intelligence (XAI).
Explainability has multiple facets, and there is to date no unified defi-
nition. For explainable ML, I have primarily addressed these aspects: the
separability of data, cluster validation, Generative Adversarial Network (GAN)
evaluation, generalizability of the Deep Neural Network (DNN), learnability
of DL models, and transparent DL. The study of explainable ML had been
motivated by several completed applications for medical-object detection and
segmentation. Studies of medical image analysis and the XAI contain very
rich questions. My research aims to contribute to medical image analysis
by focusing on the performance (accuracy) and explainability of applica-
tions, using ML and DL. The long-term goals of these works are to help
make DL-based Computer-Aided Diagnosis (CAD) systems be transparent,
understandable, and explainable and to win the trust of end-users; eventu-
ally, these new techniques can be widely accepted by clinicians to improve
medical diagnosis and treatment outcomes.
viii
Table of Contents
Dedication iv
Acknowledgments v
List of Tables xx
Chapter 1: Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Summary . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Hyper-spectral Image-based Cardiac Ablation Lesion
Detection . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Applications of Transfer Learning and the Generative
Adversarial Network (GAN) in Breast Cancer Detection 8
1.2.3 Transparent Deep Learning/Machine Learning . . . 9
1.2.4 Deep Learning-based Medical Image Segmentation . 10
1.3 Original Contributions . . . . . . . . . . . . . . . . . . . . . 11
1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . 13
ix
2.5.4 Future Work and Limitations . . . . . . . . . . . . . . 47
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
x
4.4.1 Introduction of GAN Evaluation Metrics . . . . . . . 132
4.4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . 133
4.4.3 Likeness Score: A Modified DSI for GANs Evaluation 139
4.4.4 Experiments and Results . . . . . . . . . . . . . . . . 141
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 154
4.5 Generalizability of Deep Neural Networks . . . . . . . . . . . 161
4.5.1 Introduction of Generalizability of Neural Networks . 161
4.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.5.3 Experiments and Results . . . . . . . . . . . . . . . . 168
4.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 176
4.6 Estimation of Training Accuracy for Two-layer Neural Networks 178
4.6.1 Introduction and Related Work . . . . . . . . . . . . . 179
4.6.2 The Hidden Layer: Space Partitioning . . . . . . . . . 182
4.6.3 Empirical Corrections . . . . . . . . . . . . . . . . . . 191
4.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 199
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Bibliography 252
xi
Appendix B: Synthetic Datasets 285
xii
List of Figures
xiii
3.3 Hypercube of aHSI images: images in the hypercube were ordered
by their wavelength increasingly on the Z-axis. Each pixel on the
X-Y plane thus has an associated spectrum. . . . . . . . . . . . 58
3.4 (a) Pre-processing operations and reshaping hypercube into a 2D
matrix; (b) the rule of reshaping and inverse-reshaping. . . . . 60
3.5 K-means clustering. . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Appearance of ablated tissue after: (a) linear unmixing from aHSI
system, (b) TTC staining. . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 Results for porcine atria (Set-1) clustered by k-means into: (a) 5
clusters and (b) 10 clusters. Panel (c) shows an auto-fluorescence
image at 500 nm; (d) shows the lesion areas (red) detected when
k=10, superimposed on the image in (c). The corresponding
lesion component image, which is from the unmixed image that
contains lesion component and non-lesion component, is shown
in (e); followed by binary image obtained from (e) by applying
Otsu’s thresholding (f). . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Maximum, average, and minimum accuracies over 10 datasets
for each k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 One kind of 4-feature grouping. . . . . . . . . . . . . . . . . . . . 67
3.10 Accuracies of SNs for one dataset (Set-1). . . . . . . . . . . . . . 69
3.11 Feature grouping results for porcine atria (Set-1): (a) k-means
clustering (k=10) by using all 31 features; (b) k-means clustering
(k=10) by using four features from 4-feature grouping (SN=2857):
[wavelength groups: 420-510, 520-600, 610-630, 640-720 nm];
(c) k-means clustering (k=10) by using four features from different
4-feature grouping (SN=3716): [wavelength groups: 420-580,
590-600, 610-680, 690-720 nm]. . . . . . . . . . . . . . . . . . . 69
3.12 Feature grouping accuracies for 10 datasets; each row represents
a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.13 Accuracies over 10 datasets. . . . . . . . . . . . . . . . . . . . . . 71
3.14 Smoothed Minimum accuracies (scaled) with the 3 dividers. The
green area (SN: 2730-3245) includes most high-accuracy combi-
nations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.15 Evaluated accuracies over 10 datasets. . . . . . . . . . . . . . . 73
3.16 The simplified flowchart summarizes the methods in this study.
The ground truth areas of lesion were obtained from aHSI system
by linear unmixing and verified by TTC analysis. By comparing
with truth data, we found the optimal k-value for k-means algo-
rithm (green) as well as the optimal groups (blue). The procedure
on the left (red) is our proposed methods for lesion detection from
ablated tissue to lesion areas. . . . . . . . . . . . . . . . . . . . . 74
3.17 Time costs for k-means clustering. . . . . . . . . . . . . . . . . . 75
3.18 Two clusters (classes) datasets with different label assignments.
Each histogram indicates the relative frequency of the value of
each of the three distance measures (indicated by color). . . . . 81
xiv
3.19 An example of rank numbers assignment. . . . . . . . . . . . . 86
3.20 Examples for rank-differences of synthetic datasets. . . . . . . . 97
3.21 Wrongly-predicted clusters have a higher DSI score than real
clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 Result of the New-model. Blue curve is the accuracy after each
epoch of training, and red curve is smoothed accuracy (the
smoothing interval is about 20 epochs). . . . . . . . . . . . . . . 110
4.2 Result of the Feature-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the
smoothing interval is about 20 epochs). . . . . . . . . . . . . . . 111
4.3 Result of the Tuning-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the
smoothing interval is about 20 epochs). . . . . . . . . . . . . . . 112
4.4 Comparing of the three CNN classification models: New-model
(yellow); Feature-model, to train a neural network-classifier (red);
Tuning-model (blue). The values are maximum smoothed accu-
racy and time cost (second) of training per epoch. . . . . . . . . 113
4.5 (A) A mammographic image from DDSM rendered in grayscale;
(B) Cropped ROI by the given truth abnormality boundary; (C)
Convert Grey to RGB image by duplication. . . . . . . . . . . . . 117
4.6 The three types of affine transformation. . . . . . . . . . . . . . 117
4.7 The principle of GAN. . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.8 Validation accuracy of CNN classifiers trained by three types of
AFF ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.9 The flowchart of our experiment plan. CNN classifiers were
trained by data including ORG, AFF and GAN ROIs. Valida-
tion data for the classifier were ORG ROIs that were never used
for training. The AFF box means to apply affine transformations. 125
4.10 (Top row) Real abnormal ROIs; (Bottom row) synthetic abnormal
ROIs generated from GAN. . . . . . . . . . . . . . . . . . . . . . . 126
4.11 Training accuracy and validation accuracy for six training datasets. 127
4.12 Histogram of mean and standard deviation. (Normalized) . . . . 130
4.13 Problems of generated images from the perspective of distribution.
The area of dotted line is the distribution of real images. The dark-
blue dots are real samples and red dots are generated images. (a)
is overfitting, lack of Creativity. (b) is lack of Inheritance. (c) is
called mode collapse for GAN and (d) is mode dropping. Both (c)
and (d) are examples of lack of Diversity. . . . . . . . . . . . . . 133
4.14 Lack of Creativity, Diversity, and Inheritance in 2D. Histograms
of (a) and (b) are zoomed to ranges near zero; (c) has the entire
histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.15 Plots of values in Table 4.9. . . . . . . . . . . . . . . . . . . . . . 144
xv
4.16 Column 1: samples from four types of real images; column 2-4:
samples from synthetic images of three GANs trained by the four
types of images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.17 Normalized and ranked scores. X-axis shows scores and y-axis
shows their normalized values; 0 is for the worst (model) perfor-
mance and 1 is for the best (model) performance. Colors are for
generators and shapes are for image types; see details in legend. 148
4.18 Column 1: samples from real images of CIFAR-10; column 2-6:
samples from synthetic images of five GANs: DCGAN, WGAN-GP,
SNGAN, LSGAN, and SAGAN trained by the original 2,000-image
subset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.19 Processes to build real set and generated sets including opti-
mal generated images and generated images lack creativity, lack
diversity, lack creativity & diversity, and lack inheritance. . . . 151
4.20 Column 1: samples from the real set; column 2-6: sample images
from the five virtual GAN models: Opt., LC, LD, LC&D, and LIn
trained by the real set. . . . . . . . . . . . . . . . . . . . . . . . . 153
4.21 Real and generated datasets from virtual GANs on MNIST. First
row: the 2D tSNE plots of real (blue) and generated (orange) data
points from each virtual GAN. Second row: histograms of ICDs
(blue for real data; orange for generated data) and BCD for real
and generated datasets. The histograms in (b)-(d) are zoomed to
the beginning of plots; (a) and (e) have the entire histograms. . 155
4.22 Time cost of measures running on a single core of CPU (i7-6900K).
To test time costs, we used same amount of real and generated
images (200, 500, 1000, 2000, and 5000) from CIFAR-10 dataset
and DCGAN trained on CIFAR-10. † IS only used the generated
images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.23 To generate adversarial examples of classifier f . . . . . . . . . . 164
4.24 Adversarial examples generated by pairs of data. . . . . . . . . . 165
4.25 Two kinds of round decision boundary. . . . . . . . . . . . . . . 167
4.26 Local adversarial set generated by 3-nearest neighbors of a pair. 168
4.27 Decision boundaries of two models trained by the synthetic 2-D
dataset. The FCNN:(a) has only one hidden layer with one neuron;
its number of parameters is 5 (including bias). The FCNN:(b) has
three hidden layers with 10, 32 and 16 neurons; its number of
parameters is 927 (including bias). . . . . . . . . . . . . . . . . . 170
4.28 Local DBC scores from two models trained by the breast cancer
dataset. The FCNN bC1 has three hidden layers (20 neurons in
each layer) and three Dropout layers; its number of parameters is
1,481 (including bias). The bC2 has one hidden layer with 1,000
neurons; its number of parameters is 32,001 (including bias). . 172
xvi
4.29 Training and test accuracies in training process of three models.
The CNN cC1 has three convolutional layers, three max-pooling
layers, one dense layer (64 neurons) and one Dropout layer. The
cC2 has one convolutional layer and three dense layers (256, 128,
64 neurons). The cC3 has only one dense layer (1024 neurons). 173
4.30 Means and medians of local DBC scores on model cC1, cC2 and
cC3 using different numbers of nearest neighbors. . . . . . . . 174
4.31 Increasingly sorted local DBC scores from three models. The
upper figure is the whole plot, and the lower figure is zoomed the
plot in range from 2k-6k to clearly see positions of three curves. 175
4.32 Adversarial examples for the cC1 model. . . . . . . . . . . . . . 176
4.33 Linear adversarial set (green) on lumpy boundary (black). . . . 178
4.34 An example of the two-layer Fully-Connected Neural Network
(FCNN) with d − L − 1 architecture. This FCNN is used to classify
N random vectors in Rd belonging to two classes. Detailed settings
are stated before in Section 4.6.1.1. The training accuracy of this
classification can be estimated by our proposed method, without
applying any training process. The detailed Algorithm of our
method is shown in Section 4.6.3.3. . . . . . . . . . . . . . . . . 181
4.35 Maximum number of partitions in 2-D . . . . . . . . . . . . . . . 190
4.36 Fitting curve of b1 = f (N, L) in 2-D . . . . . . . . . . . . . . . . . . 193
4.37 Plots of d v.s. xydd from Table 4.18. Blue dot-line is linearly fitted
by points to show the growth. . . . . . . . . . . . . . . . . . . . 196
4.38 Estimated training accuracy results comparisons. y-axis is accu-
racy, x-axis is the dimensionality of inputs (d). . . . . . . . . . 197
4.39 Evaluation of estimated training accuracy results. y-axis is esti-
mated accuracy; x-axis is the real accuracy; each dot is for one
case; red line is y = x. R2 ≈ 0.955. . . . . . . . . . . . . . . . . . . 199
5.1 Full thermal raw images of two patients, including the neck,
shoulder, abdomen, background and chair. . . . . . . . . . . . . 208
5.2 Our breast infrared thermography system. . . . . . . . . . . . . 210
5.3 Preprocessing of the raw IR images: (a) original raw IR image, (b)
manual rectangular crop to remove shoulders and abdomen, and
(c) is the hand-trace of the breast contour to generate the manual
segmentation (ground truth). . . . . . . . . . . . . . . . . . . . . 211
5.4 Training and testing data for Experiment 1 and 2. . . . . . . . . 214
5.5 The evaluation processes. . . . . . . . . . . . . . . . . . . . . . . 215
5.6 The training curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.7 Segmentation results of one patient from Experiment 1. . . . . 217
5.8 Results of Experiment 1. The blue dots are the average IoU for
each patient and bars show the range among 3 samples. . . . . 218
5.9 Segmentation results of two patients from Experiment 2. . . . . 218
xvii
5.10 Results of Experiment 2. The blue dots are the average IoU of each
patient among its 15 testing samples, the red lines are medians
and the bars show the ranges. . . . . . . . . . . . . . . . . . . . 219
5.11 Comparison of results from the two experiments (first row: Exper-
iment 2, second row: Experiment 1). The second column (Gray
seg-image) shows output of segmentation models. The third col-
umn is the ground truth breast region of the patient’s testing
samples. (Top part: p.001, bottom part: p.009). . . . . . . . . . 221
5.12 The size and object-area ratio change of images. We change image
size by down-sampling and change object-area ratio by adding
blank margin around the object and down-sampling to keep the
same size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.13 Results of Grad-CAM applied to Xception model with input of
an elephants image. (a) is the input image. (b) is original image
masked by Grad-CAM heatmap (using ‘Parula’ colormap) of the
prediction on this input. (c) is the Grad-CAM heatmap mask
using gray-scale colormap. (d) is original image filtered by the
heatmap mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
5.14 Flowchart of the Experiment #1. The true boundaries of tumor
regions in abnormal ROIs are provided by the DDSM database. 231
5.15 Flowchart of the Experiment #2. The normal and abnormal ROIs
are used twice to train the CNN classifiers and then to generate
CAMs by trained classifiers using Grad-CAM algorithm. The CNN
classifiers to be trained by CAM-filtered ROIs are the same CNN
models (same structures) as trained by the original ROIs before
but trained from scratch again. . . . . . . . . . . . . . . . . . . 232
5.16 The ROI (left) is cropped from an original image (right) from DDSM
dataset. The red boundary shows the tumor area. The ROI is
larger than the size of tumor area because of padding. . . . . . 234
5.17 The padding is added to four sides of ROIs by some randomness
and depended on the size of tumor area. . . . . . . . . . . . . . 235
5.18 Examples of ROIs. The tumor mask is binary image created from
the tumor ROI and truth boundary of the tumor area. . . . . . 235
5.19 Result of Experiment #1. The first row shows one of the abnormal
(tumor) ROIs and its truth mask. Other rows show the CAMs of
this ROI generated by using trained CNN classifiers and Grad-
CAM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
5.20 Plots of Dice and CAM_val_acc for the six CNN classifiers in
Table 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.21 Examples of truth-mask-filtered (a) and inverse-mask-filtered (b)
ROIs from the case shown in Figure 5.19. . . . . . . . . . . . . . 241
5.22 Some tumor ROIs and their CAMs from Xception. . . . . . . . 241
xviii
List of Tables
xix
4.11 Measure results averaged by generators . . . . . . . . . . . . . . 147
4.12 Measure results on CIFAR-10 . . . . . . . . . . . . . . . . . . . . 149
4.13 Measure results from virtual GAN models . . . . . . . . . . . . . 154
4.14 Statistical Results of local DBC scores on bC1 and bC2. . . . . 172
4.15 Statistical Results of local DBC scores on cC1, cC2 and cC3. . 176
4.16 Accuracy results comparison. The columns from left to right
are dimension, dataset size, number of neurons in hidden layer,
the real training accuracy and estimated training accuracy by
Equation (4.15) and Theorem 4.1. . . . . . . . . . . . . . . . . . 192
4.17 Estimated training accuracy results comparison in 2-D. The
columns from left to right are dataset size, number of neurons
in hidden layer, the real training accuracy, estimated/predicted
training accuracy by Equation (4.17) and Theorem 4.1, and (abso-
lute) differences based on estimations between real and estimated
accuracies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.18 Parameters {xd , yd , cd } in Equation (4.16) (Observation 4.1) for
various dimensionalities of inputs are determined by fitting. . 195
B.1 Names of the 97 used synthetic datasets from the Tomas Barton
repositorya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
xx
List of Abbreviations
CT Computed Tomography
DL Deep Learning
FC Fully-Connected
xxi
LS Likeness Score
ML Machine Learning
NN Neural Network
xxii
List of Symbols
IRef Binary-Unmixing-Reference
I31Rlt Binary-31(features)-Result
I4Rlt Binary-4(features)-Result
SN Serial Number
xxiii
Chapter 1: Introduction
1.1 Background
1
Figure 1.1: Accuracy and explainability trade-off [97]
2
There is no one unified definition for XAI so far [64, 158]. Various
definitions and terminologies are proposed by different studies, while many
definitions overlap in concept [72]. A simple definition is to explain the
internal decisions within a ML system that lead to outcomes proposed by D.
Doran et al. [63]. Such explanations require insight into the rationale that
the ML system uses to obtain conclusions from the input data. But this
definition is not comprehensive because how the “internal decisions” could
be understood by users is still a problem. For example, a trained Fully-
Connected Neural Network (FCNN) can be described as a very complicated
composition function; but the composition function does not make sense to
us. D. Gunning proposed a more specific definition in the XAI program of
DARPA [95]. D. Gunning considers that an XAI system must answer:
• Why the decisions were made by a ML system and why not something
else.
In addition, an XAI system should maintain a high accuracy for tasks. Based
on D. Gunning’s definition, F. Hohman et al. [109] summarize the definition
of XAI system by focusing on five W’s and How (Why, Who, What, How,
When, and Where) questions. Authors consider that we should firstly clarify
the purpose of explanation (why to explain and for whom), then decide
which parts to explain (what) and find methods (how), and finally know
methods’ limitations and effective domains (when can use and where be
used). Extending from D. Doran et al., A. Barredo Arrieta et al. [21] consider
3
that the XAI system can produce detailed reasons to make internal functions
clear and understandable. These definitions emphasize the importance of
human’s participation and understanding. Human user’s experiences do
matter, hence, the goals of XAI also include confidence, trustworthiness,
fairness, accessibility, privacy awareness, among others [158, 21]. To make
the definition clear and neat, R. Roscher et al. [222] summarize the XAI in
three core elements:
1. Transparency of design
2. Interpretability of processing
3. Explainability to users
Transparency requires that all techniques used in the XAI system – models,
methods and algorithms – should have clear reasons or purposes according
to the needs and goal. For example, an XAI system must provide the
reason why it uses a Convolutional Neural Network (CNN) model with 16
hidden layers. Interpretability is to interpret the processing from input data
to results (output). For example, for a result, the XAI could answer the
question: the ML algorithm bases its decision on what? [36]. Explainability
is important to users. It further explains the answers of transparency and
inter-pretability integrated with domain knowledge. The final goal of an XAI
system is to provide explanations and answers to specific users. Therefore,
explainability requires the XAI system should translate explanations to
be understandable to its target users. In fact, the concept of XAI is still
evolving. While the definition of XAI has been discussed for years, a concise
or universal definition is unavailable [222]. Since explainability is related
to the human mind, a deep and comprehensive discussion of XAI must
involve the disciplines of arts and social sciences [180] such as philosophy,
4
linguistics, psychology, and cognitive science. In this study, I do not study
the definition of XAI further because such discussion is beyond the technical
field and unrelated to the main topic. Instead, I focus on the approaches for
XAI related to the techniques and methods used in medical image analysis.
There, ML can be applied to a wide range of tasks and provides promis-
ing approaches to make the diagnostic process more precise and efficient.
Although the ML and DL have achieved noticeable results in the laboratory,
they have not been deployed significantly in the clinics because of the lack of
explainability [252]. For the reasons of responsibility and reliability, explain-
ability is required of the CAD system for trust by physicians, regulators, and
patients. Tosun et al. [274] indicates the main requirements for explainable
CAD systems:
• Display the multi-level triaging for a case (like the decision tree).
Recent explainability methods for medical imaging are mostly focused on the
saliency/attribution maps [252, 272, 250, 275], which show the ROIs used
to make decisions, such as the CAM [308]/Grad-CAM [240], Gradient [71],
and SHAP [296].
5
In the same way as the development of medical image analysis tech-
niques, new methods from XAI are being attempted to apply in the field
of CAD continuously. The goals are to make the CAD system transparent,
understandable, and explainable and to win the trust of end-users; finally,
it can be widely accepted by clinics to improve medical diagnosis and treat-
ment outcomes. Therefore, my studies include not only several applications
of ML and DL technologies in medical imaging but also aim to contribute to
methods focusing on the explainability of ML and DL.
6
priori knowledge of tissue spectra, can be also an effective means to de-
tect lesions from aHSI hypercubes. The average accuracy for detection by
k-means (k = 10) using 31 features was about 74% when compared to refer-
ence images. Secondly, we have demonstrated that the number of spectral
bands (which are referred to as features) can be reduced (by grouping them)
without significantly affecting lesion detection accuracy. Specifically, we
show that by using the best four grouped features, the accuracy of lesion
identification was about 94% of that achieved by using 31 features. The
time cost of 4-feature clustering was about 40% of the 31-feature clustering,
demonstrating that 4-feature grouping can speed up acquisition and pro-
cessing. From an instrumentation point of view, by using a limited number
of features one is able to combine multiple spectral bands into one spectrally
wide band. This is extremely beneficial for low-light applications such as
implementation of aHSI via catheter access.
This project is important because the recurrence rate of AF after an
ablation procedure can be as high as 50% and more than 90% of these
recurrent cases have been linked to gaps between ablation lesions [284].
Incomplete placement of lesions that later result in AF recurrence can be
curtailed if clinicians could directly monitor lesion formation along with
the degree of tissue damage. Unfortunately, the endocardial surface of
the left atrium, where most of RF ablation procedures are performed, is
covered by thick layers of collagen and elastin preventing direct visualization
of ablated muscle beneath. While imaging technologies like MRI, CT, and
ultrasound have been successfully applied for lesion testing, they have
significant limitations. CT and MRI are expensive, involve radiation and/or
contrast agents, and ultrasound imaging has poor image resolution; thus,
they are not always good for live monitoring [198]. Therefore, our work
7
has been exploring another visualization approach called aHSI to solve the
problem.
This project is important because training the CNN from scratch requires
a large number of labeled images [73]. For some kinds of medical image
data such as mammographic tumor images, however, to obtain a sufficient
number of images to train a CNN classifier is difficult because the true
positives are scarce in the datasets and expert labeling is expensive [111].
The shortcomings of an insufficient number of images to train a classifier are
8
well-known [146]; thus, it is very important to study methods to solve the
problem and thus to improve the performance of a CNN classifier, especially
for medical image analysis.
9
focused on using classification performance and statistical metrics.
In this study, we consider a fundamental way to evaluate GANs by
directly analyzing the images they generate, instead of using them as
inputs to other classifiers.
10
in computer vision and has brought a breakthrough in image segmentation
applications, especially for medical images.
Autoencoder-like convolutional and deconvolutional neural networks (C-
DCNN) are promising computational approaches to automatically segment
breast areas in thermal images. We apply the C-DCNN to segment breast
areas from our thermal breast images database, which we are collecting in
our pilot study by imaging breast cancer patients with our infrared camera
(N2 Imager). We then examine how to segment targets using a classifier
trained on the targets, instead of training a new segmentation model. We
then evaluate the segmentation results. Specifically, we test this method
for medical object segmentation.
This project is important because applying image segmentation to medical
images can remove unnecessary parts and extract the key regions of interest
(ROIs) in the images; that is a crucial preprocessing step for CAD systems
of medical imaging. And, automatic segmentation of the ROIs will limit the
area for tumor search and reduce processing time. It will further reduce the
time and effort required in manual segmentation, and potentially minimize
human errors. Traditional segmentation methods, however, are not suited
to these challenging tasks. Recent DL-based segmentation methods have
been shown to outperform previous techniques for many types of medical
images.
11
achieve users’ trust. My research has practical applications in lesion/cancer
detection, medical image segmentation, and explainable Artificial Intelli-
gence (XAI). My most significant research contributions are:
12
networks; the measure can be used to analyze the generalizability of
deep learning models.
13
radio-frequency ablation (RFA) lesion detection using k-means cluster-
ing and the application of DSI as a CVI to evaluate clustering results.
14
Chapter 2: Distance-based Intrinsic Measure of Data Separability1
2.1 Introduction
Data and models are the two main foundations of machine learning
and deep learning. Models learn knowledge (patterns) from datasets. An
example is that the convolutional neural network (CNN) classifier learns how
to recognize images from different classes. There are two aspects in which
we examine the learning process: complexity of the learning model [116]
1 This work has been published in the [J1].
15
and the separability of the dataset [47]. The learning outcomes are highly
dependent on the two aspects. For a specific model, the learning capability is
fixed, so that the training process depends on the training data. Separability
is an intrinsic characteristic of a dataset [76] to describe how data points
belonging to different classes mix with each other.
(a) (b)
As reported by Chiyuan et al. [300], the time to convergence for the same
training loss on random labels on the CIFAR-10 dataset was greater than
when training on true labels. It is not surprising that the performance
of a given model varies between different training datasets depending on
their separability. For example, in a two-class problem, if the scattering
area for each class has no overlap, one straight line (or hyperplane) can
completely separate the data points (Figure 2.1a). For the distribution
shown in Figure 2.1b, however, a single straight line cannot separate the
data points successfully, but a combination of many lines can. In other
words, for a given classifier, it is more difficult to train on some datasets
than on others. The difficulty of training on a less-separable dataset is made
evident by the requirement for greater learning times (e.g., number of epochs
for deep learning) to reach the same accuracy (or loss) value and/or to obtain
a lower accuracy (or higher loss), compared with the more-separable dataset.
The training difficulty, however, also depends on the model employed. In
16
summary, the separability of a dataset can be characterized in three ways:
17
Besides the applications discussed in our studies, DSI has broad po-
tentialities to be applied to other applications in deep learning, machine
learning, and data science. By providing understanding of data separabil-
ity, DSI could help in choosing a proper machine learning model for data
classification [83]. By examining the similarity of the two distributions,
DSI can detect (or certify) the distribution of a sample set, i.e., distribution
estimation. DSI can also be used as a feature selection method [236, 62]
for dimensionality reduction and as an anomaly detection method in data
analysis.
Our review of the literature indicates that there have been substantially
fewer studies on data separability per se than on classifier models. A more
general issue than that of data separability is data complexity [84, 37],
which measures not only the relationship between classes but also the
data distribution in feature space. Ho and Basu [108] conducted a ground-
breaking review of data complexity measures. They reported measures for
classification difficulty, including those associated with the geometrical
complexity of class boundaries. Recently, Lorena et al. [165] summarized
existing methods for the measurement of classification complexity. In the
survey, most complexity measures have been grouped in six categories:
feature-based, linearity, neighborhood, network, dimensionality, and class
imbalance measures (Table 2.1). Other ungrouped measures discussed in
Lorena’s paper have similar characteristics to the grouped measures or may
have large time cost. Each of these methods has possible drawbacks. In
particular, the features extracted from data for the five categories of feature-
based measures may not accurately describe some key characteristics of
18
the data; some linearity measures depend on the classifier used, such as
support-vector machines (SVMs); neighborhood measures [130] may show
only local information; some network measures may also be affected by local
relationships between classes depending on the computational methods
employed; dimensionality measures are not strongly related to classification
complexity; and, class imbalance measures do not take the distribution of
data into account.
19
The Fisher discriminant ratio (FDR) [153], also known as Linear discrim-
inant analysis (LDA), measures the separability of data using the mean and
standard deviation (SD) of each class. FDR is a feature-based measure (F1
and F1v in Table 2.1), and it has been used in many studies. But FDR fails
in some cases (e.g., as Figure 2.5(e) shows, Class 1 data points are scattered
around Class 2 data points in a circle; their FDR ≈ 0.) The initial definition
of FDR considers the separability between two classes to be calculated from
between-class and within-class scatter matrices.
Histogram of distances
Frequency
Class Y
𝒅𝒅𝒚𝒚 Kolmogorov–Smirnov (KS)
Similarity:
Distance
𝑠𝑠𝑦𝑦 = 𝐾𝐾𝐾𝐾 𝑑𝑑𝑦𝑦 , 𝑑𝑑𝑥𝑥,𝑦𝑦
20
as the Distance-based Separability Index (DSI). DSI uses the distances
between data points both between-class and intra-class, and it is similar in
some respects to the network measures because it represents the universal
relations between the data points. Especially, we have formally shown that
the DSI can indicate whether the distributions of datasets are identical
for any dimensionality. Figure 2.2 shows the definition and computation
of the DSI by an example of a two-class dataset in 2-D. Similarly inspired
by the idea of FDR, the recent works of Generalized Discrimination Value
(GDV) [238] and Sequentially Forward Selection based on the Separability
(SFSS) algorithm [115] also proposed their separability-based evaluations
for data classes. Since these evaluations use only the averaged measures
like the FDR, however, they also fail in some cases. Our DSI overcomes
these drawbacks because it considers all between-class and intra-class
distance values instead of their mean values. In this paper, we verify DSI by
comparing it with other state-of-the-art separability/complexity measures
based on synthetic and real (CIFAR-10/100) datasets.
In general, the DSI has a wide applicability and is not limited to simply
understanding the data; for example, it can also be applied to measure
generative adversarial network (GAN) performance (Section 4.4.3), evaluate
clustering results (Section 3.5), anomaly detection [200], the selection of
classifiers [193, 30, 31, 83], and features for classification [256, 48].
The novelty of this study is to examine the distributions of datasets via the
distributions of distances between data points in datasets, and the proved
Theorem connects the two kinds of distributions. That is the gist of DSI.
To the best of our knowledge, none of the existing studies uses the same
methods.
21
2.3 Methodological Development for Distance-based Separability In-
dex (DSI)
22
distribution (distributions have the same shape, position and support, i.e.,
the same probability density function) and have sufficient data points to fill
the region, this dataset reaches the maximum entropy because within any
small regions, the occurrence probabilities of the two classes data are equal
(50%). It is also the most difficult situation for separation of the dataset.
Here, we proposed a new method – Distance-based Separability Index
(DSI) to measure the similarity of data distributions. DSI is used to analyze
how two classes of data are mixed together, as a substitute for entropy.
Definition 2.1. The Intra-Class Distance (ICD) set {dx } is a set of distances
between any two points in the same class (X), as: {dx } = {kxi − x j k2 |xi , x j ∈
X; xi 6= x j }.
Definition 2.2. The Between-Class Distance (BCD) set {dx,y } is the set
of distances between any two points from different classes (X and Y ), as
{dx,y } = {kxi − y j k2 | xi ∈ X; y j ∈ Y }.
Remark. The metric for all distances is Euclidean (l 2 norm) in this paper.
In Section 2.5.3, we compare the Euclidean distance with some other dis-
23
tance metrics including City-block, Chebyshev, Correlation, Cosine, and
Mahalanobis, and we showed that the DSI based on Euclidean distance has
the best sensitivity to complexity, and thus we selected it.
1. First, the ICD sets of X and Y : {dx }, {dy } and the BCD set: {dx,y } are
computed by their definitions (Defs. 2.1 and 2.2).
2. Second, the similarities between the ICD and BCD sets are then com-
puted using the the Kolmogorov–Smirnov (KS) [78] distance2 :
(sx + sy )
DSI({X,Y }) = .
2
Remark. We do not use the weighted average because once the distributions
of the ICD and BCD sets can be well characterized, the sizes of X and Y
will not affect the KS distances sx and sy . And, DSI is invariant to location
and scale transformations of the data points because the location and
scale transformations will be applied equally to all distances between data
2 Inexperiments, we used the scipy.stats.ks_2samp from the SciPy package in Python to
compute the KS distance. https://docs.scipy.org/doc/scipy/reference/generated/scipy.
stats.ks_2samp.html
24
points. Thus, the (normalized) histograms of ICD and BCD sets (as shown
in Figure 2.2) will not be changed, and the DSI keeps the same value.
2. Compute the n KS distances between ICD and BCD sets for each class:
3. Calculate the average of the n KS distances; the DSI of this dataset is:
∑ si
DSI({Ci }) = .
n
Remark. DSI ∈ (0, 1). A small DSI (low separability) means that the ICD and
BCD sets are very similar. In this case, by Theorem 2.1, the distributions
of datasets are similar too. Hence, these datasets are difficult to separate.
Then, the Theorem 2.1 shows how the ICD and BCD sets are related to
the distributions of the two-class data; it demonstrates the core value of
this study.
25
Theorem 2.1. When |X| and |Y | → ∞, if and only if the two classes X and Y
have the same distribution, the distributions of the ICD and BCD sets are
identical.
According to this theorem, that the distributions of the ICD and BCD
sets are identical indicates that the dataset has maximum entropy because
X and Y have the same distribution. Thus, as we discussed before, the
dataset has the lowest separability. And in this situation, the dataset’s DSI
≈ 0 by its definition.
The time costs for computing the ICD and BCD sets increase linearly with
the number of dimensions and quadratically with the number of data points.
It is much better than computing the dataset’s entropy by dividing the space
into many small regions. Our experiments (in Section 2.4.2.1) show that
the time costs could be greatly reduced using a small random subset of
the entire dataset without significantly affecting the results (Figure 2.8).
And in practice, the computation of DSI can be sped-up considerably by
using tensor-based matrix multiplications on a GPU (e.g., it takes about 2.4
seconds for 4000 images from CIFAR-10 running on a GTX 1080 Ti graphics
card) because the main time-cost is the computation of distances.
26
2.3.4 Proof of the Theorem
Consider two classes X and Y that have the same distribution (distribu-
tions have the same shape, position, and support, i.e., the same probability
density function) and have sufficient data points (|X| and |Y | → ∞) to fill
their support domains. Suppose X and Y have Nx and Ny data points, and
Ny
assume the sampling density ratio is Nx = α. Before providing the proof of
Theorem 2.1, we firstly prove Lemma 2.1, which will be used later.
Remark. The condition of most relevant equations in the proof is that the
Nx and Ny are approaching infinity in the limit.
Lemma 2.1. If and only if two classes X and Y have the same distribution
Ny
covering region Ω and Nx = α, for any sub-region ∆ ⊆ Ω, with X and Y having
nyi
nxi , nyi points, nxi = α holds.
Proof. Assume the distributions of X and Y are f (x) and g(y). In the
union region of X and Y , arbitrarily take one tiny cell (region) ∆i with
nxi = ∆i f (xi )Nx , nyi = ∆i g(y j )Ny ; xi = y j . Then,
Therefore:
g(xi ) g(xi )
α =α ⇔ = 1 ⇔ ∀xi : g(xi ) = f (xi )
f (xi ) f (xi )
27
𝛅 y𝑖
𝐃𝒊𝒋
x𝑖
∆𝒋
∆𝒊 y𝑗
x𝑗
Proof. Within the area, select two tiny non-overlapping cells (regions) ∆i and
∆ j (Figure 2.4). Since X and Y have the same distribution but in general
different densities, the number of points in the two cells nxi , nyi ; nx j , ny j fulfills:
nyi ny j
= =α
nxi nx j
The scale of cells is δ , the ICDs and BCDs of X and Y data points in cell ∆i are
approximately δ because the cell is sufficiently small. By the Definitions 2.1
and 2.2:
dxi ≈ dxi ,yi ≈ δ ; xi , yi ∈ ∆i
Similarly, the ICDs and BCDs of X and Y data points between cells ∆i and
∆ j are approximately the distance between the two cells Di j :
First, divide the whole distribution region into many non-overlapping cells.
Arbitrarily select two cells ∆i and ∆ j to examine the ICD set for X and the
BCD set for X and Y . By Corollaries 2.1 and 2.2:
28
i) The ICD set for X has two distances: δ and Di j , and their numbers are:
1
dxi ≈ δ ; xi ∈ ∆i : |{dxi }| = nxi (nxi − 1)
2
ii) The BCD set for X and Y also has two distances: δ and Di j , and their
numbers are:
dxi ,yi ≈ δ ; xi , yi ∈ ∆i : |{dxi ,yi }| = nxi nyi
dxi ,y j ≈ dyi ,x j ≈ Di j ; xi , yi ∈ ∆i ; x j , y j ∈ ∆ j :
This means that the number of proportions of the number of distances with
a value of Di j in the two sets is equal. We then examine the proportions of
the number of distances with a value of δ in the ICD and BCD sets.
29
For ICDs:
∑i (n2xi ) 1 − N1x
2
Nx2 − Nx nxi
· =∑ · 2 → 1 (Nx → ∞)
Nx2 ∑i (n2xi ) − Nx i Nx2 nxi
∑i N 2 − 1
x Nx
This means that the number of proportions of the number of distances with
a value of δ in the two sets is equal.
In summary, the fact that the proportion of any distance value (δ or Di j )
in the ICD set for X and in the BCD set for X and Y is equal indicates that the
distributions of the ICD and BCD sets are identical, and a corresponding
proof applies to the ICD set for Y .
30
the same distribution, but the distributions of the ICD and BCD sets are
identical.
Ny
Proof. Suppose classes X and Y have the data points Nx , Ny , which Nx = α.
Divide their distribution area into many non-overlapping tiny cells (regions).
In the i-th cell ∆i , since distributions of X and Y are different, according to
Lemma 2.1, the number of points in the cell nxi , nyi fulfills:
nyi
= αi ; ∃αi 6= α
nxi
The scale of cells is δ and the ICDs and BCDs of the X and Y points in cell
∆i are approximately δ because the cell is sufficiently small.
31
For the distributions of the two sets to be identical, the ratio of proportions
of the number of distances with a value of δ in the two sets must be 1, that
(2.3) (2.3)
is (2.1) = (2.2) = 1. Therefore:
Similarly,
To eliminate the ∑i (αi n2xi ) by considering the Equations (2.4) and (2.5), we
have:
∑i (αi2 n2xi )
∑(n2xi) = α2
i
αi 2
Let ρi = α , then,
∑(n2xi) = ∑(ρin2xi)
i i
Since nxi could be any value, to hold the equation requires ρi = 1. Hence:
α 2
i
∀ρi = = 1 ⇒ ∀αi = α
α
32
This contradicts ∃αi 6= α. Therefore, the contrapositive proposition has been
proved.
2.4 Experiments
In this section, we present the results of the DSI and the other com-
plexity measures (listed in Table 2.1) for several typical two-class datasets3 .
Figure 2.5 displays their plots and histograms of the ICD sets (for Class 1
and Class 2) and the BCD set (between Class 1 and Class 2). Each class
consists of 1,000 data points.
Table 2.2 presents the results for these measures shown in Table 2.1 and
our proposed DSI. The measures noted by “*” are considered to have failed
in measuring separability and are not used for subsequent experiments.
In particular, the dimensionality and class-imbalance measures do not
3 These datasets are created by the Samples Generator in sklearn.datasets: https:
//scikit-learn.org/stable/modules/classes.html#samples-generator
33
ICD set for Class 1
Data Plot Histograms for the Sets Class-1
ICD set for Class 2
Class-2
BCD set between classes
Figure 2.5: Typical two-class datasets and their ICD and BCD set distribu-
tions
34
proposed measure (1 − DSI) are shown to accurately reflect the separability
of these datasets.
Table 2.2: Complexity measures results for the two-class datasets (Fig-
ure 2.5). The measures noted by “*” failed to measure separability.
35
In this section, we synthesize a two-class dataset that has different
separability levels. The dataset has two clusters, one for each class. The
parameter controlling the standard deviation (SD) of clusters influences
separability (Figure 2.6), and the baseline is the TD we defined.
We created nine two-class datasets4 , and each dataset has 2,000 data
points (1,000 per class) and two cluster centers for the two classes, and the
SD parameters of clusters are set from 1 to 9. Along with SD of clusters
increasing, distributions of two classes are more overlapped and mixed
together, thus reducing the separability of the datasets.
We use a simple fully-connected neural network (FCNN) model to classify
4 By using the sklearn.datasets.make_blobs function in Python
36
these two-class datasets. This FCNN model has three hidden layers; there
are 16, 32, and 16 neurons, respectively, with ReLU activation functions in
each layer. The classifier was trained on one of the nine datasets, repeatedly
from scratch. We set 1,000 epochs for each training session to compute the
TD of each dataset.
In this case, separability could be clearly visualized by the complexity of
the decision boundary. Figure 2.6 shows that datasets with a larger cluster
SD need more complex decision boundaries. In fact, if a classifier model
can produce decision boundaries for any complexity, it can achieve 100%
training accuracy for any datasets (no two data points from different classes
have all the same features) but the training steps (i.e., epochs) required
to reach 100% training accuracy may vary. For a specific model, a more
complex decision boundary may need more steps to train. Therefore, the
average training accuracy throughout the training process – i.e., TD – can
indicate the complexity of the decision boundary and the separability of the
dataset.
Since the training accuracy ranges from 0.5 to 1.0 for two-class classifi-
cation, to enable a comparison with other measures that range from 0 to 1,
we normalize the accuracy by the function:
r (x) = (x − 0.5)/0.5
rT D = r(T D). The range of rTD is from 0 to 1, and the lowest complexity
(highest separability) is 1. We also compute N2, N4, T1, LSC, Density, and
the proposed measure (1 − DSI) for the nine datasets and present them
together with rTD as a baseline for separability in Figure 2.7.
As shown in Figure 2.7, the rTD for datasets with larger cluster SDs
37
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
Cluster_SD
N2 N4 T1 LSC
Density 1-DSI rTD
Figure 2.7: Complexity measures for two-class datasets with different cluster
SDs.
38
2.4.2 CIFAR-10/100 Datasets
The main time cost of DSI is to form ICD and BCD sets by calculating the
Euclidean (l 2 -norm) distances between any two data points. If N corresponds
to the number of points in a dataset and d stands for its dimensionality
(number of features), the time cost of DSI is O d · N 2 , which is the same
as the comparable complexity measures: N2, N4, T1, LSC, and Density
(referring to the Table 1 in Lorena et al.’s paper [165]).
Images in the CIFAR-10 dataset are grouped into 10 classes and the
CIFAR-100 dataset consists of 20 super-classes. Both CIFAR-10 and CIFAR-
100 consist of N = 50,000 images (32x32, 8-bit RGB), and each image has d =
3,072 pixels (features). Thus, to apply the measures using all 50,000 images
would be very time-consuming (including the DSI, most of the measures
have a time cost of O d · N 2 ).
39
We randomly select subsets of 1/5, 1/10, 1/50, 1/100, and 1/500 of
the original training images (i.e., without pre-processing) from CIFAR-10
and compute their DSIs. For each subset, we repeat the random selection
and DSI computation eight times to calculate the mean and SD of DSIs.
Figure 2.8 shows that the subset containing 1/50 training images or more
does not significantly affect the measures. For example, the DSI for the
whole (50,000) training images is 0.0945, while the DSI for a subset of 1,000
randomly selected images is 0.1043 ± 0.0049 – the absolute difference is up to
0.015 (16%) but with an execution speed that is about 2,500 times greater:
computing the DSI for 1,000 images requires about 30 seconds; for the
whole training dataset, the DSI calculation requires about 20 hours. In
addition, because the same subset is used for all measures, the comparison
results are not affected. Therefore, we have randomly selected 1,000 training
images to compute the measures, and this subset still accurately reflects
the separability/complexity of the entire dataset.
40
2.4.2.2 Results
CIFAR-10 CIFAR-100
1.00 1.00
0.95 0.95
0.90 0.90
0.85 0.85
0.80 0.80
0.75 0.75
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.50 0.50
0.45 0.45
0.40 0.40
0.35 0.35
0.30 0.30
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
Pre-processing Pre-processing
Figure 2.9 shows the results for CIFAR-10 and CIFAR-100. The x-axis
41
shows the pre-processing methods applied to the datasets, decreasingly
ordered from left to right by TD, which is the baseline of data separability.
Since a lower TD indicates lower separability and higher complexity, the
values of complexity measures should strictly increase from left to right.
We put specific values of measures in the Tables 2.3 and 2.4 because some
differences of complexity measures’ results are small and not obviously
shown by curves. By examining these values, we clearly find that the
measures LSC, T1 (which almost overlaps with LSC) and Density have high
values and remain nearly flat from left to right (insensitive), while N2 and
N4 decrease for the Contrast (2) pre-processing stage. Unlike the other
measures, (1−DSI) monotonically increases from left to right and correctly
reflects (and is more sensitive to) the complexity of these datasets. These
results show the advantage of DSI and indicate that image pre-processing
is useful for improving CNN performance in image classification.
42
2.5 Discussion
This work is motivated by the need for a new metric to measure the
difficulty of a dataset to be classified by machine learning models. This
measure of a dataset’s separability is an intrinsic characteristic of a dataset,
independent of classifier models, that describes how data points belonging
to different classes are mixed. To measure the separability of mixed data
points of two classes is essentially to evaluate whether the two datasets are
from the same distribution. According to Theorem 2.1, the DSI provides an
effective way to verify whether the distributions of two sample sets
are identical for any dimensionality.
As discussed in Section 2.3, if the DSI of sample sets is close to zero,
the very low separability means that the two classes of data are scattered
and mixed together with nearly the same distribution. The DSI transforms
the comparison of distributions problem in Rn (for two sample sets) to
the comparison of distributions problem in R1 (i.e., ICD and BCD sets) by
computing the distances between samples. For example, in Figure 2.5(a),
samples from Class 1 and 2 come from the same uniform distribution in R2
over [0, 1)2 . Consequently, the distributions of their ICD and BCD sets are
almost identical and the DSI is about 0.0058. In this case, each class has
1,000 data points. For twice the number of data points, the DSI decreases
to about 0.0030. When there are more data points of two classes from the
same distribution, the DSI will approach zero, which is the limit of the DSI
if the distributions of two sample sets are identical.
For another example, we equally divide 5,000 airplane-labeled images
from the CIFAR-10 dataset into two subsets: AIR1 and AIR2. We then take
43
a subset of 2,500 automobile-labeled images from the same dataset, named
AUTO. The DSI of the mixed set: AIR1 and AIR2 is about 0.0045. The DSI
of the mixed set: AIR1 and AUTO is about 0.1083. Since the images in AIR1
and AIR2 are from the same airplane class and could be considered having
the same distribution, the DSI of the AIR1 and AIR2 mixed set is closer to
zero.
In summary, to test whether two distributions are identical, we firstly
take labeled data as many as possible from the two distributions. We then
compute the DSI of these data and see how close the value is to zero. The
closer the DSI is to zero, the more likely the two distributions are similar.
Where P and Q are the respective CDFs of the two distributions p and q.
Although many statistical measures, such as the Bhattacharyya distance,
44
Kullback–Leibler divergence, and Jensen–Shannon divergence, could be
used to compare the similarity between two distributions, most of them
require the two sets to have the same number of data points. It is easy to
show that the ICD and BCD sets (|{dx }|, |{dy }|, and |{dx,y }|) cannot be the
same size. For example, The f -divergence [196]:
p (x)
Z
D f (P, Q) = q (x) f dx
q (x)
cannot be used to compute the DSI because the ICD and BCD have different
numbers of values, thus the distributions p and q are in different domains.
Measures based on CDFs can solve this problem because CDFs exist in
the union domain of p and q. Therefore, the Wasserstein-distance [212] (W-
distance) can be applied as an alternative similarity measure. For two 1-D
distributions (e.g., ICD and BCD sets), the result of W-distance represents
the difference in the area of the two CDFs:
Z
W1 (P, Q) = |P (x) − Q (x)| dx
The DSI uses the KS distance rather than the W-distance because we find
that normalized W-distance is not as sensitive as the KS distance for mea-
suring separability. To illustrate this, we compute the DSI by using the two
distribution measures for the nine two-cluster datasets in Section 2.4.1.2.
The two DSIs are then compared by the baseline rTD, which is also used
in Section 2.4.1.2. Figure 2.10 shows that along with the separability of
the datasets decreasing, KS distance has a wider range of decrease than
the W-distance. Hence, the KS distance is considered a better distribution
measure for the DSI in terms of revealing differences in the separability of
datasets.
45
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
Cluster_SD
DSI by KS distance DSI by W-distance rTD
46
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
Cluster_SD
rTD Euclidean City-block
Chebyshev Correlation Cosine
Mahalanobis
47
tion Score (IS) and Fréchet Inception Distance (FID) [105]. As with the FID,
measuring how close the distributions of real and GAN-generated images are
to each other is an effective approach to assess GAN performance because
the goal of GAN training is to generate new images that have the same
distribution as real images. We have also applied the DSI in Section 3.5
as an internal cluster validity index (CVI) [9] to evaluate clustering results
because the goal of clustering is to separate a dataset into clusters, in
the macro-perspective, how well a dataset has been separated could be
indicated via the separability of clusters.
By examining the similarity of the two distributions, the DSI can detect
(or certify) the distribution of a sample set, i.e., distribution estimation.
Several distributions could be assumed (e.g., uniform or Gaussian) and a
test set is created with an assumed distribution. The DSI could then be
calculated using the test and sample sets. The correct assumed distribution
will have a very small DSI (i.e., close to 0) value. In addition to the men-
tioned applications, DSI can also be used as a feature selection method for
dimensionality reduction and an anomaly detection method in data analysis.
DSI has broad applications in deep learning, machine learning, and data
science beyond direct quantification of separability.
The DSI could also help to understand how data separability changes
after passing through each layer of a neural network. As an example, we
reuse the three-layer FCNN model and nine datasets from Section 2.4.1.2.
An FCNN model is trained using a single dataset. We then input the data
into the trained model and record the output from each layer. Finally, we
compute the DSI of every output and input data. As shown in Figure 2.12,
for every dataset, that the DSI of final output is always higher than the
input indicates the classifier improves the separability of data. Some DSIs
48
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Input L1 L2 L3 Output
1 2 3 4 5
6 7 8 9
Figure 2.12: DSIs of input and output data from each layer in the FCNN
model for nine datasets from Section 2.4.1.2. The x-axis represents the
outputs from layers of the FCNN: input layer, three hidden layers, and
output layer. The y-axis represents the DSI values of output. Plots are for
the nine datasets.
of output from hidden layers, however, are even smaller than that of the
input data. This phenomenon is non-intuitive because it is assumed that
hidden layers improve separability and increase the DSI continuously. A
possible reason for this is that the dimensions of data increase in the
hidden layers. The dimension of input data is two, and it changes to
16, 32, 16, and 1 for the output because of the number of neurons in
hidden layers. In higher dimensional space, data may be coded by fewer
features or mapped closer to each other, thus, the separability decreases.
Although DSI works for any dimensionality, dimensionality can affect data
distributions and the measurement of distance, which is known as the
curse of dimensionality [228], thus affecting the DSI. More studies should
49
address the impact of dimensions on DSI and how to compare separability
across different numbers of dimensions.
50
separability of data in some situations must be discovered. For example,
when the separability changes, the change of DSI is nonlinear, and the DSI
of linearly separable data is usually not 1 (e.g., Figure 2.5f).
2.6 Conclusion
51
measures and demonstrate its competitiveness as an effective measure.
In addition to its uses as a separability measure and as an evaluator of
GANs and of clustering results (all as shown by our studies), DSI has the
clear potential to be applied in other important areas, such as distribution
estimation, feature selection, and anomaly detection.
52
Chapter 3: Hyperspectral Images-based Cardiac Ablation Lesion
Detection Using Unsupervised Learning1
53
355nm steerable
laser catheter inflatable
handle balloon
light guide
image guide
Camera
bandpass ablated tissue
filter
validity indices, which use predicted labels and data, have been created.
Without true labels, to design an effective Cluster Validity Index (CVI) is as
difficult as to create a clustering method. And it is crucial to have more CVIs
because there are no universal CVIs that can be used to measure all datasets
and no specific methods of selecting a proper CVI for clusters without true
labels. Therefore, to apply a variety of CVIs to evaluate clustering results
is necessary. In this chapter, we apply the Distance-based Separability
Index (DSI), which is previously provided, as a novel internal CVI. We
compared the DSI with eight internal CVIs including studies from early Dunn
(1974) to most recent CVDD (2019) and an external CVI as ground truth,
by using clustering results of five clustering algorithms on 12 real and 97
synthetic datasets. Results show DSI is an effective, unique, and competitive
CVI to other compared CVIs. We also summarized the general process to
evaluate CVIs and created the rank-difference metric for comparison of CVIs’
results.
54
3.1 Introduction of the Autofluorescence-based Hyperspectral Imag-
ing
55
for live monitoring [198, 197]. Therefore, another visualization approach
called autofluorescence hyperspectral imaging (aHSI) [284, 213, 140, 86]
has been explored. The previous studies have shown that hyperspectral
imaging can circumvent this limitation [86, 185, 184].
Hardware
CCD
λ Scanning
LCTF
lens
UV LED
To implement the aHSI approach during the RFA procedure, one has to
deliver ultraviolet (UV) light (λ = 365 nm) to the heart by an optical fiber
threaded into a percutaneous catheter [178, 144]. This allows illumination
of the endocardial atrial surface, which is highly autofluorescent. The
autofluorescence signal is then detected through the image guide and the
attached HSI camera system, which forms a stack of images acquired at
individual wavelengths. Figure 3.2 shows a diagram of such a system,
while Figure 3.3 illustrates the hypercube construction. The hypercubes
56
contain rich spectral information about the tissue. Our previous studies
have shown that subtle changes in the tissue autofluorescence profiles
can help to identify the ablated regions in both animal and human atrial
tissue [86, 185]. In those studies, we had to pre-acquire target spectra for
lesion and non-lesion sites before applying linear unmixing [284], since it
is a supervised learning method. The first objective of this work was to
apply an unsupervised learning method, k-means clustering, to detect RFA
atrial lesions without a priori knowledge about tissue spectra. Our second
objective was to use k-means clustering to select the minimal number of
spectral bands (feature groups) without significantly reducing the accuracy
of lesion detection. This is important for future implementation of an
intracardiac aHSI catheter, since it is beneficial to decrease the number
of spectral images within the hypercube while preserving the method’s
ability to reveal the lesions. First, having fewer images will speed-up both
acquisition and processing, enabling us to visualize the ablated areas in
real time. Secondly, by widening spectral bands around the most useful
wavelengths, one can collect more photons and make the output images
more robust to noise.
Atria from freshly excised porcine hearts were ablated by a non-irrigated
RF ablation catheter (Boston Scientific). Several lesions were created on
one tissue sample. Atria were illuminated with a 365nm UVA LED (Mightex,
Pleasanton, CA) placed 10 cm from the tissue surface. A CCD camera
outfitted with a Nikon AF Micro-Nikkor 60mm f/2.8D objective and a liq-
uid crystal tunable filter (LCTF, Nuance FX, PerkinElmer/CRi) was used
to acquire hypercubes of the samples. The LCTF was tuned to pass the
wavelengths from 420 to 720 nm at continuous wavelengths separated by
the filter’s band interval, 10 nm; this yielded 31 channels. As shown in
57
Spectral trace from λi
pixel with xi, yi
coordinates
xi
yi
Y: spatial dimension
X: spatial dimension
Figure 3.2, through the LCTF, a lens projects the collected light onto a
CCD containing 1392×1040 pixels. Finally, the hypercube for each sam-
ple was constructed from the 31 auto-fluorescence images, each of size
1392×1040. 10 samples were used in this study; therefore, we collected
310 auto-fluorescence images in total.
58
LCTF [284] (correction), followed by normalization which converted values
of each spectrum to the range from 0 to 1 (Figure 3.4a). Normalization is
critical because for classification algorithm it is the overall shape of the
spectrum that matters, but not absolute light intensity at each wavelength.
For normalization maximum value was set at 1 and minimum value at 0.
More details about discussing the importance of normalization step are
included in the reference to the earlier study [86].
Then, we reshaped the 3D hypercube to a 2D matrix according to the
rule shown in Figure 3.4b: for every point on the X-Y plane (a pixel), the
data along the spectral dimension were considered as a vector in the new
2D matrix; the pixels were ordered from left to right in the first row (upper
left), then in the second row and so forth. The spectrum of each pixel in the
X-Y plane was represented as a vector in the matrix; the matrix therefore
had 31 columns corresponding to 31 spectral bands (420-430 nm, 430-440
nm, . . . , 710-720 nm). Hereafter, we refer to each pixel as a sample; each
sample is a vector of 31 features.
59
31 W × L raw Tiff slices (a) (b)
1 2 3
4 5 6
7 8 9
60
3.2.1 K-means Clustering
In our case, we have 1040 × 1392 = 1447680 samples for each dataset. We
performed k-means clustering, in which the value of k is unknown initially
and determined by experiments. Each pixel was labeled by its cluster. Then,
we assigned colors to these numbers to allow visualization of the clusters.
The procedure is shown in Figure 3.5.
Feature
Labels
vectors
V1 l1 L
V2 l2
V3 l3 l1 l2 l3 l4 lp
V4 k-means l4 i-reshape
W
li
Vi li
lq lN
VN lN
𝑙𝑙𝑖𝑖 ∈ 1,2,3, ⋯ , 𝑘𝑘
61
3.2.2 Evaluation and Results
To be able to evaluate any lesion detection method, one must have sets
of images in which the lesions are labeled. This section describes the
construction of such sets.
(a) (b)
Figure 3.6: Appearance of ablated tissue after: (a) linear unmixing from
aHSI system, (b) TTC staining.
Previous studies have shown that lesions detected by the linear unmixing
algorithm based on pre-acquired spectral libraries closely match RF lesions
in the corresponding TTC image (TTC-Reference) [284, 213]. Figure 3.6b
62
shows one such example. Therefore, we considered the lesion component
image obtained using linear unmixing of a 31-band hypercube as a reference
image; this is called Gray-Unmixing-Reference.
The Gray-Unmixing-Reference has a continuous gray-scale, and lesions
are brighter (have larger gray values) than non-lesion areas. To create a new
image that identifies unambiguously the lesion and non-lesion pixels, we
used a gray-level threshold. The threshold was found by Otsu’s method [98],
which uses the image’s histogram to find the threshold that maximizes the
between-class variance. The pixels with intensities greater than the thresh-
old then were then labeled lesion; all others were considered non-lesion.
Having this binary (two-class) image (the Bi-Unmixing-Reference), enabled
us to then quantitively evaluate the k-means approach.
The k-means clustering yielded an image in which each pixel was labeled
with an integer from ‘1’ to ‘k’. For finding the label of lesions, we recorded the
locations of all lesion pixels (whose value is ‘1’) in the Bi-Unmixing-Reference.
Then, we examined all the corresponding pixels (those having the same
locations) in the k-means image. Since every pixel has a label (cluster
number) after k-means clustering, we can calculate the modal (most-often
occurring) label of these sample pixels as the label of lesions; all other labels
represent non-lesions. Finally, in the clustering image, all pixels having the
label of lesion were set to value ‘1’; and other pixels (non-lesion) were set
to value ‘0’. So, we obtained the binary image (the Bi-31-Result) for lesion
detection by k-means clustering using 31 features.
63
comparing the Bi-31-Result (e.g., lesions colored red in Figure 3.7d) with
the outcome of linear unmixing (lesion areas in Bi-Unmixing-Reference; e.g.,
white regions in Figure 3.7f).
The Binary-Unmixing-Reference (IRef ) and Binary-31(features)-Result
(I31Rlt ) are binary images having the same size; if the value of a given pixel
was different in the two images, it was declared to be a ‘miss’. Accuracy
index (Acc) was defined as 1 minus the ratio of the number of ‘miss’ (Diff) to
the total number (N) of pixels of lesion areas in the two (detected and truth)
images:
Diff
Acc (IRef , I31Rlt ) = 1 −
N
If the accuracy was acceptable, we could use the lesions that were detected
by k-means using 31 features as a reference (Bi-31-Result) to evaluate the
outcomes after the next step: feature grouping (Section 3.3.1).
For porcine samples (we have 10 datasets of samples in total) that encom-
passed an area of 1392 by 1040 pixels, Figure 3.7a shows that k=5 is not
sufficient to distinguish ablated regions for this sample (Set-1). To find the
optimal k, we computed k ranging from 2 to 41 for all our porcine datasets.
For each k, we plotted the maximum, average, and minimum accuracies
over the 10 datasets in Figure 3.8.
Because a smaller k will make k-means run faster, we seek the smallest
k that is effective. Figure 3.8 indicates that k=10 is overall optimal: it is the
smallest k that almost reaches all the highest values for maximum, average,
and minimum accuracies. As illustrated in Figure 3.7b, k=10 is effective
for the Set-1 sample.
A set of 31 aHSI planes was required to obtain the lesion detection
64
100 100
200 200
300 300
400 400
500 500
600 600
700 700
800 800
900 900
1000 1000
200 400 600 800 1000 1200 200 400 600 800 1000 1200
100
3500
200
3000
300
400 2500
500
2000
600
1500
700
1000
800
900 500
1000
0
200 400 600 800 1000 1200
Figure 3.7: Results for porcine atria (Set-1) clustered by k-means into: (a) 5
clusters and (b) 10 clusters. Panel (c) shows an auto-fluorescence image at
500 nm; (d) shows the lesion areas (red) detected when k=10, superimposed
on the image in (c). The corresponding lesion component image, which is
from the unmixed image that contains lesion component and non-lesion
component, is shown in (e); followed by binary image obtained from (e) by
applying Otsu’s thresholding (f).
65
Figure 3.8: Maximum, average, and minimum accuracies over 10 datasets
for each k.
Dataset# 1 2 3 4 5 6 7 8 9 10
Acc (IRef , I31Rlt ) 0.87 0.91 0.38 0.80 0.69 0.82 0.66 0.87 0.66 0.75
66
crease the number of spectral bands without reducing the accuracy of lesion
detection appreciably.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 31
1 2 3 4
There are 4,060 ways to divide 31 features into four separate and contigu-
ous groups. The 31 features are the intensities at each of the wavelengths
from 420 to 720 nm. The goal was to find the best 4-feature groupings
from the 4,060 possible combinations to adequately detect the lesion areas.
That number is sufficiently small that we could construct every possible
grouping and get its detection result (the Bi-4-Result).
We assigned a Serial Number (SN) to each combination. The boundaries
between groups (the dividers) were described by the last feature’s number
67
Table 3.2: Combinations of 4 groups.
in the 1st, 2nd, and 3rd group (Table 3.2). “720” is not shown because it is
always the last feature’s number in the 4th group. We assess the Binary-
4(features)-Result (I4Rlt ) by comparing them to the Binary-31(features)-Result
(I31Rlt ) to yield the accuracy: Acc (I31Rlt , I4Rlt ) whose calculation method was
the same as Acc (IRef , I31Rlt ).
68
Figure 3.10: Accuracies of SNs for one dataset (Set-1).
lesions
Figure 3.11: Feature grouping results for porcine atria (Set-1): (a) k-means
clustering (k=10) by using all 31 features; (b) k-means clustering (k=10)
by using four features from 4-feature grouping (SN=2857): [wavelength
groups: 420-510, 520-600, 610-630, 640-720 nm]; (c) k-means clustering
(k=10) by using four features from different 4-feature grouping (SN=3716):
[wavelength groups: 420-580, 590-600, 610-680, 690-720 nm].
69
1
Dataset
6
10
SN
Figure 3.12: Feature grouping accuracies for 10 datasets; each row repre-
sents a dataset.
where AiSN is the accuracy of the i-th dataset and SN is the serial number.
The result (Figure 3.13) shows that there are many 4-feature groupings
that perform well (peaks) across all samples we tested. But taking the worst
performance has a flaw: if there exists one bad dataset, reducing accuracies
of all feature groupings appreciably, it will influence the result greatly. If
we plot the smoothed (by average) minimum accuracies, obviously periodic
phenomenon to series numbers is shown. Figure 3.14 shows the locations
(wavelength) of dividers and the smoothed minimum accuracy value (scaled)
to each SN. And, by looking at this figure, one can notice that such periods
70
Figure 3.13: Accuracies over 10 datasets.
are defined by the first divider. We also noticed that the SN range in the
green area (2730-3245), in which the divider 1 ranges from 510 to 530 nm,
includes most high-accuracy combinations.
Another method is to design an evaluation function to reflect the average
performance of a feature grouping combination through all samples:
1
ESN = ∏
i 1 − AiSN
where AiSN is the accuracy of i-th dataset and SN is the serial number of
combination.
By this formula, a feature grouping combination will get a large score if
its accuracy is close to 1. Thus, this nonlinear function emphasizes high
accuracy values. Since the maximum value of accuracy in this work was
less than 0.99, ESN is bounded. By comparing the two results (Figure 3.13
and Figure 3.15), we observe that the good-performance combinations are
similar. The evaluation function could find a better grouping for all tested
71
Figure 3.14: Smoothed Minimum accuracies (scaled) with the 3 dividers.
The green area (SN: 2730-3245) includes most high-accuracy combinations.
datasets than the minimum accuracy method, but the grouping that we
obtained from the max-min method could be more stable for new datasets
because it provides a reliable lower bound on accuracy.
From this evaluation, the highest value (red point in Figure 3.15) is
obtained with the 4-feature grouping [420-520, 530-590, 600-640, 650-720
nm] (SN=3018). Using this grouping, we found the accuracies of 4-feature
clustering results for all datasets (Table 3.3).
Dataset# 1 2 3 4 5 6 7 8 9 10
Acc (I31Rlt , I4Rlt ) 0.97 0.99 0.96 0.96 0.92 0.93 0.96 0.93 0.89 0.95
72
The highest value
ESN
73
Ablated tissue
TTC analysis
Truth
data
Linear unmixing
aHSI
system Compare
Feature Decide
Decide k
groups groups
k-means
Lesion areas
74
up the processing while maintaining good accuracy of lesion detection.
3.3.4 Discussion
We used the k-means clustering method to find the lesion sites and
compared the outcomes to those using linear unmixing. Since k-means is
an unsupervised learning algorithm, we did not require a priori knowledge of
lesion spectra. In contrast, the supervised learning methods do require such
knowledge about the lesion to construct the training set containing labeled
spectra. In practice, k-means assigns lesion and non-lesion areas to different
groups and assigns different colors. The outcome of k-means verified our
hypothesis that the spectra of ablated tissue are different from those of
non-ablated tissue. Also, it confirmed that the auto-fluorescence images
contain information about the components and structure of tissues [284].
Whereas k-means clustering is in general repeatable, one disadvantage is
that the detected lesion regions may vary slightly for each clustering result
from a given dataset. That is a characteristic of k-means because the initial
75
points of groups are selected randomly, and the clustering result may be
affected by the choice of initial points.
Alternatively, we could apply supervised learning methods. A classifier
would be trained through labeled lesion and non-lesion spectral data. One
advantage of supervised classification is that the regions of lesion detected
by a classifier model are invariant for a given dataset. Though the time for
training a classifier might be greater, the lesion detection process by using
the classifier would be faster unless the classifier model is very complicated
(non-linear and in high dimension). But its disadvantage is that one will
require a large amount of labeled lesion and non-lesion spectral data for
such training.
To evaluate the results presented in this study, we compared the out-
comes after feature grouping with the results before feature grouping (I4Rlt
vs. I31Rlt ). Additionally, the results before feature grouping were verified
by comparing them with the outcomes of linear unmixing (I31Rlt vs. IRef ).
A direct comparison between the outcomes of k-means and TTC staining
would have been ideal, but this presents practical problems. First, the
chemical reaction that occurs during TTC staining makes ablated tissues
shrink to a certain degree. Secondly, images taken after TTC staining are
not taken at exactly the same orientation, so an exact comparison is not
possible, even when image registration methods are used. But since the
main goal of this study was to find the best feature grouping, direct com-
parison with TTC was not necessary; and we have previously reported a
direct comparison between lesion surface areas of the lesions in TTC images
and those obtained in Gray Unmixing Images [284, 124]. By computing the
difference between detected lesions before and after feature grouping, we
still were able to achieve our goal.
76
3.4 Introduction of Cluster Validation in Unsupervised Learning
Like the k-means we used for previous lesion detection, cluster analysis
is an important unsupervised learning method in machine learning. The
clustering algorithms divide a dataset into clusters [123] based on the dis-
tribution structure of the data, without any prior knowledge. Clustering
is widely studied and used in many fields, such as data mining, pattern
recognition, object detection, image segmentation, bioinformatics, and data
compression [220, 282, 92, 57, 132, 173]. The shortage of labels for training
is a big problem in some machine learning applications, such as medical
image analysis, and applications of big data [188] because labeling is ex-
pensive [112]. Since unsupervised machine learning does not use labels for
training, to apply cluster analysis can avoid the problem.
77
validations. External validations use the truth-labels of classes and pre-
dicted labels, and internal validations use predicted labels and data. Since
external validations require true labels and there are no true class labels in
unsupervised learning tasks, we can employ only the internal validations in
cluster analysis [162]. In fact, to evaluate clustering results by internal vali-
dations has the same difficulty as to do clustering itself because measures
have no more information than the clustering methods [205]. Therefore, the
difficulty of designing an internal Cluster Validity Index (CVI) is like creating
a clustering algorithm. The different part is that a clustering algorithm can
update clustering results by a value (loss) from the optimizing function but
the CVI provides only a value for clusters evaluation.
Various CVIs have been created for the clustering of many types of
datasets [55]. By methods of calculation [114], the internal CVIs are based
on two categories of representatives: center and non-center. Center-based
internal CVIs use descriptors of clusters. For example, the Davies–Bouldin
(DB) index [50] uses cluster diameters and the distance between cluster
centroids. Non-center internal CVIs use descriptors of data points. For
example, the Dunn index [68] considers the minimum and maximum dis-
tances between two data points.
Besides the DB and Dunn indexes, in this paper, some other typical
internal CVIs are selected for comparison. The Calinski-Harabasz (CH)
index [33] and Silhouette coefficient (Sil) [225] are two traditional internal
CVIs. In recently developed internal CVIs, the I index [176], WB index [305],
Clustering Validation index based on Nearest Neighbors (CVNN) [162], and
Cluster Validity index based on Density-involved Distance (CVDD) [114] are
selected. Eight typical internal CVIs, which range from early studies (Dunn,
1974) to the most recent studies (CVDD, 2019), are selected to compare
78
with our proposed CVI.
In addition, an external CVI – the Adjusted Rand Index (ARI) [234] is se-
lected as the ground truth for comparison because external validations use
the true class labels and predicted labels. Unless otherwise indicated, the
“CVIs” that appear hereafter mean internal CVIs and the only external
CVI is named “ARI”.
A small DSI (low separability) of classes X and Y means that their ICD
and BCD sets are very similar. In this case, the distributions of classes X
and Y are similar too. Hence, data of the two classes are difficult to separate.
An example of two-class dataset is shown in Figure 3.18. Figure 3.18a
shows that, if the labels are assigned correctly by clustering, the distri-
butions of ICD sets will be different from the BCD set and the DSI will
reach the maximum value for this dataset because the two clusters are
79
well separated. For an incorrect clustering, in Figure 3.18b, the difference
between distributions of ICD and BCD sets becomes smaller so that the
DSI value decreases. Figure 3.18c shows an extreme situation, that is, if
all labels are randomly assigned, the distributions of the ICD and BCD sets
will be nearly identical. It is the worst case of separation for the two-class
dataset and its separability (DSI) is close to zero. Therefore, the separability
of clusters can be reflected well by the proposed DSI. The DSI ranges from
0 to 1, DSI ∈ (0, 1), and we suppose that the greater DSI value means the
dataset is clustered better.
CVIs are used to evaluate the clustering results. In this study, several
internal CVIs including the proposed DSI have been employed to examine
the clustering results from different clustering methods (algorithms). To use
different clustering methods on a given dataset may obtain different cluster
results and thus, CVIs are used to select the best clusters. We choose eight
commonly used (classical and recent) internal CVIs and an external CVI -
the Adjusted Rand Index (ARI) to compare with our proposed DSI (Table 3.4).
The role of ARI is the ground truth for comparison because ARI involves
true labels (clusters) of the dataset.
In this study, the synthetic datasets for clustering are from the Tomas
Barton repository 4 , which contains 122 artificial datasets. Each dataset
has hundreds to thousands of objects with several to tens of classes in two
or three dimensions (features). We have selected 97 datasets for experiment
4 https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/
datasets/artificial
80
(a) Correct labeling
DSI ≈ 0.645
81
Table 3.4: Compared CVIs.
because the 25 unused datasets have too many objects to run the clustering
processing in reasonable time. The names of the 97 used synthetic datasets
are shown in Appendix B. Illustrations of these datasets can be found in
Tomas Barton’s homepage 5 .
The 12 real datasets used for clustering are from three sources: the
sklearn.datasets package 6 , UC Irvine Machine Learning Repository [58]
and Tomas Barton’s repository (real world datasets) 7 . Unlike the synthetic
datasets, the dimensions (feature numbers) of most selected real datasets
are greater than three. Hence, CVIs must be used to evaluate their clustering
results rather than plotting clusters as for 2D or 3D synthetic datasets.
Details about the 12 real datasets appear in Table 3.5.
5 https://github.com/deric/clustering-benchmark
6 https://scikit-learn.org/stable/datasets
7 https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/
datasets/real-world
82
Table 3.5: The description of used real datasets.
83
Table 3.6: CVI scores of clustering results on the wine recognition dataset.
Clustering
Ward Spectral
method KMeans BIRCH EM
Linkage Clustering
Validity a
ARIb + 0.913c 0.757 0.880 0.790 0.897
Dunn + 0.232 0.220 0.177 0.229 0.232
CH + 70.885 68.346 70.041 67.647 70.940
DB - 1.388 1.390 1.391 1.419 1.389
Silhouette + 0.284 0.275 0.283 0.278 0.285
WB - 3.700 3.841 3.748 3.880 3.700
I+ 5.421 4.933 5.326 4.962 5.421
CVNN - 21.859 22.134 21.932 22.186 21.859
CVDD + 31.114 31.141 29.994 30.492 31.114
DSI + 0.635 0.606 0.629 0.609 0.634
a. CVI for best case has the minimum (-) or maximum (+) value. b. The first
row shows results of ARI as ground truth; other rows are CVIs. c. Bold
value: the best case by the measure of this row.
84
Table 3.7: Hit-the-best results for the wine dataset.
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
wine 1 0 1 0 1 1 1 0 1
a. Sil = Silhouette.
1. Find the minimum and maximum values of N scores from one se-
85
quence.
5. Define rank number of max is 1, and intervals are left open and right
closed: (upper value, lower value].
interval numbers 1 2 3 4
intervals
9 7 5 3 1
8 6 2
86
Table 3.8: Rank sequences of CVIs converted from the score sequences in
Table 3.6.
Clustering
Ward Spectral
method KMeans BIRCH EM
Linkage Clustering
Validity
ARIa 1 4 1 4 1
Dunn 1 1 4 1 1
CH 1 4 2 4 1
DB 1 1 1 4 1
Silhouette 1 4 1 3 1
WB 1 4 2 4 1
I 1 4 1 4 1
CVNN 1 4 1 4 1
CVDD 1 1 4 3 1
DSI 1 4 1 4 1
a. The first row shows results of ARI as ground truth; other rows are CVIs.
changed to 7.1 and 6.9, their rank numbers will still be 1 and 2 even they
are very close).
Remark. If the score whose rank number is 1 (1-rank score) represents the
optimal performance, that the rank number of the maximum CVI score is 1
only works for the CVI whose optimum is maximum but does not work for
the CVI whose optimum is minimum, like DB and WB, because its 1-rank
score should be minimum. A simple solution to make the rank number
work for both types of CVIs is to negate all values in score sequences of
the CVIs whose optimum is minimum before converting to rank sequence
(Figure 3.19). Thus, the 1-rank score always represents the optimal perfor-
mance for all CVIs.
Table 3.8 shows rank sequences of CVIs converted from the score se-
quences in Table 3.6. For each CVI, four ranks are assigned to five scores.
Since the ARI row shows the truth rank sequence, for rank sequences in
other CVI rows, the more similar to the ARI row, the better the CVI performs.
87
Table 3.9: Rank-difference results for the wine dataset.
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
wine 9 1 3 1 1 0 0 7 0
a. Sil = Silhouette.
For two score sequences (e.g., CVI and ARI), after quantizing them to
two rank sequences, we will compute the difference of two rank sequences
(called rank-difference), which is simply defined as the summation of ab-
solute difference between two rank sequences. For example, the two rank
sequences from Table 3.8 are:
ARI : {1, 4, 1, 4, 1}
CV DD : {1, 1, 4, 3, 1}
|1 − 1| + |4 − 1| + |1 − 4| + |4 − 3| + |1 − 1| = 7
88
(EM). Suppose we have a dataset and know its real number of clusters, c;
then the steps to evaluate CVIs through predicting the number of clusters
in this dataset are:
3. The predicted number of clusters by the i-th CVI: k̂i , is the number of
clusters that perform best on the i-th CVI. (i.e., the optimal number of
clusters recognized by this CVI)
4. The successful prediction of the i-th CVI is that its predicted number
of clusters equals the real number of clusters: k̂i = c.
3.5.3 Results
89
Table 3.10: Hit-the-best results for real datasets.
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
Iris 0 0 0 0 0 0 0 1 0
digits 0 0 0 1 0 0 1 0 1
wine 1 0 1 0 1 1 1 0 1
cancer 0 0 0 0 0 0 1 0 0
faces 1 1 1 1 1 1 0 1 1
vertebral 0 0 0 0 0 0 0 0 0
haberman 0 1 0 0 1 0 0 0 0
sonar 0 1 0 0 1 0 0 0 0
tae 0 0 0 0 0 0 1 1 0
thy 0 0 0 0 0 0 0 0 0
vehicle 0 0 0 0 0 0 1 0 1
zoo 1 0 1 0 0 1 0 0 1
Totalb 3 3 3 2 4 3 5 3 5
(rank) (4) (4) (4) (9) (3) (4) (1) (4) (1)
a. Sil = Silhouette. b. Larger value is better (rank number is smaller).
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
Iris 8 13 15 15 13 11 15 6 15
digits 2 2 1 1 4 6 8 7 6
wine 9 1 3 1 1 0 0 7 0
cancer 8 7 6 9 7 8 2 7 9
faces 4 3 4 4 2 3 9 2 5
vertebral 6 13 14 12 15 13 15 6 13
haberman 9 7 7 7 7 9 7 7 8
sonar 7 3 3 4 3 4 11 10 3
tae 9 14 9 9 14 15 0 9 9
thy 5 2 2 2 2 6 2 3 10
vehicle 12 11 9 13 13 12 3 3 7
zoo 1 6 1 6 6 1 9 8 1
Totalb 80 82 74 83 87 88 81 75 86
(rank) (3) (5) (1) (6) (8) (9) (4) (2) (7)
a. Sil = Silhouette. b. Smaller value is better (rank number is smaller).
90
results of nine CVIs on the wine dataset. The outcome of the rank-difference
comparison is a value in the range [0, N(N − 2)], where N is the sequence
length. As Table 3.8 shows, the length of sequences is 5; hence, the range
of rank-difference is [0, 15]. Table 3.9 shows the rank-difference results of
nine CVIs on the wine dataset. The smaller rank-difference value means the
CVI predicts better.
We applied the evaluation method to the selected CVIs (Table 3.4) by
using real and synthetic datasets (Section 3.5.1.2) and the five clustering
methods (Table 3.6). Table 3.10 and Table 3.12 are hit-the-best comparison
results for real and synthetic datasets. Table 3.11 and Table 3.13 are rank-
difference comparison results for real and synthetic datasets. To compare
across data sets, we summed all results at the bottom of each table. For the
hit-the-best comparison, the larger total value is better because more hits
appear. For the rank-difference comparison, the smaller total value is better
because results of the CVI are closer to that of ARI. Finally, ranks in the
last row uniformly indicate CVIs’ performances. The smaller rank number
means better performance. Since there are 97 synthetic datasets, to keep
the tables to manageable lengths, Tables 3.12 and 3.13 present illustrative
values for the datasets and most importantly, the totals and ranks for each
measure.
91
Table 3.12: Hit-the-best results for 97 synthetic datasets.
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
3-spiral 1 0 0 0 0 0 0 1 0
aggregation 0 0 0 0 0 0 1 1 1
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
zelnik5 1 0 0 0 0 0 0 1 0
zelnik6 1 1 0 0 1 0 0 0 0
Totalb 46 30 35 35 29 31 35 50 40
(rank) (2) (8) (4) (4) (9) (7) (4) (1) (3)
a. Sil = Silhouette. b. Larger value is better (rank number is smaller).
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
3-spiral 2 12 14 13 14 12 13 1 13
aggregation 3 3 2 2 4 5 2 5 3
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
zelnik5 4 10 12 10 11 11 10 4 11
zelnik6 4 3 2 2 3 3 5 2 2
Totalb 406 541 547 489 583 554 504 337 415
(rank) (2) (6) (7) (4) (9) (8) (5) (1) (3)
a. Sil = Silhouette. b. Smaller value is better (rank number is smaller).
92
(the real number of clusters is included). Clustering algorithms have been
applied on four datasets: the wine, tae, thy, and vehicle datasets (see
Table 3.5 for details).
Tables 3.14 to 3.17 show prediction of the number of clusters based
on CVIs, clustering algorithms and datasets. The predicted number of
clusters by a CVI is the number of clusters that perform best on this CVI.
Captions of sub-tables contain the real number of clusters (classes) for each
dataset. A successful prediction of the CVI is that its predicted number of
clusters equals the real number of clusters. In the results, it is worth noting
that only DSI successfully predicted the number of clusters from spectral
clustering for all datasets. This implies that DSI may work well with the
spectral clustering method.
Table 3.14: Number of clusters prediction results on the wine dataset (178
samples in 3 classes).
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 3b 4 6
CH 3 3 3
DB 3 3 3
Sila 3 3 3
WB 3 3 3
I 3 2 2
CVNN 2 2 2
CVDD 2 2 2
DSI 3 3 3
a.Sil = Silhouette. Same to the following Tables. b. Bold value: the
successful prediction of the CVI whose predicted number of clusters equals
the real number of clusters. Same to the following Tables.
93
Table 3.15: Number of clusters prediction results on the tae dataset (151
samples in 3 classes).
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 2 3 2
CH 6 6 4
DB 6 6 5
Sil 6 3 2
WB 6 6 6
I 5 3 3
CVNN 2 2 2
CVDD 2 2 2
DSI 6 3 5
Table 3.16: Number of clusters prediction results on the thy dataset (215
samples in 3 classes).
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 5 2 6
CH 3 3 3
DB 5 3 4
Sil 4 3 2
WB 6 3 6
I 3 3 4
CVNN 2 2 2
CVDD 2 2 5
DSI 5 3 6
94
Table 3.17: Number of clusters prediction results on the vehicle dataset
(948 samples in 4 classes).
Clustering
method Spectral
KMeans EM
Clustering
Validity
Dunn 6 2 5
CH 2 2 2
DB 2 2 2
Sil 2 2 2
WB 3 3 3
I 5 2 5
CVNN 2 2 2
CVDD 2 2 2
DSI 5 4 5
3.5.4 Discussion
Although DSI obtains only one first-rank (Table 3.10) compared with
other CVIs in experiments, having no last rank means that it still performs
better than some other CVIs. It is worth emphasizing that all compared
CVIs are excellent and widely used. Therefore, experiments show that DSI
can join them as a new promising CVI. Actually, by examining those CVI
evaluation results, we confirm that none of the CVIs performs well for
all datasets. And thus, it would be better to measure clustering results by
using several effective CVIs. The DSI provides another CVI option. Also, DSI
is unique: none of the other CVIs performs the same as DSI. For example, in
Table 3.10, for the vehicle dataset, only CVNN and DSI predicted correctly.
But for zoo dataset, CVNN was wrong and DSI was correct. For another
example, in Table 3.11, for the sonar dataset, DSI performed better than
Dunn, CVNN, and CVDD; but for the cancer dataset, Dunn, CVNN, and
CVDD performed better than DSI. More examples of the diversity of CVI are
shown in Table 3.18 and their plots with true labels are shown in Figure 3.20
(the atom dataset has three features, and the others have two features).
95
Table 3.18: Rank-difference results for selected synthetic datasets.
CVI
Dunn CH DB Sila WB I CVNN CVDD DSI
Dataset
atom 0 15 15 15 15 14 4 0 0
disk-4000n 10 0 7 0 0 0 11 12 1
disk-1000n 6 12 15 12 13 14 15 8 14
D31 5 1 2 1 0 2 10 2 0
flame 10 6 11 7 7 8 12 11 7
square3 11 0 2 0 0 7 0 11 0
a. Sil = Silhouette.
The former examples show the need for employing more CVIs because
each is different and every CVI may have its special capability. That ca-
pability, however, is difficult to describe clearly. Some CVIs’ definitions
show them to be categorized into center/non-center representative [114] or
density-representative. Similarly, the DSI is a separability-representative
CVI. That is, DSI performs better for clusters having high separability with
true labels (like the atom dataset in Figure 3.20); otherwise, if real clusters
have low separability, the incorrectly predicted clusters may have a higher
DSI score (Figure 3.21).
Clusters in datasets have great diversity so that the diversity of clustering
methods and CVIs is necessary. Since the preferences of CVIs are difficult to
analyze precisely and quantitatively, more studies for selecting a proper CVI
to measure clusters without true labels should be performed in the future.
Having more CVIs expands the options. And before the breakthrough that
we discover approaches to select an optimal CVI to measure clusters, it is
meaningful to provide more effective CVIs and apply more than one CVI to
evaluate clustering results.
In addition, to evaluate CVIs is also an important task. Its general
process is:
96
(a) atom (b) disk-4000n (c) disk-1000n
2. To compute external CVI with true labels as ground truth and internal
CVIs.
97
(a) Real clusters: (b) Predicted clusters:
DSI ≈ 0.456 DSI ≈ 0.664
Figure 3.21: Wrongly-predicted clusters have a higher DSI score than real
clusters.
3.6 Conclusion
98
accuracy of lesion identification was about 94% of that using 31 features.
The time cost of 4-feature clustering was about 40% of the 31-feature
clustering, demonstrating that 4-feature grouping can speed up acquisition
and processing. From an instrumentation point of view, by using a limited
number of features one is able to combine multiple spectral bands into one
spectrally wide band. This is extremely beneficial for low-light applications
such as implementation of aHSI via catheter access.
Furthermore, to evaluate clustering results, like k-means; it is essential
to apply various CVIs because there is no universal CVI for all datasets and
no specific method for selecting a proper CVI to measure clusters without
true labels. In this study, we propose the DSI as a novel CVI based on a
data separability measure. Since the goal of clustering is to separate a
dataset into clusters, we hypothesize that better clustering could cause
these clusters to have a higher separability.
Including the proposed DSI, we applied nine internal CVI and one exter-
nal CVI – Adjusted Rand Index (ARI) as ground truth to clustering results
of five clustering algorithms on various real and synthetic datasets. The re-
sults show DSI to be an effective, unique, and competitive CVI to other CVIs
compared here. And we summarized the general process to evaluate CVIs
and used two methods to compare the results of CVIs with ground truth.
We created the rank-difference as an evaluation metric to compare two
score sequences. This metric avoids two disadvantages of the hit-the-best
measure, which is commonly used in CVI evaluation. We believe that both
the DSI and rank-difference metric can be helpful in clustering analysis
and CVI studies in the future.
99
Chapter 4: Breast Cancer Detection Using Explainable Deep
Learning
4.1 Introduction
Breast cancer is the second leading cause of death among U.S women,
and will be diagnosed in about 12% of them [248, 53]. The commonly
used mammographic detection based on Computer-Aided Diagnosis (CAD)
methods can improve treatment outcomes for breast cancer and increase
survival times for the patients [215]. These traditional CAD tools, however,
have a variety of drawbacks because they rely on manually designed features.
The process of hand-crafted features design can be tedious, difficult, and
non-generalizable [293]. In recent years, developments in machine learning
have provided alternative methods to CAD for feature extraction; one is
to learn features from whole images directly through a CNN [163, 125].
100
Usually, training the CNN from scratch requires a large number of labeled
images [73]; for example, the AlexNet (a classical CNN model) was trained
by using about 1.2 million labeled images [146]. For some kinds of medical
image data such as mammographic tumor images, to obtain a sufficient
number of images to train a CNN classifier is difficult because the true
positives are scarce in the datasets and expert labeling is expensive [111].
The shortcomings of an insufficient number of images to train a classifier are
well-known [146, 207]; thus, it is worthwhile to research into its solutions.
One promising solution is to reuse as the feature extractor a pre-trained
CNN model that has been trained with very large image datasets from
other fields, or re-train (fine-tune) such a model using a limited number of
labeled medical images [267]. This approach is also called transfer learn-
ing, which has been successfully applied to various computer vision ques-
tions [244, 15, 202]. In fact, some results of transfer learning are counter-
intuitive: previous studies for the pulmonary embolism and melanocytic
lesion detection [267, 75] show that the features (connection weights in the
CNN) learned from natural images could be transferred to medical images,
even if the target images greatly differ from the pre-trained source images.
Another solution is applying image augmentation to create new training
images and thus to improve the performance of a CNN classifier. Previous
approaches to image augmentation used original images modified by rota-
tion, shifting, scaling, shearing and/or flipping. The potential problem with
such processing is that slightly changed images are similar to original ones;
they may not be used as new training images to improve the performance
of a CNN classifier. Large changes, on the other hand, may change the
structure or pattern of objects in training images and degrade the perfor-
mance of the classifier. An alternative image augmentation method is to
101
generate synthetic images using the features extracted from original im-
ages. These generated images are not exactly like the original ones but
could keep the essential features, structures or patterns of the objects in
original images. In this point, the Generative Adversarial Network (GAN)
is an ideal candidate as such image augmentation method for augment-
ing the training dataset. As with CNN, GAN is a neural network-based
learning method introduced by Goodfellow et al. in 2014 [90], and it is
a state-of-the-art technique in the field of deep learning [110]. GAN has
many novel applications in the field of image processing, for example, image
translation [279, 294], object detection [154], super-resolution [150] and
image blending [286]. Recently, various GAN are also developed for the
medical imaging, such as GANCS [174] for MRI reconstruction, SegAN [287],
DI2IN [291] and SCAN [49] for medical image segmentation. In this study,
synthetic mammographic images are generated from GAN to improve the
performance of a CNN classifier.
Recently, the number of types of GANs has grown to about 500 [107] and a
substantial number of studies are about the theory and applications of GANs
in various fields of image processing. Compared to the theoretical progress
and applications of GANs, however, fewer studies have focused on evaluating
or measuring GANs’ performance [29]. Most existing GANs’ measures have
been conducted using classification performance (e.g., Inception Score) and
statistical metrics (e.g., Fréchet Inception Distance). A more fundamental
alternative approach to evaluate a GAN is to directly analyze the images it
generated, instead of using them as inputs to other classifiers (e.g., Inception
network) and then analyzing the outcomes. In this study, we propose
a fundamental way to analyze GAN-generated images quantitatively and
qualitatively.
102
In addition, we have examined two more basic questions for the CNN
and deep learning models: the generalizability of deep neural networks
and how to understand the mechanism of neural network models. For
supervised learning models, like CNN, the analysis of generalization ability
(generalizability) is vital because the generalizability expresses how well a
model will perform on new data. Traditional generalization measures, such
as the VC dimension [276], do not apply to Deep Neural Network (DNN)
models. Thus, new theories to measure the generalizability of DNNs are
required. In this study, we hypothesize that the DNN with a simpler decision
boundary has better generalizability by the law of parsimony (Occam’s
Razor) [25]. And, although the DNN technique plays an important role in
machine learning, to comprehensively understand the mechanisms of DNN
models and to explain their output results, however, still require more basic
research [223]. To understand the mechanisms of DNN models, that is, the
transparency of deep learning, there are mainly three ways: the training
process [66], generalizability [159], and loss or accuracy prediction [12].
Besides the analysis of generalizability of DNN, in this study, we also create
a novel theory from scratch to estimate the training accuracy for two-layer
neural networks applied to random datasets. Such studies may provide
starting points of some new ways for researchers to make progress on the
difficult problem of understanding deep learning.
103
trained CNN model to extract features from medical images [87, 20, 46]
and 3) fine-tuning pre-trained CNN model on medical images [239, 35,
155]. In this study, we compared the three main techniques to detect
breast cancer using the Mammographic Image Analysis Society (MIAS)
mammogram database [258].
Previous studies have applied various machine learning methods for
breast cancer/tumor detection using mammograms [81]. The MIAS
database is a commonly used public mammogram databases. Some stud-
ies used the traditional automatic feature extraction (not manual extrac-
tion) techniques, such as Gabor filter, fractional Fourier transform and
Gray Level Co-Occurrence Matrix (GLCM), to obtain features and then
applied Support Vector Machine (SVM) or other classifier to do classifica-
tion [135, 210, 136, 303, 187]. Neural networks were also used as classi-
fiers [280, 194]. And some studies applied CNN to generate features from
mammographic images [312, 129, 59, 28]. Some of these studies used
pre-trained CNN as applications of transfer learning. Few previous studies,
however, presented results obtained by using only CNN for both feature
generation and classification for breast cancer detection in mammograms.
In our study, we used only one CNN; its front convolutional layers are re-
sponsible for feature generation and the back Fully-Connected (FC) layers
are the classifier. Thus, the input for our CNN is mammographic images
and its output are the (predicted) labels.
We tested three training methods on MIAS dataset: 1) trained a CNN
from scratch, 2) applied the pre-trained VGG-16 model [251] to extract
features from input images and used these features to train a Neural Network-
classifier, 3) updated the weights in several last layers of VGG-16 model by
back-propagation (fine-tuning) to detect abnormal regions. By comparison,
104
we found that the method 2) is ideal for study.
• For abnormal ROIs from images containing abnormalities, they are the
minimum rectangle-shape areas surrounding the whole given ground
truth boundaries.
• We firstly obtained abnormal ROIs. Then for normal ROIs, they are
also rectangle-shape images and their size are about the average size
of abnormal ROIs in the same database. Their locations are randomly
selected on normal breast areas. In this study, we cropped only one
ROI from a whole normal breast image.
2 http://peipa.essex.ac.uk/info/mias.html
105
The sizes of abnormal ROIs vary with abnormality boundaries. Since the
CNN requires all input images to be one specific size and usual inputs for
CNN are RGB images (images in MIAS are gray-scale images and the input
of VGG-16 model requires RGB images), we resized the ROIs by resampling
and made them to RGB (3-layer cubes) by duplication.
106
• To train a CNN from scratch (New-model)
We built our own CNN in this part. The details about this CNN structure
show in Table 4.1. It consists of three convolutional layers with max-pooling
layers and one FC layer. The activation function for each layer is the ReLU
function [186] except the last one for output, which is sigmoid function.
The notation Conv_3-32 means there are 32 convolutional neurons (units)
and the filter size in each unit is 3×3-pixel (height×width) in this layer.
107
MaxPool_2 means a max-pooling layer with size of filters is 2×2-pixel window,
stride 2. And FC_64 means a fully-connected layer having 64 units. Dropout
layer [257] randomly set a fraction rate of input units to 0 for the next layer
at every updating during training; it could help the CNN avoid overfitting.
The output layer uses a sigmoid function, which maps the output value to
the range of [0, 1].
108
Table 4.2: CNN architecture for transfer learning.
109
FC layer (FC_256 + ReLU) were still randomly initialized and updated by
training.
4.2.3.4 Results
Figure 4.1: Result of the New-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the smoothing
interval is about 20 epochs).
110
Feature-model The result in Figure 4.2 shows the average accuracy of
the Feature-model converged at about 0.906 (also Max = 0.906) and the
accuracy curve converged. The time cost for each epoch is about 14% of
that of the New-model. Therefore, such comparison demonstrates that the
performance of CNN in transfer learning is much better than training from
scratch for breast cancer/tumor detection.
Figure 4.2: Result of the Feature-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the smoothing
interval is about 20 epochs).
Tuning-model The result in Figure 4.3 shows the average accuracy of the
Tuning-model can reach a maximum of 0.914 and the accuracy curve also
converged. Its performance is slightly improved (about 0.88%) compared to
the Feature-model. But the training time for each epoch is about 22 times
that of training the classifier by only feature extraction.
111
29.03s per epoch
0.914
Figure 4.3: Result of the Tuning-model. Blue curve is the accuracy after
each epoch of training, and red curve is smoothed accuracy (the smoothing
interval is about 20 epochs).
4.2.4 Discussion
112
0.914 29.0s/epoch
0.906 1.3s/epoch
0.751 9.3s/epoch
113
4.3 Breast Cancer Detection Using Synthetic Mammograms from
Generative Adversarial Networks3
In this section, we name the original images ORG images, the augmented
images by affine transformation AFF images, and the synthetic images
generated from GAN GAN images.
To compare the performances of GAN images with AFF images to image
augmentation, we firstly cropped the regions of interest (ROIs) from images
in the Digital Database for Screening Mammography (DDSM) [104] database
as original (ORG) ROIs. Second, by using these ORG ROIs, we applied GAN
to generate the same number of GAN ROIs. We also used ORG ROIs to
generate the same number of AFF ROIs. Then, we used six groups of ROIs:
GAN ROIs, AFF ROIs, ORG ROIs and three mixture groups of any two of the
three simple ROIs to train a CNN classifier from scratch for each group. We
used the rest ORG ROIs that were never used in augmentation and training
to validate classification outcomes.
114
a widely used mammographic images resource by the U.S. Mammographic
Image Analysis Research Community. It is a collaborative effort between
Massachusetts General Hospital, Sandia National Laboratories and the
University of South Florida Computer Science and Engineering Depart-
ment. The DDSM database contains approximately 2,620 mammograms
in total: 695 normal mammograms, 1925 abnormal mammograms (914
malignant/cancers, 870 benign and 141 benign without callback) with
locations and boundaries of abnormalities. Each case includes four images
representing the left and right breasts in CC and MLO views.
• For abnormal ROIs from images containing abnormalities, they are the
minimum rectangle-shape areas surrounding the whole given ground
truth boundaries.
• For normal ROIs, they were cropped on the other side of a breast having
4 http://www.eng.usf.edu/cvprg/Mammography/Database.html
115
abnormal ROI and the normal ROI was the same size and location
as the abnormal ROI on different breast side. If both left and right
breasts having abnormal ROIs and their locations overlapping, we
discarded this sample. Since in most cases, only one side of breast
has tumor and the area and shape of left and right breast are similar;
thus, normal ROIs and abnormal ROIs have similar black background
areas and scaling.
116
B A
reflect the image by the boundaries. Figure 4.6 displays the results of the
three padding methods. We will choose to use one padding method that
can obtain the best classification accuracy.
Affine Transformation
117
with a similar distribution (Figure 4.7). Formally, in d-dimension space, for
x ∈ Rd , y = pdata (x) is a mapping from x to real data y. We create a neural
network called the generator G to simulate this mapping. If sample y comes
from pdata , it is a real one; and sample z comes from G, it is a synthetic
one. Another neural network discriminator D is used to detect whether
a sample is real or synthetic. Ideally, D(y) = 1; D(z) = 0. The two neural
networks G and D compose the GAN. We can find G and D by solving the
two-player minimax game [90], with value function V (G, D):
118
Real data distribution Generated distribution
real sample y
generated sample z
D G Simulated mapping
Update z = G ( x)
GAN
Real mapping
y = pdata ( x ) x : N ( 0, 1)
Standard Gaussian Distribution
119
the real one. Like a typical CNN, the discriminator has four convolutional
layers with max-pooling layers and one FC layer. The activation function
for each convolutional layer is also the ReLU function and the last one for
output is sigmoid function, which maps the output value to the range of [0,
1].
The notation Conv_3-32 means there are 32 convolutional neurons (units)
and the filter size in each unit is 3×3-pixel (height × width) in this layer.
MaxPool_2 means a max-pooling layer with size of filters is 2×2-pixel window,
stride 2. And FC_n means a fully-connected layer having n units. Dropout
layer [257] randomly set a fraction rate of input units to 0 for the next layer at
every updating during training; it could help the networks avoid overfitting.
Our training optimizer is Nadam [65] using default parameters (except the
learning rate changed to 1e-4), the loss function is Binary Cross Entropy,
the updating metric is Accuracy, the batch size is 30 and the number of
total epochs is set to be 1e+5.
The steps of training methods for GAN are:
120
Table 4.3: The architecture of generator and discriminator neural networks.
Generator
Layer Shape
input: 100-length vector 100
FC_(256x20x20) + ReLU 102400
Reshape to 20x20x256 20x20x256
Normalization + Up-sampling 40x40x256
Conv_3-256 + ReLU 40x40x256
Normalization + Up-sampling 80x80x256
Conv_3-128 + ReLU 80x80x128
Normalization + Up-sampling 160x160x128
Conv_3-64 + ReLU 160x160x64
Normalization + Up-sampling 320x320x64
Conv_3-32+ ReLU 320x320x32
Normalization + Conv_3-3+ ReLU 320x320x3
output (tanh): [−1, 1] 320x320x3
Discriminator
Layer Shape
input: RGB image 320x320x3
Conv_3-32 + ReLU 320x320x32
MaxPooling_2 + Dropout (0.25) 160x160x32
Conv_3-64 + ReLU 160x160x64
MaxPooling_2 + Dropout (0.25) 80x80x64
Conv_3-128 + ReLU 80x80x128
MaxPooling_2 + Dropout (0.25) 40x40x128
Conv_3-256 + ReLU 40x40x256
MaxPooling_2 + Dropout (0.25) 20x20x256
Flatten 102400
FC_1 1
output (sigmoid): [0, 1] 1
121
updating the generator. It is noteworthy that in this step, only the
weights in generator are changed; weights in discriminator are fixed.
5. Repeat Step 2 to Step 4 until all real images have been used once, that
counts one epoch. When the number of epoch reaches a certain value,
training stops.
Actually, for the Step 5, the ideal situation to stop training is when the
classification accuracy of discriminator converges to 50%. It means the
discriminator no longer can distinguish the real images and synthetic images
generated from a well-trained generator. The discriminator plays a role as an
assistant in GAN. After training, we will use the generator neural networks
to generate synthetic images for usage next.
4.3.4 Experiments
122
Table 4.4: The architecture of CNN classifier.
CNN classifier
Layer Shape
input: RGB image 320x320x3
Conv_3-32 + ReLU 320x320x32
MaxPooling_2 160x160x32
Conv_3-32 + ReLU 160x160x32
MaxPooling_2 80x80x32
Conv_3-64 + ReLU 80x80x64
MaxPooling_2 40x40x64
Flatten 102400
FC_64 + ReLU + Dropout (0.5) 64
FC_1 1
output (sigmoid): [0, 1] 1
function was used in the output layer, the predicted outcome from the CNN
classifier is a value between 0 and 1. By default, the classification threshold
is 0.5, meaning that if the value is less than 0.5 it will be considered as “0”
(normal), otherwise it will be considered as “1” (abnormal). The optimizer for
training is Nadam using default parameters [139] (except the learning rate
changed to 1e-4), the loss function is Binary Cross Entropy, the updating
metric is Accuracy, the batch size is 26 and the number of total epochs is
set to be 750. To train this CNN classifier from scratch, we used the labeled
ROIs of abnormal and normal mammographic images. All training data
include ORG ROIs, AFF ROIs and GAN ROIs, but validation data are only
the ORG ROIs.
To the affine transformation, we firstly decide the padding method. We
collected 1300 real abnormal ROIs (Oabnorm , ‘O’ for original) and 1300 real
normal ROIs (Onorm ) in total. After taking off 10% for validation, there are
1170 Oabnorm and 1170 Onorm . We firstly did the data augmentation to 1170
Oabnorm and 1170 Onorm by affine transformations to obtain 1170 Aabnorm (‘A’
for affine) and 1170 Anorm ; the details are shown in Section 4.3.2. For the
123
Figure 4.8: Validation accuracy of CNN classifiers trained by three types of
AFF ROIs.
Areflect reflect
abnorm , 1170 Anorm ] respectively. Figure 4.8 shows the validation accuracy of
the three CNN classifiers. Obviously, the CNN classifier trained by nearest
padding AFF ROIs has the best overall performance. Therefore, we used the
nearest padding AFF ROIs for our rest experiments.
We then used the ORG ROIs to train two generators: GANabnorm and
GANnorm for generating GAN ROIs. As shown in Figure 4.9 (GAN box), during
124
Real images ORG ROIs
420 abnormal ROIs 336 84
20% for validation
DDSM Crop ROIs 420 normal ROIs 336 84
GAN AFF
Discriminator
AFF ROIs
336
Noise vector x Generator 336
GAN ROIs
336 Training Training accuracy
CNN classifier
336 data Validation accuracy
Figure 4.9: The flowchart of our experiment plan. CNN classifiers were
trained by data including ORG, AFF and GAN ROIs. Validation data for the
classifier were ORG ROIs that were never used for training. The AFF box
means to apply affine transformations.
4.3.5 Results
For training GAN, we used 336 real abnormal ROIs to obtain the generator
GANabnorm , and used 336 real normal ROIs to obtain the generator GANnorm .
Figure 4.10 shows some synthetic abnormal ROIs (Gabnorm ) generated from
GANabnorm . Then, we generated 336 Gabnorm and 336 Gnorm by generators.
125
Table 4.6: Training plans. Training by using CNN classifier in Table 4.4.
Notations are described in Table 4.5.
Real
Synthetic
Figure 4.10: (Top row) Real abnormal ROIs; (Bottom row) synthetic abnormal
ROIs generated from GAN.
126
SStd: 0.0381
SStd: 0.0129
Best:0.7976
Best: 0.7857
Stable: 0.6452 SStd: 0.0245
Stable: 0.7348
Time: 4.00s/ep Best:0.7321
Time: 7.01s/ep
Stable: 0.6036
Time: 4.04s/ep
Figure 4.11: Training accuracy and validation accuracy for six training
datasets.
(Stable) and time cost (in second) for each training epoch. The maximum
validation accuracy can indicate the best performance of the classifier,
but it may be reached fortuitously. The average validation accuracy after
600 epochs can show the stable performance of the classifier. For a good
classifier, this value will be monotone increasing and converged. And SStd
shows how validation accuracy varies from its average after 600 epochs.
Table 4.7 shows these quantitative results.
Since the maximum validation accuracy may be fortuitous, the stable
performance is more reliable to evaluate a classifier. From Table 4.7, it
demonstrations that:
127
• ORG ROIs must add to training set because the stable performances
of sets without ORG ROIs are lower than 70%.
For training by only real ROIs, the validation accuracy is lower than training
by adding GAN ROIs. Adding AFF ROIs can also improve the validation ac-
curacy. Therefore, image augmentation is necessary to train CNN classifiers
and since GAN performs better than affine transformation, GAN could be a
good alternative option. But GAN ROIs may have different features against
ORG ROIs because the over-fitting occurred. Adding ORG ROIs in training
set can help correct this problem. The images augmented by GAN or affine
transformation cannot substitute for real images to train CNN classifiers
because the absence of real images in training set will cause over-fitting.
4.3.6 Discussion
128
Table 4.7: Analysis of validation accuracy for CNN classifiers.
Best Stable
Set# SStd (%) Time/epoch (s)
perfa (%) perf (%)
1 (ORG) 78.75 73.48 1.29 7.01
2 (GAN) 79.76 64.52 3.81 4.00
3 (AFF) 73.21 60.36 2.45 4.04
4 (ORG + GAN) 85.12 74.96 1.65 10.15
5 (ORG + AFF) 81.55 71.32 2.12 9.82
6 (GAN + AFF) 80.95 69.31 1.93 6.79
a. perf = performance.
which shows that the generator acquired some important features from
ORG ROIs. But the GAN ROIs may also have different features against ORG
ROIs, thus, the stable accuracy is about 9% lower. Adding ORG ROIs in
training set can help correct this problem.
Since abnormal ROIs may contain more features than normal ROIs,
we take a statistic view for comparing the real abnormal ROIs and their
augmented ROIs: Oabnorm , Anearest
abnorm , and Gabnorm . For each category, we use
336 samples, compute their mean, standard deviation (Std), skewness, and
entropy. Then we plot normalized values in statistic histogram to see their
distributions. Since the limited space in paper, we only display their Std
and mean in Figure 4.12.
From the view of mean’s distribution, GAN is more like ORG than AFF. But
the view of Std’s distribution shows the opposite. To quantitatively analyze
difference between distributions, we calculate the Wasserstein distance [229]
between two histograms. The value of Wasserstein distance is smaller if
the difference between two distributions is smaller. Wasserstein distance
is equal to 0 when the two distributions are identical. Table 4.8 shows
Wasserstein distances of ORG ROIs vs. GAN ROIs and ORG ROIs vs. AFF
ROIs in the four statistical criterions. GAN ROIs are closer to ORG ROIs
129
Figure 4.12: Histogram of mean and standard deviation. (Normalized)
than AFF in mean and entropy but farther in Std and skewness. Such
results may explain why GAN ROIs are valid image augmentation. These
results also inspire us how to improve the GAN. We could modify GAN
to generate images having less Wasserstein distances in those statistical
criterions to real images. Actually, the Wasserstein GAN [11] is designed by
the similar idea.
Criterion 336 Oabnorm vs. 336 Gabnorm 336 Oabnorm vs. 336 Anearest
abnorm
Mean 0.083 0.185
Std 0.100 0.040
Skewness 0.101 0.047
Entropy 0.111 0.456
130
Theoretically, a well-trained GAN could generate images having the same
distributions as real images. The synthetic images will have zero Wasserstein
distance to real images in any statistical criterions. If so, the performance
of CNN classifier trained by GAN ROIs will be as good as by ORG ROIs. Our
results, however, shows that based on distribution and training performance,
GAN did not correspond with theoretical expectations. The problem could
be found by looking the synthetic images (Figure 4.10): they have clear
artificial flavors. One possible reason is that GAN adds some features or
information not belonging to real images; that is why the distributions of
four statistical criterions to GAN ROIs are different from ORG ROIs. Those
new features disturb classifiers to detect abnormal features in real images
and make the validation accuracy lower. The possible solution is to change
the architecture of generator or/and discriminator in GAN. In this study,
the architecture we used is DCGAN [209]. Actually, there are about 500
different architectures of GAN [107] recently. We believe that some of them
can achieve a better performance for image augmentation.
131
measure. In addition, we discuss how these evaluations could help us to
deepen our understanding of GANs and to improve their performance.
The optimal GAN for images can generate images that have the same
distribution as real samples (used for training), are different from real ones
(not duplication), and have variety. Expectations of generated images could
be described by three aspects: 1) non-duplication of the real images, 2)
generated images should have the same style, which we take to mean that
their distribution is close to that of the real images, and 3) generated images
are different from each other. Therefore, we evaluate the performance of a
GAN as an image generator according to the three aspects:
132
Unlike real data
distance and has a simple and uniform framework for the three aspects of
ideal GANs and depends less on visual evaluation.
The proposed LS measure is applied to analyze the generated images
directly, without using pre-trained classifiers. We applied the measure to out-
comes of several typical GANs: DCGAN [209], WGAN-GP [94], SNGAN [181]
LSGAN [172] and SAGAN [301] on various image datasets. Results show
that the LS can reflect the performance of GAN well and are very competitive
with other compared measures. In addition, the LS is stable with respect to
the number of images and could provide an explanation of results in terms
of the three respects of ideal GANs.
Recently, the two most widely applied indexes to evaluate GANs per-
formance are the Inception Score (IS) [231] and Fréchet Inception Distance
(FID) [106]. They both depend on the pre-trained Inception network [264]
that was trained on the ImageNet [51] dataset.
133
4.4.2.1 KL Divergence Based Evaluations
From the perspective of the three aspects for ideal GANs, the IS focuses
on measuring the inheritance and diversity. Specifically, we let x ∈ G be
a generated image; y = InceptionNet(x) is the label obtained from the pre-
trained Inception network by inputting image x. For all generated images,
we have the label set Y . H(Y ) defines the diversity (H(·) is entropy) because
the variability of labels reflects the variability of images. H(Y |G) could show
the inheritance because a good generated image can be well recognized and
classified, and thus the entropy of p(y|x) should be small. Therefore, an
ideal GAN will maximize H(Y ) and minimize H(Y |G). Equivalently, the goal
is to maximize:
H(Y ) − H(Y |G) = EG [DKL (p(y|x)kp(y))]
134
data. And it has no ability to detect overfitting. For example, if the set
of generated images was a copy of the real images and very similar to
images of ImageNet, IS will give a high score.
Like the IS, larger value of MS is better; but smaller value of AM is better.
The FID also exploits real data and uses the pre-trained Inception net-
work. Instead of output labels it uses feature vectors from the final pooling
layers of the InceptionNet. All real and generated images are input to the
network to extract their feature vectors.
Let ϕ(·) = InceptionNet_lastPooling(·) be the feature extractor and let
Fr = ϕ(R), Fg = ϕ(G) be two groups of feature vectors extracted from real
and generated image sets. Consider that the distributions of Fr , Fg are
135
multivariate Gaussian:
Fr ∼ N (µr , Σr ) ; Fg ∼ N (µg , Σg )
1
FID (R, G) = kµr − µg k22 + Tr Σr + Σg − 2 (Σr Σg ) 2
136
1-D Wasserstein distances, which have simple solutions [212, 3].
Compared to IS and FID, SWD directly uses the real and generated
images without auxiliary networks but it requires that the two data sets
have the same number of images: |R| = |G|. Usually, the amount of real data
is smaller than that of generated data (generated data can be an arbitrarily
large amount). And the result of SWD is in general different with each
application of the algorithm because of its dimensionality reduction by
random projections. Thus, we have to take its average values by computing
repeatedly.
As with the FID, the Wasserstein distance measures the difference be-
tween distributions of real and generated images and a good GAN can
minimize the difference between the two distributions. Hence, for FID and
SWD, the smaller value is better.
137
to a 1-NN classifier trained on dataset: {R ∪ G} with labels “1” for R and
“0” for G. For each validation result, the accuracy is either 1 or 0; and the
Leave-one-out (LOO) accuracy is the final average of all validation results.
• LOO accuracy ≈ 0.5 is the optimal situation because the two distribu-
tions are very similar.
• LOO accuracy < 0.5, the GAN is overfitting to R because the generated
data are very close to the real samples. In an extreme case, if the GAN
memorizes every sample in R and then generates them identically, i.e.,
G = R, the accuracy would be = 0 because every sample from R would
have its nearest neighbor from G with zero distance.
• LOO accuracy > 0.5 means the two distributions are different (separa-
ble). If they are completely separable, the accuracy would be = 1.
Let r1NNC = r(1NNC). Therefore, for r1NNC, the best score is 1 and the
larger value is better.
As reported by Borji [29], many other GAN evaluation measures have
been proposed recently. Measures like the Average Log-likelihood [268],
Coverage Metric [273], and Maximum Mean Discrepancy (MMD) [91] depend
on selected kernels. And measures like the Classification Performance (e.g.,
138
FCN-score) [121], Boundary Distortion [235], Generative Adversarial Metric
(GAM) [119], Normalized Relative Discriminative Score (NRDS) [304], and
Adversarial Accuracy and Divergence [292] use various types of auxiliary
models. Some measures compare real and generated images based on image-
level techniques [253, 298], such as SSIM, PSNR, and filter responses. The
idea of the Geometry Score (GS) [137] is similar to our proposed LS in some
aspects but its results are unstable and rely on required parameters6 .
We will further discuss the GS later.
By considering the complexity of algorithm, efficiency in high dimensions,
dependency on models or parameters, the extent of use in GAN study field,
and (codes) availability for implementation, we finally chose the IS, FID,
r1NNC(C2ST), MS, AM, and SWD from the currently-used quantitative
measures to compare with our proposed LS.
Like FID, 1NNC, and SWD, to examine how the distributions of real and
generated images are close to each other is an effective way to measure
GANs because the goal of GAN training is to make generated images have
the same distribution as real ones.
Considering a dataset that contains real and generated data, the most
difficult situation to separate the two classes (or two types: real and gen-
erated data) of data arises when the two classes are scattered and mixed
together in the same distribution. In this sense, the separability of real and
generated data could be a promising measure of the similarity of the two
distributions. As the separability increases, the two distributions have more
differences. Therefore, we propose to use the Distance-based Separability
6 Inpractice, we used the codes provided by its author: https://github.com/KhrulkovV/
geometry-score.
139
Index (DSI) to analyze how two classes of data are mixed together.
Since for GANs’ evaluation, there are only two classes: the real image
set R and generated image set G, we have two ICD sets and one BCD set
(see their definitions in Section 2.3.1). The DSI can be applied in a multi-
class scenario by one-versus-others but here, we focus on the computation
of DSI for GANs’ evaluation (two-class scenario).
Similar to the procedure shown in Section 2.3.2 (but the last step is
different), to compute the LS for two classes R and G:
1. First, to compute the ICD sets of R and G: {dr }, {dg } and the BCD set:
{dr,g }.
because the maximum value can highlight the difference between ICD
and BCD sets.
Remark. The similarity of the distributions of the ICD sets: KS({dr }, {dg })
is not used because it shows only the difference of distribution shapes, not
their location information. For example, two distributions that have the
7 Inexperiments, we used the scipy.stats.ks_2samp from the SciPy package in Python to
compute the KS distance. https://docs.scipy.org/doc/scipy/reference/generated/scipy.
stats.ks_2samp.html
140
same shape but no overlap will have zero KS distance between their ICD
sets: KS({dr }, {dg }) = 0.
The first experiment has two purposes: one is to test the stability of
the proposed measure, i.e., how little the results change when different
amounts of data are used. Another purpose is to find the minimum amount
141
(a) lack of Creativity
BCD peaks ahead
Real Gen
Data Plot Histograms of sets
Figure 4.14: Lack of Creativity, Diversity, and Inheritance in 2D. Histograms
of (a) and (b) are zoomed to ranges near zero; (c) has the entire histogram.
142
of data required for the following experiments because a GAN could generate
unlimited data and we wish to bound it to make computation practicable.
The following experiments compare our measure LS with the commonly
used measures: IS and FID, and other selected measures. The purpose is
not to show which GAN is better but to show how the results (values) of our
measure compare to those of existing measures.
143
DCGAN-Plastics
1.5
1.3
1.1
Index Value
0.9
0.7
0.5
0.3
0 1000 2000 3000 4000 5000
# of generated images
LS IS
FID/100 r1NNC
MS AM/1000
SWD/1000 GS
To fit the axes, the values of FID, AM, and SWD are scaled by 0.01, 0.001,
and 0.001, respectively. The result indicates that the scores except the
GS, are stable to different numbers of testing images, especially when the
amount is greater than 1000. We remove the GS from further comparisons
because its results are highly unstable with the amount of data.
144
4.4.4.2 Four Image Types and Three GANs
145
Real DCGAN WGAN-GP SNGAN
Hole
Small leaf
Big leaf
Plastic
Figure 4.16: Column 1: samples from four types of real images; column
2-4: samples from synthetic images of three GANs trained by the four types
of images.
146
Table 4.10: Measure results
score is better.
score is better.
147
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
LS IS FID r1NNC MS AM SWD
DCGAN-hole DCGAN-small_leaf
DCGAN-big_leaf DCGAN-plastic
WGAN-GP-hole WGAN-GP-small_leaf
WGAN-GP-big_leaf WGAN-GP-plastic
SNGAN-hole SNGAN-small_leaf
SNGAN-big_leaf SNGAN-plastic
Figure 4.17: Normalized and ranked scores. X-axis shows scores and y-axis
shows their normalized values; 0 is for the worst (model) performance and
1 is for the best (model) performance. Colors are for generators and shapes
are for image types; see details in legend.
148
Table 4.12: Measure results on CIFAR-10
score is better.
149
Real DC W SN LS SA
Figure 4.18: Column 1: samples from real images of CIFAR-10; column 2-6:
samples from synthetic images of five GANs: DCGAN, WGAN-GP, SNGAN,
LSGAN, and SAGAN trained by the original 2,000-image subset.
150
2000 images (name: Optimal
MNIST Opt.
Select generated set)
Label= “8”
2000 images (real set)
20 images
Repeat copying 100 times
Small modification
by median filter
2000 images (name: generated
LD
Lack of Diversity set)
Figure 4.19: Processes to build real set and generated sets including optimal
generated images and generated images lack creativity, lack diversity, lack
creativity & diversity, and lack inheritance.
151
no common image in the three sets. One set having 2,000 images was con-
sidered as the optimal generated set (Opt.) because these images come from
the same source of real data. The lack-of-diversity set (LD) was generated
by repeatedly copying the 20 images 100 times. Another 2,000-image set
was considered as the real set and used to generate the lack-of-creativity set
(LC) by the small modification of all images with the median filter. Since
filtering could slightly change images and keep their main information,
each image after filtering is similar to its original version i.e., the modified
images lack creativity. Choosing 20 images from the lack-of-creativity set
and repeatedly copying them 100 times generates the lack-of-creativity &
diversity set (LC&D). The lack-of-inheritance set (LIn) contains 2,000 images
selected randomly from handwritten digit “7” images in MNIST because the
handwritten digit “7” is greatly different from digit “8”.
The five datasets: Opt., LC, LD, LC&D, and LIn mimic the datasets that
are generated from five virtual GAN models trained on the 2,000-image
real set. The optimal generated set (Opt.) as if it was generated from an
optimal GAN and the other four sets as if they were generated from four
different GANs having respective drawbacks. Figure 4.20 shows samples
from these datasets. Then, we applied the LS, and other six measures to
the five “generated” image sets and the 2,000-image real set. Results are
shown in Table 4.13.
In this experiment, we know the Opt. GAN is the best one. Hence, we
could state the concrete conclusion that LS, FID, 1NNC, MS, and SWD
successfully discover the best GAN model. As we discussed in Section 4.4.2,
results of IS confirm that it is not good at evaluating the creativity and
inheritance of GANs because it gives them higher scores (2.112 and 1.941)
than the best case (1.591) and the IS emphasizes the diversity. Other
152
Real Opt. LC LD LC&D LIn
Figure 4.20: Column 1: samples from the real set; column 2-6: sample
images from the five virtual GAN models: Opt., LC, LD, LC&D, and LIn
trained by the real set.
153
Table 4.13: Measure results from virtual GAN models
4.4.5 Discussion
Since Geirhos et al. [85] recently reported that CNNs trained by ImageNet
have a strong bias to recognize textures rather than shapes, we chose texture
images to train GANs. From results in Table 4.11, the proposed LS agrees
with IS, 1NNC, and SWD that the WGAN-DP performs the best and SNGAN
performs the worst on selected texture images. As shown in Table 4.12, LS
makes the same evaluation on CIFAR-10 dataset. As shown in Figure 4.16,
SNGAN and WGAN-GP generate synthetic images that look different from
real samples but SNGAN tends to generate many very similar images (its
diversity is low). Hence, all measures rate SNGAN as performing worst on
texture datasets. Results on CIFAR-10 dataset (Table 4.12) show a similar
154
(a) Optimal (b) Lack of Creativity (c) Lack of Diversity (d) Lack of C & D* (e) Lack of Inheritance
Figure 4.21: Real and generated datasets from virtual GANs on MNIST. First
row: the 2D tSNE plots of real (blue) and generated (orange) data points
from each virtual GAN. Second row: histograms of ICDs (blue for real data;
orange for generated data) and BCD for real and generated datasets. The
histograms in (b)-(d) are zoomed to the beginning of plots; (a) and (e) have
the entire histograms.
conclusion.
155
has 28 × 28 pixels so that these data are in a 784-dimensional space. To
visually represent the data in two dimensions, we applied the t-distributed
Stochastic Neighbor Embedding (tSNE) [171] method. In contrast, the ICD
and BCD sets were computed in the 784-dimensional space directly, without
using any dimensionality reduction or embedding methods.
As shown in Figure 4.21, the ICD and BCD sets for computing the LS
offer an interpretation of how LS works and verify that LS is able to detect
the lack of creativity, diversity, and inheritance for GAN generated data, as
we discussed in Section 4.4.3. Figure 4.21(a) shows the real (training) data
and data generated by the ideal GAN. Since distributions of the three sets
are nearly the same, LS gets the highest score (close to 1, in Table 4.13).
Figure 4.21(b) shows the GAN lacks creativity. Almost every generated data
point is overlapped with (or very close to) a real data point. Hence, the BCD
set has some peaks at the beginning of plot. Lack of diversity is shown by
Figure 4.21(c). Most generated data points are not close to real data points,
but some points are very close to each other. That results in a peak at the
beginning of generated ICD plot. Any differences of the histograms of
ICD and BCD sets will decrease the LS. Therefore, LS is affected by the
isolated peaks of one distance set. Figure 4.21(d) shows the combined effect.
Generated data points are close to real data points and cluster in a few
places. Both BCD and generated ICD peaks can be found at the beginning
of plot. For the last Figure 4.21(e), lack of inheritance means generated
data are dissimilar from real data. The two kinds of data are distributed
separately so that distributions of the three sets are all different, contrary
to Figure 4.21(a); that leads to the lowest LS.
156
211
210
29 685.89
28
27
110.86
6
Time (s)
25
26.14
24
23
6.66
2
2
21
20 1.13
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
# of real or synthetic images
LS IS† FID 1NNC
MS AM SWD
Figure 4.22: Time cost of measures running on a single core of CPU (i7-
6900K). To test time costs, we used same amount of real and generated
images (200, 500, 1000, 2000, and 5000) from CIFAR-10 dataset and
DCGAN trained on CIFAR-10. † IS only used the generated images.
157
4.4.5.2 Time Complexity
Both LS and 1NNC use the direct image comparison which is the Eu-
clidean (l 2 -norm) distance between two images. The main time cost of LS is
to calculate ICD and BCD sets. LS’s time complexity for N (Class 1) and M
(Class 2) data is about O N 2 /2 + M 2 /2 + MN (two ICD sets and one BCD set).
Although 1NNC also uses Euclidean distance between two images, its time
complexity is about O (M + N)2 , which is double the cost of LS, because it
158
4.4.5.3 Comparison Summary
The compared measures have various drawbacks. The IS, FID, MS, and
AM depends on the Inception network pre-trained by ImageNet. In addition,
IS lacks the ability to detect overfitting (creativity) and inheritance and FID
depends on the Gaussian distribution assumption of feature vectors from
the network. The SWD and 1NNC require that the amount of real data be
equal to the amount of generated data. The local conditions of distribu-
tions will greatly influence results of 1NNC (e.g., it obtains extreme values
like 0 or 1 in Table 4.10) because it only considers the 1-nearest neigh-
bor. That there are several required parameters8 such as slice_size and
n_descriptors is another disadvantage of SWD; both changes of parameters
and the randomness of radial projections will influence its results.
The proposed LS is designed to avoid those disadvantages. We have
created three criteria (creativity, diversity, and inheritance) to describe ideal
GANs. And we have shown that LS evaluates a GAN by examining the three
aspects in a uniform framework. In addition, LS does not need a pre-trained
classifier, image analysis methods, nor a priori knowledge of distributions.
Ranging between 0 and 1 is another merit of LS because we could know
how close the performance of a GAN model is to the ideal situation.
We found that the idea of GS [137] has some similar points to our LS. The
GS compares the complexities of the manifold structures, which are built
by pairwise distances of samples, between real and generated data. And we
think the complexity of data manifold may have some connections to data
separability. However, we found the results of GS is too unstable to use. For
example, we have computed GS measure twice on 2,000 generated and 2,000
real images from DCGAN and CIFAR-10 (the same test in Section 4.4.4.3);
8 More details are in its source codes: https://github.com/koshian2/swd-pytorch.
159
one result is 0.0078 and another is 0.0142 – it is almost doubled. As
Figure 4.15 shown, GS results not only differ on each computation time
but also on the amount of samples.
160
4.5 Generalizability of Deep Neural Networks9
161
VC dimension [276] or Rademacher complexity [22], DNNs tend to overfit
the training data and demonstrate poor generalization. Much empirical
evidence, however, has indicated that neural networks can exhibit a re-
markable generalizability [299]. This fact requires new theories to explain
the generalizability of neural networks. Two main approaches characterize
studies of generalizability for deep learning [128]: a generalization bound
on the test/validation error calculated from the training process [69, 160],
and a complexity measure of models [134, 189, 127], motivated by the
VC-dimension.
Classifiers that overfit the training data lead to poor generalizability. To
limit the overfitting, several regularization techniques such as dropout and
weight decay have been widely applied in training DNNs. As L1 and L2
regularization could generate sparsity for sparse coding [151], regularization
techniques simplify the model’s structure and then prevent the model from
overfitting [257, 295]. This is because the simplified model cannot fit all
training data precisely but must learn the approximate outline or distribu-
tion of the training data, which is the key information required to perform
well on test data (generalizability). On the other hand, the law of parsimony
(Occam’s Razor) [25] implies that any given simple model is a priori more
probable than any given complex model [99]. Therefore, we hypothesize
that, on a specific dataset, if two models have similar high training accuracy
(close to 1), the simpler model will have a higher test accuracy (better
generalizability).
There are two ways to measure model complexity: 1) to examine trainable
parameters and the structure of the model [189, 38]; 2) to evaluate the com-
plexity of the decision boundary [311, 14, 156], which is the consequential
representation of model complexity. Recently, several analyses of complexity
162
of the decision boundary investigated adversarial examples that are near
the decision boundary [103, 297, 133]. In this paper, for DNN models, we
analyze generalizability based on the complexity of the decision boundary.
Unlike other recent studies, this one proposes a novel method to charac-
terize these adversarial examples to reveal the complexity of the decision
boundary, and this method is applicable to datasets of any dimensionality.
4.5.2 Methods
f (x) ≈ 0.5
x = λ a + (1 − λ ) b, 0 ≤ λ ≤ 1
The line must cross the decision boundary because its two ends are in
different classes. Hence, the adversarial example c exists on the line.
163
Decision boundary of classifier f
1−𝜆
𝑓 𝑏 ≈1
𝜆
𝑓 𝑐 ≈ 0.5
𝑓 𝑎 ≈0
Class 1 Class 2
The precision of the distance of the adversarial example from the bound-
ary depends on the step (ε) value; the time cost depends on it also: it is
about O (1/ε). This process can be speeded up to O (log (1/ε)) by the divide-
and-conquer algorithm, which uses binary search. In experiments, we set
ε = 1/256 because the inputs are 8-bit images.
164
Decision boundary of classifier f
Class 1 Class 2
XXT W = λ W
165
Where W is eigenvector matrix: s.t. WT W = I and λ contains the n eigenval-
ues: {λ1 , λ2 , · · · , λn }.
These eigenvalues could show the complexity of the adversarial set. If
λi
∑ λk
= 1, it means all m examples lie on the line of the i-th eigenvector. It is the
λi +λ j
simplest condition for the adversarial set. If ∑ λk
= 1, it means all m examples
are on a plane; that indicates the decision boundary most likely is a plane.
In general, we could measure the Decision Boundary Complexity (DBC) of f
by computing the Shannon entropy of the eigenvalues:
n o
λ1 λ2
H ,
∑ λi ∑ λi
, · · · , ∑λλn i
DBC { f } =
log n
Dividing by log n normalizes the DBC to the range [0, 1]. 0 is the simplest
condition: the decision boundary is just a line.
A problem arises if we think about the most difficult condition of the
boundary (DBC=1). For example, in 2-D, DBC=1 when the adversarial set
forms a circle, but we cannot say the round boundary is the most complex
one. For round-shape decision boundaries, some boundaries are smooth,
and some may be lumpy. As Figure 4.25 shows, the boundary (a) is more
smooth (simpler) than (b). Under our hypothesis, we consider that the
generalizability of model (a) is better than (b). DBC scores computed by
adversarial sets of the two models will, however, be similar (and close to 1).
In Figure 4.25, the boundary (a) is obviously simpler than boundary (b)
because (b) has many zigzags in every segment. But if we compute the DBC
score using the entire adversarial set, the effect (on eigenvalues’ entropy)
of zigzags is confused with the round-shape. Thus, it is not appropriate
to use the entire adversarial set in such cases. If the adversarial set is
generated by all data (in Figure 4.24), we name it the global adversarial
166
Class 2 Class 2
Class 1 Class 1
set. And the DBC score computed from it is called the global DBC. To solve
the round-shape problem, we turn to consider adversarial examples on a
section of the boundary, the segmental boundary. We define the adversarial
data set formed by a segmental boundary as the local adversarial set.
Adversarial examples in a local adversarial set should be close to each
other to outline the shape of the segmental boundary. As Figure 4.26 shows,
a pair of data points from two classes is randomly selected, then to find
n-nearest neighbors of one of those two data points. Finally, adversarial
examples (green points) are generated by lines between these n + 1 data
points in one class to another data point in a different class. To decide
the number of examples k for one local adversarial set is an interesting
question. It probably depends on the dimension and distances between
example points. We will further discuss this question in Sec. III.B.
The computation process for complexity of a local adversarial set is the
167
Decision boundary of classifier f
Class 1 Class 2
same as that for the global adversarial set. The steps show the process. The
difference is that N pairs of data generate one global adversarial set but N
local adversarial sets. Thus, one decision boundary has many local DBC
scores.
168
dataset from sklearn.datasets.load_breast_cancer10 . Its dimensionality is
30. The third experiment uses real images of cats and dogs downloaded
from GitHub11 . The image size is 150x150x3; thus, these data are in very
high dimension.
The key ideas of experiments are to train DNNs with different general-
izabilities and compute DBC scores of these trained models. The ground
truth for generalizability is the test accuracy because better performance
on test data indicates greater generalizability: performance on new data.
In the experiments the generalizability of a DNN is adjusted by intentional
overfitting, such as by adding excessive trainable weights and removing
regularization layers.
cancer.html
11 https://github.com/vyomshm/Cats-Dogs-with-keras
169
boundary (a) boundary (b)
170
decision boundary). It is not very impressive for the 2-D dataset because
the boundary is visible. We could visually identify the simpler boundary
case. But for a high-dimensional dataset, we must rely on the DBC score to
describe the complexity of decision boundary.
The dataset is imported from the breast cancer (Wisconsin) dataset and
has two classes (212 Malignant and 375 Benign cases). Each case contains
30 numerical features.
Two FCNN models (bC1 and bC2) have been trained to classify this
dataset. The training-test data ratio is 3:2 and both training accuracies are
nearly 100% (> 0.99) at the end. Then, we obtain models’ test accuracies
as the ground truth for model complexity. The greater test accuracy value
means better generalizability (simpler decision boundary).
To compute the local DBC scores requires only the training data. We
randomly select a pair of data points, of which one is a Malignant sample
and another one is a Benign sample, to compute one local DBC score on
trained models. This process is repeated 2,500 times (about 5 times of
total number of data) to obtain 2,500 local DBC scores for each model.
These local DBC scores are based on 30-nearest neighbors because the
space dimension is 30. Thus, each local DBC score is computed by 31
adversarial examples. The reason is that, in 30-D, the simplest element
(30-simplex) contains 31 vertices (e.g., as triangle in 2-D and tetrahedron in
3-D). We consider that n-nearest neighbors could best reflect the complexity
of segmental boundary in n-D. The next experiment shows that the number
of nearest neighbors could be much smaller than the dimensionality and
not unique.
171
Histograms of local DBC scores form models
Number of repetitions
Figure 4.28: Local DBC scores from two models trained by the breast cancer
dataset. The FCNN bC1 has three hidden layers (20 neurons in each layer)
and three Dropout layers; its number of parameters is 1,481 (including
bias). The bC2 has one hidden layer with 1,000 neurons; its number of
parameters is 32,001 (including bias).
Table 4.14: Statistical Results of local DBC scores on bC1 and bC2.
172
Figure 4.28 and Table 4.14 clearly indicate that the model bC2 generally
has larger local DBC scores than bC1. The result means bC1 has better
generalizability than bC2, which is verified by their test accuracies. We do
not calculate the standard deviation of scores because their distributions
are not Gaussian but more like the long-tailed distribution. Instead, we
apply the two-sample rank test12 to estimate which scores are smaller.
This dataset contains 1,440 cat and 1,440 dog RGB photos. The image
size is 150x150x3 (67,500 8-bit integers). Three convolutional neural net-
work (CNN) models (cC1, cC2 and cC3) are trained to classify this dataset.
The training-test ratio is 32:13 and both training accuracies are > 0.95 at
the end. Then, we obtain models’ test accuracies as the ground truth for
model complexity. Figure 4.29 shows the training process.
0.73
0.63 0.58
Figure 4.29: Training and test accuracies in training process of three models.
The CNN cC1 has three convolutional layers, three max-pooling layers,
one dense layer (64 neurons) and one Dropout layer. The cC2 has one
convolutional layer and three dense layers (256, 128, 64 neurons). The cC3
has only one dense layer (1024 neurons).
12 https://www.mathworks.com/help/stats/signrank.html
173
To compute the local DBC scores uses only the training set. We randomly
select a cat and a dog image from the training set to compute local DBC
scores on trained models. This process is repeated 6,000 times (about 5
times of the size of training set) to obtain 6,000 local DBC scores for each
model.
Since the space dimension (67,500) is far beyond the size of dataset
(2,880), we cannot choose based on the idea of a simplex and use 67,500-
nearest neighbors to compute local DBC scores. Even if we have enough
images to use, the number of nearest neighbors is too large to run the
process. Hence, to find a properly small number, we test the 3, 5, 10, 15,
20, 30-nearest neighbors.
0.9
Mean Median
0.85
0.8
DBC scores
0.75
0.7
0.65
0.6
0.55
0 5 10 15 20 25 30 0 5 10 15 20 25 30
# of nearest neighbors cC1 cC2
cC3
Figure 4.30: Means and medians of local DBC scores on model cC1, cC2
and cC3 using different numbers of nearest neighbors.
Figure 4.30 shows the means and medians of 6,000 local DBC scores
based on various numbers of nearest neighbors. Since the distributions of
these scores are not Gaussian but more like the long-tailed, we use their
174
Sorted local DBC scores (15-NN) from models
Number of repetitions
Figure 4.31: Increasingly sorted local DBC scores from three models. The
upper figure is the whole plot, and the lower figure is zoomed the plot in
range from 2k-6k to clearly see positions of three curves.
175
Table 4.15: Statistical Results of local DBC scores on cC1, cC2 and cC3.
4.5.4 Discussion
The main idea of this study is simple and clear: using the adversarial
examples on or near the decision boundary to measure the complexity of
the boundary. It is difficult to define and measure the complexity of a
boundary surface in high dimensions, but easier to measure the complexity
of adversarial example sets. We measure the complexity via the entropy of
eigenvalues of adversarial sets. Other complexity measures for grouped data
are also worth considering [165]. Figure 4.32 shows several adversarial
examples for the cC1 model generated by training images. They look like
mixed cat and dog photos.
To generate the adversarial examples, as Figure 4.23 shows, we use a
pair of real data from different classes. At least one adversarial example is
176
on the line segment between two data points because the line must cross the
decision boundary at least once. If we use only real data from the training
set, we could evaluate models’ generalizability without using test sets. That
is an advantage when data are limited because we could have more data
for training. However, the disadvantage of this method is the dependence
on real data. The number of adversarial examples that could be generated
depends on the size of the real dataset. Can we generate an adversarial
example x for classifier f by randomly searching for f (x) ≈ 0.5? Maybe, but
it is very difficult in a high-dimensional space. Even to find two data points
a, b whose f (a) ≈ 1, f (b) ≈ 0 is difficult because one of the areas (say f (a) ≈ 1)
would be very small and sparse in the space. Definitely, there are some
other methods to generate adversarial examples, such as the DeepDIG [133]
and applications of the Generative Adversarial Network (GAN).
Smaller local DBC scores are necessary but insufficient conditions for
a simpler decision boundary because a lower complexity adversarial set
may be generated from a higher complexity boundary (Figure 4.33). Hence,
the density of adversarial examples is important. Denser examples have
higher probability to reflect the real condition of boundary. In practice,
more adversarial examples are required to be on the effective segment of
decision boundary, which is not the whole boundary but the part close to
the data. From this aspect, to generate adversarial examples on the line
segments between two data points is an appropriate way to create a dense
adversarial set on the effective segment of decision boundary.
A smaller DBC score indicates that the model has a simpler decision
boundary and better generalizability on a certain dataset. It is worth
noting that the DBC score is meaningless for a single model and cannot
be compared across different datasets. The gist of DBC score is used to
177
Figure 4.33: Linear adversarial set (green) on lumpy boundary (black).
compare various models trained on the same dataset. In this study, all
three experiments use two-class datasets. In future work, we will use
multi-class datasets. The multi-class problem could be treated as multiple
two-class problems by one class vs. others.
178
problem of understanding deep learning.
In recent years, the neural network (deep learning) technique has played
a more and more important role in applications of machine learning. To
comprehensively understand the mechanisms of Neural Network (NN) mod-
els and to explain their output results, however, still require more basic
research [223]. To understand the mechanisms of NN models, that is, the
transparency of deep learning, there are mainly three ways: the training
process [66], generalizability [159], and loss or accuracy prediction [12].
In this study, we create a novel theory from scratch to estimate the
training accuracy for two-layer neural networks applied to random datasets.
Figure 4.34 demonstrates the mentioned two-layer neural network and sum-
marizes the processes to estimate its training accuracy using the proposed
method. Its main idea is based on the regions of linearity represented by
NN models [201], which derives from common insights of the Perceptron.
This study may raise other questions and offer the starting point of a new
way for future researchers to make progress in the understanding of deep
learning. Thus, we begin from a simple condition of two-layer NN models,
and we discuss the use for multi-layer networks in Section 4.6.4.3 as future
works. This study has two main contributions:
179
4.6.1.1 Preliminaries
3. For the most important step, we perform experiments to test the pro-
posed theory by comparing actual outputs of the system with predicted
outputs. If the predictions are close to the real results, we could accept
180
𝑑
𝑥𝑖 ∈ 0, 1 → [0, 1]
𝑦𝑖 ∈ {0, 1}
𝑖 = 1, 2, ⋯ , 𝑵
𝒅
𝑳
𝑵
𝒅 ൡ → Our Method → Estimated Training Accuracy
𝑳
Figure 4.34: An example of the two-layer FCNN with d − L − 1 architecture.
This FCNN is used to classify N random vectors in Rd belonging to two
classes. Detailed settings are stated before in Section 4.6.1.1. The training
accuracy of this classification can be estimated by our proposed method,
without applying any training process. The detailed Algorithm of our method
is shown in Section 4.6.3.3.
To the best of our knowledge, only a few studies have discussed the
prediction/estimation of the training accuracy of NN models. None of them,
however, estimates training accuracy without using input data and/or
trained models as does our method.
The overall setting and backgrounds of the studies of over-parameterized
two-layer NNs [66, 12] are similar to ours. But the main difference is that
we do not estimate the value of training accuracy. One study [66] mainly
181
shows that the zero training loss on deep over-parametrized networks can be
obtained by using gradient descent. Another study analyzes the generaliza-
tion bound [12] between training and test performance (i.e., generalization
gap). There are other studies [127, 288] to investigate the prediction of the
generalization gap of neural networks. We do not further discuss the gener-
alization gap because we focus on only the estimation of training accuracy,
ignoring the test accuracy.
Unlike our proposed method that does not need to use input data nor
to apply any training process, in recent works related to the accuracy
estimation for neural networks [289, 82, 40, 243], the accuracy prediction
methods require pre-trained NN models or weights from the pre-trained NN
models. Through our method, to estimate the training accuracy for two-layer
FCNN on random datasets (two classes) requires only three arguments: the
dimensionality of inputs (d), the number of inputs (N), and the number of
neurons in the hidden layer (L). The Peephole [260] and TAP [122] techniques
apply Long Short Term Memory (LSTM)-based frameworks to predict a NN
model’s performance before training the original NN model. However, the
frameworks themselves still must be trained by the input data before making
predictions.
In general, the output of the k-th neuron in the first hidden layer is:
sk (x) = σ (wk · x + bk ),
182
where input x ∈ Rd ; parameter wk is the input weight of the k-th neuron and
its bias is bk . We define σ (·) as the ReLU activation function, defined as:
|wk · x + bk |
dk (x) =
kwk k
If wk · x + bk > 0,
For a given input data point, L neurons assign it a unique code: {s1 , s2 ,
· · · , sL }; some values in the code could be zero. L neurons divide the input
space into many partitions, input data in the same partition will have codes
that are more similar because of having the same zero positions. Conversely,
it is obvious that the codes of data in different partitions have different zero
positions, and the differences (the Hamming distances) of these codes are
greater. It is apparent, therefore, that the case of input data separated into
different partitions is favorable for classification.
183
4.6.2.1 Complete separation
Given L neurons that divide the input space into S partitions, we hy-
pothesize that:
Remark. For most real classification problems (i.e., in which labels have
been assigned), complete separation of all data points is a very strong
assumption because adjacent same-class samples assigned to the same par-
tition is a looser condition and will not affect the classification performance.
Because our partitions by complete separation are unlabeled, the prin-
ciple of discriminant analysis (which aims to minimize the within-class
separations and maximize the between-class separations) is not applicable.
Finally, adjacent samples assigned to different partitions could have the
same label and thus define a within-class separation. The partitions we men-
tioned are thus not necessarily the final decision regions for classification;
those will be determined when labels are assigned.
S
N S!
Pc = SN
= (4.2)
(S − N)!SN
N!
184
In other words, Pc is the probability that each partition contains at most
one data point after randomly assigning N data points to S partitions. By
Stirling’s approximation,
√ S S
S! 2πS e
Pc = ≈p
(S − N)!S N S−N S−N N
2π (S − N) e S
N S−N+0.5
1 S
Pc = (4.3)
e S−N
lim Pc = 0 when a = 1
N→∞
Equation (4.5) shows that for large N, the probability of complete separation
is nearly zero when 1 ≤ a < 2, and close to one when a > 2. Only for a = 2
(i.e., S = bN 2 ) is the probability controlled by the coefficient b:
1
lim Pc = e−( 2b ) (4.6)
N→∞
Although complete separation holds lim Pc = 1 for a > 2, there is no need
N→∞
to incur the exponential growth of S with a when compared to the linear
13 A derivation of this simplification is in the Appendix C.
185
growth with b. And a high probability of complete separation does not
require even a large b. For example, when a = 2 and b = 10 the lim Pc ≈ 0.95.
N→∞
Therefore, we let S = bN 2 throughout this study.
(γNS ) (S−γN)(1−γ)N
N
(γN ) (1−γ)N! S! (S − γN)(1−γ)N
Pinc = SN
= (4.7)
(S − γN)!SN
N!
In other words, Pinc is the probability that at least γN partitions contain only
one data point after randomly assigning N data points to S partitions. When
γ = 1, Pinc = Pc , i.e., it becomes the complete separation, and when γ = 0,
Pinc = 1. We apply Stirling’s approximation and let S = bN 2 , N → ∞, similar to
Equation (4.6), we have:
γ(γ−2)
lim Pinc = e 2b (4.8)
N→∞
186
4.6.2.3 Expectation of separation ratio
In fact, Equation (4.8) shows the probability that at least γN data points
(when N is large enough) have been completely separated, which implies:
γ(γ−2)
Pinc (x ≥ γ) = e 2b ⇒
dPinc (x < γ) d (1 − Pinc (x ≥ γ))
Pinc (x = γ) = =
dγ dγ
γ(γ−2)
d 1−e 2b
1 − γ γ(γ−2)
= = e 2b = Pinc (γ)
dγ b
We notice that the equation Pinc (γ) does not include the probability of com-
plete separation Pc because Pinc (1) = 0. Hence, Pinc (1) is replaced by Pc and
the comprehensive probability for the separation ratio γ is:
Pc = e−( 2b1 )
γ =1
P(γ) = γ(γ−2)
(4.9)
1−γ e 2b
0≤γ <1
b
Z 1 Z 1
1 − γ γ(γ−2)
E [γ] = γ · P (γ) dγ = 1 · Pc + γ· e 2b dγ ⇒
0 0 b
√
2πb 1 1
E [γ] = erfi √ e−( 2b ) (4.10)
2 2b
187
where erfi(x) is the imaginary error function:
2 ∞ x2n+1
erfi(x) = √ ∑
π n=0 n! (2n + 1)
Hypothesis 4.2. The separation ratio directly determines the training ac-
curacy.
γN + 0.5 (1 − γ) N 1 + γ
α= =
N 2
1 + E [γ]
E [α] = (4.11)
2
188
ratio and training accuracy. After replacing E [γ] in Equation (4.11) with
Equation (4.10), we obtain the formula to compute the expectation of
training accuracy:
√
1 2πb 1 1
E [α] = + erfi √ e−( 2b ) (4.12)
2 4 2b
Since:
d d
L L
∑ i = O d!
i=0
We let:
Ld
S= (4.14)
d!
Figure 4.35 shows that the partition numbers calculated from Equa-
tions (4.13) and (4.14) are very close in 2-D. In high dimensions, both
theory and experiments show that Equation (4.14) is still an asymptotic
189
Figure 4.35: Maximum number of partitions in 2-D
Now, we have introduced our main theory that could estimate the train-
ing accuracy for a d − L − 1 structure FCNN and two classes of N random
(uniformly distributed) data points by using Equations (4.12) and (4.15).
For example, let a dataset have 200 two-class random data samples in R3
(100 samples for each class) and let it be used to train a 3 − 200 − 1 FCNN.
In this case,
2003
b= ≈ 33.33.
3! · 2002
Substituting b = 33.33 into Equation (4.12) yields E [α] ≈ 0.995, i.e., the ex-
pectation of training accuracy for this case is about 99.5%.
190
4.6.3 Empirical Corrections
2
1 L
b=
2 N
L
If N = c, b is not changed by N. To test this counter-intuitive inference, we let
L
L = N = {100, 200, 500, 800, 1000, 2000, 5000}. Since N = 1, b and E[α] are
unchanged. But Table 4.16 shows the real training accuracies vary with N.
The predicted training accuracy is close to the real training accuracy only
at N = 200 and the real training accuracy decreases with the growth of N.
Hence, our theory must be refined using empirical corrections.
The correction could be applied on either Equation (4.15) or Equa-
tion (4.12). We modify Equation (4.15) because the range of functionin
191
Table 4.16: Accuracy results comparison. The columns from left to right
are dimension, dataset size, number of neurons in hidden layer, the real
training accuracy and estimated training accuracy by Equation (4.15) and
Theorem 4.1.
192
1
Figure 4.36: Fitting curve of b = f (N, L) in 2-D
The fitting process uses the Curve Fitting Tool (cftool) in MATLAB. Fig-
ure 4.36 shows the 81 points of {N, L} and the fitted curve. The R2 value of
1
fitting is about 0.998. The reason to fit b instead of b is to avoid b = +∞ when
1
the real accuracy is 1 (which can occur). In this case, b = 0. Conversely,
b = 0 when the real accuracy is 0.5, which never appears in our experiments.
Using an effective classifier rather than random guess makes the accuracy
1
> 0.5 (b > 0), thus, b 6= +∞. To cover the parameter space as completely as
possible, we manually choose the 81 points of N and L. We then verify the
fitted model: Equation (4.17) by other random values of N and L.
By using Equations (4.12) and (4.17), we estimate training accuracy on
193
Table 4.17: Estimated training accuracy results comparison in 2-D. The
columns from left to right are dataset size, number of neurons in hidden
layer, the real training accuracy, estimated/predicted training accuracy
by Equation (4.17) and Theorem 4.1, and (absolute) differences based on
estimations between real and estimated accuracies.
random values of N and L in 2-D. The results are shown in Table 4.17. The
differences between real and estimated training accuracies are small, except
the first (row) one. For higher real-accuracy cases (> 0.86), the difference
1
is larger because b < 1 (b > 1 when the accuracy > 0.86), while the effect is
1
smaller in cases with b > 1 during the fitting to find Equation (4.17).
194
Table 4.18: Parameters {xd , yd , cd } in Equation (4.16) (Observation 4.1) for
various dimensionalities of inputs are determined by fitting.
d xd yd cd R2
2 0.0744 0.6017 8.4531 0.998
3 0.1269 0.6352 15.5690 0.965
4 0.2802 0.7811 47.3261 0.961
5 0.5326 0.8515 28.4495 0.996
6 0.4130 0.8686 61.0874 0.996
7 0.4348 0.8239 33.4448 0.977
8 0.5278 0.9228 61.3121 0.996
9 0.7250 1.0310 82.5083 0.995
10 0.6633 1.0160 91.4913 0.995
xd
But the growth of yd is preserved. From Equation (4.15),
xd d
=
yd 2
xd xd
The yd linearly increases with d. The real d v.s. yd (Figure 4.37) shows the
same tendency.
Table 4.18 indicates that xd , yd , and cd increase almost linearly with d.
Thus, we apply linear fitting on d-xd , d-yd , and d-cd to obtain these fits:
195
1
0.8
xd ⁄ yd 0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11
d
Figure 4.37: Plots of d v.s. xydd from Table 4.18. Blue dot-line is linearly fitted
by points to show the growth.
4.6.3.4 Testing
• N L : N = 10000, L = 1000
• N∼
= L : N = L = 10000
• N L : N = 1000, L = 10000
The results are shown in Figure 4.38. The maximum differences between
196
N >> L
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
Real Acc Est. Acc Diff
N=L
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
Real Acc Est. Acc Diff
N << L
1
0.8
0.6
0.4
0.2
0
1 3 5 7 9 11 13 15 17 19 21 23 25
Real Acc Est. Acc Diff
197
Algorithm 3: To estimate the training accuracy α for two-layer neu-
ral networks on a random dataset without training
Input: the dimensionality of inputs d, the number of data points N,
the number of neurons in the hidden layer L.
Result: the expectation of training accuracy E [α].
// Use Equation (4.18) to calculate the three parameters:
{xd , yd , cd }.
1 xd ← 0.0758 · d − 0.0349;
2 yd ← 0.0517 · d + 0.5268;
3 cd ← 9.4323 · d − 8.8558;
// Use Equation (4.16) to compute the ensemble index b.
L xd
4 b ← cd yd ;
N
// Use Equation (4.12) to compute the expectation of training
accuracy E [α].
√ 1
5 E [α] ← +
2
1
4
2πb
erfi √1 e−( 2b ) ;
2b
Output: E [α].
real and estimated training accuracies are about 0.130 (N L), 0.076 (N ∼
= L),
and 0.104 (N L). There may be two reasons that the differences are not
small in some cases of N L: 1) in the fitting process, we do not have
enough samples for which N L, so that the corrections are not perfect;
2) the reason why the differences are greater for higher real accuracies is
similar to the 2-D situation discussed above.
In addition, we estimate training accuracy on 40 random cases. For each
case, the N, L ∈ [100, 20000] and d ∈ [2, 24], but principally we use d ∈ [2, 10]
because in high dimensions, almost all cases’ accuracies are close to 100%
(see Figure 4.38). Figure 4.39 shows the results. Each case is plotted with
its real and estimated training accuracy. The overall R2 value is about 0.955,
indicating good estimation.
198
1
0.9
0.8
y: Est. Acc
0.7
0.6
0.5
0.5 0.6 0.7 0.8 0.9 1
x: Real Acc
Figure 4.39: Evaluation of estimated training accuracy results. y-axis is
estimated accuracy; x-axis is the real accuracy; each dot is for one case;
red line is y = x. R2 ≈ 0.955.
4.6.4 Discussion
199
It also is based on the conditions of classifier models and datasets stated in
the Section 4.6.1.1 and the two hypotheses. There appear to be no other
studies that have proposed a method to estimate training accuracy in this
way.
Our theory is based on the notion that hidden layers in neural networks
perform space partitioning and the hypothesis that the data separation
ratio determines the training accuracy. Theorem 4.1 introduces a mapping
function between training accuracy and the ensemble index. The ensemble
index, by virtue of its domain (0, ∞); is better suited than accuracy (domain
[0.5, 1]) for the computation required by the fitting process. This is to
maintain parity between the input variables’ domain and the range of the
fitted quantity (the ensemble index). The extended domain consists with
the domains of N, L, and d; it is good for designing prediction models or
fitting experimental data. Observation 4.1 provides a calculation of the
ensemble index based on empirical corrections. And these corrections;
which successfully improve our model to make estimations.
200
of the N L cases are smaller than those of N L, and for specific N and
L, the training accuracies of higher dimensionality of inputs are greater
than those of lower dimensionality. These characteristics are shown by
the estimation curves in Figure 4.38. Although there are large errors for
some cases, Figure 4.38 shows the similar tendencies of real and estimated
accuracy.
And the theorem, Observation 4.1 and empirical corrections could be
improved in the future. The improvements would be along these directions:
201
4.6.4.3 For Deeper Neural Networks
k−1 ! n0
ni n
S= ∏ n0 ∑ ik
i=1 i=0
Where, n0 is the size of input layer and ni is the i-th hidden layer. This
reduces to Equation (4.13) in the present case (k = 1, n0 = d and n1 = L). The
theory for multi-layer neural networks could begin by using the approaches
above.
In addition, this study still has several places that are worth attempting to
enhance or extend and raises some questions for future work. For example,
the proposed theory could also extend to distributions of data other than
uniform, unequal number of samples in different classes, and/or other
types of neural networks by modifying the ways to calculate separation
202
probabilities, such as in Equations (4.2) and (4.7).
4.7 Conclusion
203
To further evaluate the images generated from GAN models, we propose
a novel GAN measure – Likeness Score (LS), which can directly analyze the
generated images without using a pre-trained classifier and it is stable with
respect to the amount of images. Comparing with other methods, such as
IS and FID, LS has fewer constraints and wider applications. Particularly,
LS could provide explanation of results in the three main respects of opti-
mal GANs according to our expectations of ideal generated images. Such
explanations help us to deepen our understanding of GANs and of other
GAN measures that will help to improve GAN performance.
In addition, we have examined two more basic questions for the CNN and
deep learning models: the generalizability of the Deep Neural Network (DNN)
and how to understand the mechanism of DNN models.
We propose the Decision Boundary Complexity (DBC) score to define
and measure the decision boundary complexity of the DNN. DBC score is
computed from the entropy of eigenvalues of adversarial examples, which
are generated on or near the decision boundary, and in a feature space of
any dimension. Training data and the trained models are used to compute
the DBC scores, and test data are used to obtain test accuracies as the
ground truth for models’ generalizability. Results verifies our hypothesis
that a DNN with a simpler decision boundary has better generalizability.
Thus, DBC provides an effective way to measure the complexity of decision
boundaries and its relationship to the generalizability of DNNs.
To understand the mechanism of DNN models, we create a novel theory
based on space partitioning to estimate the approximate training accuracy
for two-layer neural networks on random datasets without training. It
does this using only three arguments: the dimensionality of inputs (d), the
number of input data points (N), and the number of neurons in the hidden
204
layer (L). The theory has been verified by the computation of real training
accuracies in our experiments. Although the method requires empirical
correction factors, they are determined using a principled and repeatable
approach. The results indicate that the method will work for any dimension,
and it has the potential to estimate deeper neural network models. This
study may raise other questions and suggest a starting point for a new
way for researchers to make progress on studying the transparency of deep
learning and explainable deep learning.
205
Chapter 5: Deep Learning-based Medical Images Segmentation
5.1 Introduction
Breast cancer is the second leading cause of death for women in the U.S.
Early detection of breast cancer has been shown to be the key to higher
survival rates for breast cancer patients. We are investigating infrared ther-
mography as a noninvasive adjunct to mammography for breast screening.
Thermal imaging is safe, radiation-free, pain-free, and non-contact. Auto-
mated segmentation of the breast area from the acquired thermal images will
help limit the area for tumor search and reduce the time and effort needed
for manual hand segmentation. Autoencoder-like C-DCNN are promising
computational approaches to automatically segment breast areas in thermal
images. In this study, we apply the C-DCNN to segment breast areas from
our thermal breast image database, which we are collecting in our clini-
cal trials by imaging breast cancer patients with our infrared camera (N2
Imager). For training the C-DCNN, the inputs are 132 gray-value thermal
206
images and the corresponding manually-cropped breast area images (binary
masks to designate the breast areas). For testing, we input thermal images
to the trained C-DCNN and the output after post-processing are the binary
breast-area images.
Instead of using current DL-based segmentation models (like the UNet
and variants), we then employ a “sneak attack” on segmentation of mam-
mographic images using CNN classifiers. The CNN classifiers can automat-
ically extract important features from images for classification. Those ex-
tracted features can be visualized and formed into heatmaps using Gradient-
weighted Class Activation Mapping (Grad-CAM). This study tested whether
the heatmaps could be used to segment the classified targets. We also pro-
posed an evaluation method for the heatmaps; that is, to re-train the CNN
classifier using images filtered by heatmaps and examine its performance.
We used the mean-Dice coefficient to evaluate segmentation results.
Breast cancer will be diagnosed among about 12% of U.S. women dur-
ing their lifetime, making it the second leading cause of death for U.S.
women [248, 54]. Early detection of breast cancer via Computer-Aided
Diagnosis (CAD) systems has been shown to improve outcomes of breast
cancer treatment and increase patients’ survival times [215]. If the tumor
is detected and localized early, the 5-year relative survival rate is more
than 94% [52]. Although X-ray mammography is the gold standard for
breast cancer detection, it nevertheless has a substantial false-positive rate,
requires exposure to radiation, often is uncomfortable, and is less effective
1 This work has been published in the [C10].
207
Figure 5.1: Full thermal raw images of two patients, including the neck,
shoulder, abdomen, background and chair.
208
mostly used histogram analysis, threshold-based techniques, edge-based
techniques, and region-based techniques, for example, edge detection by
Hough transform feature curve (parabola) extraction [208], edge detection
by interpolation of curves [237], snake algorithm [70], detection of edge and
boundary curves [131], anisotropic diffusion filter-based edge detection [259]
and automated segmentation algorithm based on ellipse detection (our lab).
Recently, Deep Learning (DL) has become a state-of-the-art method to
segment images. The SegNet [17], for example, was trained to segment
urban street images to parts of sky, building, road marking, pavement,
etc. For medical-image applications, deep neural networks have applied
to segmentation of retinal vessels, brain tissues in MRI, and liver lesions
in CT [177, 182, 44]. However, we are aware of no study that has used
DL-based segmentation to segment breast thermograms. This study fills
this gap by providing a DL model to automatically segment the breast area
from the whole thermal breast image.
209
Figure 5.2: Our breast infrared thermography system.
210
(a) (b) (c)
Figure 5.3: Preprocessing of the raw IR images: (a) original raw IR image,
(b) manual rectangular crop to remove shoulders and abdomen, and (c) is
the hand-trace of the breast contour to generate the manual segmentation
(ground truth).
natural cool down of the breast tissue, and without causing discomfort for
the patient by sitting still for a long period of time. The rationale is that
the surrounding tissue cools faster than the tumor, which increases the
thermal contrast.
Initially, images were cropped manually by removing the upper and lower
regions (neck and abdomen). All breast IR images were converted to 8-bit
gray-scale. Then, a trained student manually traced the breast curvature
and cropped the breast region from the rest of the body to form the ground
truth breast region images for training and testing the segmentation model
(Figure 5.3). In practice, these truth breast region images were set to binary
values, where (in gray-scale) 0 (black) is for background and 255 (white) for
breast areas.
211
Table 5.1: C-DCNN segmentation architecture for thermal breast images.
Layer Shape
input: gray-scale image 400x200x1
Conv_3-1 + ReLU Normalization 400x200x1
Conv_3-64 + ReLU 400x200x64
MaxPool_2 Normalization 200x100x64
Conv_3-128 + ReLU 200x100x128
MaxPool_2 Normalization 100x50x128
Flatten 640000
FC 200 + ReLU 200
FC 200 + ReLU 200
FC 640000 + ReLU 640000
Reshape to 100x50x128 100x50x128
Normalization Up-sampling 200x100x128
Conv_3-128 + ReLU 200x100x128
Normalization Up-sampling 400x200x128
Conv_3-64 + ReLU Normalization 400x200x64
output: Conv_3-1 + tanh 400x200x1
212
fully-connected (FC) layers and one flattening layer. The activation function
for each layer is the ReLU function [186] except the last one for the output,
which is the tanh function.
The notation Conv_3-64 means there are 64 convolutional neurons
(units), with each unit having a filter size of 3×3-pixel (height × width)
in the layer. MaxPool_2 is a max-pooling layer with the filter size 2×2-pixel
window, stride 2; up-sampling layers have the same size. FC_200 is a
fully-connected layer containing 200 units. Normalization is the batch
normalization layer, which normalizes the activations of the previous layer
at each batch and helps accelerate deep network training [120]. The output
layer uses the tanh function, which maps the output value to the range of
[-1, 1].
Data shapes from input to output are symmetric. The CNN (encoder)
transforms an image to a 200-length vector (code) and the D-CNN (decoder)
transforms the vector back to an image. The 8-bit gray-scale input images
were scaled from [0, 255] to [-1,1] to match the value range required for
the neural network input. Similarly, the neural network segmented output
image is then rescaled back to uint-8 [0, 255].
Experiment 1 Since all image samples were from 11 breast cancer pa-
tients, with 15 samples for each patient, the first experiment randomly
selected 12 samples from each patient for the training set and the remaining
3 samples for the testing set. In total, there are 132 breast infrared images
along with 132 manually segmented regions for training the segmentation
model, and 33 breast infrared images and their segmentations for testing.
213
Experiment 1
Patient Training set Testing set
001 12 images 3 images
002
003
011
Experiment 2
Patient Training set
001 15 images
002
Testing set
006 15 images
010
011
214
Trained seg-model
IR breast image Gray seg-image
Otsu's
Binary seg-image
IoU
Truth regions
to compare the predicted images with truth data, we applied the Otsu’s [98]
algorithm to automatically convert gray-scale segmentation images to binary
segmentation images.
We compared the binary segmentation images with truth region images
by computing their Intersection-over-union (IoU), also called the Jaccard
Similarity (See Figure 5.5). The IoU of two binary images is the ratio of
overlapped area divided by area of union. Therefore, for two binary images
I1 and I2 of the same size, the IoU is:
|I1 ∩ I2 |
IoU (I1 , I2 ) =
|I1 ∪ I2 |
If the two images have the same breast region, IoU will be 1. For all the
testing results, we computed their IoU with ground truth manual segmented
regions to evaluate the segmentation C-DCNN model.
5.2.5 Results
215
Figure 5.6: The training curve.
216
IR breast image Trained seg-model Gray seg-image
011 Otsu's
Truth regions Binary seg-image
IoU = 0.960
change over time. The overall average IoU is about 0.9424 with 0.0248
standard deviation.
In Experiment 1, training and testing sets contain images of the same
patient. That is, IR breast images having the same region are in both train-
ing and testing sets. A possible explanation for the good segmentation
performance of the C-DCNN might be that it memorized the breast region
for each patient but have not learned how to segment breast regions. There-
fore, Experiment 2 evaluated the trained segmentation model by different
patients’ IR breast images from training to avoid the memorization.
Figure 5.9 shows two results from Experiment 2. In the first row, the
segmentation model was trained by 10 patients’ IR breast images without
patient 002 (leave-one-case-out). All 15 test images come from the 002
patient. The predicted segmentation breast region seems to be synthesized
by several trained breast areas from other patients and the segmentation
result of patient 002 is not as accurate as the result from Experiment 1;
however, the predicted breast area still covers most of the ground truth
breast area. The second row shows a better example for patient 007.
217
Ave: 0.9424 Std: 0.0248
Subject ID
Figure 5.8: Results of Experiment 1. The blue dots are the average IoU for
each patient and bars show the range among 3 samples.
IoU = 0.774
002
IoU = 0.864
007
218
Ave: 0.8340 Std: 0.0809
Figure 5.10: Results of Experiment 2. The blue dots are the average IoU of
each patient among its 15 testing samples, the red lines are medians and
the bars show the ranges.
Figure 5.10 shows that the average IoUs in most cases are better than 0.8,
and the overall average IoU is about 0.8340 with 0.0809 standard deviation.
This is relatively high because of the wide variety of breast shapes and
contours among different patients. Low average IoU for each case means
the breast region (shape and contour) is quite different from other cases in
the training set. Higher IoU cases mean there are similar breast areas in
the training set (See Discussion section).
5.2.6 Discussion
219
the breast region in other samples from the same patient (Figure 5.11).
In the top part of Figure 5.11, one IR breast image of patient 001 (p.001)
was input to two trained segmentation models: one model had been trained
without p.001’s samples (Experiment 2) and another one had been trained
with some of p.001’s samples (different from the input one) (Experiment 1).
Both the outputs and IoUs demonstrate that the outcomes by training with
or without the same patient’s samples can have big differences, with results
from the Experiment 1 (with the same patient’s samples) being better. For
p.001, the output from Experiment 1 looks very similar to the ground truth
region, however, the predicted segmentation area from Experiment 2 looks
like breast regions from other patients used in the training set.
On the contrary, in the bottom part of Figure 5.11, the segmentation
outputs for p.009 from the two experiments are very similar. This is because
another patient (p.010) has a similar IR breast region (breast shape and
contour) as p.009. The segmentation model was trained by similar-looking
breast regions. It is not surprising that if training samples include breast
images very similar to the test image, the segmentation outcomes become
better. Such results indicate that including more IR breast images of various
breast shapes and sizes in the training process of the C-DCNN segmentation
model will greatly improve overall performance.
Both Otsu’s thresholding and IoU computation used in this study have
limitations. Although Otsu’s thresholding converts gray-scale images to
binary automatically, it cannot guarantee optimal segmentation. IoU is
used to compare two binary regions but results are subject to image size.
There could exist multiple ways to segment the breast region from IR im-
220
IR breast image Gray seg-image Truth regions
IoU = 0.778
without 001
IoU = 0.895
without 009
010
Figure 5.11: Comparison of results from the two experiments (first row:
Experiment 2, second row: Experiment 1). The second column (Gray seg-
image) shows output of segmentation models. The third column is the
ground truth breast region of the patient’s testing samples. (Top part:
p.001, bottom part: p.009).
221
ages, suggesting some limitations to the manual hand segmentation. For
instance, a predicted segmentation area with low IoU value by the C-DCNN
segmentation model might still be a reasonable way to segment the breast
region even if it does not match the manual segmentation. Hence, a better
evaluation metric needs to be developed to assess the quality of the breast
segmentation by our developed model. In future studies, we will consider
applying other thresholding and region or contour comparison methods.
For future works, one approach to improve outcomes is to combine deep-
learning based segmentation with other methods for pre/post-processing,
such as the contrast limited adaptive histogram equalization (CLAHE). Since
the histogram equalization globally changes the images, it may not be
achieved by CNN because the convolutional operations are localized and
could play the roles of various image filters. From Experiment 2 we know that
more varieties and number of training images could benefit the C-DCNN seg-
mentation model, thus we will collect more patients and volunteers’ samples
for future training. Also, we could train other deep-learning segmentation
models, such as SegNet [17] or U-Net [221].
222
in their related published/pre-printed papers that can be found in the
references. Here, we will not show more details about their architectures
because they are not highly relevant to our main topic.
In our segmentation studies, besides the IoU, we apply another measure-
ment metric – Tanimoto Similarity [219]. To evaluate the performance of
segmentation, we need a method to compare the segmented region with
ground truth. Since we applied the sigmoid function to activate the final
convolutional operator in DL-based segmentation models, the output is
a gray-level image which maps into the range [0,1]. Therefore, we must
threshold before calculating accuracy. Usually, thresholding grayscale im-
age to binary (binarization) [241], like Otsu’s method [98] used in this study,
introduces additional errors.
In previous studies, we choose IoU as the measurement metric; it com-
pares two binary images as two sets A and B, their IoU value is:
|A ∩ B|
IoU (A, B) =
|A ∪ B|
For binary images, IoU compares images by union and intersection opera-
tions. The intersection operation could be considered as sum of products.
For two sets A and B:
|A ∩ B| = ∑ ai bi (5.1)
|A ∩ A| = ∑ ai 2
223
And,
|A ∪ B| = |A| + |B| − |A ∩ B| = ∑ ai 2 + bi 2 − ai bi
For gray-to-gray comparison, according to IoU (A, B), the value of the Tani-
moto similarity [219] is:
∑ ai bi
T (A, B) =
∑ ai 2 + bi 2 − ai bi
224
Change of image size
Figure 5.12: The size and object-area ratio change of images. We change
image size by down-sampling and change object-area ratio by adding blank
margin around the object and down-sampling to keep the same size.
5.3.1 Introduction
225
(a) (b)
(c) (d)
226
Figure 5.13 shows an example of how the heatmap from the Grad-CAM
method segments the targets. We input an image of African elephants
(Figure 5.13a) into the Xception [43] neural network model pre-trained by
the ImageNet database [227]. The pre-trained Xception model can classify
images including 1000-class targets4 . Its top-1 prediction result of the
input image (Figure 5.13a) is ‘African_elephant’. Then, we apply Grad-CAM
to show the heatmap of this prediction; results are shown in Figure 5.13.
Finally, the input image filtered by the heatmap mask can be considered as
a segmentation result of the targets segmented from the background (grass
and sky).
This example shows that we can achieve a segmentation result without
training a segmentation model but by using a trained classifier. In this
study, we applied this method to mammographic images for breast tumor
segmentation. We used breast tumor images from the DDSM database and
various CNN-based classifier models (e.g., Xception). Since DDSM describes
the location and boundary of each abnormality by a chain-code, we were
able to extract the true segmentations of tumor regions. We used the regions
of interest instead of entire images to train CNN classifiers. After training
the two-class (with- or without-tumor) classifier, we applied Grad-CAM to
the classifier for tumor region segmentation. We expect that this will be a
beneficial method for general medical image segmentation; e.g., we could
segment breast cancer areas by re-using a classifier developed for breast
cancer detection without training a new segmentation model from scratch.
This study – that to segment breast tumors by re-using trained clas-
sifiers – is inspired by applications of explainability in medical imaging.
A main category of explainability methods is attribution-based methods,
4 https://image-net.org/challenges/LSVRC/2014/browse-synsets
227
which are widely used for interpretability of deep learning [252]. The com-
monly used algorithms of attribution-based methods for medical images are
saliency maps [170], activation maps [275], CAM [308]/Grad-CAM [240],
Gradient [71], SHAP [296], et cetera.
In this study, we examined how the attribution maps (the heatmaps)
generated from the Grad-CAM algorithm can contribute to the segmentation
of breast tumors. The visualization of the class-specific units [307, 308] for
CNN classifiers is used to locate the most discriminative components for
classification in the image. The authors of Grad-CAM also evaluated the
localization capability of Grad-CAM [240] by bounding boxes containing the
objects. But those methods provide coarse boundaries around the targets,
and these studies have not provided further quantitative analysis about
the differences between predicted and real boundaries of the targets. Thus,
they are considered to be methods for localization rather than segmentation.
For weakly-supervised image segmentation [143, 240], CAM/Grad-CAM’s
heatmaps can be computed and combined with other segmentation models,
such as UNet CNNs [191, 211], to improve segmentation performance. We
are aware of no similar study that has merely applied the Grad-CAM algo-
rithm by trained CNN classifiers to a specific application of medical image
segmentation, without using any other segmentation models or approaches.
We quantitatively analyzed the differences between predicted boundaries
from Grad-CAM and real boundaries of the targets, and we discussed the
relationships between the performance of segmentation and classification
based on the CNN classifier.
228
5.3.2 The Grad-CAM Method
To compute the class activation maps (CAM), Zhou et al. [308] proposed
to insert a global average pooling (GAP) layer between the last convolutional
layer (feature maps) and the output layer in CNNs. The size of each 2-D
feature map is [x, y], and a value in the k-th feature map is fi,k j . Suppose the
last convolutional layer contains n feature maps, then the GAP layer will
have n nodes for each extracted feature map. From the definition, the value
of the k-th node is the average value of the k-th feature map:
1
Fk = ∑ fi,k j (5.2)
xy i∈{1,2,··· ,x}
j∈{1,2,··· ,y}
After inserting the GAP layer, to obtain the weights of connections from
the GAP layer to the output layer requires to re-train the whole network
with training data. wck is the weight of connection from the k-th node in the
GAP layer to the c-th node in the output layer (for the c-th class). Thus, the
CAM of c-th class is:
n
CAMc [i, j] = ∑ wck · fi,k j (5.3)
k=1
n
Yc = ∑ wck · F k (5.4)
k=1
229
By taking the partial derivative of F k :
∂Y c
= wck (5.5)
∂ Fk
∂ Fk 1
k
= (5.7)
∂ fi, j xy
∂Y c
wck = · xy (5.8)
∂ fi,k j
Finally, putting Equations (5.3) and (5.8) together, they find the way to
compute the CAM for c-th class without really inserting and training a GAP
layer. Thus, it is called Grad-CAM:
n
∂Y c k
Grad-CAMc [i, j] = xy · ∑ k
· fi, j (5.9)
k=1 ∂ f i, j
The size of CAM (Grad-CAM) equals the size of feature maps: [x, y], which
is usually smaller than the size of input images. For comparison, resizing
is commonly applied to CAMs to enlarge their size to be the same as input
images.
230
Tumor mask
Left breast True boundary
Figure 5.14: Flowchart of the Experiment #1. The true boundaries of tumor
regions in abnormal ROIs are provided by the DDSM database.
class (with/without tumors) CNN classifiers. ROIs with tumors are called
abnormal ROIs and ROIs without tumors are called normal ROIs. After
training the two-class classifier using these normal and abnormal ROIs, we
will apply Grad-CAM and the classifier to test abnormal ROIs to segment
tumor regions. Then, we will use the true boundary of the test whether
abnormal ROIs can be used to evaluate segmentation results. Figure 5.14
shows the flowchart of this experiment. The goal of the Experiment #1 is
to verify how well the medical targets are segmented by a trained classifier
using Grad-CAM algorithm.
By using the Grad-CAM algorithm, the trained CNN classifiers can gen-
erate CAMs from both normal and abnormal ROIs. These CAMs can be
considered as masks that indicate the areas that are important to classifi-
cation. In the second experiment, we trained CNN classifiers from scratch
by only using information of those areas. The training data are ROIs filtered
by CAMs. It is an evaluation method for the CAMs: to re-train the CNN
classifiers using images filtered by CAMs and examine their performance.
By combining with the Experiment #1, the steps of Experiment #2 are
(Figure 5.15):
231
• To train two-class classifiers with normal and abnormal ROIs.
CNN Classifiers
×
ROIs CAMs ROIs filtered
by CAMs
Figure 5.15: Flowchart of the Experiment #2. The normal and abnormal
ROIs are used twice to train the CNN classifiers and then to generate CAMs
by trained classifiers using Grad-CAM algorithm. The CNN classifiers to be
trained by CAM-filtered ROIs are the same CNN models (same structures)
as trained by the original ROIs before but trained from scratch again.
232
to generating the CAM-filtered ROIs. For abnormal ROIs, we multiply ROIs
by their corresponding tumor masks so that only the tumor areas are kept
and the background (non-tumor area) is removed (pixel values = 0). For
normal ROIs, since there is no tumor area, we multiply ROIs by randomly
selected tumor masks from abnormal cases, for the purpose of making
normal/abnormal ROIs have similar shapes (outlines).
In this study, we also use the mammographic images from the Digi-
tal Database for Screening Mammography (DDSM) [104] as introduced in
Section 4.3.1.
We firstly downloaded mammographic images from DDSM database and
cropped the Region of Interest (ROI) by given abnormal areas as ground
truth information. Images in DDSM are compressed in LJPEG format. To
decompress and convert these images, we used the DDSM Utility [246]. We
converted all images in DDSM to PNG format. DDSM describes the location
and boundary of actual abnormality by chain-codes, which are recorded in
OVERLAY files for each breast image containing abnormalities. The DDSM
Utility also provides the tool to read boundary information and display them
for each image having abnormalities. Since the DDSM Utility tools run on
MATLAB, we implemented all pre-processing tasks using MATLAB.
We used the ROIs instead of entire images to train CNN classifiers. These
ROIs are cropped rectangle-shape images (Figure 5.16) and obtained by:
• For abnormal ROIs from images containing abnormalities, they are the
minimum rectangle-shape areas surrounding the whole given ground
truth boundaries with padding.
• For normal ROIs, they were cropped on the other side of a breast
233
4
10
Figure 5.16: The ROI (left) is cropped from an original image (right) from
DDSM dataset. The red boundary shows the tumor area. The ROI is larger
than the size of tumor area because of padding.
having abnormal ROI and the normal ROI was the same size (with
padding) and location as the abnormal ROI on different breast side. If
both left and right breasts having abnormal ROIs and their locations
overlapping, we discarded this sample. Since in most cases, only one
side of breast has tumor and the area and shape of left and right breast
are similar; thus, normal ROIs and abnormal ROIs have similar black
background areas and scaling.
The padding is added to all ROIs in order to vary the locations of tumors
in abnormal ROIs and to avoid excessive proportion of the tumor area in
a ROI. ROIs are larger than the sizes of tumor areas because of padding.
As shown in Figure 5.17, the padding is added by some randomness and
depended on the size of tumors:
234
• Width: randomly adding 10%-30% of tumor width on left and right
sides.
0.1w~0.3w
w
0.1h~0.3h
Figure 5.17: The padding is added to four sides of ROIs by some randomness
and depended on the size of tumor area.
After collecting ROIs, as shown in Figure 5.18, we have normal ROIs and
abnormal (tumor) ROIs to apply classification (using binary labels), and
could have real tumor masks to apply segmentation.
True boundary
Abnormal
Normal ROI Tumor mask
(tumor) ROI
Figure 5.18: Examples of ROIs. The tumor mask is binary image created
from the tumor ROI and truth boundary of the tumor area.
235
bile [314], MobileNetV2 [233], DenseNet121 [118], ResNet50V2 [102], Xcep-
tion [43], and InceptionV3 [266]. Except for cropping ROIs, experiments are
implemented by codes written in the Python language.
Our dataset has 325 abnormal (tumor) ROIs and 297 normal ROIs in
total. To train the CNN classifiers, we divide the dataset into 80% for training
and 20% for validation. The framework of deep learning models is Keras5 .
Every CNN model is trained about 200 epochs with EarlyStopping6 setting.
The classifier models having the best validation (accuracy) performance
during each training were saved.
To input an abnormal ROI into the trained CNN classifier by using Grad-
CAM algorithm, we can obtain a CAM for that ROI. Then, we resized the
CAM to the size same as the input ROI. The CAMs are gray-value image,
and the truth tumor masks we have are binary image. Thus, we applied
the mean-Dice metric to compare CAMs and tumor masks.
The Dice coefficient [60, 255] of two binary images A and B is:
|A ∩ B|
Dice(A, B) = 2 ×
|A| + |B|
To calculate Dice coefficient requires both images are binary; thus, we need
to transform the CAMs from gray-value ([0, 255]) to binary ({0, 1}). Suppose
B is the CAM, it can be binarized by setting a threshold (t): Bt (B > t) = 1 and
Bt (B ≤ t) = 0, where Bt is the binarized CAM. Then, the mean-Dice metric is
defined:
1 255
mean-Dice(A, B) = · ∑ Dice(A, Bt ) (5.10)
256 t=0
236
Table 5.2: Result of Experiment #1. Descending sort by val_acc.
CNN classifier in Table 5.2. The averaged mean-Dice is calculated using all
325 abnormal (tumor) ROIs. Figure 5.19 shows CAMs of one tumor ROI
generated by using trained CNN classifiers and Grad-CAM algorithm.
As shown in the result, CAMs from Xception overlap the most regions
of true tumor masks. But CAMs from DenseNet121 and MobileNetV2
almost do not cover the true tumor regions. We could see from Figure 5.19,
heatmaps (CAMs) of the two classifiers highlight the corners and outer areas
of images instead of the tumor regions. Although CAMs from DenseNet121
and MobileNetV2 have very small Dice values with true tumor areas, they
still have good classification performance. Thus, the result leads to two
questions:
237
Tumor ROI Tumor mask
by Dice. And, we plot the Dice and CAM_val_acc for the six CNN classifiers
in Figure 5.20.
As shown in Figure 5.20, in general, training on ROIs filtered by CAMs
covering more tumor areas (higher Dice values) leads to better classification
performance (CAM_val_acc). InceptionV3 model is an exception: its Dice is
smaller than NASNetMobile’s but it has a higher CAM_val_acc than NAS-
NetMobile. The reason may be that InceptionV3 has a better classification
capability than NASNetMobile because 1) the parameters in InceptionV3 are
238
Table 5.3: Result of Experiment #2. Descending sort by Dice.
239
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Dice CAM_val_acc
Figure 5.20: Plots of Dice and CAM_val_acc for the six CNN classifiers in
Table 5.3.
5.3.6 Discussion
240
(a) (b)
Figure 5.22: Some tumor ROIs and their CAMs from Xception.
241
classifier models. As shown by the DenseNet121 in Table 5.3 and Figure 5.19,
its CAMs have very small Dice values with true tumor areas but the model
still has a good classification performance. This implies that the dark
regions in CAMs also contribute to classification.
This study may raise other questions and discover the starting points
for future studies to make progress in the understanding of deep learning.
The Grad-CAM is not the only method to generate the heatmap that reflects
the basis of classification. In future works, we would test other techniques,
such as Saliency map [170], SHAP [296], and Activation map [275], to create
segmentation and make comparison. We found the dark-regions in CAMs
from Grad-CAM also contribute to classification; thus, we wonder if some
other techniques could solve this drawback.
Since the performance of Grad-CAM depends on classifier models, we
would ask:
• How do the bottom layers (fully-connected layers, layers after the last
convolutional layer) in CNN models affect the CAMs?
242
5.4 Conclusion
243
Chapter 6: Conclusions and Future Work
244
Table 6.1: My contributions (citations in brackets) in the four summarized projects regarding the complexity and
learnability, which are the two important components of explainable machine learning (or XAI).
XAI
Complexity → Data Learnability → Models
Projects
Transparent Deep Create the Distance-based Separabil- Create the Decision Boundary Complexity
Learning/Machine ity Index (DSI) to measure the separa- (DBC) measure to analyze the generalizability
Learning bility of datasets. [J1] of deep learning models [C5] and develop a
theory to estimate the training accuracy for
two-layer neural networks applied to random
datasets, to understand the mechanisms of
deep neural networks. [Section 4.6]
Hyper-Spectral Image- Apply DSI as an effective internal Apply k-means clustering to detect lesions
Based Cardiac Abla- Cluster Validity Index (CVI) to evaluate from hyperspectral images and reduce the
tion Lesion Detection clusters. [C4] number of spectral bands (by grouping them)
245
without significantly affecting detection accu-
racy. [J5]
Applications of Trans- Create the Likeness Score (LS) (a va- Show that adding GAN-generated images
fer Learning and the riety of DSI) to evaluate the perfor- makes the training of CNNs from scratch suc-
Generative Adversarial mances of GANs by directly analyzing cessful and improves CNNs’ performances.
Network in Breast Can- their generated images without using [J3]
cer Detection a pre-trained classifier. [J2]
Deep Learning-Based Test the reverse process to approach Demonstrate the capability of Convolutional
Medical Image Seg- the segmentation problem for mam- and Deconvolutional Neural Network (C-
mentation mograms using pre-trained Convolu- DCNN) to learn essential features of breast
tional Neural Network (CNN) classi- regions and delineate them in thermal images.
fiers because the complexity of medi- [C10]
cal images demands new approaches
to segmentation. [C2]
I have applied Deep Learning (DL)-based methods to detect breast cancer
from mammograms. Since training a Convolutional Neural Network (CNN)
from scratch is not feasible for a limited number of labeled mammographic
images, I show that using transfer learning in CNN is a promising solution for
breast cancer detection and the Generative Adversarial Network (GAN) can
be used as an image augmentation method for training and to improve the
performance of CNN classifiers. In terms of explainable DL, to further study
DL-models, I propose a novel GAN measure – Likeness Score (LS) – based
on the DSI to evaluate the images generated from GAN models, propose
the Decision Boundary Complexity (DBC) score to define and measure
the generalizability of the Deep Neural Network (DNN), and create a novel
theory based on space partitioning to estimate the approximate training
accuracy for two-layer neural networks. All were developed to reveal the
mechanism of DNN models. These studies may raise other questions and
suggest starting points for new ways for researchers to make progress on
studying the transparency of deep learning and explainable deep learning.
I have applied Deep Learning (DL)-based methods to medical image
segmentation. My studies demonstrate the capability of the Convolutional
and Deconvolutional Neural Network (C-DCNN) to learn essential features of
breast regions and delineate them in thermal images; further, the C-DCNN
can segment breast regions. Then, I test whether the heatmaps extracted
from trained classifiers (e.g., using Grad-CAM) could be applied to segment
the objects. Results indicate that the use of only Grad-CAM to train two-
class CNN classifiers may not be an optimal approach for segmentation;
instead, to combine Grad-CAM with other segmentation methods could be
a promising direction.
Based on the research presented in this dissertation, some future works
246
in medical image analysis could be:
247
through each hidden layer in neural networks. Data separability may
provide another perspective to understand how neural networks work.
In general, we want to understand what DL-models learn. For example,
for a CNN classifier, does it learn or extract the patterns for classifi-
cation from training data or just memorize the training data? For a
specific DL-model, we could try to define the learned patterns, find
evidence or ways to determine either that the DL-model learns patterns
from training data or records the data, and study how the DL-model
keeps and reuses such information for recognition/classification.
As new methods from Machine Learning (ML) and Deep Learning (DL)
have been applied to medical image analysis for detection, classification, and
segmentation, there has been since 2016 a parallel interest in explainable
ML and DL (i.e., XAI): more and more research is being focused on this
issue. Although ML and DL have achieved notable results in the laboratory,
they have not been deployed significantly in the clinic because of the lack
of explainability. In addition to the technical issues in XAI being studied
by researchers and engineers, it is important, for the reasons of respon-
sibility and reliability, to involve physicians, regulators, and patients in
the principled approach to defining and realizing explainability for medical
applications.
248
List of Publications
Journal Articles
[J1] S. Guan and M. Loew, “A novel intrinsic measure of data separa-
bility”, Applied Intelligence, 2022, in press. doi:10.1007/s10489-
022-03395-6.
Conference Papers
[C1] A. Lou, S. Guan, H. Ko, and M. Loew, “Caranet: Context axial re-
verse attention network for segmentation of small medical objects”,
in Medical Imaging 2022: Image Processing, International Society
for Optics and Photonics, vol. 12032, SPIE, 2022, pp. 81–92. doi:
10.1117/12.2611802.
249
[C3] A. Lou, S. Guan, and M. H. Loew, “DC-UNet: rethinking the U-Net
architecture with dual channel efficient CNN for medical image
segmentation”, in Medical Imaging 2021: Image Processing, Inter-
national Society for Optics and Photonics, vol. 11596, SPIE, 2021,
pp. 749–759. doi: 10.1117/12.2582338.
[C4] S. Guan and M. Loew, “An internal cluster validity index using a
distance-based separability measure”, in 2020 IEEE 32nd Interna-
tional Conference on Tools with Artificial Intelligence (ICTAI), 2020,
pp. 827–834. doi: 10.1109/ICTAI50040.2020.00131.
250
[C11] S. Guan, M. Loew, H. Asfour, N. Sarvazyan, and N. Muselimyan,
“Lesion detection for cardiac ablation from auto-fluorescence hyper-
spectral images”, in Medical Imaging 2018: Biomedical Applications
in Molecular, Structural, and Functional Imaging, vol. 10578, SPIE,
2018, pp. 389–403. doi: 10.1117/12.2293652.
251
Bibliography
[4] Zighed Djamel A., Lallich Stéphane, and Muhlenbach Fabrice. Sep-
arability index in supervised learning. Lecture Notes in Computer
Science, pages 475–487. Springer Berlin Heidelberg, 2002.
[5] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga,
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiao-
qiang Zheng. Tensorflow: Large-scale machine learning on hetero-
geneous distributed systems. arXiv:1603.04467 [cs], 3 2016. arXiv:
1603.04467.
[6] Amina Adadi and Mohammed Berrada. Peeking Inside the Black-Box:
A Survey on Explainable Artificial Intelligence (XAI). IEEE Access,
6:52138–52160, 2018.
252
editors, Proceedings of the 34th International Conference on Machine
Learning, volume 70 of Proceedings of Machine Learning Research,
pages 214–223, International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
[11] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein
gan. arXiv:1701.07875 [cs, stat], 1 2017. arXiv: 1701.07875.
[12] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang.
Fine-grained analysis of optimization and generalization for overpa-
rameterized two-layer neural networks. In The 36th International
Conference on Machine Learning (ICML), volume 97 of Proceedings of
Machine Learning Research, pages 322–332. PMLR, 2019.
[13] Aruna Arujuna, Rashed Karim, Dennis Caulfield, Benjamin Knowles,
Kawal Rhode, Tobias Schaeffter, Bernet Kato, C Aldo Rinaldi, Michael
Cooklin, Reza Razavi, et al. Acute pulmonary vein isolation is achieved
by a combination of reversible and irreversible atrial injury after
catheter ablation: evidence from magnetic resonance imaging. Circu-
lation: Arrhythmia and Electrophysiology, 5(4):691–700, 2012.
[14] Esmaeil Atashpaz-Gargari, Chao Sima, Ulisses M. Braga-Neto, and
Edward R. Dougherty. Relationship between the accuracy of classi-
fier error estimation and complexity of decision boundary. Pattern
Recognition, 46(5):1315–1322, 5 2013.
[15] Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto
Maki, and Stefan Carlsson. From generic to specific deep representa-
tions for visual recognition. page 36–45, 2015.
[16] André Ricardo Backes, Dalcimar Casanova, and Odemir Martinez
Bruno. Color texture analysis based on fractal descriptors. Pattern
Recognition, 45(5):1984–1992, 5 2012.
[17] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep
convolutional encoder-decoder architecture for image segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(12):2481–2495, 12 2017.
[18] Mihalj Bakator and Dragica Radosav. Deep Learning and Medical
Diagnosis: A Review of Literature. Multimodal Technologies and Inter-
action, 2(3):47, August 2018.
[19] Pierre Baldi. Autoencoders, unsupervised learning, and deep archi-
tectures. page 37–49, 2012.
[20] Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and
Hayit Greenspan. Chest pathology detection using deep learning with
non-medical training. page 294–297. IEEE, 2015.
253
[21] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser,
Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia,
Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila,
and Francisco Herrera. Explainable artificial intelligence (xai): Con-
cepts, taxonomies, opportunities and challenges toward responsible
ai. Information Fusion, 58:82–115, 6 2020.
[24] Lei Bi, Jinman Kim, Ashnil Kumar, Dagan Feng, and Michael Fulham.
Synthesis of positron emission tomography (pet) images via multi-
channel generative adversarial networks (gans). In Molecular Imaging,
Reconstruction and Analysis of Moving Body Organs, and Stroke Imag-
ing and Treatment, Lecture Notes in Computer Science, pages 43–51.
Springer, Cham, 9 2017. DOI: 10.1007/978-3-319-67564-0_5.
[26] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister.
Sliced and Radon Wasserstein Barycenters of Measures. Journal of
Mathematical Imaging and Vision, 51(1):22–45, January 2015.
[27] Tiago B. Borchartt, Aura Conci, Rita C. F. Lima, Roger Resmini, and
Angel Sanchez. Breast thermography from an image processing view-
point: A survey. SIGNAL PROCESSING, 93(10, SI):2785–2803, 10
2013.
[28] Wener Borges Sampaio, Edgar Moraes Diniz, Aristófanes Corrêa Silva,
Anselmo Cardoso de Paiva, and Marcelo Gattass. Detection of masses
in mammogram images using cnn, geostatistic functions and svm.
Computers in Biology and Medicine, 41(8):653–664, 8 2011.
[29] Ali Borji. Pros and cons of gan evaluation measures. Computer Vision
and Image Understanding, 179:41–65, 2 2019.
254
7 2016. 2016 International Joint Conference on Neural Networks
(IJCNN), IEEE.
[33] Tadeusz Caliński and Jerzy Harabasz. A dendrite method for cluster
analysis. Communications in Statistics-theory and Methods, 3(1):1–27,
1974.
[39] Tong Che, Yanran Li, Athul Jacob, Yoshua Bengio, and Wenjie Li.
Mode Regularized Generative Adversarial Networks. November 2016.
[40] Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang,
and Ming Yan. Practical accuracy estimation for efficient deep neural
network testing. ACM Trans. Softw. Eng. Methodol., 29(4), October
2020.
255
[41] SA Chen, MH Hsieh, CT Tai, CF Tsai, VS Prakash, WC Yu, TL Hsu,
YA Ding, and MS Chang. Initiation of atrial fibrillation by ectopic beats
originating from the pulmonary veins : Electrophysiological charac-
teristics, pharmacological responses, and effects of radiofrequency
ablation. Circulation, 100(18):1879–1886, 11 1999.
[46] Francesco Ciompi, Bartjan de Hoop, Sarah J. van Riel, Kaman Chung,
Ernst Th Scholten, Matthijs Oudkerk, Pim A. de Jong, Mathias Prokop,
and Bram van Ginneken. Automatic classification of pulmonary peri-
fissural nodules in computed tomography using an ensemble of 2d
views and a convolutional neural network out-of-the-box. Medical
image analysis, 26(1):195–202, 2015.
[47] Uri Cohen, SueYeon Chung, Daniel D. Lee, and Haim Sompolinsky.
Separability and geometry of object manifolds in deep neural networks.
Nature Communications, 11(1):746, December 2020.
[48] Dabboor, Stephen Howell, Shokr, and J.J. Yackel. The jef-
fries–matusita distance for the case of complex wishart distribution
as a separability criterion for fully polarimetric sar data. International
Journal of Remote Sensing, 35, 10 2014.
[49] Wei Dai, Joseph Doyle, Xiaodan Liang, Hao Zhang, Nanqing Dong,
Yuan Li, and Eric P. Xing. Scan: Structure correcting adversarial
network for organ segmentation in chest x-rays. arXiv:1703.08770
[cs], 3 2017. arXiv: 1703.08770.
256
[50] David L. Davies and Donald W. Bouldin. A cluster separation measure.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
1(2):224–227, 4 1979. doi:10.1109/TPAMI.1979.4766909.
[51] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database. pages 248–255.
2009 IEEE Conference on Computer Vision and Pattern Recognition,
6 2009. ISSN: 1063-6919.
[52] Carol DeSantis, Rebecca Siegel, and Ahmedin Jemal. Breast cancer
facts & figures 2015-2016. page 44.
[56] Sanjay Deshpande, John Catanzaro, and Samuel Wann. Atrial fibrilla-
tion: Prevalence and scope of the problem. Cardiac Electrophysiology
Clinics, 6(1):1–4, 3 2014. PMID: 27063816.
[58] Dua Dheeru and E. Karra Taniskidou. Uci machine learning repository.
2017.
257
[61] F. T. de Dombal, D. J. Leaper, J. R. Staniland, A. P. McCann, and
Jane C. Horrocks. Computer-aided Diagnosis of Acute Abdominal
Pain. Br Med J, 2(5804):9–13, April 1972. Publisher: British Medical
Journal Publishing Group Section: Papers and Originals.
[62] Ngan Thi Dong and Megha Khosla. Revisiting Feature Selection with
Data Complexity. In 2020 IEEE 20th International Conference on
Bioinformatics and Bioengineering (BIBE), pages 211–216, 2020. ISSN:
2471-7819.
[63] Derek Doran, Sarah Schulz, and Tarek R. Besold. What does ex-
plainable ai really mean? a new conceptualization of perspectives.
arXiv:1710.00794 [cs], 10 2017. arXiv: 1710.00794.
[64] Finale Doshi-Velez and Been Kim. Towards a rigorous science of in-
terpretable machine learning. arXiv e-prints, 1702:arXiv:1702.08608,
2 2017.
[65] Timothy Dozat. Incorporating nesterov momentum into adam. 2
2016.
[66] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai.
Gradient descent finds global minima of deep neural networks. In The
36th International Conference on Machine Learning (ICML), volume 97
of Proceedings of Machine Learning Research, pages 1675–1685. PMLR,
2019.
[67] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Clas-
sification. John Wiley & Sons, November 2012. Google-Books-ID:
Br33IRC3PkQC.
[68] J. C. Dunn. Well-separated clusters and optimal fuzzy
partitions. Journal of Cybernetics, 4(1):95–104, 1 1974.
doi:10.1080/01969727408546059.
[69] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous
generalization bounds for deep (stochastic) neural networks with many
more parameters than training data. arXiv:1703.11008 [cs], 10 2017.
arXiv: 1703.11008.
[70] NG EDDIEY.-K. Segmentation of breast thermogram : Improved
boundary detection with modified snake algorithm. 2006.
[71] Fabian Eitel and Kerstin Ritter. Testing the Robustness of Attribution
Methods for Convolutional Neural Networks in MRI-Based Alzheimer’s
Disease Classification. In Interpretability of Machine Intelligence in
Medical Image Computing and Multimodal Learning for Clinical Decision
Support, Lecture Notes in Computer Science, pages 3–11, Cham, 2019.
Springer International Publishing.
258
[72] Frank Emmert-Streib, Olli Yli-Harja, and Matthias Dehmer. Explain-
able artificial intelligence and machine learning: A reality rooted
perspective. arXiv:2001.09464 [cs, stat], 1 2020. arXiv: 2001.09464.
[74] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al.
A density-based algorithm for discovering clusters in large spatial
databases with noise. KDD, 96:226–231, 1996.
[75] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M.
Swetter, Helen M. Blau, and Sebastian Thrun. Dermatologist-level
classification of skin cancer with deep neural networks. Nature,
542(7639):115–118, 2 2017.
[82] Shangqian Gao, Feihu Huang, Weidong Cai, and Heng Huang. Net-
work pruning via performance maximization. In Proceedings of the
259
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 9270–9280, June 2021.
[83] Luis P. F. Garcia, Ana C. Lorena, Marcilio C. P. de Souto, and Tin Kam
Ho. Classifier recommendation using data complexity measures.
pages 874–879, Beijing, 8 2018. 2018 24th International Conference
on Pattern Recognition (ICPR), IEEE.
[84] Nathan Garcia, Frederico Tiggeman, Eduardo Borges, Giancarlo
Lucca, Helida Santos, and Graçaliz Dimuro. Exploring the Rela-
tionships between Data Complexity and Classification Diversity in
Ensembles. In Proceedings of the 23rd International Conference on En-
terprise Information Systems, pages 652–659. SCITEPRESS - Science
and Technology Publications, 2021.
[85] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge,
Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns
are biased towards texture; increasing shape bias improves accu-
racy and robustness. In 7th International Conference on Learning
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net, 2019.
[86] Daniel A Gil, Luther M Swift, Huda Asfour, Narine Muselimyan,
Marco A Mercader, and Narine A Sarvazyan. Autofluorescence hyper-
spectral imaging of radiofrequency ablation lesions in porcine cardiac
tissue. Journal of biophotonics, 10(8):1008–1017, 8 2017.
[87] van B. Ginneken, A. A. A. Setio, C. Jacobs, and F. Ciompi. Off-the-
shelf convolutional neural network features for pulmonary nodule
detection in computed tomography scans. pages 286–289. 2015 IEEE
12th International Symposium on Biomedical Imaging (ISBI), 4 2015.
[88] Rafael C Gonzalez, Richard E Woods, et al. Digital image processing,
2002.
[89] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,
N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural In-
formation Processing Systems 27, page 2672–2680. Curran Associates,
Inc., 2014.
[90] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,
N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural In-
formation Processing Systems 27, page 2672–2680. Curran Associates,
Inc., 2014.
260
[91] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard
Schölkopf, and Alexander Smola. A kernel two-sample test. Journal
of Machine Learning Research, 13(25):723–773, 2012.
[92] Shuyue Guan, Huda Asfour, Narine Sarvazyan, and Murray Loew.
Application of unsupervised learning to hyperspectral imaging of
cardiac ablation lesions. Journal of Medical Imaging, 5(4):046003, 12
2018. doi:10.1117/1.JMI.5.4.046003.
[93] John T. Guibas, Tejpal S. Virdi, and Peter S. Li. Synthetic medical
images from dual generative adversarial networks. arXiv:1709.01872
[cs], 9 2017. arXiv: 1709.01872.
[96] C. Guo, S. Mita, and D. McAllester. Robust road detection and tracking
in challenging scenarios based on markov random fields with unsu-
pervised learning. IEEE Transactions on Intelligent Transportation
Systems, 13(3):1338–1354, 9 2012.
[97] Philipp Hacker, Ralf Krestel, Stefan Grundmann, and Felix Naumann.
Explainable AI under contract and tort law: legal incentives and
technical challenges. Artificial Intelligence and Law, 28(4):415–439,
December 2020.
[100] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. arXiv:1512.03385 [cs], 12
2015. arXiv: 1512.03385.
[101] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delv-
ing deep into rectifiers: Surpassing human-level performance on
261
imagenet classification. pages 1026–1034. Proceedings of the IEEE
International Conference on Computer Vision, 2015.
[102] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity
Mappings in Deep Residual Networks. arXiv:1603.05027 [cs], July
2016. ECCV 2016 camera-ready.
[103] Warren He, Bo Li, and Dawn Song. Decision boundary analysis of
adversarial examples. 2018.
[104] Michael Heath, Kevin Bowyer, Daniel Kopans, Richard Moore, and
W. Philip Kegelmeyer. The digital database for screening mammog-
raphy. In Proceedings of the 5th international workshop on digital
mammography, pages 212–218. Medical Physics Publishing, 2000.
[109] Fred Hohman, Minsuk Kahng, Robert Pienta, and Duen Horng Chau.
Visual analytics in deep learning: An interrogative survey for the next
frontiers. IEEE Transactions on Visualization and Computer Graphics,
25(8):2674–2693, 8 2019. event: IEEE Transactions on Visualization
and Computer Graphics.
[110] Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo, and Sungroh Yoon. How
generative adversarial networks and its variants work: An overview of
gan. arXiv:1711.05914 [cs], 11 2017. arXiv: 1711.05914.
262
[111] Shin Hoo-Chang, Holger R. Roth, Mingchen Gao, Le Lu, Ziyue Xu,
Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M. Sum-
mers. Deep convolutional neural networks for computer-aided detec-
tion: Cnn architectures, dataset characteristics and transfer learn-
ing. IEEE transactions on medical imaging, 35(5):1285–1298, 5 2016.
doi:10.1109/TMI.2016.2528162.
[112] Shin Hoo-Chang, Holger R. Roth, Mingchen Gao, Le Lu, Ziyue Xu, Is-
abella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M. Summers.
Deep convolutional neural networks for computer-aided detection:
Cnn architectures, dataset characteristics and transfer learning. IEEE
transactions on medical imaging, 35(5):1285–1298, 5 2016. PMID:
26886976 PMCID: PMC4890616.
[115] Meng Hu, Eric C. C. Tsang, Yanting Guo, and Weihua Xu. Fast
and Robust Attribute Reduction Based on the Separability in Fuzzy
Decision Systems. IEEE Transactions on Cybernetics, pages 1–14,
2021.
[116] Xia Hu, Lingyang Chu, Jian Pei, Weiqing Liu, and Jiang Bian. Model
Complexity of Deep Learning: A Survey. August 2021. arXiv:
2103.05127.
[117] Yipeng Hu, Eli Gibson, Li-Lin Lee, Weidi Xie, Dean C. Barratt, Tom
Vercauteren, and J. Alison Noble. Freehand ultrasound image simu-
lation with spatially-conditioned generative adversarial networks. In
Molecular Imaging, Reconstruction and Analysis of Moving Body Or-
gans, and Stroke Imaging and Treatment, Lecture Notes in Computer
Science, pages 105–115. Springer, Cham, 9 2017. DOI: 10.1007/978-
3-319-67564-0_11.
[118] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Wein-
berger. Densely Connected Convolutional Networks. arXiv:1608.06993
[cs], January 2018. CVPR 2017.
[119] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memi-
sevic. Generating images with recurrent adversarial networks. arXiv
preprint arXiv:1602.05110, 2016.
263
[120] Sergey Ioffe and Christian Szegedy. Batch normalization: Accel-
erating deep network training by reducing internal covariate shift.
arXiv:1502.03167 [cs], 2 2015. arXiv: 1502.03167.
[121] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-
image translation with conditional adversarial networks. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition,
pages 1125–1134, 2017.
[124] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An introduction to statistical learning, volume 112. Springer, 2013.
[126] Li Jiang, Wang Zhan, and Murray H. Loew. Modeling static and
dynamic thermography of the human breast under elastic deforma-
tion. Physics in Medicine and Biology, 56(1):187–202, 1 2011. PMID:
21149948.
[127] Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio.
Predicting the generalization gap in deep networks with margin distri-
butions. In 7th International Conference on Learning Representations.
ICLR, 2019.
[129] Zhicheng Jiao, Xinbo Gao, Ying Wang, and Jie Li. A deep feature
based framework for breast masses classification. Neurocomputing,
197:221–231, 7 2016.
264
[131] Pragati Kapoor, S.V.A.V. Prasad, and Seema Patni. Image segmen-
tation and asymmetry analysis of breast thermograms for tumor
detection. International Journal of Computer Applications, 50(9):40–45,
7 2012.
[132] Md Rezaul Karim, Oya Beyan, Achille Zappa, Ivan G. Costa, Diet-
rich Rebholz-Schuhmann, Michael Cochez, and Stefan Decker. Deep
learning-based clustering approaches for bioinformatics. Briefings in
Bioinformatics, 2020.
[133] Hamid Karimi, Tyler Derr, and Jiliang Tang. Characterizing the deci-
sion boundary of deep neural networks. arXiv:1912.11460 [cs, stat],
6 2020. arXiv: 1912.11460.
[137] Valentin Khrulkov and Ivan Oseledets. Geometry score: A method for
comparing generative adversarial networks. In International Confer-
ence on Machine Learning, pages 2621–2629. PMLR, 2018.
[138] J. Kim, D. Han, Y. W. Tai, and J. Kim. Salient region detection via high-
dimensional color transform. pages 883–890. 2014 IEEE Conference
on Computer Vision and Pattern Recognition, 6 2014.
[139] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv:1412.6980 [cs], 12 2014. arXiv: 1412.6980.
265
[141] Jon M. Kleinberg. An impossibility theorem for clustering. In S. Becker,
S. Thrun, and K. Obermayer, editors, Advances in Neural Information
Processing Systems 15, page 463–470. MIT Press, 2003.
[144] Jacob Koruth, Shigeki Kusa, Srinivas Dukkipati, Petr Neuzil, Ter-
rance Ransbury, KC Armstrong, Larson Larson, Cinnamon Bowen,
Amirana. Omar, Marco Mercader, Narine A Sarvazyan, Matthew W
Kay, and Vivek Y Reddy. Direct assessment of catheter-tissue contact
and rf lesion formation: a novel approach using endogenous nadh
fluorescence. Heart Rhythm, page S111, 2015.
[148] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images,
speech, and time series. The handbook of brain theory and neural
networks, 3361(10):1995, 1995.
[150] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, An-
drew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani,
Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic sin-
gle image super-resolution using a generative adversarial network.
arXiv:1609.04802 [cs, stat], 9 2016. arXiv: 1609.04802.
266
[151] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient
sparse coding algorithms. In B. Schölkopf, J. C. Platt, and T. Hoffman,
editors, Advances in Neural Information Processing Systems 19, page
801–808. MIT Press, 2007.
[153] Cheng Li and Bingyu Wang. Fisher linear discriminant analysis. 2014.
[154] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and
Shuicheng Yan. Perceptual generative adversarial networks for small
object detection. pages 1951–1959, Honolulu, HI, 7 2017. 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
IEEE.
[155] Rongjian Li, Wenlu Zhang, Heung-Il Suk, Li Wang, Jiang Li, Dinggang
Shen, and Shuiwang Ji. Deep learning based imaging data completion
for improved brain disease diagnosis. page 305–312. Springer, 2014.
[156] Yu Li, Lizhong Ding, and Xin Gao. On the decision boundary of deep
neural networks. arXiv:1808.05385 [cs], 1 2019. arXiv: 1808.05385.
[157] Zhongyu Li, Xiaofan Zhang, Henning Müller, and Shaoting Zhang.
Large-scale retrieval for medical image analytics: A comprehensive
review. Medical Image Analysis, 43:66–84, January 2018.
[159] Jinlong Liu, Yunzhi Bai, Guoqing Jiang, Ting Chen, and Huayan Wang.
Understanding Why Neural Networks Generalize Well Through GSNR
of Parameters. In International Conference on Learning Representations,
ICLR, 2020.
[160] Jinlong Liu, Yunzhi Bai, Guoqing Jiang, Ting Chen, and Huayan
Wang. Understanding why neural networks generalize well through
gsnr of parameters. 2020.
[161] Xiangbin Liu, Liping Song, Shuai Liu, and Yudong Zhang. A review of
deep-learning-based medical image segmentation methods. Sustain-
ability, 13(3), 2021.
[162] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu, and
Sen Wu. Understanding and enhancement of internal clustering vali-
dation measures. IEEE Transactions on Cybernetics, 43(3):982–994,
6 2013. doi:10.1109/TSMCB.2012.2220543.
267
[163] Shih-Chung B. Lo, Heang-Ping Chan, Jyh-Shyan Lin, Huai Li,
Matthew T. Freedman, and Seong K. Mun. Artificial convolution
neural network for medical image pattern recognition. Neural Net-
works, 8(7–8):1201–1214, 1995.
[166] Ange Lou, Shuyue Guan, Nada Kamona, and Murray Loew. Segmen-
tation of infrared breast images using multiresunet neural networks.
In 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR),
pages 1–6, 2019.
[167] Ange Lou, Shuyue Guan, Hanseok Ko, and Murray Loew. Caranet:
Context axial reverse attention network for segmentation of small
medical objects. arXiv preprint arXiv:2108.07368, 2021.
[168] Ange Lou, Shuyue Guan, and Murray Loew. Cfpnet-m: A light-weight
encoder-decoder based network for multimodal biomedical image real-
time segmentation. arXiv preprint arXiv:2105.04075, 2021.
[169] Ange Lou, Shuyue Guan, and Murray H Loew. Dc-unet: rethinking
the u-net architecture with dual channel efficient cnn for medical
image segmentation. In Medical Imaging 2021: Image Processing,
volume 11596, page 115962T. International Society for Optics and
Photonics, 2021.
[170] Daniel Lévy and Arzav Jain. Breast Mass Classification from Mammo-
grams using Deep Convolutional Neural Networks. arXiv:1612.00542
[cs], December 2016. arXiv: 1612.00542.
[171] van der Laurens Maaten and Geoffrey Hinton. Visualizing data using
t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
[172] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and
Stephen Paul Smolley. Least squares generative adversarial networks.
pages 2813–2821. 2017 IEEE International Conference on Computer
Vision (ICCV), 10 2017. ISSN: 2380-7504.
[173] Yuliya Marchetti, Hai Nguyen, Amy Braverman, and Noel Cressie.
Spatial data compression via adaptive dispersion clustering. Compu-
tational Statistics & Data Analysis, 117:138–153, 2018.
268
[174] Morteza Mardani, Enhao Gong, Joseph Y. Cheng, Shreyas Vasanawala,
Greg Zaharchuk, Marcus Alley, Neil Thakur, Song Han, William Dally,
John M. Pauly, and Lei Xing. Deep generative adversarial networks
for compressed sensing automates mri. arXiv:1706.00051 [cs, stat], 5
2017. arXiv: 1706.00051.
[175] David Martens, Jan Vanthienen, Wouter Verbeke, and Bart Baesens.
Performance of classification models from a user perspective. Decision
Support Systems, 51(4):782–793, 11 2011.
[177] Martina Melinščak, Pavle Prentašić, and Sven Lončarić. Retinal ves-
sel segmentation using deep neural networks. VISAPP 2015 (10th
International Conference on Computer Vision Theory and Applications),
Proceedings, Vol.1, page 577, 5 2015.
[178] Marco Mercader, Armstrong Kc, Terry Ransbury, Vivek Y. Reddy, Ja-
cob Koruth, Cinnamon Larsen, James Bowen, Narine A Sarvazyan,
and Omar Amirana. Optical tissue interrogation catheter that provides
real-time monitoring of catheter-tissue contact and rf lesion progres-
sion using nadh fluorescence. EP Europace, 18(suppl_1):i27–i27, 6
2016.
[179] Amit Kumar Mishra. Separability indices and their use in radar signal
based target recognition. IEICE Electronics Express, 6(14):1000–1005,
2009.
269
[184] Narine Muselimyan, Al Mohammed Jishi, Huda Asfour, Luther Swift,
and Narine A. Sarvazyan. Anatomical and optical properties of atrial
tissue: Search for a suitable animal model. Cardiovascular Engineer-
ing and Technology, 8(4):505–514, 2017.
[186] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve
restricted boltzmann machines. page 807–814, 2010.
[188] Olfa Nasraoui and Chiheb-Eddine Ben N’Cir. Clustering Methods for
Big Data Analytics. Springer, 2019.
[192] Dong Nie, Roger Trullo, Jun Lian, Caroline Petitjean, Su Ruan, Qian
Wang, and Dinggang Shen. Medical image synthesis with context-
aware generative adversarial networks. Lecture Notes in Computer
Science, pages 417–425. International Conference on Medical Image
Computing and Computer-Assisted Intervention, Springer, Cham, 9
2017.
270
pages 691–696, Fukuoka, 12 2010. 2010 Second World Congress on
Nature and Biologically Inspired Computing (NaBIC 2010), IEEE.
[195] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning de-
convolution network for semantic segmentation. pages 1520–1528.
Proceedings of the IEEE International Conference on Computer Vision,
2015.
[196] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training
generative neural samplers using variational divergence minimiza-
tion. In Proceedings of the 30th International Conference on Neural
Information Processing Systems, pages 271–279, 2016.
[197] Hakan Oral, Bradley P. Knight, Mehmet Ozaydin, Hiroshi Tada, Aman
Chugh, Sohail Hassan, Christoph Scharf, Steve W. K. Lai, Radmira
Greenstein, Frank Pelosi, S. Adam Strickberger, and Fred Morady.
Clinical significance of early recurrences of atrial fibrillation after pul-
monary vein isolation. Journal of the American College of Cardiology,
40(1):100–104, 7 2002. PMID: 12103262.
[198] Feifan Ouyang, Roland Tilz, Julian Chun, Boris Schmidt, Erik Wissner,
Thomas Zerm, Kars Neven, Bulent Köktürk, Melanie Konstantinidou,
Andreas Metzner, Alexander Fuernkranz, and Karl-Heinz Kuck. Long-
term results of catheter ablation in paroxysmal atrial fibrillation:
lessons from a 5-year follow-up. Circulation, 122(23):2368–2377, 12
2010. PMID: 21098450.
[199] Yuehao Pan, Weimin Huang, Zhiping Lin, Wanzheng Zhu, Jiayin Zhou,
Jocelyn Wong, and Zhongxiang Ding. Brain tumor grading based on
neural networks and convolutional neural networks. page 699–702.
IEEE, 2015.
[200] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den
Hengel. Deep learning for anomaly detection: A review. ACM Comput.
Surv., 54(2), March 2021.
[201] Razvan Pascanu, Guido Montúfar, and Yoshua Bengio. On the number
of inference regions of deep feed forward networks with piece-wise
linear activations. In The 2nd International Conference on Learning
Representations (ICLR), Conference Track Proceedings, 2014.
271
[203] Philip Perconti and Murray H. Loew. Salience measure for assessing
scale-based features in mammograms. Journal of the Optical Society
of America. A, Optics, Image Science, and Vision, 24(12):B81–90, 12
2007. PMID: 18059917.
[204] Anna Dagmar Peterson. A Separability Index for Clustering and Classi-
fication Problems with Applications to Cluster Merging and Systematic
Evaluation of Clustering Algorithms. PhD thesis, Ames, IA, USA, 2011.
[206] Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V.
Le. Meta Pseudo Labels. arXiv:2003.10580 [cs, stat], March 2021.
arXiv: 2003.10580.
[207] Nicolas Pinto, David D. Cox, and James J. DiCarlo. Why is real-world
visual object recognition hard? PLOS Computational Biology, 4(1):e27,
1 2008.
[208] Hairong Qi, Wesley Snyder, Jonathan F. Head, and Robert L. Elliott.
Detecting breast cancer from infrared images by asymmetry analysis.
volume 2, pages 1227–1228 vol.2, 2 2000.
[209] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised rep-
resentation learning with deep convolutional generative adversarial
networks. 2016.
[212] Aaditya Ramdas, Nicolas Garcia Trillos, and Marco Cuturi. On wasser-
stein two-sample testing and related families of nonparametric tests.
Entropy, 19(2):47, 2017.
272
of gaps in atrial ablation lesion sets using a real-time magnetic reso-
nance imaging system. Circulation. Arrhythmia and electrophysiology,
5(6):1130–5, 12 2012.
[217] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-
cnn: Towards real-time object detection with region proposal networks.
arXiv:1506.01497 [cs], 6 2015. arXiv: 1506.01497.
[221] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
tional networks for biomedical image segmentation. Lecture Notes in
Computer Science, pages 234–241. Springer International Publishing,
2015.
[222] Ribana Roscher, Bastian Bohn, Marco F. Duarte, and Jochen Garcke.
Explainable machine learning for scientific insights and discoveries.
IEEE Access, 8:42200–42216, 2020. event: IEEE Access.
273
[223] Ribana Roscher, Bastian Bohn, Marco F. Duarte, and Jochen Garcke.
Explainable Machine Learning for Scientific Insights and Discoveries.
IEEE Access, 8:42200–42216, 2020. Conference Name: IEEE Access.
[226] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
and Michael Bernstein. Imagenet large scale visual recognition chal-
lenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[227] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
[231] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, Xi Chen, and Xi Chen. Improved techniques for training gans.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, Advances in Neural Information Processing Systems 29, page
2234–2242. Curran Associates, Inc., 2016.
274
[234] Jorge M. Santos and Mark Embrechts. On the use of the adjusted rand
index as a metric for evaluating supervised classification. Lecture
Notes in Computer Science, page 175–184, Berlin, Heidelberg, 2009.
Springer.
[235] Shibani Santurkar, Ludwig Schmidt, and Aleksander Madry. A
classification-based study of covariate shift in gan distributions. In In-
ternational Conference on Machine Learning, pages 4480–4489. PMLR,
2018.
[236] Saeed Sarbazi-Azad, Mohammad Saniee Abadeh, and Mohammad Er-
fan Mowlaei. Using data complexity measures and an evolutionary
cultural algorithm for gene selection in microarray data. Soft Comput-
ing Letters, 3:100007, 2021.
[237] N. Scales, C. Kerry, and M. Prize. Automated image segmentation for
breast analysis using infrared images. volume 1, pages 1737–1740.
The 26th Annual International Conference of the IEEE Engineering
in Medicine and Biology Society, 9 2004.
[238] Achim Schilling, Andreas Maier, Richard Gerum, Claus Metzner, and
Patrick Krauss. Quantifying the separability of data classes in neural
networks. Neural Networks, 139:278–293, 2021.
[239] Thomas Schlegl, Joachim Ofner, and Georg Langs. Unsupervised
pre-training across image domains improves lung tissue classification.
page 82–93. Springer, 2014.
[240] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-
ishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual
explanations from deep networks via gradient-based localization. In
Proceedings of the IEEE international conference on computer vision,
pages 618–626, 2017.
[241] Mehmet Sezgin and Bülent Sankur. Survey over image threshold-
ing techniques and quantitative performance evaluation. Journal of
Electronic imaging, 13(1):146–165, 2004.
[242] C. E. Shannon. A mathematical theory of communication. The Bell
System Technical Journal, 27(3):379–423, 1948.
[243] Zhihui Shao, Jianyi Yang, and Shaolei Ren. Increasing the trust-
worthiness of deep neural networks via accuracy monitoring. In
Proceedings of the Workshop on Artificial Intelligence Safety, volume
2640 of CEUR Workshop Proceedings. CEUR-WS.org, 2020.
[244] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan
Carlsson. Cnn features off-the-shelf: an astounding baseline for
recognition. page 806–813, 2014.
275
[245] Anmol Sharma. DDSM Utility. GitHub, 2015.
[246] Anmol Sharma. Ddsm utility. https://github.com/trane293/
DDSMUtility, 2015.
[247] Wei Shen, Mu Zhou, Feng Yang, Caiyun Yang, and Jie Tian. Multi-
scale convolutional neural networks for lung nodule classification.
page 588–599. Springer, 2015.
[248] Rebecca L. Siegel, Kimberly D. Miller, and Ahmedin Jemal. Cancer
statistics, 2016. CA: A Cancer Journal for Clinicians, 66(1):7–30, 1
2016.
[249] Lincoln Silva, D C. M. Saade, Giomar Sequeiros Olivera, Ari Silva,
Anselmo Paiva, Renato Bravo, and Aura Conci. A new database for
breast research with infrared image. Journal of Medical Imaging and
Health Informatics, 4:92–100, 3 2014.
[250] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside
convolutional networks: Visualising image classification models and
saliency maps. In In Workshop at International Conference on Learning
Representations. Citeseer, 2014.
[251] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv:1409.1556 [cs], 9
2014. arXiv: 1409.1556.
[252] Amitojdeep Singh, Sourya Sengupta, and Vasudevan Lakshmi-
narayanan. Explainable Deep Learning Models in Medical Image
Analysis. Journal of Imaging, 6(6):52, June 2020. Number: 6 Pub-
lisher: Multidisciplinary Digital Publishing Institute.
[253] Jake Snell, Karl Ridgeway, Renjie Liao, Brett D Roads, Michael C Mozer,
and Richard S Zemel. Learning to generate images with perceptual
similarity metrics. In 2017 IEEE International Conference on Image
Processing (ICIP), pages 4277–4281. IEEE, 2017.
[254] Jaemin Son, Sang Jun Park, and Kyu-Hwan Jung. Retinal vessel
segmentation in fundoscopic images with generative adversarial net-
works. arXiv:1706.09318 [cs], 6 2017. arXiv: 1706.09318.
[255] Th A Sorensen. A method of establishing groups of equal amplitude
in plant sociology based on similarity of species content and its appli-
cation to analyses of the vegetation on danish commons. Biol. Skar.,
5:1–34, 1948.
[256] José Sotoca, José Sánchez, and R Mollineda. A review of data complex-
ity measures and their applicability to pattern classification problems.
Actas del III Taller Nacional de Mineria de Datos y Aprendizaje, 1 2005.
276
[257] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. Dropout: a simple way to prevent neu-
ral networks from overfitting. Journal of machine learning research,
15(1):1929–1958, 2014.
[260] Yanan Sun, Xian Sun, Yuhan Fang, Gary G. Yen, and Yuqiao Liu. A
novel training protocol for performance predictors of evolutionary neu-
ral architecture search algorithms. IEEE Transactions on Evolutionary
Computation, 25(3):524–536, 2021.
[261] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learn-
ing. In The 30th International Conference on Machine Learning (ICML),
volume 28 of Proceedings of Machine Learning Research, pages 1139–
1147. PMLR, June 2013.
[262] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. Effi-
cient Processing of Deep Neural Networks: A Tutorial and Survey.
arXiv:1703.09039 [cs], March 2017. arXiv: 1703.09039.
[263] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
Inception-v4, inception-resnet and the impact of residual connections
on learning. arXiv:1602.07261 [cs], 2 2016. arXiv: 1602.07261.
[264] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. Rethinking the inception architecture for computer
vision. pages 2818–2826, Las Vegas, NV, USA, 6 2016. 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), IEEE.
277
[267] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
M. B. Gotway, and J. Liang. Convolutional neural networks for medical
image analysis: Full training or fine tuning? IEEE Transactions on
Medical Imaging, 35(5):1299–1312, 5 2016.
[268] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the
evaluation of generative models. In Yoshua Bengio and Yann LeCun,
editors, 4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track
Proceedings, 2016.
[272] Erico Tjoa and Cuntai Guan. A Survey on Explainable Artificial Intel-
ligence (XAI): Towards Medical XAI. IEEE Transactions on Neural Net-
works and Learning Systems, pages 1–21, 2020. arXiv: 1907.07374.
[275] Pieter Van Molle, Miguel De Strooper, Tim Verbelen, Bert Vankeirsbilck,
Pieter Simoens, and Bart Dhoedt. Visualizing Convolutional Neural
Networks to Improve Decision Support for Skin Lesion Classification.
In Understanding and Interpreting Machine Learning in Medical Image
Computing Applications, Lecture Notes in Computer Science, pages
115–123, Cham, 2018. Springer International Publishing.
[276] Vladimir Vapnik and Alexey Chervonenkis. The necessary and suf-
ficient conditions for consistency in the empirical risk minimization
method. Pattern Recognition and Image Analysis, 1(3):283–305, 1991.
278
[277] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and
computing, 17(4):395–416, 2007.
[278] Ulrike Von Luxburg, Robert C. Williamson, and Isabelle Guyon. Clus-
tering: Science or art? Proceedings of ICML Workshop on Unsupervised
and Transfer Learning, page 65–79, 2012.
[279] Chaoyue Wang, Chang Xu, Chaohui Wang, and Dacheng Tao. Per-
ceptual adversarial networks for image-to-image transformation.
arXiv:1706.09138 [cs], 6 2017. arXiv: 1706.09138.
[280] Shuihua Wang, Ravipudi Venkata Rao, Peng Chen, Yudong Zhang,
Aijun Liu, and Ling Wei. Abnormal breast detection in mammogram
images by feed-forward neural network trained by jaya algorithm.
Fundamenta Informaticae, 151(1-4):191–211, 1 2017.
[282] Lulu Wen, Kaile Zhou, and Shanlin Yang. A shape-based clustering
method for pattern recognition of residential electricity consumption.
Journal of cleaner production, 212:475–488, 2019.
[285] Jelmer M. Wolterink, Tim Leiner, Max A. Viergever, and Ivana Išgum.
Automatic coronary calcium scoring in cardiac ct angiography using
convolutional neural networks. page 589–596. Springer, 2015.
[286] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Gp-gan:
Towards realistic high-resolution image blending. arXiv:1703.07195
[cs], 3 2017. arXiv: 1703.07195.
[287] Yuan Xue, Tao Xu, Han Zhang, Rodney Long, and Xiaolei Huang.
Segan: Adversarial network with multi-scale l1 loss for medical image
segmentation. arXiv:1706.01805 [cs], 6 2017. arXiv: 1706.01805.
279
[288] Scott Yak, Javier Gonzalvo, and Hanna Mazzawi. Towards task and
architecture-independent generalization gap predictors. In ICML “Un-
derstanding and Improving Generalization in Deep Learning” Workshop,
2019.
[289] Yasunori Yamada and Tetsuro Morimura. Weight features for predict-
ing future model performance of deep neural networks. In The 25th
International Joint Conference on Artificial Intelligence (IJCAI), pages
2231–2237. AAAI Press, July 2016.
[291] Dong Yang, Tao Xiong, Daguang Xu, Qiangui Huang, David Liu,
S. Kevin Zhou, Zhoubing Xu, JinHyeong Park, Mingqing Chen, Trac D.
Tran, Sang Peter Chin, Dimitris Metaxas, and Dorin Comaniciu. Au-
tomatic vertebra labeling in large-scale 3d ct using deep image-to-
image network with message passing and sparsity regularization.
arXiv:1705.05998 [cs], 5 2017. arXiv: 1705.05998.
[292] Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi Parikh. LR-GAN:
layered recursive generative adversarial networks for image generation.
In 5th International Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
OpenReview.net, 2017.
[293] Darvin Yi, Rebecca Lynn Sawyer, David Cohn III, Jared Dunnmon,
Carson Lam, Xuerong Xiao, and Daniel Rubin. Optimizing and vi-
sualizing deep learning for benign/malignant classification in breast
tumors. arXiv:1705.06362 [cs], 5 2017. arXiv: 1705.06362.
[294] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsuper-
vised dual learning for image-to-image translation. arXiv:1704.02510
[cs], 4 2017. arXiv: 1704.02510.
[295] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for
improving the generalizability of deep learning. arXiv:1705.10941 [cs,
stat], 5 2017. arXiv: 1705.10941.
[296] Kyle Young, Gareth Booth, Becks Simpson, Reuben Dutton, and Sally
Shrapnel. Deep Neural Network or Dermatologist? In Interpretability
of Machine Intelligence in Medical Image Computing and Multimodal
Learning for Clinical Decision Support, Lecture Notes in Computer
Science, pages 48–55, Cham, 2019. Springer International Publishing.
280
[297] Roozbeh Yousefzadeh and Dianne P. O’Leary. Investigating decision
boundaries of trained neural networks. arXiv:1908.02802 [cs, stat], 8
2019. arXiv: 1908.02802.
[298] Yu Zeng, Huchuan Lu, and Ali Borji. Statistics of deep generated
images. arXiv preprint arXiv:1708.02688, 2017.
[299] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and
Oriol Vinyals. Understanding deep learning requires rethinking gener-
alization. arXiv:1611.03530 [cs], February 2017. arXiv: 1611.03530.
[300] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and
Oriol Vinyals. Understanding deep learning (still) requires rethinking
generalization. Commun. ACM, 64(3):107–115, February 2021.
[301] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena.
Self-attention generative adversarial networks. pages 7354–7363.
International Conference on Machine Learning, 5 2019. ISSN: 1938-
7228 section: Machine Learning.
[302] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an effi-
cient data clustering method for very large databases. ACM SIGMOD
Record, 25(2):103–114, 6 1996.
[303] Yu-Dong Zhang, Shui-Hua Wang, Ge Liu, and Jiquan Yang. Computer-
aided diagnosis of abnormal breasts in mammogram images by
weighted-type fractional fourier transform. Advances in Mechanical
Engineering, 8(2):1687814016634243, 2 2016.
[304] Zhifei Zhang, Yang Song, and Hairong Qi. Decoupled learning for
conditional adversarial networks. In 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV), pages 700–708. IEEE, 2018.
[305] Qinpei Zhao and Pasi Fränti. Wb-index: A sum-of-squares based
index for cluster validity. Data & Knowledge Engineering, 92:77–89, 7
2014.
[306] R. Zhao, W. Ouyang, and X. Wang. Unsupervised salience learning
for person re-identification. pages 3586–3593. 2013 IEEE Conference
on Computer Vision and Pattern Recognition, 6 2013.
[307] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and An-
tonio Torralba. Object Detectors Emerge in Deep Scene CNNs.
arXiv:1412.6856 [cs], April 2015. ICLR 2015 conference paper.
[308] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. Learning Deep Features for Discriminative Localization. In
2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 2921–2929, Las Vegas, NV, USA, June 2016. IEEE.
281
[309] Quming Zhou, Zhuojing Li, and J. K. Aggarwal. Boundary extraction
in thermal images by edge map. SAC ’04, page 254–258, New York,
NY, USA, 2004. ACM.
[310] Zhiming Zhou, Han Cai, Shu Rong, Yuxuan Song, Kan Ren, Weinan
Zhang, Jun Wang, and Yong Yu. Activation maximization generative
adversarial nets. In International Conference on Learning Representa-
tions, 2018.
[311] Hui Zhu, Jianhua Huang, and Xianglong Tang. Comparing decision
boundary curvature. volume 3, pages 450–453 Vol.3. Proceedings
of the 17th International Conference on Pattern Recognition, 2004.
ICPR 2004., 8 2004. ISSN: 1051-4651.
[312] Wentao Zhu, Qi Lou, Yeeleng Scott Vang, and Xiaohui Xie. Deep
multi-instance networks with sparse label assignment for whole
mammogram classification. arXiv:1612.05968 [cs], 12 2016. arXiv:
1612.05968.
[313] Wentao Zhu, Xiang Xiang, Trac D. Tran, Gregory D. Hager, and Xiao-
hui Xie. Adversarial deep structured nets for mass segmentation from
mammograms. arXiv:1710.09288 [cs], 10 2017. arXiv: 1710.09288.
[314] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le.
Learning Transferable Architectures for Scalable Image Recognition.
arXiv:1707.07012 [cs, stat], April 2018. arXiv: 1707.07012.
282
Appendix A: The CNN architecture for Cifar-10/100 used in
Section 2.4.2
Layer Shape
Input: RGB image 32 × 32 × 3
Conv_3-32 + ReLU 32 × 32 × 32
Conv_3-32 + ReLU 32 × 32 × 32
MaxPooling_2 + Dropout (0.25) 16 × 16 × 32
Conv_3-64 + ReLU 16 × 16 × 64
Conv_3-64 + ReLU 16 × 16 × 64
MaxPooling_2 + Dropout (0.25) 8 × 8 × 64
Flatten 4096
FC_512 + Dropout (0.5) 512
FC_10 (Cifar-10) / FC_20 (Cifar-100) 10 / 20
Output (softmax): [0,1] 10 (Cifar-10) / 20 (Cifar-100)
The CNN architecture used in Section 2.4.2 of the main paper consists of
four convolutional layers, two max-pooling layers, and two fully connected
(FC) layers. The activation function for each convolutional layer is the ReLU
function, and that for output is softmax function, which maps the output
value to a range of [0, 1], with a summation of 1. The notation Conv_3-32
indicates that there are 32 convolutional neurons (units), and the filter
size in each unit is 3×3 pixels (height×width) in this layer. MaxPooling_2
denotes a max-pooling layer with a filter of 2 × 2 pixels window and stride
2. In addition, FC_n represents a FC layer with n units. The dropout layer
randomly sets the fraction rate of the input units to 0 for the next layer
with every update during training; this layer helps the network to avoid
overfitting. Table A.1 shows the detailed architecture. Our training optimizer
is RMSprop [271] with a learning rate of 1e-4 and a decay of 1e-6, the loss
283
function is categorical cross-entropy, the updating metric is accuracy, the
batch size is 32, and the number of total epochs is set at 200.
284
Appendix B: Synthetic Datasets
Table B.1: Names of the 97 used synthetic datasets from the Tomas Barton
repositorya .
285
Appendix C: Simplification from Equation 4.4 to Equation 4.5
N bN a −N+0.5
1 bN a
lim Pc = lim
N→+∞ N→+∞ e bN a − N
N bN a −N
1 bN a
lim Pc = lim
N→+∞ N→+∞ e bN a − N
bN a
a
−N + (bN − N) ln
a
bN a −N bN a − N
N
ln ( 1e ) bNbN
a −N
| {z }
(A )
lim Pc = lim e = lim e
N→+∞ N→+∞ N→+∞
1
Let t = N → +0,
b a−1
b − t a−1 ln b−t a−1 − t
!
b
1 b 1 ta
(A ) = − + a − ln =
t t t b
ta − 1t ta
[i] If a = 1,
(B)
z }| {
b
(b − 1) ln −1
b−1
(A ) =
t
(b−1)
b
(B) = ln
b−1
286
In R, it is easy to show that, for b > 0,
(b−1)
b
1< <e
b−1
Then,
0 < (B) < 1
(B) − 1
lim (A ) = lim = −∞
t→+0 t→+0 t
Therefore,
b a−1
b
b − t a−1 ln b−t a−1 − t L’Hôpital’s
(1 − a)t a−2 ln b−t a−1
lim (A ) = lim = lim
t→+0 t→+0 ta t→+0 at a−1
(1 − a) ln b (a−1)2
b−t a−1 L’Hôpital’s (a − 1)2 t
= lim = lim = lim
t→+0 at t→+0 a (t − bt 2−a ) t→+0 a (1 − bt 1−a )
(a−1)2
L’Hôpital’s −t 2 (a − 1)
= lim = lim −
t→+0 (a − 1) abt −a t→+0 abt 2−a
Substitute N = 1t ,
(a−1)N 2−a
lim Pc = lim e(A ) = lim e− ab (Equation (4.5) in paper, when a > 1)
N→+∞ t→+0 N→+∞
+∞
lim Pc = e− ab = 0
N→+∞
287
When a > 2,
0
lim Pc = e− ab = 1
N→+∞
For a = 2,
1
lim Pc = e−( 2b ) (Equation (4.6) in paper)
N→+∞
288
ProQuest Number: 29064905
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA