MasterThesis Final
MasterThesis Final
MasterThesis Final
Agapi Davradou
Examination Committee
January 2021
Declaration
I declare that this document is an original work of my own authorship and that it fulfills all the require-
ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.
Acknowledgments
This thesis would not exist without the help of many. I would like to thank everyone, who in their own
special way contributed to the final outcome of this thesis.
First of all, I would like to thank my supervisors. Prof. Mário Figueiredo made this beautiful journey
possible, by accepting to be my supervisor and providing me the opportunity to come to study at IST.
Except from his truly insightful advice, availability and experience , which I am very grateful for, he is
one of the most inspiring and respectful professors I have ever seen. Dr. Nuno Silva, for making the
collaboration with Hospital da Luz possible, as well as his constant eagerness to help, his enthousiasm
and valuable advice from the very beginning.
In addition, I am very thankful to André Gomes, Dr. Konstantinos Makantasis and Maria Mousouraki.
André, not only for his patience and great willingness to share the beauty of math with me, but also for all
the great moments we shared in Lisbon. Dr. Konstantinos Makantasis for always being there for me to
answer questions, advice and guide me. Maria for her great help to create and optimize several figures,
as well as for the beautiful time we spent together the last difficult months.
I feel incredibly fortunate to be surrounded by my friends Giannis, Christina, Annoula, Vaso, Stefania,
Konstantina, Penni and Anna and I am very grateful for their love, wonderful moments and psychological
support.
Finally, I cannot thank enough my family for their constant love, support and care all these years and
for always being there for me, no matter the choices I make.
Abstract
Pancreatic cancer is one of the deadliest types of cancer, with 25% of the diagnosed patients surviving
for only one year and 6% of them for five. Computed tomography (CT) screening trials have played
a key role in improving early detection of pancreatic cancer, which has shown significant improvement
in patient survival rates. However, advanced analysis of such images often requires manual segmen-
tation of the pancreas, which is a time-consuming task. Moreover, pancreas presents high variability
in shape, while occupying only a very small area of the entire abdominal CT scans, which increases
the complexity of the problem. The rapid development of deep learning can contribute to offering ro-
bust algorithms that provide inexpensive, accurate, and user-independent segmentation results that can
guide the domain experts. This dissertation addresses this task by investigating a two-step approach
for pancreas segmentation, by assisting the task with a prior rough localization or detection of pancreas.
This rough localization of the pancreas is provided by an estimated probability map and the detection
task is achieved by using the YOLOv4 deep learning algorithm. The segmentation task is tackled by
a modified U-Net model applied on cropped data, as well as by using a morphological active contours
algorithm. For comparison, the U-Net model was also applied on the full CT images, which provide a
coarse pancreas segmentation to serve as reference. Experimental results of the detection network on
the National Institutes of Health (NIH) dataset and the pancreas tumour task dataset within the Med-
ical Segmentation Decathlon show 50.67% mean Average Precision. The best segmentation network
achieved good segmentation results on the NIH dataset, reaching 67.67% Dice score.
Keywords
Pancreatic cancer; morphological snakes; deep learning; segmentation; convolutional neural networks
iii
Περίληψη
Ο καρκίνος του παγκρέατος αποτελεί έναν από τους πιο θανάσιμους τύπους καρκίνου, με το 25% των δια-
γνωσμένων ασθενών να επιβιώνουν μόνο για ένα χρόνο και το 6% αυτών για πέντε. Οι δοκιμές διαλογής
με αξονική τομογραφία έχουν αποτελέσει βασικό παράγοντα στη βελτίωση της έγκαιρης ανίχνευσης του
καρκίνου του παγκρέατος, η οποία έχει δείξει σημαντική βελτίωση στα ποσοστά επιβίωσης των ασθενών. Ω-
στόσο, η προηγμένη ανάλυση τέτοιων εικόνων απαιτεί συχνά χειροκίνητη κατάτμηση του παγκρέατος, η οποία
αποτελεί μία χρονοβόρα διαδικασία. Επιπλέον, το πάγκρεας παρουσιάζει υψηλή μεταβλητότητα σε σχήμα,
ενώ καταλαμβάνει μόνο μια πολύ μικρή περιοχή ολόκληρης της κοιλιακής αξονικής τομογραφίας, γεγονός
που αυξάνει την πολυπλοκότητα του προβλήματος. Η ταχεία ανάπτυξη της βαθιάς μηχανικής μάθησης μπορεί
να συμβάλει στην ανάπτυξη ισχυρών αλγορίθμων που παρέχουν οικονομικά, ακριβή, και ανεξάρτητα από το
χρήστη αποτελέσματα κατάτμησης, τα οποία μπορούν να καθοδηγήσουν τους ειδικούς του τομέα. Η παρο-
ύσα διατριβή αντιμετωπίζει το προαναφερθέν πρόβλημα διερευνώντας μία προσέγγιση δύο βημάτων για την
κατάτμηση του παγκρέατος, βοηθώντας αρχικά τη διαδικασία, είτε με έναν κατά προσέγγιση χωρικό προσδιο-
ρισμό, είτε με εντοπισμό του παγκρέατος. Αυτός ο κατά προσέγγιση χωρικός προσδιορισμός επιτυγχάνεται
μέσω ενός εκτιμώμενου πιθανοτικού χάρτη και ο εντοπισμός κάνοντας χρήση του YOLOv4 αλγορίθμου βα-
θιάς μηχανικής μάθησης. Το πρόβλημα της κατάτμησης αντιμετωπίζεται μέσω ενός τροποποιημένου U-Net
μοντέλου που εφαρμόζεται σε περικομμένα δεδομένα, καθώς και χρησιμοποιώντας τον αλγόριθμο morpho-
logical active contours. Για λόγους σύγκρισης, το U-Net μοντέλο εφαρμόστηκε επίσης στις πλήρεις εικόνες
αξονικής τομογραφίας, οι οποίες παρέχουν μία κατά προσέγγιση κατάτμηση του παγκρέατος που χρησιμεύει
ως αναφορά. Τα πειραματικά αποτελέσματα του δικτύου εντοπισμού στο σύνολο δεδομένων του National
Institutes of Health (NIH) και στο σύνολο δεδομένων για τον όγκο του παγκρέατος στο Medical Segmen-
tation Decathlon δείχνουν 50,67% mean Average Precision. Το καλύτερο δίκτυο κατάτμησης πέτυχε καλά
αποτελέσματα κατάτμησης στο σύνολο δεδομένων του NIH, επιτυγχάνοντας 67.67% Dice score.
Λέξεις − κλειδιά
Καρκίνος του παγκρέατος· κατάτμηση· βαθιά μηχανική μάθηση· συνελικτικά νευρωνικά δίκτυα·
v
Contents
1 Introduction 1
1.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
vii
3.3 Segmentation Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 U-Net model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Cropped U-Net model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 YOLO + MorphGAC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.5 YOLO + U-Net model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.6 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.7 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Hardware Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
List of Figures
2.1 Pancreas location and anatomy. Adapted from [1] and [2]. . . . . . . . . . . . . . . . . . . 6
2.2 Basic principles of CT scanning. Adapted from [3]. . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Abdominal CT scans in the transversal (a), coronal (b) and sagittal (c) plane. . . . . . . . 9
2.4 The single-layer perceptron model. Adapted from: [4]. . . . . . . . . . . . . . . . . . . . . 10
2.5 A feedforward neural network with two hidden layers. . . . . . . . . . . . . . . . . . . . . 11
2.6 Momentum-based gradient descent (a) and Nesterov accelerated gradient descent (b) [5]. 13
2.7 The dropout regularization technique. It keeps a neuron active with a probability p or sets
it to zero otherwise [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Early-stopping during the training of a neural model [7]. . . . . . . . . . . . . . . . . . . . 16
2.9 Typical architecture of a convolutional neural network. Adapted from [8]. . . . . . . . . . . 17
2.10 A 3x3 convolution operation with stride size 1. Adapted from [9]. . . . . . . . . . . . . . . 17
2.11 Output volume of the first convolutional layer with depth size of 5. Each neuron in the con-
volutional layer computes the weighted sum followed by an activation function. Adapted
from [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 The maxpooling downsampling operation, with a stride of 2 and a 2x2 filter. Adapted
from [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.13 Basic architecture of region proposal based or two-stage detectors (a) and regression/-
classification or one-stage detectors (b) [11]. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.14 The flowchart of R-CNN detector. Adapted from [12]. . . . . . . . . . . . . . . . . . . . . . 20
2.15 The YOLOv1 model [13]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.16 Architecture of a modern one-stage detector, such as YOLOv4, constisting of one addi-
tional part, the neck. Adapted from [14]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.17 The original PAN implementation (a) and the modified one (b) used in YOLOv4 [14]. . . . 22
2.18 The U-Net model architecture [15]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
3.1 Calculation of the IoU metric. With purple color the predicted bounding box is depicted
and with green the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 The 5-level U-Net segmentation network for an input image of size 256x256. . . . . . . . 35
3.3 Default cropping position of a 512x512 CT-scan into a 224x224 scan. . . . . . . . . . . . 37
3.4 Training procedure of the cropped U-Net model, using a default cropping position with
respect to the probability map of pancreas. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Testing procedure of the cropped U-Net model, using a default cropping position with
respect to the probability map of pancreas. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 The YOLO+MorphGAC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Training procedure of the YOLO+U-Net model, using the centroid of the ground-truth im-
ages to crop the scans and train the model. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Testing procedure of the YOLO+U-Net model, using a default cropping position when no
pancreas is predicted and the centroid of the prediction, when pancreas is found. . . . . . 42
3.9 Example of the application of the optimization step on a slice. The optimization is applied
only on the middle slice of the three used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.10 Calculation of the DSC metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 The Hounsfield unit values of the pancreas for the Decathlon (a) and NIH (b) dataset,
respectively. In Decathlon dataset the intensity range from -998.0 HU to +3071.0, while in
NIH from -875.0 HU to +458.0 HU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Probability of pancreas in an 512x512 image for Decathlon (a) and NIH (b) datasets,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Detection of pancreas on a NIH (a) and Decathlon (b) slice, respectively. The green color
represents the ground-truth and the purple the YOLO prediction. . . . . . . . . . . . . . . 47
4.4 Detection of pancreas on a multilabel slice. The green color represents the ground-truth
and the purple the YOLO prediction. The two bigger green bounding boxes represent the
two classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 The CIoU training loss with a minimum value of 0.18. . . . . . . . . . . . . . . . . . . . . . 49
4.6 The validation mAP during training, reaching a maximum value of 90.9%. . . . . . . . . . 49
4.7 The metric of loss for the YOLO+U-Net during training with a minimum value of 0.28. . . . 50
4.8 Some of the best segmentations on the test set of the YOLO+MorphGAC (a), U-Net (b),
cropped U-Net (c) and YOLO+U-Net (d) model, respectively. The blue line indicates the
model’s predicted segmentation and the red line represents the ground-truth. . . . . . . . 52
4.9 Some of the worst segmentations on the test set of the YOLO+MorphGAC (a), U-Net (b),
cropped U-Net (c) and YOLO+U-Net (d) model, respectively. The blue line indicates the
model’s predicted segmentation and the red line represents the ground-truth. . . . . . . . 53
x
A.1 Examples of the SId operator on individual pixels of binary images. In the cases where
the central pixel belongs to a straight line of active pixels (marked in red), the central pixel
remains active, as shown in examples (a) and (b). On the contrary, if a straight line is not
found, the central pixel becomes inactive, as shown in the examples (c) and (d) [16]. . . . 70
A.2 Examples of the effect of the ISd operator on individual pixels of binary images [16]. . . . 70
xi
xii
List of Tables
xiii
xiv
Acronyms
AG attention gates
AP average precision
CT computed tomography
HU Hounsfield unit
xv
MorphGAC morphological geodesic active contours
xvi
Introduction
1
Contents
1.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1
1.1 Motivation and Objectives
Cancer is the second leading cause of death worldwide, accounting for an estimated 9.6 million deaths
in 2018 [20]. Globally, about 1 in 6 deaths is caused by cancer and pancreatic cancer is the seventh
highest cause of death from it [20]. In most cases, the symptoms are not visible until the disease has
reached an advanced stage. Due to its very poor prognosis, after diagnosis, 25% of people survive only
one year and the 5-year survival rate is only 6% [21].
This highlights the need for improved screening modalities and early detection. In this context, ra-
diomics, the process of applying machine learning algorithms to extract meaningful information and
automate image analysis processes from computed tomography (CT) scans or magnetic resonance
imaging (MRI), has become a very active research area. This allows the noninvasive characterization
of lesions and the assessment of their progression and possible response to therapies. However, the
application of such techniques often require manual segmentation of the structures of interest which is a
time-consuming task. Moreover, manual segmentation is also user-dependent. Consequently, it benefits
from automatic segmentation techniques.
Automatic pancreas segmentation still remains a very challenging task, as the target often occupies
a very small fraction (e.g., < 0.5%) of the entire volume, has poorly-defined boundaries with respect to
other tissues and suffers from high variability in shape and size.
The goal of this dissertation is to develop a model to automate pancreas segmentation in CT scans,
in order to help the medical community to produce faster and more consistent results.
1.2 Methodology
A preliminary thorough review was conducted on the areas of deep learning and image segmentation,
focusing on medical applications, such as pancreas segmentation on CT scans and MRI. Through anal-
ysis of the state-of-the-art deep learning methods, it was concluded that a coarse-to-fine approach leads
to better segmentation performance of the pancreas. Therefore, the core idea of this work was to add
an intermediate step, as visualized in Figure 1.1, in order to reduce the area of search for pancreas and
assist the segmentation task.
2
Figure 1.1: Addition of an intermediate step to assist segmentation.
Following this methodology, the current dissertation follows two approaches: a) the development of
a coarse segmentation network, which serves as a reference, and b) the development and evaluation of
two-step methods, which lead to cropped segmentation networks that first localize roughly the pancreas
and then segment it. The rough localization of the pancreas is achieved either by using an estimated
probability map of the pancreas, or by using a pancreas detection model.
At first, the research was focused on the selection of the best detection model and various tests were
made, in order to increase its performance. In parallel two approaches were selected for segmentation:
curve evolution; and deep learning.
In summary, four segmentation models were developed and compared:
• A coarse deep learning segmentation model used for both choosing the model’s architecture, as
well as for comparison afterwards.
• A two-step model, which first uses a probability map to crop the data containing the pancreas and
then applies a deep network to segment it.
• A two-step model, which first uses a detection model to localize the pancreas and then applies a
curve evolution method to segment it.
• A two-step model, which first uses a detection model to localize the pancreas and then applies a
deep network to segment it.
3
All models are then evaluated and a review on the limitations is provided alongside with possible
future improvements. All the tools and technologies used for this work are open-source and the source
code will be made available at https://github.com/adavradou.
• Chapter 1: The motivation and the objectives of this dissertation are presented and the methodol-
ogy followed is briefly described.
• Chapter 2: Some of the fundamental concepts required for the reader to fully comprehend the
dissertation are presented and an overview of related work is given. More specifically, first the
necessary medical background is introduced, followed by the basic concepts of Active Contours
and Morphological Snakes, as well as Neural Networks and Deep Learning. This chapter ends
with an overview of the state-of-the-art deep learning algorithms used for the task of pancreas
segmentation.
• Chapter 3: The details of the datasets and implementation are described and the proposed al-
gorithms are presented. The architecture of both the detection and segmentation networks are
defined and details about the training procedure are given. An overview of the metrics used for the
evaluation of each task is also included.
• Chapter 4: The experimental results for the pancreas detection and segmentation are presented
and discussed. A comparison with the state-of-the-art is also provided.
• Chapter 5: The main conclusions of this work are summarized and possible future work is recom-
mended.
4
Basic Concepts and Related Work
2
Contents
2.1 Medical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5
2.1 Medical Background
The pancreas is an organ located in the upper-left part of the abdomen, surrounded by other organs
including the small intestine, stomach, liver, and spleen (Figure 2.1).
Figure 2.1: Pancreas location and anatomy. Adapted from [1] and [2].
Anatomically, the pancreas is about 12-15cm long, shaped like a flat pear and it is divided into head,
neck, body, and tail. The head is the widest part of the organ - it is positioned towards the center of the
abdomen, in the curve of the duodenum, which is the first division of the small intestine. The longest
part of the pancreas, called the body, extends slightly upwards, behind the stomach. The neck of the
pancreas separates the head from the body. Finally, the pancreas narrows towards the tail, which is
adjacent to the spleen.
The pancreas is an organ of both the digestive and the endocrine system and functions as an het-
6
erocrine gland. As an endocrine gland, its main function is blood sugar regulation, since the endocrine
cells of the pancreas, called islets of Langerhans, produce and secrete hormones directly into the blood-
stream. Two of the main pancreatic hormones are insulin, which acts to lower blood sugar, and glucagon,
which acts to raise blood sugar.
As an exocrine gland, the pancreas produces enzymes important to digestion. These enzymes
include trypsin and chymotrypsin to digest proteins, amylase for the digestion of carbohydrates, and
lipase to break down fats.These enzymes are released into a system of small ducts that lead to the main
pancreatic duct, which runs through the body of the pancreas carrying pancreatic enzymes and other
secretions, collectively called pancreatic juice. The main pancreatic duct connects with the common
bile act, which carries bile from the gallbladder, in the head of the pancreas and together they form the
ampulla of Vater, which is located at the duodenum. The released enzymes travel down the pancreatic
duct into the bile duct in an inactive form. When they enter the duodenum, they are activated.
Computed tomography (CT) scan is a diagnostic imaging procedure that uses X-rays to build detailed
cross-sectional images of the interior of the body in three dimensions.
More specifically, during a CT scan, the patient lies on a bed that slowly moves through the gantry,
while the X-ray tube rotates around the patient emitting a fan-shaped X-ray beam through the patient’s
body (Figure 2.2). X-ray detectors are positioned directly opposite the X-ray source and measure the
radiation attenuated through the scanned object and transmit it to a computer.
CT is based on the fundamental principle that the density of the tissue passed by the X-ray beam
can be measured from the calculation of the attenuation coefficient. Using this principle, each time
the X-ray source completes one full rotation, the computer combines the X-ray measurements from the
various angles to reconstruct the density of the body and create the cross-sectional image or slice. The
thickness of the tissue represented in each image slice can vary depending on the CT machine used;
usually it is less than 3mm [22].
This technique acquires the data volumetrically, reconstructing those images from sinograms. The
spatial resolution is related with the detector resolution and the spectral resolution with the X-rays en-
ergies. The advantage of acquiring a volume acquisition is the ability to reconstruct the images in 3
different planes: the axial, sagittal, and coronal planes (Figure 2.3).
Radiodensity refers to the relative inability of electromagnetic radiation, particularly X-rays and radio
waves, to pass through a particular material. Its quantitative measurement is given by the Hounsfield
unit (HU), used by radiologists in the interpretation of CT images. The HU, also referred to as the CT
unit, is a linear transformation of the original linear attenuation coefficient measure, resulting into one in
which the radiodensity at standard temperature and pressure (STP) of distilled water is defined as 0 HU
7
Figure 2.2: Basic principles of CT scanning. Adapted from [3].
and of air as -1000 HU respectively. Let µwater and µair be the attenuation coefficients for water and air,
respectively. In a voxel with average linear attenuation coefficient µ, we define its HU as
µ − µwater
HU = . (2.1)
µwater − µair
In Table 2.1 are listed the HU unit for several body tissues.
Substance HU
Bone 1000
Liver 40 to 60
White Matter 46
Gray Matter 43
Blood 40
Muscle 10 to 40
Kidney 30
Pancreas 49 to 60
Cerebrospinal Fluid 15
Water 0
Fat −50 to −100
Air −1000
8
Figure 2.3: Abdominal CT scans in the transversal (a), coronal (b) and sagittal (c) plane.
Neural networks (NNs), also known as artificial neural networks (ANNs), are biologically inspired com-
putational models aiming to mirror the brain function. The predecessor of modern neural networks is the
perceptron, proposed by Rosenblatt in 1958 [23]. It is probably the oldest algorithm for training a linear
classifier and can be expressed as
n
!
Õ
ŷi = g b + xi wi = g (b + x · w) , (2.2)
i=1
where ŷ denotes the predicted output, g(.) is the activation function, b is a bias term to shift the activation
function to better fit the data and x and w are the vectors of input features and weights, respectively. The
original perceptron implementation used the Heaviside step function as an activation function; however,
nowadays, non-linear activation functions, such as the sigmoid, the hyperbolic tangent and the rectified
linear unit (ReLU) are preferred, especially for more complex networks.
As illustrated in Figure 2.4, each input component has its corresponding weight, which, if the weight
is positive, represents an excitatory synapse, and if it is negative, an inhibitory one. At each iteration
the perceptron takes a set of inputs and uses the current model to make a prediction. In the case of
binary classification, the prediction is ”1” if the weighted sum of the inputs is above a threshold, and ”0”
otherwise. If the predicted output is correct, nothing happens. Otherwise, if the model misclassifies, the
model’s weights are modified, in order to train it.
Since single layer perceptrons are only capable of learning linearly separable patterns, multiple layers
9
Figure 2.4: The single-layer perceptron model. Adapted from: [4].
of perceptrons are combined, in order to create more complex architectures, called multilayer perceptron
(MLP) which are capable of distinguishing not linearly separable data [24]. The value of the j-th element
of layer k + 1 is computed according to
n0
!
Õ
(0) (0)
g b j + xi wi, j , if k = 0
i=1
z(k+1) =
j nk
! (2.3)
Õ
(k) (k) (k)
g b j + g(zi )wi, j ,
if k ≥ 1
i=1
g b + x · w , if k = 0
(0) (0)
= .
z(k+1) (2.4)
g b(k) + g(z(k) ) · w(k) ,
k≥1
Multi-layer perceptrons are a class of feedforward neural networks, because the information flows
only in the forward direction: from the input layer, through the hidden layers, and finally reaches the
output layer. In contrast to other types of network architectures, such as recurrent neural networks
(RNNs) [25], there are no loops or cycles in the network [26, p. 73]. A simple feedforward neural network
with two hidden layers is exemplified in Figure 2.5.
One main advantage of these feedforward neural networks is that they are universal approximators
and just one hidden layer is enough to approximate arbitrarily closely every continuous function that
maps intervals of real numbers to some output interval of real numbers [27].
The training of the network involves adjusting the weights of the network, in order to improve the
accuracy of the result. This can be achieved by defining a loss function (also known as cost function
or objective function) and minimize it with respect to the model’s weights, usually by using iterative
gradient-based optimization algorithms. The minimization task of a loss function J can be expressed as
10
Figure 2.5: A feedforward neural network with two hidden layers.
w∗ = argmin J (W)
W
1Õ
n (2.5)
= argmin L( f (x (i) ; W), y (i) ),
W n
i=1
where w = {W (0), W (1), ...} are the model’s weights, L is the loss function, f (x (i) ; w) is the predicted output
of example i with respect to x and w, n is the number of examples, and y is the target output.
Gradient descent [28] is a first-order iterative optimization algorithm for minimizing the loss function
during neural network training. More specifically, the gradient of the function at the current point is
leveraged to choose a weight update that will comprise a small step in the opposite direction of the
gradient. Therefore, the weight update using gradient descent, could be done in the following way
where ∇J (w (t) ) is the gradient of the loss function and η > 0 is a hyperparameter called learning rate.
The learning rate should be chosen carefully, since it determines the step size in the descent direction.
Consequently, setting the learning rate too low results in a slow convergence and setting it too high can
cause the algorithm to diverge away from the minimum [6, p. 21].
This method is also known as batch gradient descent, because each step requires the processing
of the entire training dataset, in order to evaluate ∇J (w (t) ) and update the parameters. This makes it
11
inapplicable for large datasets, since it is computationally too expensive. However, subject to relatively
mild assumptions, and for particular choices of η, gradient descent is guaranteed to converge to a local
minimum if J is non-convex and to a global minimum if it is.
By using the stochastic gradient descent (SGD) method, which is an on-line version of the gradient
descent, the convergence can be accelerated considerably when training on large datasets [29], as
it uses only a single example (a batch size of 1) to approximate the gradient of J (w (t) ). The term
“stochastic” indicates that the one example comprising each batch is chosen randomly. In this case, the
weight update when the gradient is approximated at a single point xi is achieved using the following rule
In order to achieve smoother convergence, another method, called “mini-batch” stochastic gradient de-
scent, takes the average gradient on a minibatch of N training examples at each iteration. In this way,
a compromise is made between the effectiveness of computing the true gradient and speed of approxi-
mating the gradient at a single example.
The stochastic gradient descent method has trouble navigating to surfaces containing ravines, i.e.,
areas where the level curves of J are much steeper in one direction than another, which are common
around local minima [30]. In these regions, a large value of the learning rate parameter will cause diver-
gent oscillations across the ravine and a small one, will slow down convergence, since both the gradient
and η are small [31]. For this reason, momentum techniques [32], [33] are used, which accelerate con-
vergence and mitigate oscillations, by combining the current gradient with a history of the previous steps.
The techniques update w t+1 using the filtered gradient ν t+1 via the following equations [34]
where α typically ranges from 0.5 to 0.95. The first parameter of equation 2.8 is called the momentum
term and has an accumulation effect, if successive gradients are in similar directions. If not, the memory
of the momentum term is set to zero.
Nesterov accelerated gradient (NAG) [35] (also called Nesterov momentum) is a slightly different
version of the momentum method. The main difference between the two methods is that NAG computes
the gradient in a different position. In practice, since the parameter w (t+1) is a result of the momentum
term αν (t) , then computing w t + αν (t) gives an approximation of the position of the parameters in the next
iteration and the gradient is computed in the following way
12
ν (t+1) = αν (t) − η∇J (w (t) + αν (t) ), (2.10)
This ”look ahead” acts like a correction factor to the momentum term and performs better in many
problems [36]. The two methods are illustrated in Figure 2.6.
Figure 2.6: Momentum-based gradient descent (a) and Nesterov accelerated gradient descent (b) [5].
In recent works, adaptive gradient descent algorithms are preferred, because they tackle the prob-
lem of manually choosing the learning rate hyperparameter. The most popular adaptive optimization
algorithm is Adam [37], which stores an exponentially decaying average of both past gradients mt (like
momentum) and squared gradients νt . Adam’s parameter update is given by
where mt and ν are the estimates of the first moment (mean) and the second moment (the uncentered
variance) of the gradients respectively and β1 , β2 are hyperparameters acting as their respective forget-
ting factors. Adam’s update rule is then given by
η
w (t+1) = w (t) − √ m̂(t), (2.14)
ν̂ (t) +
where ν̂ and m̂ are bias-corrected first and second moment estimates.
Other very popular adaptive learning rate techniques are RMSProp [38], AdaGrad [39], and Adadelta
13
[40].
Although in recent works adaptive gradient descent methods have gained enormous popularity, there
are some cases, such as object recognition [41] or machine translation [42], in which SGD with momen-
tum outperforms them, since they are unable to converge to a optimal solution. The exponential moving
average of past squared gradients is regarded to be the cause of their generalization error [43].
Finally, an efficient method to calculate the ∇J (w (t) ; x; y) is needed and back-propagation is a widely
used algorithm in supervised learning for that purpose [44, p. 241]. The name of the method comes
from the fact that it passes the error from the output layer back through the network, in order to compute
the gradients and update the weights, thus it ”back-propagates” the error. In practice, the algorithm
computes the derivative of the loss function with respect to each weight using the chain rule.
Although the back-propagation is often misunderstood to be the whole learning technique, it is actu-
ally only the method for computing the gradient and then another algorithm, such as the SGD mentioned
above, is used for performing the actual learning, using the gradient information [31, p. 203].
The ability of a machine learning model to perform well on previously unseen data is called generalization
and the expected value of the error on a new input is called generalization error or test error [31].
Typically, a model is initially trained on a training set, which is a set of data used to fit the model’s
parameters, namely the weights, and successively used on a validation dataset, in order to tune its
hyperparameters, such as the number of hidden units [45, p. 354].
During this training procedure, the training error is calculated and minimized. If the training error is not
sufficiently low, then the model is said to underfit. However, if the training error is low, but the test error is
high, then the model is said to overfit, which essentially means that the network ”memorizes” properties
of the training data and cannot generalize well on new examples [31, pp. 110–112]. The generalization
error can be minimized by preventing overfitting and several techniques have been proposed to achieve
that.
One method to tackle overfitting is called regularization, which adds a term to the loss function in
order to penalize large coefficients. The most common type of regularization is ridge regression [46],
also called l2 regularization, or weight decay, which adds a penalty equivalent to the squared magnitude
of all the weights to the loss function
n
1Õ 1
W∗ = argmin L( f (x (i) ; w), y (i) ) + λ kwk 22 , (2.15)
W n n
i=1
14
Another very common regularization technique is the Lasso [47] or l1 regularization, which adds the
sum of the magnitude of all the weights to the loss function:
n
1Õ
W = argmin
∗
L( f (x (i) ; w), y (i) ) + λ kwk 1 , (2.16)
W n
i=1
where kwk 1 =
Í
|wi |.
i
The l1 regularization leads to sparse solutions, meaning that some parameters have an optimal value
of zero. This sparsity property has been extensively used as a feature selection mechanism. However, if
feature selection is not necessary, L2 regularization is preferred, because it empirically performs better.
Another method to reduce overfitting is called dropout [48], which consists in randomly ”dropping
out” units (both hidden and visible) during the training process, by setting each neuron as inactive with
some probability p, as illustrated in Figure 2.7. An intuitive interpretation of dropout is that it prevents
the network from becoming too dependent on neuron combinations. Mathematically expressed, dropout
combines exponentially many different network architectures that can be formed by removing non-output
units from the base network.
Figure 2.7: The dropout regularization technique. It keeps a neuron active with a probability p or sets it to zero
otherwise [6].
An additional regularization technique is early-stopping, which helps determine the best number of
iterations to train a neural network. More specifically, during training, it is often observed that the training
error decreases steadily over time, but the validation error starts increasing, which is a sign that the
model starts overfitting and training should be stopped, as exemplified in Figure 2.8. In this way, a model
with the lower validation error is obtained.
Finally, transfer learning is a popular method in deep learning, which consists of taking features
learned on one problem and leveraging them to solve new similar problems. Therefore, instead of
training a model from scratch, it is common to use a pre-trained model on a large and general enough
dataset, such as the ImageNet [49], either as a fixed feature extractor or as an initialization for the task
of interest. In the former case, the layers from the previously trained model are frozen and a classifier
15
Figure 2.8: Early-stopping during the training of a neural model [7].
is added on top of them and trained from scratch on the new dataset. In the latter case, some layers
of the frozen model, or all of them, are jointly trained alongside the newly-added classifier, in order to
”fine-tune” the base model for the specific task. Recent research has shown that transfer learning for
medical imaging applications offers limited gains in performance and much smaller, lightweight networks
can perform comparably to ImageNet architectures [50].
Convolutional neural networks (CNNs) are a form of feed-forward neural networks specialized on pro-
cessing 1D time-series data, such as speech or text, and 2D data, such as images. They have achieved
an enormous success in recent practical applications, since their architecture allows the utilization of
spatial information, as well as, the reduction of the number of parameters.
Convolutional neural networks are biologically inspired by the animal visual cortex, which, as shown
in the 1950s and 1960s by Hubel and Wiesel [51], [52], contains neurons that individually respond to
stimuli only in small regions of the visual field, known as the receptive field. That work inspired the
development of LeNet-5 [53], one of the earliest CNNs, proposed for handwritten character recognition.
Similarly to LeNet-5, every typical CNN (Figure 2.9) includes three main operations: convolution, non-
linear activation function, and pooling. The output of all these convolutional and pooling layers is then
passed to a fully-connected network, to actually perform the desired task, e.g. classification.
In the case that a two-dimensional image I is used as an input and a two-dimensional kernel K,
which is essentially the model’s weights, their convolution is expressed in the following way:
ÕÕ ÕÕ
S(i, j) = (I ∗ K)(i, j) = I(m, n)K(i − m, j − n) = I(i − m, j − n)K(m, n). (2.17)
m n m n
In Figure 2.10, a 3x3 convolution is illustrated and it is visible how its connectivity resembles the visual
cortex, since each neuron ”sees” only a specific region of the whole image, termed as its receptive field.
16
Figure 2.9: Typical architecture of a convolutional neural network. Adapted from [8].
This is called a 3x3 convolution due to the shape of the filter. The filter slides over the image and
aggregates the convolution results in the feature map. The size of the step the kernel moves every
time is called the stride and a stride size of 1 means the filter slides pixel by pixel. Bigger strides lead
to less overlap between the receptive fields and simultaneously to smaller feature maps. Since the
resulting feature map is always smaller than the input, padding can be used to prevent feature map from
shrinking. Padding surrounds the input with a layer of zero-value pixels and thus the dimensionality is
maintained.
Figure 2.10: A 3x3 convolution operation with stride size 1. Adapted from [9].
By using several kernels, distinct feature maps can be obtained, in order to extract different features.
Therefore, the output of the convolutional layer will correspond to a volume, where the height and width
are the spatial dimensions of the input layer, and the depth is given by the number of filters used, as
exemplified in Figure 2.11.
Similar to the feedforward neural networks mentioned in Section 2.2.1, the neurons in a convolutional
layer still compute a weighted-sum of the input followed by a non-linearity. The main difference is that
17
their connectivity is now restricted to be spatially local.
Figure 2.11: Output volume of the first convolutional layer with depth size of 5. Each neuron in the convolutional
layer computes the weighted sum followed by an activation function. Adapted from [10].
A convolutional layer is often followed by a pooling layer. Pooling is an operation that is used to reduce
dimensionality, by downsampling each feature map independently. The width and height dimensions are
reduced, while the depth size is kept intact. One of the most common pooling techniques is called max
pooling [54], which computes the maximum value in the pooling window.
Figure 2.12: The maxpooling downsampling operation, with a stride of 2 and a 2x2 filter. Adapted from [10].
Object detection is a challenging task in computer vision, which involves detecting instances of semantic
objects of a certain class (such as organs, lesions, etc) and drawing a bounding box around them and
therefore it has been extensively used in medical imaging applications. The development of convolutional
neural networks has played a vital role in the rapid evolution of object detection techniques and their
performance improvement.
18
A CNN for object detection can be divided to two main parts: feature extraction and detection. The
feature extraction pipeline, also known as backbone network, is the stage where distinct high-level fea-
tures are extracted from the input image through the convolutional and pooling layers. These features
are then passed to a fully connected network, also known as detection network, to actually perform the
object detection task.
The state-of-the-art deep learning object-detection algorithms can be divided into two main cat-
egories: the region-proposal-based models (two-stage detectors), and the regression/classification-
based models (one-stage detectors), illustrated in Figures 2.13 (a) and 2.13 (b) respectively. The former
ones follow a two-step approach, first they propose regions of interest (RoIs), where objects may be
present (region proposal) and then these RoIs are classified into different object categories. The lat-
ter ones follow a unified approach and manage to perform categorical prediction of objects in the input
images directly, without the region proposal step, by approaching the object detection task as a classi-
fication or regression problem. Therefore, the region-proposal-based models achieve higher accuracy
in object localization and recognition tasks, whereas the regression/classification-based models achieve
higher inference speed, making them more efficient for real-time applications.
Figure 2.13: Basic architecture of region proposal based or two-stage detectors (a) and regression/classification or
one-stage detectors (b) [11].
19
The R-CNN [12] detector was proposed in 2014 by Girshick et al. and it is one of the earliest
applications of CNNs to object detection. It is also the first one that showed the effectiveness of CNNs
in object detection, by achieving a much higher detection accuracy on PASCAL VOC datasets [55]
compared to the rest of the systems at that time.
As it is visualized in Figure 2.14, the R-CNN detector consists of three main stages: region proposal
generation, CNN-based deep feature extraction and, finally, classification and localization.
The region proposals are just smaller parts of the input image, which are likely to contain the objects
of interest. The original implementation adopts the selective search algorithm [56] to create nearly 2000
region proposals for each input image. Then, each one of these region proposals are warped to a
specific size and passed to a CNN [57], which is used to extract a 4096-dimensional feature vector from
each region proposal. Finally, the extracted features from the CNN are fed into pre-trained linear SVMs
for multiple classes, one for each object class.
The R-CNN detector introduced the fundamental concept of combining region proposals with CNNs
and many more region-proposal-based techniques followed, such as SPP-net [58], Fast R-CNN [59],
Faster R-CNN [60], R-FCN [61], FPN [62], Mask R-CNN [63], and Libra R-CNN [64].
The you only look once (YOLO) algorithm [13] was first introduced in 2016 by Redmon et al. and
it was a milestone in object detection due to its capability of detecting objects in real-time with a good
accuracy.
The YOLO model uses a single neural network to predict bounding boxes and class probabilities
directly from full images in one evaluation. More specifically, as exemplified in Figure 2.15, YOLO divides
the input image into an S × S grid and then for each grid cell predicts B bounding boxes, confidence for
those boxes, and C conditional class probabilities, one per class, conditioned on the grid cell containing
an object. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor. The class confidence score
for each bounding box at test time is given by multiplying box confidence score with the conditional class
probability.
20
Figure 2.15: The YOLOv1 model [13].
Many improved versions of YOLO algorithm followed up YOLOv2 [65], YOLOv3 [19] and YOLOv4
[14], as well as the development of more regression/classification based models, including: SSD [66],
AttentionNet [67], G-CNN [68], DSSD [69], DSOD [70], RetinaNet [71], PP-YOLO [72], EfficientDet [73].
2.2.3.A YOLOv4
YOLOv4 is an improved version of the previous YOLO networks [13], [65], [19] and, as illustrated in
Figure 2.16, it consists of 3 main parts: the backbone, the neck, and the head, which probe feature
maps at different spatial resolutions. In order to enrich the information that reaches the head, the most
recent object detectors ofter include additional layers between backbone and head, in order to add
together element-wise or concatenate neighboring feature maps coming from different stages, before
feeding them into the head. This part of the system is called a neck.
Figure 2.16: Architecture of a modern one-stage detector, such as YOLOv4, constisting of one additional part, the
neck. Adapted from [14].
As a backbone feature extractor, the CSPDarknet53 is utilized, which is a combination of cross stage
partial (CSP) connections [74] with the Darknet-53 architecture, which contains 53 convolutional layers,
21
shown in Table 2.2. In addition to CSPDarknet53, 109 more layers have been added, leading to a
network of 162 layers in total.
For the network’s neck, spatial pyramid pooling (SPP) [58] and a modified path aggregation network
(PAN) [75] are implemented. The first one increases the receptive field and the latter aggregates the
parameters from different backbone levels for different detector levels. In contrast to the original PAN
implementation, instead of adding neighbor layers together, the modified PAN of YOLOv4 concatenates
together the feature maps, as exemplified in Figure 2.17.
Figure 2.17: The original PAN implementation (a) and the modified one (b) used in YOLOv4 [14].
Finally, the head of YOLOv3 [19] is used for the actual bounding box prediction. Multi-input weighted
residual connections (MiWRC) [73] and Mish activation functions [76] are also included in the network.
22
As described in Section 2.2.3, YOLO approaches the detection problem as a regression problem, by
splitting the input image into a grid of SxS non-overlapped cells and predicting for each cell B bounding
boxes, the confidence for those boxes and C conditional class probabilities, conditioned that an object
exists in the cell. The detailed YOLO loss function is the following
S2 Õ
B
1iobj j (xi − x̂i )2 + (yi − ŷi )2
Õ
L = λcoor d
i=0 j=0
S2 Õ
B
" p q 2#
√ 2
1
Õ p
ob j
+ λcoor d ij wi − ŵi + hi − ĥi
i=0 j=0
S2 Õ
B
1iobj j Ci − Ĉi
Õ 2
+ (2.18)
i=0 j=0
S2 Õ
B
1inoob
Õ
j 2
+ λnoob j j Ci − Ĉi
i=0 j=0
S2
1iob j
Õ Õ
+ (pi (c) − p̂i (c))2,
i=0 c ∈classes
In more detail, the confidence score is the expression of the existence or absence of any object (i.e.
pancreas) inside the bounding box and is given by the following equation
where the confidence score is 0 when no pancreas exists and it is equal to the complete Intersection
over Union (CIoU) [77], when exists.
The CIoU, is a recent improvement of the Intersection over Union (IoU), also known as the Jaccard
index. The IoU is computed in the following way
| A ∩ B|
IoU = , (2.20)
| A ∪ B|
where A and B are the prediction and ground truth bounding boxes, respectively and the CIoU is ex-
pressed as
23
ρ2 (b, bgt )
LC I oU = 1 − IoU + + αν, (2.21)
c2
where b and bgt are the central points of the predicted and ground-truth bounding boxes respectively, ρ
is the Euclidean distance, c is the diagonal length of the smallest enclosing box covering the two boxes
and α is a trade-off parameter defined as
ν
α= , (2.22)
(1 − IoU) + ν
4 w gt w
ν= (arctan − arctan )2, (2.23)
π 2 h gt h
where w,h, w gt and hgt are the width and height values for the predicted and ground-truth bounding
boxes, respectively.
Finally, YOLO predicts class-specific confidence scores for each box, by multiplying the conditional
class probabilities and the individual box confidence predictions, in the following way
The YOLOv4 model implements some additional new features during the training process, such as
cross mini-batch normalization (CmBN) [78], the use of DropBlock regularization [79]. In addition, two
newly-introduced data augmentation techniques are implemented in YOLOv4, the Mosaic and the self-
adversarial training (SAT).
Image segmentation is an important task in digital image processing, which is used to cluster pixels
into salient image regions, i.e., regions corresponding to individual objects. More specifically, image
segmentation indicates instances of recognized objects by highlighting their specific pixels and outputs
a pixel-wise mask of the image, instead of providing a coarse bounding box, as in the case of object
detection. This method gives valuable information for the shape of the object and therefore it has been
used extensively for medical imaging applications, where, e.g., the shape of a cell plays a crucial role in
determining whether it is cancerous or not.
One of the first deep learning algorithms for semantic image segmentation was proposed in 2015 by
Long et al. [80], using a fully convolutional network (FCN). The authors modified classification-purposed
CNNs [57], [81], [82] in order to perform semantic image segmentation, by replacing fully connected
layers with convolutional ones to output spatial maps instead of classification scores. In this way, they
24
introduced a network which includes only convolutional layers and is able get an image of arbitrary
size as input and produce a segmentation mask of the same size. The fully convolutional network is
considered a milestone in image segmentation, demonstrating that deep networks can be trained end-
to-end for semantic segmentation on images with variable size.
Based on the FCN, the U-Net architecture was proposed by Ronneberger et al. [15] for biomedical
image segmentation. As illustrated in Figure 2.18, the model comprises two parts: a contracting, FCN-
like, path with a series of convolution and max-pooling layers and a symmetric expanding path containing
a sequence of up-convolution (or deconvolution) layers, resulting in this u-shaped architecture. The
down-sampling or contracting path is used to capture the context in the image and the up-sampling
or expanding path enables precise localization. Many extensions of the U-Net followed, including 3D
U-Net, a U-Net architecture for 3D volumes proposed by Çiçek et al. [83], U-Net++, a nested U-Net
architecture proposed by Zhou et al. [84], and TernausNet, a U-Net model with VGG11 [81] weights on
the contracting path [85].
Finally, V-Net [86] is another popular FCN-based model, proposed by Milletari et al. for 3D medical
image segmentation.
Early work on pancreas segmentation from abdominal CT used statistical shape models or multi-atlas
techniques. In these approaches, the Dice similarity coefficient (DSC) or Dice score in the public Na-
25
tional Institutes of Health (NIH) would not exceed 75%. Therefore, convolutional neural networks have
rapidly become the mainstream methodology for medical image segmentation. Despite their good rep-
resentational power, it was observed that such deep segmentation networks are easily disrupted by the
varying contents in the background regions, when detecting small organs such as the pancreas, and
as a result produce less satisfying results. Taking that into consideration, a coarse-to-fine approach is
commonly adopted. These cascaded frameworks extract regions of interest (RoIs) and make dense
predictions on that particular RoIs. More specifically, the state-of-the art methods primarily fall into two
categories.
The first category is based on segmentation networks originally designed for 2D images, such as
the fully convolutional networks and the U-Net [15]. The U-Net architecture utilizes up-convolutions to
make use of low-level convolutional feature maps by projecting them back to the original image size,
which delineates object boundaries with details. Many frameworks have used a variant of the U-Net
architecture to segment the pancreas [87], [88] and others have made use of FCN and U-Net in order to
build more complex models. For example, Zhou et al. [89] finds the rough pancreas region and then a
trained FCN-based fixed-point model refines the pancreas region iteratively. Roth et al. [90] first segment
pancreas regions by holistically-nested networks and then refines them by the boundary maps obtained
by robust spatial aggregation using random forests. In addition, the TernaryNet proposed by Heinrich
et al. [91] applies sparse convolutions to the CT pancreas segmentation problem, which reduce the
computational complexity by requiring a lower number of non-zero parameters. Combinations of CNNs
with recurrent neural networks [92], [93], [94] have also been applied, as well as long short-term memory
(LSTM) networks [95]. Recently, Zheng et al. [96] proposed a model, which can involve uncertainties in
the process of segmentation iteratively, by utilizing the shadowed sets theory. In some cases, in order to
incorporate spatial 3D contextual information through the 2D segmentation, slices along different views
(axial, sagittal, and coronal) are used, by fusing the results of all 2D networks, e.g. through majority
voting.
In the second category, the methods are based on 3D convolutional layers and therefore operations
such as 2D convolution, 2D max-pooling, and 2D up-convolution are replaced by their 3D counterparts.
Such networks are the 3D U-Net [83] (which is a 3D extension of the U-Net), the Dense V-Net [97],
ResDSN [98], 3DFCN [99], and more [100]. In addition, OBELISK-Net proposed by Heinrich et al. [101]
applies sparse convolutions to 3D U-Net, Oktay et al. [102] applies attention gates (AG) and Khosravan
et al. [103] utilizes Projective Adversarial Network to perform pancreas segmentation. Finally, Zhu et
al. [104] applies neural architecture search, to automatically find optimal network architectures between
the 2D, 3D and pseudo3D convolutions.
26
2.3 Image segmentation and Curve Evolution
Active contours, also known as ”snakes”, were first introduced in 1988 by Kass et al. [105]. These are
energy-minimizing methods, which are used extensively in medical image processing, as a segmentation
technique. The idea is to initialize a position for the contour, and then define image forces that act on the
contour, making it change its position and adapt to the image’s features. For instance, Kass et al. [105]
define three energies: the internal energy (Eint ), the image energy (Eimage ) and the constraint energy
(Econ ), which represent the internal energy of the snake due to bending, the image’s energy (which
takes into consideration, for example, edges and lines) and the energy of external constraint forces
(which takes into consideration constraints created by the user), respectively.
In this work, the authors propose different expressions for each energy, which are beyond the scope
of this work and will not be presented. As an example, given a parametrization of a snake’s position of
the form v(s) = x(s), y(s) , s ∈ [0, 1], the internal energy proposed by Kass et al. [105] is defined as
∫ 1
Eint := α(s)|v 0(s)| 2 + β(s)|v 00(s)| 2 ds. (2.25)
0
The weights α(s) and β(s) control the relative importance of the first and second-order terms, respec-
tively. According to Kass et al. [105], the first-order term is responsible for making the snake act more
like a membrane, whereas the second-order term is responsible for making the snake act more like a
thin plate. Minimization of the first term leads to the minimization of the distance between the snake’s
points, causing the contour to shrink. The second-term enforces the smoothness of the curve and avoids
oscillations. In practice, a large value of the first one penalizes changes in distances between points in
the contour and a large value of the latter one penalizes high contour curvatures. Thus, setting β(s) = 0,
allows the snake to become second-order discontinuous, which means that the ‘optimal snake’ (i.e., the
snake that minimizes Eint ) may have one or more corners.
After defining the remaining energies, Kass et al. [105] present their active contour algorithm as the
solution of the following energy minimization problem:
∫ 1
∗
Esnake := Esnake v(s) ds
0
∫ 1
(2.26)
Eint v(s) + Eimage v(s) + Econ v(s) ds,
:=
0
∫ ∫ b
f r(t) r 0(t) dt is the line integral of f along a piecewise smooth curve C,
where f (r) ds :=
C a
r : [a, b] → C an arbitrary bijective parametrization of the curve C such that r(a) and r(b) are the
endpoints of C, with a < b.
27
Since the minimization of energy leads to dynamic behavior in the segmentation and because of
the way the contours slither while minimizing energy, Kass et al. [105] called them snakes. It should
also be noted that this minimization problem is generally not convex, which means that different snake
initializations may lead to different segmentations. In order to overcome the problems of bad initialization
and local minima, many variants of this method have been proposed, such as using a balloon force to
encourage the contour expansion [106] or incorporating gradient flows [107]. Geometric models for
active contours were also introduced, by Caselles et al. [108], Yezzi et al. [109], and more [110], [111],
as well as a geodesic active contour (GAC) model by Caselles et al. [112]. [113] et al. incorporated
statistical shape knowledge in a single energy functional, by modifying the Mumford-Shah functional
[114] and its cartoon limit and Chan and Vese [115] introduced a region-based method which is not
using an edge-function, but it minimizes an energy, which can be seen as a particular case of the
minimal partition problem. Additional region-based methods were proposed, such as those by Li et
al. [116], Zhu and Yuille [117], and Tsai et al. [118].
Inspired by the active contour evolution, a new framework for image segmentation was proposed in [16],
∗
which the authors called Morphological Snakes. Instead of computing the snake that minimizes Esnake ,
Alvarez et al. [16] proposed a new approach, which focuses on finding the solution of certain set of
partial differential equations (PDEs). As will be shown below, this approach also yields a snake-like
curve and because it approximates the numerical solution of the standard PDE snake model by the
successive application of a set of morphological operators (such as dilation or erosion) defined on a
binary level-set, it results in a much simpler, faster and stable, curve evolution.
The authors of [16], [119] have introduced morphological versions of two of the most popular curve evo-
lution algorithms: morphological active contours without edges (MorphACWE) [115] and morphological
geodesic active contours (MorphGAC) [112].
MorphACWE works well when pixel values of the inside and the outside regions of the object to
segment have different average. It does not require that the contours of the object are well defined, and
it can work over the original image without any preprocessing.
MorphGAC is suitable for images with visible contours, even when these contours are noisy, cluttered,
or partially unclear. It requires, however, that the image is preprocessed to highlight the contours and
the quality of the MorphGAC segmentation depends greatly on this preprocessing step.
Considering the application of this work, the MorphGAC algorithm was adopted and will be further
28
explained. An inverse Gaussian gradient function is used to highlight the contours, proposed by the
authors, which computes the magnitude of the gradients in the image and then inverts the result in the
range [0, 1].
Since all the mathematical details are not very relevant for the whole system, they are presented in
Appendix A.
29
30
Proposed Algorithms and
3
Implementation
Contents
3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
31
3.1 Datasets
This work relied on two datasets, the pancreas tumour task dataset within the Medical Segmentation
Decathlon [120] and the NIH (National Institutes of Health) dataset [121], [122], [123].
The Decathlon dataset is a recent multi-institutional effort to generate a large, open-source collection
of annotated medical image datasets of various clinically relevant anatomies. The pancreas dataset
was collected by the Memorial Sloan Kettering Cancer Center (New York, NY, USA) and contains 420
3D CT scans, 282 for training and 139 for testing. It consists of patients undergoing resection of pan-
creatic masses (intraductal mucinous neoplasms, pancreatic neuroendocrine tumours, or pancreatic
ductal adenocarcinoma). The CT scans have slice thickness of 2.5 mm and a resolution of 512x512
pixels. An expert abdominal radiologist performed manual slice-by-slice segmentation of the pancreatic
parenchyma and pancreatic mass (cyst or tumour), using the Scout application [124].
The NIH dataset contains 80 abdominal contrast-enhanced 3D CT scans from 53 male and 27 female
subjects. It consists of healthy patients, since 17 of them are healthy kidney donors scanned prior to
nephrectomy and the remaining 65 subjects were selected by a radiologist from patients who neither
had major abdominal pathologies nor pancreatic cancer lesions. The CT scans have slice thickness
between 1.5 and 2.5 mm and a resolution of 512x512 pixels, with varying pixel sizes. The pancreas was
manually segmented in each slice by a medical student and the segmentations were verified/modified
by an experienced radiologist.
Due to computational power limit, only a certain amount of data was used. More specifically, the de-
tection model was trained and evaluated using both datasets and the segmentation model was trained
and evaluated using only the NIH dataset. More details are given in Sections 3.2.2 and 3.3.2, respec-
tively. In addition, only the information from the transversal plane was taken into consideration. In order
to evaluate the variability in the data, a probability map was recreated to define the most likely position
of the pancreas. Moreover, the average Hounsfield range and percentage of the volume occupied by
the pancreas was also evaluated. The results are presented in Section 4.1.
3.2.1 Pre-processing
Pre-processing is a crucial step in machine learning in order to improve the performance of the models.
For the detection task, the intensity of all data was clipped between -200 to +300 HU, in order to capture
the pancreas intensity range and intensify its boundaries. More specifically, this range was carefully
chosen considering the data exploration results that will be later presented in Section 4.1. The -125 to
+225 intensity range was also tested for clipping, as well as using 16-bit input images, in order to provide
32
the algorithm with more information. However, both proved to be inefficient and were not implemented.
3.2.2 YOLOv4
In this work, the publicly available YOLOv4 [14] described in Section 2.2.3.A was deployed as a detection
network. The detection network was trained with both the Decathlon and the NIH datasets. These
datasets were shuffled and split into training and validation sets. More specifically, 66 scans from NIH
and 34 scans from Decathlon (100 in total) were used for training with 20% validation. The hold-out
method was used to evaluate the performance of the model, by using 10% of the scans (5 of each
dataset) for testing. The full YOLOv4 network architecture used for pancreas detection is presented in
Appendix B. A visualization is also provided using the netron app [125].
Relying on studies regarding the effectiveness of transfer learning in deep networks [126], [127],
weights pre-trained on the ImageNet dataset [57] were used. Afterwards, YOLOv4 was fine-tuned and
re-trained with the pancreas images. The model was trained for 20000 iterations with a batch size of 64
and a learning rate of 0.0013.
The performance of all models was evaluated through the hold-out methodology, using the correspond-
ing test datasets. For the quality assessment of the predictions, various metrics were used, i.e., preci-
sion, recall, IoU, and mean average precision (mAP) for the detection task. All of the metrics are briefly
introduced in the following paragraphs.
Precision, also known as positive predictive value (PPV), measures how accurate the model’s predic-
tions are, i.e., the percentage of the predictions that are correct. It is given as the ratio of true positives
and the total number of predicted positives, as seen in the following expression:
TP
PPV = , (3.1)
T P + FP
, where TP denotes the true positives (predicted correctly as positive) and FP the false positives (pre-
dicted incorrectly as positive).
Similarly, recall, also known as sensitivity or true positive rate (TPR), measures how well all the
positives are predicted. It is given as the ratio of the true positives and the total of ground truth positives,
as seen in the following equation:
TP
T PR = , (3.2)
TP + FN
33
The IoU metric, also known as Jaccard index, is a popular similarity measure for object detection
problems using the predicted and ground-truth bounding boxes. Evidently, as exemplified in Figure 3.1,
a bigger overlap between the two bounding boxes results in higher IoU score and therefore, detection
accuracy.
Figure 3.1: Calculation of the IoU metric. With purple color the predicted bounding box is depicted and with green
the ground truth.
For object detection tasks, it is common to calculate precision and recall for a given IoU threshold
value, which in the current work was set to 0.5. Therefore, a prediction is regarded as correct if IoU ≥
0.5.
The average precision (AP) is a default evaluation metric in the PascalVOC competition [55, p. 313],
which derives from the area under the precision/recall curve. The precision/recall curve is computed
from a model’s ranked output and the predictions are ranked with respect to the confidence score of
each bounding box. In this work, the detection model is set to keep only predictions with a confidence
score higher than 25%. The AP provides an indication of the shape of the precision/recall curve, and is
defined as the interpolated average precision at a set of eleven equally spaced recall levels [0, 0.1,...,
1], expressed as:
1 Õ
AP = pinter p (r). (3.3)
11
r ∈ {0,0.1,...,1}
At each recall level r, the interpolated precision pinter p is calculated by taking the maximum precision
measured for that r, given by the following formula:
Since mAP is calculated by taking the average of the AP calculated for all the classes, they will be
used interchangeably in the current context.
34
3.3 Segmentation Network Architecture
3.3.1 Pre-processing
Intensity clipping was not adopted for the segmentation task, since it did not show any improvement
in the performance of the models. For the segmentation network, curvature driven image denoising
is applied on each slice, in order to make the pixel distribution more uniform. Finally, contrast limited
adaptive histogram equalization (CLAHE) and normalization were also investigated for the segmentation
task, but they also proved to be ineffective, in terms of Dice score.
An adaptation of the original U-Net architecture was chosen as a segmentation model. More specifically,
following the original implementation, 3×3 convolutional kernels are used in the contracting path with a
stride of 1, each followed by a 2x2 max-pooling operation with stride 2, and 2×2 up-convolutional kernels
with a stride of 2 are implemented in the expansive path. Each convolution layer, except for the last
one, is followed by a ReLU, and dropout of 0.5 is selected. However, the number of levels in both the
descending and ascending paths of the network was increased from four blocks to five blocks and the
number of filters per layer was reduced compared to the original implementation, as visualized in Figure
3.2. In addition, residual connections were included in each convolutional block, padding was added and
the sigmoid function was used as an activation function for the last layer. Due to memory limitations, all
scans were resized from 512x512 to 256x256.
Figure 3.2: The 5-level U-Net segmentation network for an input image of size 256x256.
As mentioned in Section 3.1, the U-Net model was trained on 50 volumes from the NIH dataset, 45
of them were used for training and 5 for validation.
35
Since the pancreas occupies only a very small region of a CT-Scan, the work [86] is followed, and a
DSC-loss layer is used to prevent the model from being heavily biased towards the background class.
In more detail, the DSC between two voxel sets, A and B, can be expressed as
2x| A ∩ B|
DSC(A, B) = , (3.5)
| A| + |B|
and this is slightly modified into a loss function between the ground-truth mask Y and the predicted mask
Ŷ , in the following way:
2x i ŷi yi +
Í
L(Ŷ, Y ) = 1 − Í , (3.6)
i ŷi + i yi +
Í
The model is trained with batches of 128 instances and optimized using Adam with a learning rate of
0.0001. Early-stopping is also used, in order to avoid overfitting of the network, and the model with the
lowest validation loss is chosen.
In order to reduce the irrelevant information in a CT-scan and train the U-Net on scans with more useful
information, the scans were cropped to a smaller size. The rest of the training procedure described in
Section 3.3.2 was kept intact.
The image size of the cropped scans was carefully decided, as on the one hand it should be as
small as possible, but also contain the whole pancreas. Taking into account the information retrieved
from Figure 4.2, which shows the probability of pancreas existing in a 512x512 CT slice, as well as
the maximum size of pancreas in both datasets, a default cropping position was decided. The default
cropping position has its centroid at position (x = 287, y = 250) and image size of 224x224, as visualized
in Figure 3.3.
36
Figure 3.3: Default cropping position of a 512x512 CT-scan into a 224x224 scan.
Therefore, the U-Net is trained and tested on 224x224 images, as shown in Figures 3.4 and 3.5
respectively.
37
Figure 3.4: Training procedure of the cropped U-Net model, using a default cropping position with respect to the
probability map of pancreas.
38
Figure 3.5: Testing procedure of the cropped U-Net model, using a default cropping position with respect to the
probability map of pancreas.
In order to overcome the problem of initialization, the MorphGAC algorithm was combined with the YOLO
detection model. The MorphGAC segmentation was applied on cropped bounding box predictions using
their centroid as an initialization point, as exemplified in Figure 3.6. The segmentation results are then
39
repositioned back to 512x512 images, in order to form the final 3D segmented pancreas.
40
3.3.5 YOLO + U-Net model
An additional two-step approach with cropped CT-scans is proposed, which combines the YOLO de-
tection model to crop the images containing the pancreas, as well as the default cropping position
mentioned in Section 3.3.3, to crop the images without pancreas. The U-Net architecture visualized in
Figure 3.2 and the training procedure described in 3.3.2 were used. The training and testing procedure
of YOLO + U-Net model are presented in Figures 3.7 and 3.8, respectively.
Figure 3.7: Training procedure of the YOLO+U-Net model, using the centroid of the ground-truth images to crop
the scans and train the model.
41
Figure 3.8: Testing procedure of the YOLO+U-Net model, using a default cropping position when no pancreas is
predicted and the centroid of the prediction, when pancreas is found.
42
3.3.6 Post-processing
A post-processing step is adopted for all predicted segmentations, aiming for optimization and a smoother
result, by leveraging spatial information. Since pancreas has a uniform and undivided shape, it is un-
likely that a pixel in a 2D slice has a different value from both the previous and the next slice. Taking that
into consideration, all slices in a predicted segmentation are compared in groups of three and the values
of the middle ones are changed, when different from the other two. In Figure 3.9, the segmentation
improvement on the middle slice is depicted, after post-processing it.
Figure 3.9: Example of the application of the optimization step on a slice. The optimization is applied only on the
middle slice of the three used.
The performance of all models was evaluated through the hold-out methodology, using the correspond-
ing test datasets. Since MorphGAC algorithm needs no training, the YOLO+MorphGAC approach was
evaluated on the detection’s model test dataset. For the quality assessment of the segmentation task,
the DSC or dice score was used. Because of the similarity between the IoU metric and the dice score,
the former was not used for the evaluation of the segmentation models.
The dice score is the most common evaluation metric for the segmentation of medical images and
it is calculated as twice the area of overlap divided by the total number of pixels in both images, as
illustrated in Figure 3.10.
43
Figure 3.10: Calculation of the DSC metric.
Evidently, the DSC is equivalent to the IoU, using the following expression:
2IoU
DSC = . (3.7)
1 + IoU
Due to the Covid-19 situation currently taking place, various computers were used in order to complete
this work.
More specifically, the development and the first runs of the U-Net and YOLO networks took place on
a laptop with an Intel Core i7-8750H CPU, NVidia Geforce GTX 1080 GPU, 16 Gb RAM and Ubuntu
18.04 LTS operating system.
The training of the U-Net model was then continued on a desktop with an Intel Core i7-4930 CPU,
NVidia Geforce RTX 2080 Ti GPU, 32 Gb RAM and Ubuntu 18.04 LTS operating system.
The testing of YOLO+U-Net and YOLO+MorphGAC models took place on Google Colab.
The final YOLO detection model was trained on a desktop with an Intel Core i7-6700K CPU, NVidia
Geforce GTX 1070 GPU, 16 Gb RAM and Ubuntu 20.04 LTS operating system.
Finally, the YOLO+MorphGAC model was developed on a laptop with an Intel Core i5-4210U CPU,
NVIDIA GeForce 820M GPU, 8 Gb RAM and Ubuntu 18.04 LTS operating system.
44
Results and Discussion
4
Contents
4.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
45
This chapter presents the results obtained in this dissertation. First, pancreas details concerning its
average Hounsfield range, its size and its location in both Decathlon and NIH datasets are presented.
Then, experimental results of the detection model tested on volumes from both datasets are shown and
discussed. Finally, all four segmentation models are evaluated on volumes from the NIH dataset and
afterwards, the best two are compared with the state-of-the-art.
The average Hounsfield range and percentage of the volume occupied by the pancreas were evaluated.
In Figure 4.1 the pancreas intensity values for both datasets are illustrated, where Decathlon has a mean
+80.63 ± 57.91 HU and NIH +86.71 ± 32.18 HU.
Figure 4.1: The Hounsfield unit values of the pancreas for the Decathlon (a) and NIH (b) dataset, respectively.
In Decathlon dataset the intensity range from -998.0 HU to +3071.0, while in NIH from -875.0 HU to
+458.0 HU.
In addition, the mean percentage of pancreas in the abdominal ct scans is 0.46% and 0.49% for
46
Decathlon and NIH respectively, taking into consideration only values <800 HU, in order to exclude
the air. Moreover, a probability map was recreated to define the most likely position of the pancreas.
In Figure 4.2 is the probability map of pancreas in a 2D slice for both datasets is visualized. In the
Decathlon dataset, the pancreas location ranges from pixel 150 to 434 in the x-axis and from pixel 139
to 348 in the y-axis. Similarly, in NIH dataset, the pancreas location ranges from 167 to 405 in the x-axis
and from 143 to 360 in the y-axis.
Figure 4.2: Probability of pancreas in an 512x512 image for Decathlon (a) and NIH (b) datasets, respectively.
The detection model was evaluated on the holdout test set containing volumes from both the NIH and
the Decathlon datasets and it can predict pancreas successfully in both of them, as visualized in Figure
4.3.
Figure 4.3: Detection of pancreas on a NIH (a) and Decathlon (b) slice, respectively. The green color represents
the ground-truth and the purple the YOLO prediction.
47
The results of the detection model evaluated by the PPV, TPR, IoU, and mAP metrics described in
Section 3.2.3 are presented in Table 4.1. The detection model shows a mAP of 50.67% on the test set,
with the IoU being 47.70%, the precision 0.63, and the recall 0.52. However, the performance of the
model varies significantly between the two datasets, with the mAP being 71.43% for the NIH dataset
and 29.92% for the Decathlon dataset. For that reason, the performance of the model in Table 4.1 is
also evaluated separately for each dataset. More specifically, the model shows a IoU of 57.43% with a
precision and recall of 0.63 and 0.52 respectively, on the NIH dataset. On the Decathlon dataset, the
model has 37.96% IoU, the precision is 0.5 and the recall 0.36.
The low detection accuracy of the YOLO model on the Decathlon dataset derives from the multi-label
nature of some ground-truth files. More specifically, it was later realized the YOLO model was configured
to be trained on one class and therefore, since the pancreatic parenchyma and pancreatic mass (cyst
or tumour) are annotated separately as different classes in the Decathlon dataset, the second class was
ignored. as shown in Figure 4.4. In addition, during the conversion from NIfTI to tiff format (which is
required by YOLO), all the individual globs of healthy pancreas surrounding the pancreatic mass, are
annotated as separate tiny bounding boxes. However, as visualized on this figure, it is very interesting
to notice that the model outputs a fairly good prediction on the class that it was trained on (upper-right
green bounding box).
Figure 4.4: Detection of pancreas on a multilabel slice. The green color represents the ground-truth and the purple
the YOLO prediction. The two bigger green bounding boxes represent the two classes.
48
Unfortunately, the problem was not discovered on time, since the YOLO model was trained only on
the NIH dataset at first with healthy pancreas. Nevertheless, this could be overcome by considering a
single class once the goal of the work was to segment the healthy pancreas. Due to the unavailabil-
ity of the computer on which YOLO was trained afterwards, this could not be realized in the current
dissertation.
In Figures 4.5 and 4.6, the metrics used for the assessment of the detection model during training
are visualized.
Figure 4.5: The CIoU training loss with a minimum value of 0.18.
Figure 4.6: The validation mAP during training, reaching a maximum value of 90.9%.
49
4.3 Segmentation Task
The segmentation models were evaluated on the holdout set containing volumes from the NIH dataset
and the pancreas can be successfully segmented, as visualized in Figure 4.8.
The segmentation performance of all models on the NIH dataset is presented on Table 4.2. The
YOLO+MorphGAC model has the best performance achieving a Dice score of 67.67 ± 8.62%, followed
by the YOLO+U-Net model with a Dice score of 64.87 ± 4.79%. The U-Net and cropped U-Net show
a lower performance with a Dice score of 59.91 ± 13.69% and 59.08 ± 12.76%, respectively. The
obtained DSC from the cropped U-Net when compared with the YOLO+U-Net was lower and showed a
higher standard deviation. The preliminary results suggest that YOLO was able to improve segmentation
when compared to using a probability map. A similar behavior was observed for the U-Net model
with improved results when using the YOLO model. Regarding segmentation performance, the highest
average Dice score was obtained with the YOLO+MorphGAC model and the lowest standard deviation
with the YOLO+U-Net model.
The loss function used for the assessment of the segmentation model during training is visualized in
Figure 4.7.
Figure 4.7: The metric of loss for the YOLO+U-Net during training with a minimum value of 0.28.
In Figures 4.8 and 4.9 representative slices of the best and worst results of the proposed segmenta-
tion models are presented, respectively.
50
4.3.1 Comparison with the state-of-the-art
In Table 4.3 a comparison of the best two proposed models with the state-of-the art deep learning
models is presented. Only models which implement a 2D approach were taken into consideration and
trained on the NIH dataset as well. The proposed models display a low performance compared to the
other networks. However, when examining the training methodology of all the state-of-the-art methods,
it is concluded that cross validation is the adopted validation technique, as well as that they leverage the
whole NIH dataset for the training, which in the current work was not possible, due to computational and
time limitations. These two reasons could be the main causes for the low segmentation performance of
the proposed YOLO+U-Net model and following the state-of-the-art training methods could result in a
higher a Dice score.
51
Figure 4.8: Some of the best segmentations on the test set of the YOLO+MorphGAC (a), U-Net (b), cropped U-Net
(c) and YOLO+U-Net (d) model, respectively. The blue line indicates the model’s predicted segmenta-
tion and the red line represents the ground-truth.
52
Figure 4.9: Some of the worst segmentations on the test set of the YOLO+MorphGAC (a), U-Net (b), cropped
U-Net (c) and YOLO+U-Net (d) model, respectively. The blue line indicates the model’s predicted
segmentation and the red line represents the ground-truth.
53
54
Conclusion and Future work
5
Contents
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
55
5.1 Conclusions
Taking into account the state-of-the-art cascaded segmentation models, the main goal of the current
work was to investigate and present simpler two-step approaches for the segmentation of the pancreas
in CT. In the end, three models were investigated and compared to the U-Net implementation. The
two-step approach was achieved by using a detection network for the pancreas localization prior to
the segmentation. This approach, which was combined with both morphological snakes and a U-Net
segmentation network, proved to be the most efficient, showing that the organ detection benefits the
segmentation task. Another approach was to reduce the background information by cropping the data
using a pancreas probability map and then using a U-Net network to segment this smaller area. How-
ever, this showed no improvement in segmentation performance, when compared to U-Net segmentation
results on whole CT-scans. Due to the Covid-19 situation, many computational limitations existed, which
resulted in insufficient training of both the detection and the segmentation network, which is the main
reason of the low performance of the proposed models. Since the segmentation performance lies below
70%, it is concluded that further investigation and model improvement is needed, in order to tackle the
challenging nature of the pancreas segmentation problem efficiently.
Regarding the proposed models, there are changes that could be investigated to further improve the
performance, focusing on YOLO+MorphGAC and YOLO+U-Net architectures. First and foremost, since
the detection network plays a vital role on both of them, its improvement would also lead to overall
improved performance of the model. Considering that YOLOv4 is one of the most modern and efficient
state-of-the-art detection model, experimenting with different ones would not be the main priority. A slight
improvement could occur by testing different backbone models for feature extraction, such as ResNet
[129] or EfficientNet [130]. More importantly, leveraging the whole Decathlon dataset and training on
a larger dataset, could also lead to a better detection performance. Regarding the proposed U-Net
architecture, similarly to the detection task, the need to train on a larger dataset is undeniable and adopt
the training methodology of the state-of-the-art models. In addition, a YOLO+FCN approach would also
be interesting for investigation. Finally, exploring 3D versions of the architectures could also improve
the results. The present work focuses on the segmentation of healthy pancreas, nevertheless, once
Decathlon dataset also presents tumor masks, tumor segmentation could also be explored.
56
Bibliography
[3] D. J. Brenner and E. J. Hall, “Computed tomography—an increasing source of radiation exposure,”
New England Journal of Medicine, vol. 357, no. 22, pp. 2277–2284, 2007.
[4] https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53.
[5] https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12.
[6] N. Buduma and N. Locascio, Fundamentals of deep learning: Designing next-generation machine
intelligence algorithms. ” O’Reilly Media, Inc.”, 2017.
[8] M. A. Mazurowski, M. Buda, A. Saha, and M. R. Bashir, “Deep learning in radiology: An overview
of the concepts and a survey of the state of the art with focus on mri,” Journal of magnetic reso-
nance imaging, vol. 49, no. 4, pp. 939–954, 2019.
[9] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” ArXiv e-prints, mar
2016.
[10] “Convolutional neural networks (cnns / convnets),” University Course, 2020. [Online]. Available:
https://cs231n.github.io/convolutional-networks/
[11] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu, “A survey of deep learning-based
object detection,” IEEE Access, vol. 7, pp. 128 837–128 868, 2019.
57
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object
detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2014, pp. 580–587.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object
detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2016, pp. 779–788.
[14] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object
detection,” arXiv preprint arXiv:2004.10934, 2020.
[15] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image
segmentation,” in International Conference on Medical image computing and computer-assisted
intervention. Springer, 2015, pp. 234–241.
[17] I. Backer, I. Vos, O. Vanderveken, D. Devolder, M. Braem, D. van dyck, and W. Backer, “Com-
bining mimics and computational fluid dynamics (cfd) to assess the efficiency of a mandibular
advancement device (mad) to treat obstructive sleep apnea (osa),” 03 2020.
[18] S. Lim, J. H. Bae, E. J. Chun, H. Kim, S. Y. Kim, K. M. Kim, S. H. Choi, K. S. Park, J. C. Flo-
rez, and H. C. Jang, “Differences in pancreatic volume, fat content, and fat density measured by
multidetector-row computed tomography according to the duration of diabetes,” Acta diabetolog-
ica, vol. 51, no. 5, pp. 739–748, 2014.
[21] A. Acs, “Cancer facts and figures 2010,” American Cancer Society, National Home Office, Atlanta,
pp. 1–44, 2010.
[22] L. W. Goldman, “Principles of ct: multislice ct,” Journal of nuclear medicine technology, vol. 36,
no. 2, pp. 57–68, 2008.
[23] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in
the brain.” Psychological review, vol. 65, no. 6, p. 386, 1958.
58
[24] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of control,
signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
[26] A. Zell, Simulation neuronaler netze. Addison-Wesley Bonn, 1994, vol. 1, no. 5.3.
[27] K. Hornik, M. Stinchcombe, H. White et al., “Multilayer feedforward networks are universal approx-
imators.” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
[28] C. Lemaréchal, “Cauchy and the gradient method,” Doc Math Extra, vol. 251, p. 254, 2012.
[30] R. Sutton, “Two problems with back propagation and other steepest descent learning procedures
for networks,” in Proceedings of the Eighth Annual Conference of the Cognitive Science Society,
1986, 1986, pp. 823–832.
[31] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016,
vol. 1.
[32] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks,
vol. 12, no. 1, pp. 145–151, 1999.
[33] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Com-
putational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
[34] L. B. Almeida, “C1. 2 multilayer perceptrons,” Handbook of Neural Computation C, vol. 1, 1997.
[35] Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of conver-
gence o (1/kˆ 2),” in Doklady an ussr, vol. 269, 1983, pp. 543–547.
[36] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momen-
tum in deep learning,” in International conference on machine learning, 2013, pp. 1139–1147.
[37] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[38] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture 6a
overview of mini-batch gradient descent,” Cited on, vol. 14, no. 8, 2012.
59
[39] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochas-
tic optimization.” Journal of machine learning research, vol. 12, no. 7, 2011.
[40] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
[41] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional
networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), July 2017.
[42] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg,
G. Corrado et al., “Google’s multilingual neural machine translation system: Enabling zero-shot
translation,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351,
2017.
[43] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint
arXiv:1904.09237, 2019.
[45] B. D. Ripley, Pattern recognition and neural networks. Cambridge university press, 2007.
[46] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,”
Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
[47] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical
Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[49] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical
image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee,
2009, pp. 248–255.
[50] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio, “Transfusion: Understanding transfer learning
for medical imaging,” in Advances in neural information processing systems, 2019, pp. 3347–3357.
[51] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The
Journal of physiology, vol. 148, no. 3, p. 574, 1959.
[52] ——, “Receptive fields and functional architecture of monkey striate cortex,” The Journal of physi-
ology, vol. 195, no. 1, pp. 215–243, 1968.
60
[53] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[54] Y.-T. Zhou and R. Chellappa, “Computation of optical flow using a neural network.” in ICNN, 1988,
pp. 71–78.
[55] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object
classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338,
2010.
[56] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object
recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
[57] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[58] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for
visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9,
pp. 1904–1916, 2015.
[59] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision,
2015, pp. 1440–1448.
[60] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with
region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–
99.
[61] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional
networks,” in Advances in neural information processing systems, 2016, pp. 379–387.
[62] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks
for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recog-
nition, 2017, pp. 2117–2125.
[63] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE interna-
tional conference on computer vision, 2017, pp. 2961–2969.
[64] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra r-cnn: Towards balanced learn-
ing for object detection,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2019, pp. 821–830.
[65] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 2017, pp. 7263–7271.
61
[66] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot
multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[67] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, “Attentionnet: Aggregating weak directions
for accurate object detection,” in Proceedings of the IEEE International Conference on Computer
Vision (ICCV), December 2015.
[68] M. Najibi, M. Rastegari, and L. S. Davis, “G-cnn: an iterative grid based object detector,” in Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2369–
2377.
[69] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,”
arXiv preprint arXiv:1701.06659, 2017.
[70] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object
detectors from scratch,” in Proceedings of the IEEE international conference on computer vision,
2017, pp. 1919–1927.
[71] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in
Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[72] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren, S. Han, E. Ding
et al., “Pp-yolo: An effective and efficient implementation of object detector,” arXiv preprint
arXiv:2007.12099, 2020.
[73] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2020.
[74] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new
backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition Workshops, 2020, pp. 390–391.
[75] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.
8759–8768.
[76] D. Misra, “Mish: A self regularized non-monotonic neural activation function,” arXiv preprint
arXiv:1908.08681, 2019.
[77] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning
for bounding box regression.” in AAAI, 2020, pp. 12 993–13 000.
62
[78] Z. Yao, Y. Cao, S. Zheng, G. Huang, and S. Lin, “Cross-iteration batch normalization,” arXiv
preprint arXiv:2002.05712, 2020.
[79] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization method for convolutional networks,”
in Advances in Neural Information Processing Systems, 2018, pp. 10 727–10 737.
[80] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
3431–3440.
[81] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recogni-
tion,” arXiv preprint arXiv:1409.1556, 2014.
[82] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
binovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 1–9.
[83] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense
volumetric segmentation from sparse annotation,” in International conference on medical image
computing and computer-assisted intervention. Springer, 2016, pp. 424–432.
[84] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture
for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal
Learning for Clinical Decision Support. Springer, 2018, pp. 3–11.
[85] V. Iglovikov and A. Shvets, “Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for
image segmentation,” arXiv preprint arXiv:1801.05746, 2018.
[86] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric
medical image segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV). IEEE,
2016, pp. 565–571.
[87] L. Lu, L. Jian, J. Luo, and B. Xiao, “Pancreatic segmentation via ringed residual u-net,” IEEE
Access, vol. 7, pp. 172 871–172 878, 2019.
[88] Y. Man, Y. Huang, J. Feng, X. Li, and F. Wu, “Deep q learning driven ct pancreas segmentation
with geometry-aware u-net,” IEEE transactions on medical imaging, vol. 38, no. 8, pp. 1971–1980,
2019.
[89] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille, “A fixed-point model for
pancreas segmentation in abdominal ct scans,” in International conference on medical image
computing and computer-assisted intervention. Springer, 2017, pp. 693–701.
63
[90] H. R. Roth, L. Lu, N. Lay, A. P. Harrison, A. Farag, A. Sohn, and R. M. Summers, “Spatial ag-
gregation of holistically-nested convolutional neural networks for automated pancreas localization
and segmentation,” Medical image analysis, vol. 45, pp. 94–107, 2018.
[91] M. P. Heinrich, M. Blendowski, and O. Oktay, “Ternarynet: faster deep model inference without
gpus for medical 3d segmentation using sparse and binary convolutions,” International journal of
computer assisted radiology and surgery, vol. 13, no. 9, pp. 1311–1320, 2018.
[92] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille, “Recurrent saliency transformation
network: Incorporating multi-stage visual cues for small organ segmentation,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2018, pp. 8280–8289.
[93] Z. Yang, L. Zhang, M. Zhang, J. Feng, Z. Wu, F. Ren, and Y. Lv, “Pancreas segmentation in
abdominal ct scans using inter-/intra-slice contextual information with a cascade neural network,”
in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology
Society (EMBC). IEEE, 2019, pp. 5937–5940.
[94] J. Cai, L. Lu, F. Xing, and L. Yang, “Pancreas segmentation in ct and mri images via domain spe-
cific network designing and recurrent neural contextual learning,” arXiv preprint arXiv:1803.11303,
2018.
[95] H. Li, J. Li, X. Lin, and X. Qian, “Pancreas segmentation via spatial context based u-net and
bidirectional lstm,” arXiv preprint arXiv: 1903.00832, 2019.
[96] H. Zheng, Y. Chen, X. Yue, C. Ma, X. Liu, P. Yang, and J. Lu, “Deep pancreas segmentation with
uncertain regions of shadowed sets,” Magnetic Resonance Imaging, vol. 68, pp. 45–52, 2020.
[98] Z. Zhu, Y. Xia, W. Shen, E. Fishman, and A. Yuille, “A 3d coarse-to-fine framework for volumetric
medical image segmentation,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018,
pp. 682–690.
[99] H. R. Roth, H. Oda, Y. Hayashi, M. Oda, N. Shimizu, M. Fujiwara, K. Misawa, and K. Mori,
“Hierarchical 3d fully convolutional networks for multi-organ segmentation,” arXiv preprint
arXiv:1704.06382, 2017.
[100] N. Zhao, N. Tong, D. Ruan, and K. Sheng, “Fully automated pancreas segmentation with two-stage
3d convolutional neural networks,” in International Conference on Medical Image Computing and
Computer-Assisted Intervention. Springer, 2019, pp. 201–209.
64
[101] M. P. Heinrich, O. Oktay, and N. Bouteldja, “Obelisk-net: Fewer layers to solve 3d multi-organ
segmentation with sparse deformable convolutions,” Medical image analysis, vol. 54, pp. 1–9,
2019.
[103] N. Khosravan, A. Mortazi, M. Wallace, and U. Bagci, “Pan: Projective adversarial network for
medical image segmentation,” in International Conference on Medical Image Computing and
Computer-Assisted Intervention. Springer, 2019, pp. 68–76.
[104] Z. Zhu, C. Liu, D. Yang, A. Yuille, and D. Xu, “V-nas: Neural architecture search for volumetric
medical image segmentation,” in 2019 International Conference on 3D Vision (3DV). IEEE, 2019,
pp. 240–248.
[105] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” International journal of
computer vision, vol. 1, no. 4, pp. 321–331, 1988.
[106] L. D. Cohen, “On active contour models and balloons,” CVGIP: Image understanding, vol. 53,
no. 2, pp. 211–218, 1991.
[107] S. Kichenassamy, A. Kumar, P. Olver, A. Tannenbaum, and A. Yezzi, “Gradient flows and geometric
active contour models,” in Proceedings of IEEE International Conference on Computer Vision.
IEEE, 1995, pp. 810–815.
[108] V. Caselles, F. Catté, T. Coll, and F. Dibos, “A geometric model for active contours in image pro-
cessing,” Numerische mathematik, vol. 66, no. 1, pp. 1–31, 1993.
[109] A. Yezzi, S. Kichenassamy, A. Kumar, P. Olver, and A. Tannenbaum, “A geometric snake model
for segmentation of medical imagery,” IEEE Transactions on medical imaging, vol. 16, no. 2, pp.
199–209, 1997.
[110] G. Sundaramoorthi, A. Yezzi, and A. C. Mennucci, “Sobolev active contours,” International Journal
of Computer Vision, vol. 73, no. 3, pp. 345–366, 2007.
[111] R. Malladi, J. A. Sethian, and B. C. Vemuri, “Shape modeling with front propagation: A level set
approach,” IEEE transactions on pattern analysis and machine intelligence, vol. 17, no. 2, pp.
158–175, 1995.
[112] V. Caselles, R. Kimmel, and G. Sapiro, “Geodesic active contours,” in Proceedings of IEEE inter-
national conference on computer vision. IEEE, 1995, pp. 694–699.
65
[113] D. Cremers, C. Schnorr, and J. Weickert, “Diffusion-snakes: combining statistical shape knowl-
edge and image information in a variational framework,” in Proceedings IEEE Workshop on Vari-
ational and Level Set Methods in Computer Vision. IEEE, 2001, pp. 137–144.
[114] D. B. Mumford and J. Shah, “Optimal approximations by piecewise smooth functions and associ-
ated variational problems,” Communications on pure and applied mathematics, 1989.
[115] T. F. Chan and L. A. Vese, “Active contours without edges,” IEEE Transactions on image process-
ing, vol. 10, no. 2, pp. 266–277, 2001.
[116] C. Li, C.-Y. Kao, J. C. Gore, and Z. Ding, “Minimization of region-scalable fitting energy for image
segmentation,” IEEE transactions on image processing, vol. 17, no. 10, pp. 1940–1949, 2008.
[117] S. C. Zhu and A. Yuille, “Region competition: Unifying snakes, region growing, and bayes/mdl for
multiband image segmentation,” IEEE transactions on pattern analysis and machine intelligence,
vol. 18, no. 9, pp. 884–900, 1996.
[118] A. Tsai, A. Yezzi, and A. S. Willsky, “Curve evolution implementation of the mumford-shah func-
tional for image segmentation, denoising, interpolation, and magnification,” IEEE transactions on
Image Processing, vol. 10, no. 8, pp. 1169–1186, 2001.
[121] H. R. Roth, A. Farag, E. B. Turkbey, L. Lu, J. Liu, and R. M. Summers, “Data from pancreas-ct.”
The Cancer Imaging Archive, 2016. [Online]. Available: https://doi.org/10.7937/K9/TCIA.2016.
tNB1kqBU
[122] H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “Deeporgan:
Multi-level deep convolutional networks for automated pancreas segmentation,” in International
conference on medical image computing and computer-assisted intervention. Springer, 2015,
pp. 556–564.
[123] K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt,
M. Pringle et al., “The cancer imaging archive (tcia): maintaining and operating a public information
repository,” Journal of digital imaging, vol. 26, no. 6, pp. 1045–1057, 2013.
66
[124] B. M. Dawant, R. Li, B. Lennon, and S. Li, “Semi-automatic segmentation of the liver and its
evaluation on the miccai 2007 grand challenge data set,” 3D Segmentation in The Clinic: A Grand
Challenge, pp. 215–221, 2007.
[125] https://netron.app/.
[126] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data
engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[127] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers,
“Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset
characteristics and transfer learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp.
1285–1298, 2016.
[128] P. Y. Simard, D. Steinkraus, J. C. Platt et al., “Best practices for convolutional neural networks
applied to visual document analysis.” in Icdar, vol. 3, no. 2003, 2003.
[129] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceed-
ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[130] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,”
arXiv preprint arXiv:1905.11946, 2019.
[131] S. Osher and J. A. Sethian, “Fronts propagating with curvature-dependent speed: algorithms
based on hamilton-jacobi formulations,” Journal of computational physics, vol. 79, no. 1, pp. 12–
49, 1988.
[132] R. Kimmel, Numerical Geometry of Images: Theory, Algorithms, and Applications. Springer
Science & Business Media, 2004.
[133] F. Guichard, J.-M. Morel, and R. Ryan, “Contrast invariant image analysis and pde’s,” 2004.
67
68
Mathematical details of morphological
A
snakes
Let u : R+ × R2 → R be an implicit reprecentation of C, such that C(t) = {(x, y); u(t, (x, y)) = 0}. Then, if
the curve evolution has the form Ct = F · N , where N is the normal to the curve and F is a scalar field,
which determines the evolution velocity of each point in the curve, the evolution of any function u(x, y)
which embeds the curve as one of its level sets is [131], [132]
∂u
= F · |∇u|, (A.1)
∂t
∂u
= ±|∇u|, (A.2)
∂t
and if F = K, where K is the Euclidean curvature of C, the intrinsic heat equation [133] results and can
be written as
69
∂u
∇u
= div · |∇u|, (A.3)
∂t |∇u|
since the divergence of the normalized gradient is the curvature of the implicit curve at each point.
In order to express the morphological operators as PDEs, the most common ones are: the dilation
and the erosion. A dilation Dh with radius h of function u is expressed as
where in both definitions, B(0, 1) is the ball of radius 1, centered at 0 and the term hB is the set B scaled
by h. The same authors [119] show that using the successive application of the morphological operators
Dh and Eh , yields the level-set evolution PDE of (A.2). In addition, they have introduced a curvature
operator defined as SId ◦ ISd , which has an infinitesimal behavior like the PDE of (A.3). Assume two
morphological operator Th1 and Th2 . As shown in [16], operator ◦ can be approximated as
2 u + T1 u
Th/2 h/2
2
Th/2 1
◦ Th/2 u ≈ , (A.6)
2
where h should be an adequately small radius. The effect of the SId and ISd operators is depicted
through some examples in Figures A.1 and A.2, respectively.
Figure A.1: Examples of the SId operator on individual pixels of binary images. In the cases where the central pixel
belongs to a straight line of active pixels (marked in red), the central pixel remains active, as shown in
examples (a) and (b). On the contrary, if a straight line is not found, the central pixel becomes inactive,
as shown in the examples (c) and (d) [16].
Figure A.2: Examples of the effect of the ISd operator on individual pixels of binary images [16].
A morphological approach to curve evolution for surfaces of any dimension was later also introduced
by Marquez-Neila et al. [119].
70
A.0.1 Morphological Geodesic Active Contours
In more detail, the active contour equation of the GAC framework in terms of a level set implementation,
can be rewritten as
∂u
∇u
= g(I)|∇u|div + g(I)|∇u|ν + ∇g(I)∇u, (A.7)
∂t |∇u|
where ν ∈ R is the balloon force parameter, I is the image intensity and typically g(I) could be
1
g(I) = p , (A.8)
1 + α|∇G σ ∗ I |
un+1 (x) = (SI ◦ IS )µ un (x), (A.9)
d d
where the µ ∈ N is the number of successive applications of the smoothing operator, which controls the
smoothing’s step strength.
The second term of equation (A.7), the balloon term, is controlled by the factor g(I), which becomes
lower when the curve is approaching its objective, and thus the balloon force is not needed, and becomes
higher when a corresponding segment is located far from the target region, and hence the balloon force
must be stronger. The effect of g(I) can be discretized by using a single threshold value θ; if g(I) is
greater than θ, the corresponding point is updated according to the balloon force, otherwise, the point is
left unchanged. The sign of ν defines whether the remaining factors (ν|∇u|) lead to dilation or erosion of
the PDEs in (A.2). Given the curve status at iteration n, un : Zd → 0, 1, the balloon force applied to un
can be approximated using the following morphological expression
(Dd un )(x), i f g(I)(x) > θ and ν > 0,
un+1 (x) = (Ed un )(x), i f g(I)(x) > θ and ν < 0,
(A.10)
un (x),
otherwise,
where Dd and Ed refer to the discrete version of dilation and erosion, as shown in (A.4) and (A.5),
respectively.
71
The third component of equation (A.7), the attraction force, has an immediate discrete version, as it
will be shown below.
Contrary to the PDEs, which combine the three components through addition, the morphological
solution combines them by composition. In each iteration, the morphological balloon is first applied,
then the morphological smoothing, and finally the discretized attraction force over the embedding level
set function u. Given the snake evolution at iteration n, un , the un+1 can be defined using the following
steps
(Dd un )(x), i f g(I)(x) > θ and ν > 0,
1
un+ 3 (x) = (Ed un )(x), i f g(I)(x) > θ and ν < 0,
n
u (x), otherwise,
1
1, i f ∇un+ 3 ∇g(I)(x) > 0,
(A.11)
2 1
un+ 3 (x) = 0,
i f ∇un+ 3 ∇g(I)(x) < 0,
un+ 31 , i f ∇un+ 31 ∇g(I)(x) = 0,
µ
n+ 32
u (x) = SI ◦ IS u
n+1
(x),
d d
72
B
Architecture of YOLOv4 network for
pancreas detection
73
Table B.1: Architecture details of the YOLOv4 detection model.
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97