Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Generalizability

of
Semantic Segmentation Techniques
A Comparative Case Study
using
LiTS Data Sets

Keshav Bhandari
Texas State University, San Marcos, TX
k [email protected]

Abstract is an equally important aspects, as we can see this in work


[7]. However, as these experiments are subjected to base-
In the recent years, popularity of semantic segmentation line data sets and not discussed largely in adverse conditions
in computer vision has massively increased. Proposed deep where data sets are challenging, there is an opportunity to
learning architectures has their own pros and cons. Some doubt the generalizability of these algorithms.
architectures require huge amount of training data while In this paper, we implement a naı̈ve convnets and cus-
others rely on the heavy use of data augmentation to address tom Unet based convnets for semantic segmentations. The
this. These models are however trained and experimented baseline datasets we are using is LiTS data set, from a chal-
on baseline datasets and there are enough rooms to discuss lenge organised in conjunction with ISBI 2017 and MIC-
about their generalizability in rare data sets. In this work CAI 2017. The objective is to segment the liver lesions in
we try show how naı̈ve architectures in this domain fails contrast-enhanced abdominal CT scans. Due to their het-
and requires modification. We will also discuss an intuition erogeneous and diffusive shape, automatic segmentation of
behind a modification and experiment our own custom ar- tumor lesions is very challenging. Until now, only interac-
chitecture based on the work [7]. The major objective of tive methods achieved acceptable results segmenting liver
this work is to diagnose the weak spot of the generalizabil- lesions. Moreover, the nature of these scans is sequential.
ity of the previous work [2, 7] and define a new research For the majority of scans lesions appear gradually in the se-
goal for future work. quence so there are no semantic labels in such cases. So,
it opens another research opportunity to consider sequence
modeling as well. We will discuss this in future work.
1. Introduction This work tries to investigate why naı̈ve convnets failed
in this task and how simple modification on the naı̈ve im-
Semantic Segmentation or scene labeling task in com- plementation performs well.
puter vision is to identify class for each pixel. In contrast
with object detection, semantic segmentation is more pre- 1.2. Previous Work
cise and more complex work. Semantic segmentations tasks The ISBI challenge launched in the context of the ISBI
open huge applications opportunity in different domains. 2012 conference (Barcelona, Spain, 2-5th May 2012) open
However, with decades of research it is still a tough task. a new contribution in the field of medical image segmen-
tation. The best method (a sliding-window convolutional
1.1. Summary
network) at that time was outperformed by the work of [7]
Convolution neural networks provides weight sharing, in 2015 and won the ISBI cell tracking challenge. These
an important scheme for localization task that we see often works create a huge attention to the researcher working in
in semantic segmentation. However, naı̈ve implementation computer vision. Work like [5] uses semi supervised ap-
of convolution neural network for such tasks doesn’t work proaches , moreover other works like [9] tries to combine
well. As we can see that the work of [7] out performs [2]. the ideas and addressed the need of sequential modeling
Semantic segmentation has another major challenge. It from work [7]. There are many other impressive works
requires a lots of training data sets. So, data augmentation on this field.

1
FRRN (Full-Resolution Residual Networks) [6, 8] is one
of the state-of-the art model. It uses multi-scale processing
techniques by using two separate streams, the residual and
pooling stream. This helps to process semantic features for
higher classification accuracy. It progressively downsam-
ples the features maps in the pooling stream, meanwhile
processing the feature maps at full resolution in the resid-
ual stream. So these two streams accounts for high classi-
fication accuracy and low-level pixel information for high
localization accuracy.
FRRN did excellent job but with heavy processing over-
head at every scale. PSPNet [12, 8] is another state-of- Figure 1: Sample Image scans(left) and segmentation(right)
the-art to get around this overhead. It uses ResNet and
DenseNet like architecture to extract feature. This architec-
ture combined multi-scale feature maps without applying ing region. This helps preserving the structure of the acti-
many convolutions. vations [11].
The One Hundred Layers (FCDenseNet) [4, 8] is an- The second method is a modification on the first method.
other kind which uses U-Net architecture. The main contri- This approach is inspired by U-Net architecture [7], how-
bution of this architecture is the clever use of dense connec- ever this method is tailored in such a way that it has dif-
tions similar to that of the DenseNet classification model. ferent skip connections schemes. We exploit the nature of
The state-of-the-art model we discussed so far has huge our symmetric model to pass the weights to corresponding
amount of parameters overhead. DeepLabV3 [1, 8] tries to stages from down-sampling part to up-sampling part and
address the parameters overhead by using feature extraction add these weights together. Moreover, as we discussed we
frontend. This is a very ligthweight model. It downsamples also preserve the structure of the activations by approximat-
the input images to 16 times smaller input, and there are ing the true inverse during max-Unpooling.
high odds of getting good localization and can leads to poor The former method tries to predict the pixel values
pixel accuracy. The main contribution of this architecture to find the segmentation region corresponding to specific
is the clever use of its state-of-the-art Atrous convolutions. scans, as opposed to later, which tries to predict a class per
However, it still uses the same upscaling techniques as in every pixel within the mask. These two tasks differ each
PSPNet. other from their orientation of objectives. Former being re-
One of the attempts to review these techniques were done gression and later being classification.
in the work of [3], they reviewed existing methods, high- From the experiment results we can see that later method
lighting contributions and significance of those methods in works better than the former one. We will discuss more
the field. The recent work of [10] is more closely related these on results section.
work, their contribution on developing and evaluating re-
cent advances in uncertainty estimation and model inter- 2. Problem Description
pretability in the context of semantic segmentation using Manually identifying liver lesions is cumbersome task,
several enhanced architectures of Fully Convolutional Net- computer aided segmentation will alleviate the efficiency
works is one of the amazing work. and save valuable time in medical industry and in medical
research where researchers have to deal with thousands of
1.3. Methods and Results such task in their usual work routine. Creating a deep learn-
In this paper, we implement two different architecture, ing model to develop automatic segmentation is a challeng-
a naı̈ve – symmetric convolutional neural network-based ing task. Here we will discuss two methods that earned their
model and modified our naı̈ve implementation. This model popularity few years back with few modifications. How-
is inspired by the work of [7]. ever, the major focus here is to demonstrate failure and
The first method is symmetric convolutional neural net- generalizability of the deep learning models and to identify
work which has same down-sampling and up-sampling ar- why few modifications are needed for advancements, along
chitecture. 512 x 512 x 1 pixels are down-sampled to 16 which opens more research understanding for future.
x 16 x 1024, and up-sampled to 512 x 512 x 1. We mod-
2.1. Methodology
ified the network to store the indices of max-value during
down-sampling and these indices were used in correspond- We will demonstrate how naı̈ve implementations of con-
ing stages of up-sampling to obtain an approximate inverse volution neural network is not enough to do the segmenta-
by recording the locations of the maxima within each pool- tion tasks, and how simple modification of this implementa-

2
1024 x 32 x 32

1024 x 16 x 16
128 x 256 x 256

128 x 128 x 128

256 x 128 x 128


1 x 512 x 512

64 x 512 x 512

64 x 256 x 256

256 x 64 x 64

512 x 64 x 64

512 x 32 x 32
INPUT

Down Sampling

Label
Max Unpool 2d
k = 4, s =2 , p = 1

Conv2d
k = 3, s =1 , p = 1
Max Pool 2d
k = 3, s =2 , p = 1

1024 x 32 x 32

1024 x 16 x 16
128 x 256 x 256

128 x 128 x 128

256 x 128 x 128


1 x 512 x 512

64 x 512 x 512

64 x 256 x 256

ConvTranspose2d

256 x 64 x 64

512 x 64 x 64

512 x 32 x 32
k = 3, s =1 , p = 1
OUTPUT

Softmax

Up Sampling

Figure 2: Convnet based model Corresponding layers from downsampling and


upsampling are stacked together

1024 x 32 x 32

1024 x 16 x 16
128 x 256 x 256

128 x 128 x 128

256 x 128 x 128


1 x 512 x 512

64 x 512 x 512

64 x 256 x 256

256 x 64 x 64

512 x 64 x 64

512 x 32 x 32
Down Sampling
INPUT

Label
Max Unpool 2d
k = 4, s =2 , p = 1
Conv2d
k = 3, s =1 , p = 1
Max Pool 2d
k = 3, s =2 , p = 1
Softmax
1024 x 32 x 32

1024 x 16 x 16
128 x 256 x 256

128 x 128 x 128

256 x 128 x 128


1 x 512 x 512

64 x 512 x 512

64 x 256 x 256

256 x 64 x 64

512 x 64 x 64

512 x 32 x 32

Skip Connection

+
OUTPUT

Add

+ + + + + Conv2Transposed
k = 3, s =1 , p = 1

Up Sampling

Figure 3: U-net based model Corresponding layers from downsampling and


upsampling are stacked together

tion drastically changes the results and learning speed. We Info Description
perform our experiment on LiTS data sets. These data sets Source MICCAI 2017
contain 125 CT scans files. These scans contain varied num- Number of Scans 125 Scans, 108890 images, ⇠871
ber of sequential images. Detail description of the data sets images per scan in average
is given below Train Scans 117 Scans, 101966 images,
Test Scans 8 Scans, 6924 images
Sample datasets with label segmentation is shown in fig- Data Format NIFTI File Format
ure 1. Our first model architecture is defined in figure 2
.This convolution neural network based architecture has two
main parts downsampling and upsampling. Dimensions are

3
equal within levels in corresponding parts. Most important Our second model as shown in figure 3 is introduced as
things to notice about this architecture is the way how Max- an improvement from the first one. We believe our model
Unpooling is done, we pass indices of max value at cor- suffers huge feature loss during up-sampling. Inspired from
responding stages so that we can construct an approximate U-net architecture [2]. We implement a unet-like archi-
inverse. Corresponding stages are the steps from max-pool tecture with minor difference in our model. Here we pass
and max-unpool with same dimensions as shown in figure 2 the output from corresponding down-sampling layers to the
and 3. Suppose we have following examples, assume max- up-sampling layers and add them together whereas original
pool with kernel=2, and stride = 1, maxUnpool with kernel work of Unet architecture does cropping and concatenation.
= 3, and stride = 1 and A as a latent feature matrix With image processing on train data, we create a mask from
2 3 the given ground truth and run classification problem for
a1 a2 a3 
a5 a6 each pixel if it is within the mask, minimizing cross entropy
4 5 0
A = a4 a5 a6 ; A = maxpool(A) =
a8 a9 loss. We trained the models to 100 epochs.
a7 a8 a9 Following are the summary of the two models

Info/Model Conv-based Unet-based


Max-Unpooling in general from A’, when we don’t store Loss MSE Binary
the indices Cross Entropy
2 3 Optimizer Adam Adam
0 0 0 0 0 #epochs 100 100
6 0 a5 0 a6 0 7
6 7 Tr.Time 8 6
A00 = 6
6 0 0 0 0 0 7
7
GPU NVIDIA 1080Ti, 16GB NVIDIA 1080 Ti,
4 0 a8 0 a9 0 5
16GB
0 0 0 0 0 Library Pytorch Pytorch
;
Code can be found at https://github.com/
keshavsbhandari/Image_Segmentation
A 1
= maxU npool(A0 ) = maxP ool(A00 )
3. Results
Both models show significant improvement in valida-
tion loss over training loss as we can see in figure 5 and
2 3 6. However, the convolution-based models did worse job
a5 a6 a6
from generalization perspective. We analyze our results by
= 4 a8 a9 a9 5
simulating real-time scanning and segmentation task in test
a8 a9 a9
scans. Convnets based model averaged out all the features
and gave same segmentation results throughout the simula-
Max-Unpooling that we used, from A’, when we store tion, as we can see in figure 4.
the indices To perform some qunatitative analysis we used a metric
called as Jaccard index.
Jaccard index, also known as Intersection over Union
A 1
= maxU npool(A0 ) = maxP ool(A00 ) (IOU) is a statistic used for comparing the similarity and
dissimilarity of sample sets.
The Jaccard coefficients is defined as the size of the in-
tersection divided by the size of the union of sample sets:
2 3
0 0 0
=4 0 a5 a6 5 |A\B | |A\B |
IOU = J(A, B) = =
0 a8 a9 |A[B | |A|+|B | |A[B |
(if A and B are both empty, we define J(A,B) = 1.)
We can see that the inverse calculated later preservers 0  J(A, B)  1
original matrix better.
We define first model as pixel prediction problem, min- The Jaccard distance, is a measure of dissimilarity be-
imizing mean square error. We trained the models to 100 tween samples. Jaccard similarity is for comparing two bi-
epochs. nary vectors so we can compute this easily for unet-based

4
Ground Convnet Unet-based
Scans
Truth model Model

Figure 4: Experimented Result from both Convnet based and Unet based model Convnet based model doesn’t show any
generalizaiton to localization accuracy Unet-based model perform well comparaitiviely

Figure 5: Convnet based model training and validation loss Figure 6: U-net based model training and validation loss
over number of epochs over number of epochs

model but for the sake of our conv-net based model we can Moreover, we also calculate the pixel accuracy for unet-
use generalized version of the Jaccard index as given by based model. The pixel accuracy is defined as
P
min(Li , E i ) TP + TN
IOU = J(L, E) = P i accuracy =
i max(L i, Ei) TP + TN + FP + FN

5
Where, like to continue our future research on how we can tackle
TP : True Positive these problems with other data augmentation techniques
TN ; True Negative and combine sequence modelling aspects as well.
FP : False Positive
FN : False Negative References
[1] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-
thinking Atrous Convolution for Semantic Image Segmenta-
tion. ArXiv e-prints, page arXiv:1706.05587, June 2017.
[2] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhu-
ber. Deep neural networks segment neuronal membranes in
electron microscopy images. In Advances in neural informa-
tion processing systems, pages 2843–2851, 2012.
[3] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-
Martinez, and J. Garcia-Rodriguez. A Review on Deep
Learning Techniques Applied to Semantic Segmentation.
ArXiv e-prints, page arXiv:1704.06857, Apr. 2017.
[4] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Ben-
gio. The One Hundred Layers Tiramisu: Fully Convolutional
DenseNets for Semantic Segmentation. ArXiv e-prints, page
arXiv:1611.09326, Nov. 2016.
[5] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic
Segmentation using Adversarial Networks. ArXiv e-prints,
page arXiv:1611.08408, Nov. 2016.
Figure 7: Unet Based Model : Pixel accuracy calculate per [6] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-
Resolution Residual Networks for Semantic Segmentation in
Street Scenes. ArXiv e-prints, page arXiv:1611.08323, Nov.
Since, our class representation is too small this metric 2016.
can be misleading as well, as the measure will be biased in [7] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu-
mainly reporting how well model identified negative case. tional Networks for Biomedical Image Segmentation. ArXiv
Calculated metrics for both model is given below. e-prints, page arXiv:1505.04597, May 2015.
[8] G. Seif. Semantic segmentation with deep learning – towards
Info/Model Conv-based Unet-based data science, Sep 2018.
IOU 0.19 0.53 [9] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho,
Pixel Accuracy NA ˜100 Y. Bengio, M. Matteucci, and A. Courville. ReSeg: A Re-
current Neural Network-based Model for Semantic Segmen-
Above results clearly shows performance of U-net based tation. ArXiv e-prints, page arXiv:1511.07053, Nov. 2015.
model over convnet based model qunatitatively. [10] K. Wickstrøm, M. Kampffmeyer, and R. Jenssen. Uncer-
tainty and Interpretability in Convolutional Neural Networks
for Semantic Segmentation of Colorectal Polyps. ArXiv e-
4. Conclusion and Future Work prints, page arXiv:1807.10584, July 2018.
The datasets we used are rarely mention in literature. [11] M. D. Zeiler and R. Fergus. Visualizing and Under-
We showed that how semantic segmentation models can standing Convolutional Networks. ArXiv e-prints, page
be questioned to its generalizability. The unet-based model arXiv:1311.2901, Nov. 2013.
achieves very good performance compared to naı̈ve convnet [12] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene
Parsing Network. ArXiv e-prints, page arXiv:1612.01105,
based model. With little modification on the convnet based
Dec. 2016.
model we are able to show how we can significantly im-
prove the model.
We established an understanding of skip connections we
saw in Unet-based model and importance of data augmen-
tation. These techniques turned out to be important for the
model to learn more features.
We did our best in this work to compare and contrast be-
tween naive model with unet-based model. However, we
did not consider two important aspects here, first smaller
datasets and second sequential nature of data. We would

You might also like