Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
of
Semantic Segmentation Techniques
A Comparative Case Study
using
LiTS Data Sets
Keshav Bhandari
Texas State University, San Marcos, TX
k [email protected]
1
FRRN (Full-Resolution Residual Networks) [6, 8] is one
of the state-of-the art model. It uses multi-scale processing
techniques by using two separate streams, the residual and
pooling stream. This helps to process semantic features for
higher classification accuracy. It progressively downsam-
ples the features maps in the pooling stream, meanwhile
processing the feature maps at full resolution in the resid-
ual stream. So these two streams accounts for high classi-
fication accuracy and low-level pixel information for high
localization accuracy.
FRRN did excellent job but with heavy processing over-
head at every scale. PSPNet [12, 8] is another state-of- Figure 1: Sample Image scans(left) and segmentation(right)
the-art to get around this overhead. It uses ResNet and
DenseNet like architecture to extract feature. This architec-
ture combined multi-scale feature maps without applying ing region. This helps preserving the structure of the acti-
many convolutions. vations [11].
The One Hundred Layers (FCDenseNet) [4, 8] is an- The second method is a modification on the first method.
other kind which uses U-Net architecture. The main contri- This approach is inspired by U-Net architecture [7], how-
bution of this architecture is the clever use of dense connec- ever this method is tailored in such a way that it has dif-
tions similar to that of the DenseNet classification model. ferent skip connections schemes. We exploit the nature of
The state-of-the-art model we discussed so far has huge our symmetric model to pass the weights to corresponding
amount of parameters overhead. DeepLabV3 [1, 8] tries to stages from down-sampling part to up-sampling part and
address the parameters overhead by using feature extraction add these weights together. Moreover, as we discussed we
frontend. This is a very ligthweight model. It downsamples also preserve the structure of the activations by approximat-
the input images to 16 times smaller input, and there are ing the true inverse during max-Unpooling.
high odds of getting good localization and can leads to poor The former method tries to predict the pixel values
pixel accuracy. The main contribution of this architecture to find the segmentation region corresponding to specific
is the clever use of its state-of-the-art Atrous convolutions. scans, as opposed to later, which tries to predict a class per
However, it still uses the same upscaling techniques as in every pixel within the mask. These two tasks differ each
PSPNet. other from their orientation of objectives. Former being re-
One of the attempts to review these techniques were done gression and later being classification.
in the work of [3], they reviewed existing methods, high- From the experiment results we can see that later method
lighting contributions and significance of those methods in works better than the former one. We will discuss more
the field. The recent work of [10] is more closely related these on results section.
work, their contribution on developing and evaluating re-
cent advances in uncertainty estimation and model inter- 2. Problem Description
pretability in the context of semantic segmentation using Manually identifying liver lesions is cumbersome task,
several enhanced architectures of Fully Convolutional Net- computer aided segmentation will alleviate the efficiency
works is one of the amazing work. and save valuable time in medical industry and in medical
research where researchers have to deal with thousands of
1.3. Methods and Results such task in their usual work routine. Creating a deep learn-
In this paper, we implement two different architecture, ing model to develop automatic segmentation is a challeng-
a naı̈ve – symmetric convolutional neural network-based ing task. Here we will discuss two methods that earned their
model and modified our naı̈ve implementation. This model popularity few years back with few modifications. How-
is inspired by the work of [7]. ever, the major focus here is to demonstrate failure and
The first method is symmetric convolutional neural net- generalizability of the deep learning models and to identify
work which has same down-sampling and up-sampling ar- why few modifications are needed for advancements, along
chitecture. 512 x 512 x 1 pixels are down-sampled to 16 which opens more research understanding for future.
x 16 x 1024, and up-sampled to 512 x 512 x 1. We mod-
2.1. Methodology
ified the network to store the indices of max-value during
down-sampling and these indices were used in correspond- We will demonstrate how naı̈ve implementations of con-
ing stages of up-sampling to obtain an approximate inverse volution neural network is not enough to do the segmenta-
by recording the locations of the maxima within each pool- tion tasks, and how simple modification of this implementa-
2
1024 x 32 x 32
1024 x 16 x 16
128 x 256 x 256
64 x 512 x 512
64 x 256 x 256
256 x 64 x 64
512 x 64 x 64
512 x 32 x 32
INPUT
Down Sampling
Label
Max Unpool 2d
k = 4, s =2 , p = 1
Conv2d
k = 3, s =1 , p = 1
Max Pool 2d
k = 3, s =2 , p = 1
1024 x 32 x 32
1024 x 16 x 16
128 x 256 x 256
64 x 512 x 512
64 x 256 x 256
ConvTranspose2d
256 x 64 x 64
512 x 64 x 64
512 x 32 x 32
k = 3, s =1 , p = 1
OUTPUT
Softmax
Up Sampling
1024 x 32 x 32
1024 x 16 x 16
128 x 256 x 256
64 x 512 x 512
64 x 256 x 256
256 x 64 x 64
512 x 64 x 64
512 x 32 x 32
Down Sampling
INPUT
Label
Max Unpool 2d
k = 4, s =2 , p = 1
Conv2d
k = 3, s =1 , p = 1
Max Pool 2d
k = 3, s =2 , p = 1
Softmax
1024 x 32 x 32
1024 x 16 x 16
128 x 256 x 256
64 x 512 x 512
64 x 256 x 256
256 x 64 x 64
512 x 64 x 64
512 x 32 x 32
Skip Connection
+
OUTPUT
Add
+ + + + + Conv2Transposed
k = 3, s =1 , p = 1
Up Sampling
tion drastically changes the results and learning speed. We Info Description
perform our experiment on LiTS data sets. These data sets Source MICCAI 2017
contain 125 CT scans files. These scans contain varied num- Number of Scans 125 Scans, 108890 images, ⇠871
ber of sequential images. Detail description of the data sets images per scan in average
is given below Train Scans 117 Scans, 101966 images,
Test Scans 8 Scans, 6924 images
Sample datasets with label segmentation is shown in fig- Data Format NIFTI File Format
ure 1. Our first model architecture is defined in figure 2
.This convolution neural network based architecture has two
main parts downsampling and upsampling. Dimensions are
3
equal within levels in corresponding parts. Most important Our second model as shown in figure 3 is introduced as
things to notice about this architecture is the way how Max- an improvement from the first one. We believe our model
Unpooling is done, we pass indices of max value at cor- suffers huge feature loss during up-sampling. Inspired from
responding stages so that we can construct an approximate U-net architecture [2]. We implement a unet-like archi-
inverse. Corresponding stages are the steps from max-pool tecture with minor difference in our model. Here we pass
and max-unpool with same dimensions as shown in figure 2 the output from corresponding down-sampling layers to the
and 3. Suppose we have following examples, assume max- up-sampling layers and add them together whereas original
pool with kernel=2, and stride = 1, maxUnpool with kernel work of Unet architecture does cropping and concatenation.
= 3, and stride = 1 and A as a latent feature matrix With image processing on train data, we create a mask from
2 3 the given ground truth and run classification problem for
a1 a2 a3
a5 a6 each pixel if it is within the mask, minimizing cross entropy
4 5 0
A = a4 a5 a6 ; A = maxpool(A) =
a8 a9 loss. We trained the models to 100 epochs.
a7 a8 a9 Following are the summary of the two models
4
Ground Convnet Unet-based
Scans
Truth model Model
Figure 4: Experimented Result from both Convnet based and Unet based model Convnet based model doesn’t show any
generalizaiton to localization accuracy Unet-based model perform well comparaitiviely
Figure 5: Convnet based model training and validation loss Figure 6: U-net based model training and validation loss
over number of epochs over number of epochs
model but for the sake of our conv-net based model we can Moreover, we also calculate the pixel accuracy for unet-
use generalized version of the Jaccard index as given by based model. The pixel accuracy is defined as
P
min(Li , E i ) TP + TN
IOU = J(L, E) = P i accuracy =
i max(L i, Ei) TP + TN + FP + FN
5
Where, like to continue our future research on how we can tackle
TP : True Positive these problems with other data augmentation techniques
TN ; True Negative and combine sequence modelling aspects as well.
FP : False Positive
FN : False Negative References
[1] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-
thinking Atrous Convolution for Semantic Image Segmenta-
tion. ArXiv e-prints, page arXiv:1706.05587, June 2017.
[2] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhu-
ber. Deep neural networks segment neuronal membranes in
electron microscopy images. In Advances in neural informa-
tion processing systems, pages 2843–2851, 2012.
[3] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-
Martinez, and J. Garcia-Rodriguez. A Review on Deep
Learning Techniques Applied to Semantic Segmentation.
ArXiv e-prints, page arXiv:1704.06857, Apr. 2017.
[4] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Ben-
gio. The One Hundred Layers Tiramisu: Fully Convolutional
DenseNets for Semantic Segmentation. ArXiv e-prints, page
arXiv:1611.09326, Nov. 2016.
[5] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic
Segmentation using Adversarial Networks. ArXiv e-prints,
page arXiv:1611.08408, Nov. 2016.
Figure 7: Unet Based Model : Pixel accuracy calculate per [6] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-
Resolution Residual Networks for Semantic Segmentation in
Street Scenes. ArXiv e-prints, page arXiv:1611.08323, Nov.
Since, our class representation is too small this metric 2016.
can be misleading as well, as the measure will be biased in [7] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu-
mainly reporting how well model identified negative case. tional Networks for Biomedical Image Segmentation. ArXiv
Calculated metrics for both model is given below. e-prints, page arXiv:1505.04597, May 2015.
[8] G. Seif. Semantic segmentation with deep learning – towards
Info/Model Conv-based Unet-based data science, Sep 2018.
IOU 0.19 0.53 [9] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho,
Pixel Accuracy NA ˜100 Y. Bengio, M. Matteucci, and A. Courville. ReSeg: A Re-
current Neural Network-based Model for Semantic Segmen-
Above results clearly shows performance of U-net based tation. ArXiv e-prints, page arXiv:1511.07053, Nov. 2015.
model over convnet based model qunatitatively. [10] K. Wickstrøm, M. Kampffmeyer, and R. Jenssen. Uncer-
tainty and Interpretability in Convolutional Neural Networks
for Semantic Segmentation of Colorectal Polyps. ArXiv e-
4. Conclusion and Future Work prints, page arXiv:1807.10584, July 2018.
The datasets we used are rarely mention in literature. [11] M. D. Zeiler and R. Fergus. Visualizing and Under-
We showed that how semantic segmentation models can standing Convolutional Networks. ArXiv e-prints, page
be questioned to its generalizability. The unet-based model arXiv:1311.2901, Nov. 2013.
achieves very good performance compared to naı̈ve convnet [12] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene
Parsing Network. ArXiv e-prints, page arXiv:1612.01105,
based model. With little modification on the convnet based
Dec. 2016.
model we are able to show how we can significantly im-
prove the model.
We established an understanding of skip connections we
saw in Unet-based model and importance of data augmen-
tation. These techniques turned out to be important for the
model to learn more features.
We did our best in this work to compare and contrast be-
tween naive model with unet-based model. However, we
did not consider two important aspects here, first smaller
datasets and second sequential nature of data. We would