Academia.eduAcademia.edu

ResUNet++: An Advanced Architecture for Medical Image Segmentation

2019

Accurate computer-aided polyp detection and segmentation during colonoscopy examinations can help endoscopists resect abnormal tissue and thereby decrease chances of polyps growing into cancer. Towards developing a fully automated model for pixel-wise polyp segmentation, we propose ResUNet++, which is an improved ResUNet architecture for colonoscopic image segmentation. Our experimental evaluations show that the suggested architecture produces good segmentation results on publicly available datasets. Furthermore, ResUNet++ significantly outperforms U-Net and ResUNet, two key state-ofthe-art deep learning architectures, by achieving high evaluation scores with a dice coefficient of 81.33%, and a mean Intersection over Union (mIoU) of 79.27% for the Kvasir-SEG dataset and a dice coefficient of 79.55%, and a mIoU of 79.62% with CVC-612 dataset.

2019 IEEE International Symposium on Multimedia (ISM) ResUNet++: An Advanced Architecture for Medical Image Segmentation Debesh Jha∗‡ , Pia H. Smedsrud∗†§ , Michael A. Riegler∗§ , Dag Johansen‡ , Thomas de Lange†§ , Pål Halvorsen∗¶ , Håvard D. Johansen‡ ‡ UiT † Augere Medical AS, Norway ∗ SimulaMet, Norway § University of Oslo, Norway The Arctic University of Norway, Norway ¶ Oslo Metropolitan University, Norway Email: [email protected] Abstract—Accurate computer-aided polyp detection and segmentation during colonoscopy examinations can help endoscopists resect abnormal tissue and thereby decrease chances of polyps growing into cancer. Towards developing a fully automated model for pixel-wise polyp segmentation, we propose ResUNet++, which is an improved ResUNet architecture for colonoscopic image segmentation. Our experimental evaluations show that the suggested architecture produces good segmentation results on publicly available datasets. Furthermore, ResUNet++ significantly outperforms U-Net and ResUNet, two key state-ofthe-art deep learning architectures, by achieving high evaluation scores with a dice coefficient of 81.33%, and a mean Intersection over Union (mIoU) of 79.27% for the Kvasir-SEG dataset and a dice coefficient of 79.55%, and a mIoU of 79.62% with CVC-612 dataset. Index Terms—Medical image analysis, semantic segmentation, colonoscopy, polyp segmentation, deep learning, health informatics. I. I NTRODUCTION Colorectal Cancer (CRC) is one of the leading causes of cancer related deaths worldwide. Polyps are predecessors to this type of cancers and therefore important to discover early by clinicians through colonoscopy examinations. To reduce the occurrence of CRC, it is routine to resect the neoplastic lesions (for example, adenomatous polyps) [1]. Unfortunately, many adenomatous polyps are missed during the endoscopic examinations [2]. A Computer-Aided Detection (CAD) system that, in real-time, can highlight the locations of polyps in the video stream from the endoscope, can act as a second observer, potentially drawing the endoscopist’s attention to the polyps displayed on the monitor. This can reduce the chance that some polyps are overlooked [3]. For this purpose, an important improvement of pure anomaly detection approaches, which only identify whether or not there is something abnormal in an image, we also want our CAD system to have pixel-wise segmentation capability so that the specific regions of interest within each abnormal image can be identified. A key challenge for designing a precise CAD system for polyps is the high costs of collecting and labeling proper medical datasets for training and testing. Polyps come in a wide variety of shapes, sizes, colors, and appearances as shown in Figure 1. For the four main classes of polyps: adenoma, Fig. 1. Examples of polyp images and their corresponding masks from KvasirSEG dataset. The first and third column represents the original images, and the second column and fourth column represents their corresponding ground truth. serrated, hyperplastic, and mixed (rare), there are high interclass similarity and intra-class variation. There can also be high background object similarity, for instance, where parts of a polyp is covered with stool or when they blend into the background mucosa. Although these factors make our task challenging, we conjecture that there is still a high potential for designing a system with a performance acceptable for clinical use. Motivated by the recent success of semantic segmentationbased approaches for medical image analysis [4]–[6], we explore how these methods can be used to improve the performance for automatic polyp segmentation and detection. A popular deep learning architecture in the field of semantic segmentation for biomedical application is U-Net [5], which have shown state-of-the-art performance at the 2015 ISBI cell tracking challenge 1 . The ResUNet [6] architecture, is a variant of U-Net architecture that has provided state-of-the-art results for the road image extraction. We therefore adapt this architecture as a basis for our work. In this paper, we propose the ResUNet++ architecture for medical image segmentation. We have evaluated our model on two publicly available datasets. Our experimental results reveal that the improved model is efficient and achieved a performance boost compared to the popular U-Net [5] and ResUNet [6] architectures. 1 http://brainiac2.mit.edu/isbi challenge/. 978-1-7281-5606-4/19/$31.00 ©2019 IEEE 225 DOIIEEE. 10.1109/ISM46123.2019.00049 © 2019 Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: https://dx.doi.org/10.1109/ISM46123.2019.00049 In summary, the contributions of the paper are as follows: 1) We propose the novel ResUNet++ architecture, which is a semantic segmentation neural network that takes advantage of residual blocks, squeeze and excitation blocks, Atrous Spatial Pyramidal Pooling (ASPP), and attention blocks. ResUNet++ improved the segmentation results significantly for the colorectal polyps compared to other state-of-the-art methods. The proposed architecture works well with a smaller number of images. 2) We annotated the polyp class from the Kvasir dataset [7] with the help of an expert gastroenterologist to create the new Kvasir-SEG dataset [8]. We make this polyp segmentation dataset available to the research community to foster development of new methods and reproducible research. INPUT OUTPUT Sigmoid Conv2D (1x1) ASPP Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Addition Addition Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Batch Norm. & ReLU Concatenate UpSampling Attention Squeeze & Excite Addition Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Batch Norm. & ReLU Concatenate UpSampling Attention Batch Norm. & ReLU Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Addition Squeeze & Excite II. R ELATED W ORK Addition Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Batch Norm. & ReLU Concatenate UpSampling Attention Batch Norm. & ReLU Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Addition Automatic gastrointestinal (GI) tract disease detection and classification in colonoscopic videos has been an active area of research for the past two decades. Polyp detection has in particular been given attention. The performance of the machine learning software has come close to the level of expert endoscopists [9]–[12]. Apart from work on algorithm development, researchers have also investigated complete CAD systems, from data annotation, analysis, and evaluation to visualization for the medical experts [13]–[15]. Thambawita et al. [16] explored various methods, ranging from Machine Learning (ML) to deep Convolutional Neural Network (CNN), and suggested five novel models as a potential solution for classifying GI tract findings into sixteen classes. Guo et al. [17] presented two variants of fully convolutional neural networks, which secured the first position at the 2017 Gastrointestinal Image ANAlysis (GIANA) challenge and second position at the 2018 GIANA challenge. Long et al. [18] proposed a state-of-the-art semantic segmentation approach for image segmentation known as a Fully Convolutional Network (FCN). FCN are trained end-to-end, pixels-to-pixels, and outputs segmentation result without any additional post-processing steps. Ronneberger et al. [5] modified and extended the FCN architecture to U-Net architecture. There are various modification and extension based on U-Net architecture [4], [6], [17], [19]–[22] to achieve better segmentation results on both natural images and biomedical images. Most of the published work in the field of polyp detection perform well on the specific datasets, and test scenarios often used small training and validation datasets [11], [23]. The model evaluated on the smaller dataset is neither generalizable nor robust. Moreover, some of the research work only focus on a specific type of polyps. Some of the current work also use non-publicly datasets, which makes it difficult to compare and reproduce results. Therefore, the goal of the ML models to reach a performance level similar to, or better than colonoscopists has not been achieved yet. There exists a potential for improvement in boosting the performance of the system. Squeeze & Excite DECODING Batch Norm. & ReLU Conv2D (3x3) Batch Norm. & ReLU Conv2D (3x3) Addition ASPP BRIDGE ENCODING Fig. 2. Block diagram of the proposed ResUNet++ architecture. III. R ES UN ET ++ The ResUNet++ architecture is based on the Deep Residual U-Net (ResUNet) [6], which is an architecture that uses the strength of deep residual learning [24] and U-Net [5]. The proposed ResUNet++ architecture takes advantage of the residual blocks, the squeeze and excitation block, ASPP, and the attention block. The residual block propagates information over layers, allowing to build a deeper neural network that could solve the degradation problem in each of the encoders. This improves the channel inter-dependencies, while at the same time reducing the computational cost. The proposed ResUNet++ architecture contains one stem block followed by three encoder blocks, ASPP, and three decoder blocks. The block diagram of the proposed ResUNet++ architecture is shown in Figure 2. In the block diagram, we can see that the residual unit is a combination of batch normalization, Rectified Linear Unit (ReLU) activation, and convolutional layers. Each encoder block consists of two successive 3 × 3 convolutional block and an identity mapping. Each convolution block includes a batch normalization layer, a ReLU activation layer, and a convolutional layer. The identity mapping connects the input and output of the encoder block. A strided convolution layer is applied to reduce the spatial dimension 226 of the feature maps by half at the first convolutional layer of the encoder block. The output of encoder block is passed through the squeeze-and-excitation block. The ASPP acts as a bridge, enlarging the field-of-view of the filters to include a broader context. Correspondingly, the decoding path consists of residual units, too. Before each unit, the attention block increases the effectiveness of feature maps. This is followed by a nearest-neighbor up-sampling of feature maps from the lower level and the concatenation with feature maps from their corresponding encoding path. The output of the decoder block is passed through ASPP, and finally, we apply a 1 × 1 convolution with sigmoid activation, that provides the segmentation map. The extension of the ResUNet++ is the squeeze-and-excitation blocks marked in light blue, the ASPP block marked in dark red, and attention block marked in light green. A brief explanation of each of the parts is given in the following subsections. at various scales [27], [28] and many parallel atrous convolutions [29] with different rates in the input feature map are fused. Atrous convolution allows controlling the field-of-view for capturing multi-scale information precisely. In the proposed architecture, ASPP acts as a bridge between encoder and decoder in our architecture, as shown in Figure 2. The ASPP model has shown promising results on various segmentation tasks by providing multi-scale information. Therefore, we use ASPP to capture the useful multi-scale information for the semantic segmentation task. D. Attention Units The attention mechanism is mostly popular in Natural Language Processing (NLP) [30]. It gives attention to the subset of its input. Moreover, it has been employed in semantic segmentation tasks, for example, [31], for pixel-wise prediction. The attention mechanism determines which parts of the network require more attention in the neural network. The attention mechanism also burdens off the encoder to encode all the information of the polyp image into a vector of a fixed dimension. The main advantage of the attention mechanism is that they are simple, can be applied to any input size, enhance the quality of features that boosts the results. In the previous two approaches, U-Net [5] and ResUNet [6], there exists a direct concatenation of the encoder feature maps with the decoder feature maps. Inspired by the success of attention mechanism, both in NLP and computer vision tasks, we implemented the attention block in the decoder part of our architecture to be able to focus on the essential areas of the feature maps. A. Residual Units Deeper Neural Networks are comparatively challenging to train. Training a deep neural network with an increasing network depth can improve accuracy. However, it can hamper the training process and cause a degradation problem [6], [24]. He et al. [24] proposed a deep residual learning framework to facilitate the training process and address the problem of degradation. ResUNet [6] uses full pre-activation residual units. The deep residual unit makes the deep network easy to train and the skip connection within the networks helps to propagate information without degradation, improving the design of the neural network by decreasing the parameters along with comparable performance or boost in performance on semantic segmentation task [6], [24]. Because of these advantages, we use ResUNet as the backbone architecture. IV. E XPERIMENTS To evaluate the ResUNet++ architecture, we train, validate, and test models using two publicly available datasets. We compare the performance of our ResUNet++ models with ones trained using U-Net and ResUNet. B. Squeeze and Excitation Units The squeeze-and-excitation network [25] boosts the representative power of the network by re-calibrating the features responses employing precise modeling inter-dependencies between the channels. The goal of the squeeze and excite block is to ensure that the network can increase its sensitivity to the relevant features and suppress the unnecessary features. This goal is achieved in two steps. The first one is squeeze (global information embedding), where each channel is squeezed by using global average pooling for generating channel-wise statistics. The second step is excitation (active calibration) that aims to capture the channel-wise dependencies fully [25]. In the proposed architecture, the squeeze and excitation block is stacked together with the residual block to increase effective generalization over different datasets and improve the performance of the network. A. Datasets For the task of polyp image segmentation, each pixel in the training images must be labeled as belonging to either the polyp class or the non-polyp class. For the evaluation of ResUNet++, we use the Kvasir-SEG dataset [8], which consists of 1,000 polyp images and their corresponding ground truth masks annotated by expert endoscopists from Oslo University Hospital (Norway). Example images and their corresponding masks from the Kvasir-SEG dataset are shown in Figure 1. The second dataset we have used is the CVC-ClinicDB database [32], which is an open-access dataset of 612 images with a resolution of 384×288 from 31 colonoscopy sequences. B. Implementation details All architectures were implemented using the Keras framework [33] with TensorFlow [34] as backend. We performed our experiment on a single Volta 100 GPU on a powerful Nvidia DGX-2 AI system capable of 2-petaFLOPS tensor performance. The system is part of Simula Research Laboratories C. Atrous Spatial Pyramidal Pooling The idea of ASPP comes from spatial pyramidal pooling [26] that was successful for re-sampling features at multiple scales. In ASPP, the contextual information are captured 227 TABLE I T HE TABLE SHOWS THE EVALUATION RESULTS OF ALL THE MODELS ON K VASIR -SEG DATASET. heterogeneous cluster and has dual Intel(R) Xeon(R) Platinum 8168 [email protected], 1.5TB of DDR4-2667MHz DRAM, 32TB of NVMe scratch space, and 16 of NVIDIAs latest Volta 100 GPGPUs interconnected using Nvidia’s NVlink fully nonblocking crossbars switch capable of 2.4 TB/s of bisectional bandwidth. The system was running Ubuntu 18.04.3LTS OS and had the latest Cuda 10.1.243 installed. We start the training with a batch size of 16, and the proposed architecture is optimized by Adam optimizer. The learning rate of the algorithm is set to 1e−4. A lower learning rate is preferred, although a lower learning rate slowed down convergence, and a larger learning rate often causes convergence failures. The size of the image within the same dataset varies. Both the dataset used in the study consists of different resolution images. For efficient GPU utilization and to reduce the training time, we crop the images by putting a crop margin of 320×320 to increase the training dataset. Then, the images are resized to 256 × 256 pixels before feeding the images to the model. We have used the data augmentation technique such as center crop, random crop, horizontal flip, vertical flip, scale augmentation, random rotation, cutout, and brightness augmentation, etc., to increase the number of training samples. The rotation angle is randomly chosen from 0 to 90°. We have utilized 80% of the dataset for training, 10% for validation, and 10% for the testing. We trained all the models for 120 epochs with a lower learning rate so that a more generalized model can be built. The batch size, epoch, and learning rate were reset depending upon the need. There was an accuracy trade-off if we decrease the batch size; however, we preferred a larger batch size over accuracy because smaller batch size can lead to over-fitting. We also used the Stochastic Gradient Descent with Restart (SGDR) to improve the performance of the model. Method ResUNet++ ResUNet-mod ResUNet U-Net Dice mIoU Recall Precision 0.8133 0.7909 0.5144 0.7147 0.7927 0.4287 0.4364 0.4334 0.7064 0.6909 0.5041 0.6306 0.8774 0.8713 0.7292 0.9222 U-Net [5] are presented in Table I. Table I shows that the proposed model achieved the highest dice coefficient, mIoU, recall, and competitive precision for the Kvasir-SEG dataset. U-Net achieved the highest precision. However, the dice coefficient and mIoU scores are not competitive, which is an important metric for semantic segmentation task. The proposed architecture has outperformed the baseline architectures by a significant margin in terms of mIoU. B. Results on the CVC-612 dataset We have performed additional experiments for in-depth performance analysis for automatic polyp segmentation. Therefore, we attempted for the generalization of the model to check the generalizability capability of the proposed architecture on a different dataset. Generalizability would be a further step toward building clinical acceptable model. Table II shows the results for all the architectures on CVC612 datasets. The proposed model obtained highest dice coefficient, mIoU, and recall and competitive precision. Figure 3 shows the qualitative results for all the models. From Table I, Table II, and Figure 3 we demonstrate the superiority of ResUNet++ over the baseline architectures. The quantitative and qualitative result shows that the ResUNet++ model trained on Kvasir-SEG and CVC-612 dataset performs well and outperforms all other models in terms of dice coefficient, mIoU, and recall. Therefore, the ResUNet++ architecture should be considered over these baselines architecture in the medical image segmentation task. V. R ESULTS To show the effectiveness of ResUNet++, we conducted two sets of experiments on Kvasir-SEG and CVC-612 datasets. For the model comparison, we compared the results of the proposed ResUNet++ with the original U-Net and original ResUNet architecture, as both of them are the common preference for the semantic segmentation task. The original implementation of ResUNet, which uses Mean Square Error (MSE) as the loss function, did not produce satisfactory results with Kvasir-SEG and CVC-612 datasets. Therefore, we replaced the MSE loss function with dice coefficient loss and did hyperparameter optimization to improve the results and named the architecture as ResUNet-mod. With this modification, we achieved a performance boost in ResUNet-mod architecture for both the datasets. VI. D ISCUSSION The proposed ResUNet++ architecture produces satisfactory results on both Kvasir-SEG and CVC-612 datasets. From Figure 3, it is evident that the segmentation map produced TABLE II T HE TABLE SHOWS THE EVALUATION RESULTS OF ALL THE MODELS ON CVC-612 DATASET. A. Results on the Kvasir-SEG dataset Method We have tried different sets of hyperparameters (i.e., learning rate, number of epochs, optimizer, batch size, and filter size) for the optimization of ResUNet++ architecture. Hyperparameter tuning is done manually by training the models with different sets of hyperparameters and evaluating their results. The results of ResUNet++, ResUNet-mod, ResUNet [6], and ResUNet++ ResUNet-mod ResUNet U-Net 228 Dice mIoU Recall Precision 0.7955 0.7788 0.4510 0.6419 0.7962 0.4545 0.4570 0.4711 0.7022 0.6683 0.5775 0.6756 0.8785 0.8877 0.5614 0.6868 Fig. 3. Qualitative results comparison on the Kvasir-SEG dataset. From the left: image (1), (2) Ground truth, (3) U-Net, (4) ResUNet, (5) ResUNet-mod, and (6) ResUNet++. From the experimental results, we can say that ResUNet++ produces better segmentation masks than other competitors. by ResUNet++ outperforms other architectures in capturing shape information, in the Kvasir-SEG dataset. It means the generated segmentation mask in ResUNet++ has more similar ground truth then the presented state-of-the-art models. However, ResUNet-mod and U-Net also produced competitive segmentation masks. need further detailed validations. We have optimized the code as much as possible based on our knowledge and experience. However, there may exist further optimization, which may also influence the results of the architectures. We have run the code only on a Nvidia-DGX-2 machine, and the images were resized, which may have lead to the loss of some useful information. Additionally, ResUNet++ uses more parameters, which increases training time. We trained the model using different available loss functions, for example, binary cross-entropy, the combination of binary cross-entropy and dice loss, and mean square loss. We observed that the model achieved a higher dice coefficient value with all the loss function. However, mIoU were significantly lower with all other except dice coefficient loss function. We selected the dice coefficient loss function based on our empirical evaluation. Moreover, we also observed that the number of filters, batch size, optimizer, and loss function can influence the result. VII. C ONCLUSION In this paper, we presented ResUNet++, which is an architecture to address the need for more accurate segmentation of colorectal polyps found in colonoscopy examinations. The suggested architecture takes advantage of residual units, squeeze and excitation units, ASPP, and attention units. Comprehensive evaluation using different available datasets demonstrates that the proposed ResUNet++ architecture outperforms the state-of-the-art U-Net and ResUNet architectures in terms of producing semantically accurate predictions. Towards achieving the generalizability goal, the proposed architecture can be a strong baseline for further investigation in the direction of developing a clinically useful method. Postprocessing techniques can potentially be applied to our model to achieve even better segmentation results. We believe that the performance of the model can be further improved by increasing the dataset size, applying more augmentation techniques, and by applying some postprocessing steps. Despite increased numbers of parameters with the proposed architecture, we trained the model to achieve higher performance. We conclude that the application of ResUNet++ should not only limited to biomedical image segmentation but could also be expanded to the natural image segmentation and other pixel-wise classification tasks, which 229 ACKNOWLEDGEMENT [16] V. Thambawita, D. Jha, M. Riegler, P. Halvorsen, H. L. Hammer, H. D. Johansen, and D. Johansen, “The medico-task 2018: Disease detection in the gastrointestinal tract using global features and deep learning,” in Working Notes Proceedings of the MediaEval Workshop. CEUR Workshop Proceedings, 2018. [17] Y. B. Guo and B. Matuszewski, “Giana polyp segmentation with fully convolutional dilation neural networks,” in Proceedings of International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. SCITEPRESS-Science and Technology Publications, 2019, pp. 632–641. [18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2015, pp. 3431–3440. [19] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in Proceeding of International conference on medical image computing and computer-assisted intervention. Springer, 2016, pp. 424–432. [20] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The importance of skip connections in biomedical image segmentation,” in Deep Learning and Data Labeling for Medical Applications. Springer, 2016, pp. 179–187. [21] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: a deep learning framework for semantic segmentation of remotely sensed data,” arXiv preprint arXiv:1904.00592, 2019. [22] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 3–11. [23] Y. Wang, W. Tavanapong, J. Wong, J. Oh, and P. C. De Groen, “Partbased multiderivative edge cross-sectional profiles for polyp detection in colonoscopy,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 4, pp. 1379–1389, 2013. [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778. [25] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2018, pp. 7132–7141. [26] K. He., X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904– 1916, 2015. [27] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018. [28] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017. [29] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008. [31] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180, 2018. [32] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodrı́guez, and F. Vilariño, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computerized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015. [33] F. Chollet et al., “Keras,” 2015. [34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in Proceeding of {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}), 2016, pp. 265–283. This work is funded in part by Research Council of Norway project number 263248. The computations in this paper were performed on equipment provided by the Experimental Infrastructure for Exploration of Exascale Computing (eX3), which is financially supported by the Research Council of Norway under contract 270053. R EFERENCES [1] A. G. Zauber, S. J. Winawer, M. J. O’Brien, I. Lansdorp-Vogelaar, M. van Ballegooijen, B. F. Hankey, W. Shi, J. H. Bond, M. Schapiro, J. F. Panish et al., “Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths,” New England Journal of Medicine, vol. 366, no. 8, pp. 687–696, 2012. [2] J. C. Van Rijn, J. B. Reitsma, J. Stoker, P. M. Bossuyt, S. J. Van Deventer, and E. Dekker, “Polyp miss rate determined by tandem colonoscopy: a systematic review,” The American journal of gastroenterology, vol. 101, no. 2, p. 343, 2006. [3] Y. Mori and S.-e. Kudo, “Detecting colorectal polyps via machine learning,” Nature biomedical engineering, vol. 2, no. 10, p. 713, 2018. [4] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in Proceeding of International Conference on 3D Vision (3DV). IEEE, 2016, pp. 565–571. [5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceedings of International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. [6] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual unet,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018. [7] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt, M. Riegler, and P. Halvorsen, “Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection,” in Proc. of MMSYS, june 2017, pp. 164–169. [8] D. Jha, P. H. Smedsrud, M. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. Johansen, “Kvasir-seg: A segmented polyp dataset,” in International Conference on Multimedia Modeling. Springer, 2020. [Online]. Available: https://datasets.simula.no/kvasir-seg/ [9] Y. Wang, W. Tavanapong, J. Wong, J. Oh, and P. C. De Groen, “Partbased multiderivative edge cross-sectional profiles for polyp detection in colonoscopy,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 4, pp. 1379–1389, 2014. [10] Y. Mori, S.-e. Kudo, T. M. Berzin, M. Misawa, and K. Takeda, “Computer-aided diagnosis for colonoscopy,” Endoscopy, vol. 49, no. 8, pp. 813–819, 2017. [11] P. Brandao, O. Zisimopoulos, E. Mazomenos, G. Ciuti, J. Bernal, M. Visentini-Scarzanella, A. Menciassi, P. Dario, A. Koulaouzidis, A. Arezzo et al., “Towards a computed-aided diagnosis system in colonoscopy: automatic polyp segmentation using convolution neural networks,” Journal of Medical Robotics Research, vol. 3, no. 2, p. 1840002, 2018. [12] P. Wang, X. Xiao, J. R. G. Brown, T. M. Berzin, M. Tu, F. Xiong, X. Hu, P. Liu, Y. Song, D. Zhang et al., “Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy,” Nature biomedical engineering, vol. 2, no. 10, pp. 741–748, 2018. [13] Y. Wang, W. Tavanapong, J. Wong, J. H. Oh, and P. C. De Groen, “Polyp-alert: Near real-time feedback during colonoscopy,” International Journal of Computer methods and programs in biomedicine, vol. 120, no. 3, pp. 164–179, 2015. [14] M. Riegler, K. Pogorelov, S. L. Eskeland, P. T. Schmidt, Z. Albisser, D. Johansen, C. Griwodz, P. Halvorsen, and T. D. Lange, “From annotation to computer-aided diagnosis: Detailed evaluation of a medical multimedia system,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 13, no. 3, p. 26, 2017. [15] S. A. Hicks, S. Eskeland, M. Lux, T. de Lange, K. R. Randel, M. Jeppsson, K. Pogorelov, P. Halvorsen, and M. Riegler, “Mimir: an automatic reporting and reasoning system for deep learning based analysis in the medical domain,” in Proceedings of the ACM Multimedia Systems Conference. ACM, 2018, pp. 369–374. 230