Bleeding Detection in WCE

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Project Report

Entitled

Bleeding Detection in WCE


Submitted to the Department of Electronics Engineering in Partial Fulfilment for the Require-
ments for the Degree of

Bachelor of Technology
(Electronics and Communication)

: Presented & Submitted By :

Anand Gupta, Bhavay Savaliya, Aishwary Mehta, Harshit Bhardwaj


Roll No. (U20EC001, U20EC096, U20EC105, U20EC107)
B. TECH. IV(EC), 7th Semester

: Guided By :

Dr. Kishor P. Upla


Assistant Professor, SVNIT

(Year: 2023-24)

DEPARTMENT OF ELECTRONICS ENGINEERING


SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY
Surat-395007, Gujarat, INDIA.
Sardar Vallabhbhai National Institute Of Technology
Surat - 395 007, Gujarat, India

DEPARTMENT OF ELECTRONICS ENGINEERING

CERTIFICATE
This is to certify that the Project Report entitled “Bleeding Detection in WCE”
is presented & submitted by Anand Gupta, Bhavay Savaliya, Aishwary Mehta, Harshit
Bhardwaj, bearing Roll No. U20EC001, U20EC096, U20EC105, U20EC107, of B.Tech.
IV, 7th Semester in the partial fulfillment of the requirement for the award of B.Tech.
Degree in Electronics & Communication Engineering for academic year 2023-24.
They have successfully and satisfactorily completed their Project Exam in all re-
spects. We, certify that the work is comprehensive, complete and fit for evaluation.

Dr. Kishor P. Upla


Assistant Professor & Project Guide

PROJECT EXAMINERS:

Name of Examiners Signature with Date


1.
2.
3.
4.
5.

Dr. Jignesh N Sarvaiya Seal of The Department


Head, DoECE, SVNIT (December-2023)
Acknowledgements
We would like to express our profound gratitude and deep regards to our guide Dr.
Kishor P. Upla for their guidance. We are heartily thankful for the suggestions and the
clarity of the concepts of the topic that helped us a lot with this work. We would also
like to thank Dr. Jignesh N Sarvaiya, Head of the Electronics Engineering Department,
SVNIT, and all the faculties of ECED for their cooperation and suggestions. We are
very much grateful to all our classmates for their support.

Anand Gupta
Bhavay Savaliya
Aishwary Mehta
Harshit Bhardwaj
Sardar Vallabhbhai National Institute of Technology
Surat

December 07, 2023

iii
Abstract
Wireless capsule endoscopy (WCE) is an effective technology that can be used to make
a gastrointestinal (GI) tract diagnosis of various lesions and abnormalities. Due to the
long time required to pass through the GI tract, the resulting WCE data stream contains
a large number of frames which leads to a tedious job for clinical experts to perform a
visual check of each and every frame of a complete patient’s video footage. In this pa-
per, an automated technique for bleeding detection based on color and texture features
is proposed. The approach combines the color information which is an essential feature
for initial detection of frames with bleeding. Additionally, it uses the texture which
plays an important role in extracting more information from the lesion captured in the
frames and allows the system to distinguish finely between border-line cases. The de-
tection algorithm utilizes machine-learning-based classification methods, and it can ef-
ficiently distinguish between bleeding and nonbleeding frames and perform pixel-level
segmentation of bleeding areas in WCE frames. Clinical studies have been conducted
to establish the effectiveness of the bleeding detection method in terms of detection
accuracy, and we are at least as good as the state’s method. In this work, we make a
comprehensive comparison of various state-of-the-art features and classifications that
can be used to create a more efficient and flexible WCE video processing system.

iv
Contents
Page
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapters
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Capsule Endoscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Bleeding Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Window Sliding Method . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 One Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 RCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Fast RCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5.1 YOLOv4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.2 YOLOv5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.3 YOLOv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.4 YOLOv7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.5 YOLOv8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Propsed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Why Yolo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Architecture of Bleeding detection using YOLO . . . . . . . . . . . . . 20
4.4 Image Aquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 You Only Look Once (YOLO) . . . . . . . . . . . . . . . . . . . . . . 21
4.6.1 YOLO Architecture . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6.2 How Does YOLO Object Detection Work? . . . . . . . . . . . 22
4.7 YOLOv8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v
Contents

4.7.1 Why should you use YOLOv8? . . . . . . . . . . . . . . . . . 24


4.7.2 YOLOv8 Architecture . . . . . . . . . . . . . . . . . . . . . . 24
5 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Experimental analysis on YOLO models . . . . . . . . . . . . . . . . . 33
6 Conclusion and Future scope . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Future scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Certificate of Paticipation . . . . . . . . . . . . . . . . . . . . . . . . . 38

vi
List of Figures

1.1 The capsule movement in the gastro-intestinal track and also provides
the glimpse of images captured by the capsule camera during its journey
in gastrointestinal track . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 (a) Polyp, (b) Tumour (c) Hookworm (d) Celiac (e) Bubbles (f) Bleeding 2
1.3 Wireless capsule endoscopy frame samples [1]. . . . . . . . . . . . . . 3

2.1 Comparison of different object detection models . . . . . . . . . . . . . 6


2.2 An Overview of Restricted Boltzmann Machines . . . . . . . . . . . . 7
2.3 The basic structure of restricted Boltzmann machine (RBM) . . . . . . 7
2.4 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . . . 8
2.5 Deep Neural Network (DNN classifier) architecture . . . . . . . . . . . 9

3.1 RCNN Features [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


3.2 Fast RCNN Features [3]. . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 YOLO [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 YOLOv5 Architecture [5]. . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 YOLOv6 Architecture [6]. . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 YOLOv7 Architecture [7]. . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Architecture of Bleeding detection using YOLO . . . . . . . . . . . . . 20


4.2 YOLO Detection System . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 YOLO Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Output from input using yolo . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 YOLOv8 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 C2f convolutional block structure of YOLOv8 . . . . . . . . . . . . . . 27
4.7 RepVGG layer used as stage . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 The neck of YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.9 Structure of sandwich-fusion (SF) . . . . . . . . . . . . . . . . . . . . 30
4.10 Detect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 Bleeding images from training dataset . . . . . . . . . . . . . . . . . . 33


5.2 Comparison of PR and F1 curves of various models . . . . . . . . . . . 35
5.3 Comparison of results on some test images among various models [1] . 36

6.1 Certificate of Participation . . . . . . . . . . . . . . . . . . . . . . . . 38

vii
List of Tables

5.1 Table of training dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 32


5.2 Table of testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Comparison of accuracy metrics among various models . . . . . . . . . 33
5.4 Comparision of Confidence scores of best detected segments among
various models from Figure 5.3 . . . . . . . . . . . . . . . . . . . . . . 34

viii
List of Abbreviations
UC Ulcerative Colitis
GI Gastrointestinal Tract
WCE Wireless Capsule Endoscopy
VCE Video Capsule Endoscopy
CAD Computer-Aided Detection
DBM Deep Boltzmann Machine
RBM Restricted Boltzmann Machine
CNN Convolutional Neural Networks
DNN Deep Neural Networks
R-CNN Region-based Convolutional Neural Network
YOLO You Only Look Once
IOU Intersection over Union
NMS Non-Max Suppression
SF Spatial Sandwich

ix
Chapter 1
Introduction
Wireless Capsule Endoscopy (WCE) is widely made in use for inspection of small intes-
tine, reaching where it is very hard to reach with the traditional endoscopy techniques
such as colonoscopy, push enteroscopy and intraoperative enteroscopy. In WCE, the
doctors and clinicians do not have to follow the examination process continuously. In-
stead, they have to deal with the data recorded by the capsule after it has passed through
the entire gastrointesti- nal tract. During 6-8 hours of WCE procedure, a video of the
GI tract trajectory is recorded on a device attached to the patient’s belt which pro-
duces about 57,000-1,00,000 frames; analyzed posterior by experienced health experts.
Presently, an experienced gastroenterologist takes approximately 23 hours to inspect
the captured video of one-patient through a frame-by-frame analysis which is not only
time-consuming but also susceptible to human error. In view of the poor ratio of patient-
to-doctor across developing countries like India, there arises a need for investigation and
state-of-the-art (SOTA) development of robust, interpretable and generalized AI models
that can aid in reducing the burden on gastroenterologists and save their valuable time
by computer-aided classification between bleeding and non-bleeding frames and further
detection of bleeding region in that frame.
During the last two decades several approaches have been proposed to develop auto-
mated systems for abnormality (such as polyps, ulcers, bleeding) detection in WCE
images. Bleeding detection is one of the important task in this area. In this paper,
we have developed and implemented automatic bleeding detection and classification
models which tell if there is bleeding in the image or not, and if so generate a set of 4
numbers which represent the coordinates of most probable location where bleeding has
occurred.

1.1 Capsule Endoscopy


Intestinal bleeding can be a symptom of many different conditions, including ulcera-
tive colitis (UC), vascular tumors, and inflammatory diseases. In the diagnosis process
of the digestive system, doctors who know how the entire digestive system produces
blood have one of the most suspicious conditions that will indicate the disease. Modern
endoscopic techniques such as catheteroscopy and push enteroscopy are painful and
dangerous procedures for patients because in severe cases the intestine may rupture.
They are also limited in their ability to reach and see the small intestine.

1
1.1. Capsule Endoscopy

1. Wireless Capsule Endoscopy (WCE)


Wireless Capsule Endoscopy (WCE) technology emerged around 2000s, using
wireless electronic devices to capture images or videos of the entire colon. The
capsules look like ordinary tablets and can be swallowed by the patient with-
out discomfort under the supervision of a doctor. Unlike traditional endoscopy
techniques, it allows examination of the patient’s entire gastrointestinal system
without pain, sedation or insufflation. The U.S. Food and Drug Administration
(FDA) approved the use of PPE in 2001 as a medical tool to examine the mucosa
of the stomach and small intestine to detect various abnormalities and diseases.
To date, WCE technology has helped more than 1.6 million patients worldwide.

Figure 1.1: The capsule movement in the gastro-intestinal track


and also provides the glimpse of images captured by the cap-
sule camera during its journey in gastro-intestinal track [8]

Figure 1.2: (a) Polyp, (b) Tumour (c) Hook-


worm (d) Celiac (e) Bubbles (f) Bleeding [8].

2
1.2. Bleeding Detection

The small bowel is located in the middle of the gastrointestinal (GI) tract, between
the stomach and the large intestine. It is three to four metres long and has a
surface area of roughly 30 m2 ,including the villi’s surface area, and plays an
important role in nutrient absorption. As a result, small bowel problems can cause
significant development retardation in children as well as nutritional deficiencies
in both children and adults. Chronic illnesses such as Crohn’s disease, celiac
disease, and angiectasis, as well as malignant diseases such as lymphoma and
adenocarcinoma, can damage this organ.
These illnesses can pose a significant health risk to individuals and society, and
diagnosing and treating them often necessitates a comprehensive inspection of
the lumen. The small intestine, on the other hand, is less accessible for inspection
by flexible endoscopes frequently employed for the upper GI tract and the large
bowel due to its anatomical position. Video Capsule Endoscopy (VCE) has
been utilised as a supplementary diagnostic for patients with GI bleeding since
the early 2000s. A VCE is a tiny capsule that houses a wide-angle camera, as
well as light sources, batteries, and other electronics. The patient eats the capsule,
which records video as it travels through the GI tract passively. The glimpse of
images taken by capsule is provided in Fig. 1.2

1.2 Bleeding Detection


Bleeding is very a common abnormality found in the GI tract. It is crucial to detect
bleeding at an early age since it is a precursor for inflammatory bowel diseases such as
Crohn’s disease and UC.

(a) Non-Bleeding GI tract (b) Bleeding GI tract

Figure 1.3: Wireless capsule endoscopy frame samples [1].

Thus there is a huge requirement for deployment of automated systems where the ob-
jective is to take in the video/image generated the by the WCE procedure and predict

3
1.3. Motivation

which frames of the video have bleeding present in them and where (In this report we
have predicted the location of bleeding area by enclosing it in a box). This helps in
easing the pressure off of the medical professionals by allowing them to focus more on
the selective frames of videos suggested by the system.

1.3 Motivation
The visual interpretation and automated analysis of endoscopic recordings are ham-
pered by artefacts such as motion blur, bubbles, specular reflections, floating objects,
and pixel saturation. Given the increasing use of endoscopy in many clinical applica-
tions, a key medical imaging challenge is identifying such artefacts and automating the
repair of damaged video frames. The doctor is likely to miss a key frame that reveals the
irregularity because the video is the same length as the time it takes for the capsule to
pass through the intestines. To address the prevailing procedures, Computer-Aided De-
tection (CAD) approaches were proposed. The use of computers to detect abnormalities
in capsule endoscopy images has opened up new research opportunities in the fields of
medical image processing and machine learning and deep learning. Deep learning has
been proven to be one of the most efficient method to deal with images, whether it be
image classification, image segmentation or object detection in images. Thus the use of
deep learning can significantly alleviate the load of a medical professional and help him
reduce the vast set of images to a smaller set with high chances of bleeding and so that
they can focus more on those images, resulting in deeper analysis and faster deduction.
So we decided to build a model and also train some existing implementations on the
bleeding/non-bleeding caspsule endoscopy dataset.

1.4 Objective
Wireless Capsule Endoscopy is gaining vast acceptance in society, and there is a severe
need of computer aided tools to process the data obtained from the images. Our main
objective is to build and test some of the exisiting models on the images generated
by wireless capsule endoscopy so that the workload of a medical professional can be
reduced and they can focus their efforts on.

1.5 Report outline


The report is organized such that Chapter 2 shows the extensive literature review which
covers traditional object detection methods such as Deep Boltzmann Machine, Re-
stricted Boltzmann Machine, Convolutional Neural Networks etc., Chapter 3 describes

4
1.5. Report outline

various methodologies used in object detection and their advantages and drawbacks,
while Chapter 4 demonstrates the proposed work. Chapter 5 shows the comparative
study of results obtained during experimental work and the report ends with summary
and future scope.

5
Chapter 2
Literature review
This chapter contains a brief overview of conventional Object Detection methods and
Image Classification approaches using Deep learning. Object recognition is a general
term to describe a collection of related computer vision tasks that involve identifying
objects in digital photographs. Image classification involves predicting the class of one
object in an image. Object localization refers to identifying the location of one or more
objects in an image and drawing abounding box around their extent. Object detection
combines these two tasks and localizes and classifies one or more objects in an image.
Different Deep learning based techniques that are used for image Object Detection are
mentioned in the Fig. 2.5

Figure 2.1: Comparison of different object detection models [9].

2.1 Object Detection


As mentioned above, Object detection combines object recognition and image classifi-
cation for a particular image. Some of the deep-learning based methods used in object
detection are:-

1. Deep Boltzmann Machine


Deep Boltzmann machine or DBM is a type of generative feature learning model,
or network. A DBM is made up of multiple layers that contain hidden variables,
and there is no contact between variables belonging to a particular layer. The
normal Boltzmann Machine (BM), is a network that contains units, which are
connected in symmetry and are binary founded on a stochastic instrument. In
a Deep Boltzmann machine, however, every layer extracts high-order parallels
that appear amongst exercises of shrouded characteristics in the layer beneath.

6
2.1. Object Detection

These DBMs can learn internal features that gradually tum to be complex, which
is a good characteristic for solving object detection challenges as well as other
computer vision problems. A DBM can be configured as depicted in figure 2.2
below.

Figure 2.2: An Overview of Restricted Boltzmann Machines [10].

2. Restricted Boltzmann Machine


A Restricted Boltzmann Machine is a special case where the quantity of hidden
layers of a DBM has been limited to one. The similarity between RBMs and
DBMs is that they both do not have hidden-to-hidden and visible-to-visible con-
nections in their models. When constructing RBMs, the feature activation of one
RBM is employed as the training data of the next RBM, which ensures that the
shrouded layers are learned efficiently. This is the most important characteristic
of the RBM [39]. The architecture of a typical RBM is illustrated in figure 2.3
below.

Figure 2.3: The basic structure of restricted Boltzmann machine (RBM) [11].

7
2.1. Object Detection

3. Convolutional Neural Networks


Convolutional neural networks are specially-designed types of neural networks
for handling data that possesses a known, grid-like topology. One such example
is the image data, this is usually presented in the form of a 20 grid of pixels. Con
volutional neural networks have accomplished enormously outstanding results in
pragmatic utilisation. The phrase ”convolutional neural network” shows that the
system executes a numerical task termed convolution , a specially-designed type
of linear operation. CNNs introduce convolution in one of their layers, instead of
traditional matrix multiplication. The operation of the discrete time convolution
is defined in the following way:

X
s(t) = (x × w)(t) = x(a)w(t − a)da
α=−∞

The architecture of a typical convolutional neural network is illustrated in figure


2.4 below

Figure 2.4: Convolutional Neural Networks (CNN) [12].

8
2.1. Object Detection

4. Deep Neural Networks


Deep Neural Network is a type of discriminative feature learning technique, a
neural network that contains multiple hidden layers. This is a simple conceptual
extension of neural networks; however, it provides valuable advances with regard
to the capability of these mode ls and new challenges as to training them. The
structure of deep neural networks causes them to be more sophisticated in design,
and yet more complex in elements. Deep neural networks can be very beneficial
, a deep neural network can fit the data more accurately with fewer parameters
than a normal neural network, this is because more layers can be used for a more
efficient and accurate representation. The architecture of a typical deep neural
network is illustrated in figure 2.5 below.

]
Figure 2.5: Deep Neural Network (DNN classifier) architecture [13].

9
Chapter 3
Methodologies
Object detection has been witnessing a rapid revolutionary change in the field of com-
puter vision. Its involvement in the combination of object classification as well as object
localisation makes it one of the most challenging topics in the domain of computer vi-
sion. In simple words, the goal of this detection technique is to determine where objects
are located in a given image called object localisation and which category each object
belongs to, which is called object classification.
In this chapter, we list down the best algorithms for object detection.

3.1 Window Sliding Method


A labelled training set is necessary before using a sliding window approach, just like
with any other machine learning algorithm. Imagine that we wish to create a sliding
window-based car detection method [14]. We will use both images with and without an
automobile object for our training images (X). We’ll carefully crop automobiles out of
photos of cars and label the resulting images with y=1. We also need negative instances
in our training set, where the only thing in the picture is the background and no car.
We’ll set y=0 for examples that are negative.
Now that we have labelled training data, we can teach the convolutional network
(convnet) how to recognise cars in images.
In order to identify an automobile in a test input image, we first choose a sliding
window of size (x), after which we feed the input area (x) to the trained convnet by
sliding the window over the entire input image. Convnet outputs whether or not an
automobile is present in each input location. In an effort to find a window size that will
fit the car and enable convnet to recognise it, we repeatedly run a sliding window over
the image with varying window sizes, from smaller to larger.

Problem with window sliding:


1. Computational cost is a considerable disadvantage of the sliding window algo-
rithm. We have to crop so many regions and run convnet for each of them individually.
Increasing window and stride size makes it faster but at the cost of decreased accuracy.
2. Modern convnets are computational very demanding, and that makes the sliding
window algorithm very slow.
3. The sliding window does not localize objects accurately unless sliding window
and strides are tiny.

10
3.2. One Shot Learning

3.2 One Shot Learning


The phrase ”one-shot learning” refers to the ongoing problem in computer vision of
extracting a lot of knowledge about an object category from a single image. Although
deep neural networks have demonstrated excellent performance in the huge data do-
main, they typically struggle on few-shot learning problems, where a classifier must
quickly generalise after seeing only a small number of samples from each class.
The goal of one-shot object detection (OSOD) is to identify an object from just one
example in each category. The One-Shot Object Detector only needs a very limited
number of canonical examples of the object, sometimes even only one, as opposed to
the Object Detector, which needs many different samples of objects in the actual world.
Two-dimensional objects with some regularity in the natural world are best suited
for the One-shot Object Detector. Road signs, logos, playing cards, and clapperboards
are a few examples. Three-dimensional things like faces, animals, and cars are better
suited for the conventional Object Detector than the One-Shot Object Detector, which
is not appropriate for them.

3.3 RCNN
Object detection consists of two separate tasks that are classification and localization.
R-CNN stands for Region-based Convolutional Neural Network. The key concept be-
hind the R-CNN series is region proposals. Region proposals are used to localize objects
within an image as we see in figure 4.1.
The phrase refers to the ongoing problem in computer vision of extracting a lot of

Figure 3.1: RCNN Features [2].

knowledge about an object category from a single image. Although deep neural net-
works have demonstrated excellent performance in the huge data domain, they typically

11
3.4. Fast RCNN

struggle on few-shot learning problems, where a classifier must quickly generalise after
seeing only a small number of samples from each class.
The goal of one-shot object detection (OSOD) is to identify an object from just one
example in each category. The One-Shot Object Detector only needs a very limited
number of canonical examples of the object, sometimes even only one, as opposed to
the Object Detector, which needs many different samples of objects in the actual world.
Two-dimensional objects with some regularity in the natural world are best suited
for the One-shot Object Detector. Road signs, logos, playing cards, and clapperboards
are a few examples. Three-dimensional things like faces, animals, and cars are better
suited for the conventional Object Detector than the One-Shot Object Detector, which
is not appropriate for them.

Working: As can be seen in the image below before passing an image through a
network, we need to extract region proposals or regions of interest using an algorithm
such as selective search. Then, we need to resize (wrap) all the extracted crops and pass
them through a network.
Finally, a network assigns a category from C + 1, including the ‘background’ label,
categories for a given crop. Additionally, it predicts delta Xs and Ys to shape a given
crop.

Problems with R-CNN:


. It still takes a huge amount of time to train the network as you would have to
classify 2000 region proposals per image.
. It cannot be implemented real time as it takes around 47 seconds for each test
image.
. The selective search algorithm is a fixed algorithm. Therefore, no learning is
happening at that stage. This could lead to the generation of bad candidate region
proposals.

3.4 Fast RCNN


The same author of the previous paper(R-CNN) solved some of the drawbacks of R-
CNN to build a faster object detection algorithm and it was called Fast R-CNN [3] .
The approach is similar to the R-CNN algorithm. But, instead of feeding the region
proposals to the CNN, we feed the input image to the CNN to generate a convolutional
feature map. From the convolutional feature map, we identify the region of proposals
and warp them into squares and by using a RoI pooling layer we reshape them into a
fixed size so that it can be fed into a fully connected layer. From the RoI feature vector,
we use a softmax layer to predict the class of the proposed region and also the offset

12
3.5. YOLO

values for the bounding box. We can see working of Fast R-CNN in figure 4.2.

Figure 3.2: Fast RCNN Features [3].

3.5 YOLO
All of the previous object detection algorithms use regions to localize the object within
the image. The network does not look at the complete image. Instead, parts of the
image which have high probabilities of containing the object.
YOLO or You Only Look Once [4] is an object detection algorithm much different
from the region based algorithms seen above. In YOLO [4] a single convolutional
network predicts the bounding boxes and the class probabilities for these boxes.We can
see working of Fast R-CNN [3] in figure 4.3.

Working: In order for YOLO [4] to function, we first divide a picture into a SxS grid
and then take m bounding boxes for each grid. The network outputs a class probability
and offset values for the bounding box for each bounding box. The object is located
within the image by selecting the bounding boxes with the class probability over a
threshold value.
YOLO [4] is considerably faster than other object detection methods (45 frames
per second). The YOLO [4] algorithm’s weakness is that it has trouble detecting little
things in the image; for instance, it might have trouble spotting a flock of birds. The
algorithm’s spatial limitations are to blame for this.

13
3.5. YOLO

Figure 3.3: YOLO [4].

3.5.1 YOLOv4
YOLOv4 [15] is a real-time object detection model that is SOTA (state-of-the-art). It is
the fourth part of the YOLO series, and Alexey Bochkovsky released it in April 2020.
On the COCO dataset, which contains 80 different object classes, it attained SOTA
performance.
A one-stage detector is YOLO [4] . One of the two primary cutting-edge techniques
for the task of object detection that prioritises inference speeds is the one-stage method.
The classes and bounding boxes for the entire image are predicted in one-stage detector
models; the ROI (Region of Interest) is not chosen. As a result, they are quicker than
two-stage detectors. FCOS, RetinaNet, and SSD are other examples.
Moving on to the YOLOv4 [15], a two-stage detector with a number of parts. The
blog’s subsequent portion will further deconstruct each element.

3.5.2 YOLOv5
YOLOv5 [5] was released only a month after its predecessor (i.e. YOLOv4 [15]) by a
company called Ultralytics, and claimed to have several significant improvements over
existing YOLO detectors. Since the YOLOv5 model was only released as a GitHub
repository, and the model was not published as a peer-reviewed research, there were
doubts/apprehensions about the authenticity and the effectiveness of the proposal.

14
3.5. YOLO

The model was put to detailed scrutiny by another company called Roboflow, and
it was found that the only significant modification that YOLOv5 [5] included (over
YOLOv4) was integrating the anchor box selection process into the model. As a result,
YOLOv5 does not need to consider any of the datasets to be used as input, and possesses
the capability to automatically learn the most appropriate anchor boxes for the particular
dataset under consideration, and use them during training.
Despite the non-availability of a formal paper, the fact that the YOLOv5 model has
subsequently been utilized in several applications with effective results has started to
generate credibility for the model. Lastly, it needs to be pointed out the latest version of
the YOLOv5 [5] model is the YOLOv5-V6.0 release which is claimed to be a (further)
lightweight model with an improved inference speed of 1666 fps (the original YOLOv5
claimed to have 140 fps).
Another improved version of the YOLOv5 model [5] is recently presented in which
integrates the salient features of the Transformer model (details of Transformers are
discussed in a subsequent section) into the YOLOv5 model [5], and demonstrates the
improvements in the context of object detection in drone-captured videos.

Figure 3.4: YOLOv5 Architecture [5].

Model Backbone: The backbone is a pre-trained network used to extract rich feature
representation for images. This helps reducing the spatial resolution of the image and
increasing its feature (channel) resolution.

Model Neck: The model neck is used to extract feature pyramids. This helps the
model to generalize well to objects on different sizes and scales.

Model Head: The model head is used to perform the final stage operations. It applies
anchor boxes on feature maps and render the final output: classes , objectness scores
and bounding boxes.

15
3.5. YOLO

3.5.3 YOLOv6
YOLOv6 [6] is not officially a part of the YOLO family of object detector models,
but the original YOLO detector significantly inspires its implementation. It is mainly
focused on delivering industrial applications.
In this blog, we’ll discuss the components of the YOLOv6 object detection model
and compare it to other YOLO versions including YOLOX and YOLOv5. We’ll later
discuss how you can use YOLOv6 [6] with Deci’s deep learning development platform.
A single-stage object detection model performs object localization and image clas-
sification within the same network. Object localization involves identifying the position
of a single object (or multiple objects) within an image frame. While object classifica-
tion involves assigning labels to localized objects in an image using bounding boxes.
In comparison, a two-stage object detection model performs the two tasks using
separate networks. The single-stage method is simpler and faster.

Figure 3.5: YOLOv6 Architecture [6].

The YOLO model family performs single-stage predictions using a grid. It contains
a pre-trained backbone architecture to learn feature maps of the image, a neck to ag-
gregate those feature maps, and a head to make the final predictions. YOLOv6 uses
an EfficientRep backbone with a Rep-PAN neck built similarly to the original YOLO
model. Moreover, YOLOv6 uses a decoupled head.
In terms of YOLO architectures, YOLOv6 offers several model sizes, including
YOLOv6-n (4.3M parameters), YOLOv6-tiny (15M parameters), and YOLOv6-s (17.2M
parameters). Even larger sizes are expected.
The YOLOv6 backbone and neck are known for working well with GPUs, reducing
hardware delays and allowing more use cases for industrial applications.
The EfficientRep backbone used in YOLOv6 favors GPU processing using the Rep
operator, which allows structural reparameterization. This is a technique that restruc-

16
3.5. YOLO

tures trained convolutional neural network layers to achieve better inference speed.
EfficientRep uses RepConv layers followed by RepBlocks, as shown in the image
below. Each RepConv layer consists of various RepVGG layers, a batch normalization
layer, and a ReLU activation unit.
The stride parameter determines how the convolution filter is applied to the input
image. Each convolution operation shifts two pixels to reduce the output dimensions,
making the detection network faster.

3.5.4 YOLOv7
The YOLO (You Only Look Once) v7 model is the latest in the family of YOLO models.
YOLO models are single stage object detectors. In a YOLO model, image frames are
featurized through a backbone. These features are combined and mixed in the neck,
and then they are passed along to the head of the network YOLO predicts the locations
and classes of objects around which bounding boxes should be drawn.

Figure 3.6: YOLOv7 Architecture [7].

YOLOv7 [7] modifies the ELAN architecture, called E-ELAN (extended efficient
layer aggregation network). ELAN uses expand, shuffle, and merge cardinality to im-
prove the model learning ability without destroying gradient flow paths. As for the
architecture, it only modifies the architecture in the computational block, while the
architecture remains the same as ELAN in the transition layer. E-ELAN uses group
convolution to expand the channels and cardinality of the computational block. It ap-
plies the same channel multiplier and group parameter to all the computational blocks
in a computation layer. The feature map from each computational block will be shuffled
into group size g and then concatenated together. Finally, a shuffled group feature map
will be added to perform merge cardinality.

17
3.5. YOLO

What’s new with YOLO v7? YOLO v7 [7], the latest version of YOLO, has several
improvements over the previous versions. One of the main improvements is the use of
anchor boxes. Anchor boxes are a set of predefined boxes with different aspect ratios
that are used to detect objects of different shapes.
The YOLOv7 [7] model [7] has over 37 million parameters, and it outperforms
models with higher parameters like YOLov4. The YOLOv7 model has the highest mAP
and FPS rate in the range of 5 to 160 FPS. YOLO v7 also has a higher resolution than
the previous versions. It processes images at a resolution of 608 by 608 pixels, which is
higher than the 416 by 416 resolution used in YOLO v3. This higher resolution allows
YOLO v7 [7] to detect smaller objects and to have a higher accuracy overall.

3.5.5 YOLOv8
YOLOv8 [16] is a new state-of-the-art computer vision model built by Ultralytics, the
creators of YOLOv5. The YOLOv8 model contains out-of-the-box support for object
detection, classification, and segmentation tasks, accessible through a Python package
as well as a command line interface.
The architecture of YOLOv8 [16] builds upon the previous versions of YOLO algo-
rithms.
YOLOv8 [16] utilizes a convolutional neural network that can be divided into two
main parts: the backbone and the head.
A modified version of the CSPDarknet53 architecture forms the backbone of YOLOv8.
This architecture consists of 53 convolutional layers and employs cross-stage partial
connections to improve information flow between the different layers.
The head of YOLOv8 consists of multiple convolutional layers followed by a series
of fully connected layers.
Another important feature of YOLOv8 [16] is its ability to perform multi-scaled
object detection. The model utilizes a feature pyramid network to detect objects of
different sizes and scales within an image.
This feature pyramid network consists of multiple layers that detect objects at dif-
ferent scales, allowing the model to detect large and small objects within an image.
Provide additional details about YOLOv8 [16] in the next chapter.

18
Chapter 4
Propsed Work
As the existing state of art model was for natural images, we trained the existing model
on the wireless capsule endoscopy dataset to check which kind of network is best suit-
able for the WCE images. after carefully studying the outputs of YOLO models, we
reached the conclusion that the network having dense connections learns these images
very well. Meanwhile, we came across a research paper based on YOLOv8 to accom-
plish the task of detection of images. we took YOLOv8 as the base network and started
the experimental work.

4.1 Overview
we will classify the Wireless Capsule Endoscopy (WCE) using the Yolo. After suc-
cessful training of the Yolo model, the corresponding bleed will be predicted. We will
evaluate the classification performance of our model using the non-normalized and nor-
malized confusion matrices. Finally, we will obtain the classification accuracy score of
the yolo model in this task.

4.2 Why Yolo


• YOLO object detection stands for ”You Only Look Once” object detection and
most people misunderstand it as ”You Only Look Once”. It’s an instant way to
find and identify objects at up to 155 frames per second. In YOLO the architecture
divides the input image into m x m grids and then adds 2 bounding boxes to each
grid and classes appear in the bounding boxes. The important thing to remember
here is that the container size is larger than the grid size itself.

• YOLO architecture is usually FCNN (fully convolutional neural network) which


passes an image of size n x n and when passed through FCNN the output we get
is an image of size m x m.

• It takes the image, and the system divides that image into S x S grid, now the
location of the object is here, if the middle object falls in the grid, then it becomes
a grid cell only Role Check this special item.

19
4.3. Architecture of Bleeding detection using YOLO

4.3 Architecture of Bleeding detection using YOLO


The stages involved in Wireless Capsule Endoscopy (WCE) can be categorized into
four stages: image acquisition, feature extraction Model and detection, as shown in Fig.
4.1.

Figure 4.1: Architecture of Bleeding detection using YOLO [17]

4.4 Image Aquisition


It is the action of extracting an image from a source, typically a hardware-based source,
for the process of image processing. Web-Camera is the hardware-based source in our
project [17]. It is the first step in the workflow sequence because no processing can be
done without an image. The picture that is obtained has not been processed in any way.

4.5 Features Extraction


Predefined properties such as shape, outline, geometric element (position, angle, dis-
tance, etc.), color element, histogram and others are extracted from the preprocessed
images and later are used for character classification or recognition [17]. Feature ex-
traction is a step in the dimensionality reduction process that partitions and organizes a
large collection of raw data. It reduces to smaller, more manageable classes. The result
would be simpler processing. The fact that these massive data sets have a large number
of variables is the most important feature. A large amount of computing power is re-
quired to process these variables. As a result, the extraction function helps in extracting
the best features from large data sets by selecting and combining variables into features.
Data Size Reduction These features are easy to use while still accurately and uniquely
describing the actual data collection.

20
4.6. You Only Look Once (YOLO)

4.6 You Only Look Once (YOLO)


Object detection is one of the classic problems in computer vision; here you need to de-
fine the points and their locations in a particular image and their positions in the image.
The object detection problem is more difficult than classification, classification can also
identify objects but cannot tell the location of objects in the image. Additionally, the
classification does not apply to images containing multiple objects.
YOLO is popular because it achieves accuracy while still running instantly. This
algorithm “looks once” on the image because it only needs to be imaged once from the
neural network to make predictions. When does not reach the maximum value (which
ensures that the object detection algorithm finds each object only once), it displays the
accepted objects with their boxes bound [18].
Here we develop a multiple bleeding spot detection system for Wireless Capsule
Endoscopy (WCE) videos. More specifically, we developed a recognition model based
on deep learning models that can identify bleeding points in the gastrointestinal tract.
Deep learning models are used especially for pattern recognition. This model is de-
signed to localize and distribute popular products. For this purpose, we use Only You
(YOLO) deep learning method.

Figure 4.2: YOLO Detection System [18]

With YOLO, a single CNN simultaneously predicts multiple bounding boxes and
class probabilities for those boxes. YOLO trains on full images and directly optimizes
detection performance.

21
4.6. You Only Look Once (YOLO)

4.6.1 YOLO Architecture


YOLO architecture is similar to GoogleNet. It has a total of 24 convolutional layers,
four max-pooling layers and two full layers as shown in the figure below.

Figure 4.3: YOLO Architecture [4].

The architecture works as follows: The input image is converted into a 448x448
convolutional mesh before processing. First, 1x1 convolution is used to reduce the
number of channels, then 3x3 convolution is used to create the cubic output. Except for
the last method, which uses the linear activation function, the central activation function
is ReLU [4]. Some additional techniques, such as batch normalization and output based
on the function of the model, ensure that overfitting is avoided.

4.6.2 How Does YOLO Object Detection Work?


Now that you understand the architecture, let’s have a high-level overview of how the
YOLO algorithm performs object detection using a simple use case [4].

Figure 4.4: Ouout from input using yolo [4].

22
4.6. You Only Look Once (YOLO)

1. Residual blocks

The first step starts by dividing the original image (A) into equal parts like an NxN grid,
where N is 4 in our example, as shown on the right. Each cell of the grid is responsible
for finding and predicting the category and probability/belief value of the objects it
covers.

2. Bounding box regression

Bounding box regression loss measures the closure of the detected bounding box to
the ground truth bounding box. The bounding box is estimated using a loss function,
resulting in an error between the predicted and actual bounding box. YOLO determines
the properties of these bounding boxes using a variation of the following formula; where
Y is the final vector representation of each bounding box.
Y = [pc, bx, by, bh, bw, c1, c2]

3. Intersection over Union (IOU)

The intercept on union is a popular metric used to measure regional accuracy and calcu-
late regional error in sample detection. Calculates the amount of overlap between two
bounding boxes (approximate bounding box and ground truth box). IoU is the ratio of
the intersection of two boxes to their combined area. Both the ground truth bounding
box and the expected box have unit area, the denominator.

4. Non-Max Suppression or NMS

Setting the threshold for IOU is not always sufficient, as a product may have many bins
with IOUs exceeding the threshold, and holding all those bins would be noisy. Here we
can use NMS to store only the frames with the highest probability score.

here there are different versions of YOLO. These are YOLOv1, YOLOv2 YOLOv3,
YOLOv4, YOLOv5, YOLOv7 and YOLOv8. Out of these, we will use the YOLOv8
Model.

23
4.7. YOLOv8

4.7 YOLOv8
YOLOv8 is the newest and most advanced YOLO model that can be used in object de-
tection, image classification and modeling tasks. YOLOv8 is manufactured by Ultralyt-
ics, the same company that also produces the effective and industry-leading YOLOv5
model. YOLOv8 includes many architectural and design changes and improvements
compared to YOLOv5.

4.7.1 Why should you use YOLOv8?


Here are a few main reasons why you should consider using YOLOv8 on your next
computer vision project

• YOLOv8 has better accuracy than previous YOLO models.

• The latest YOLOv8 application comes with many new features.

• Supports object detection, pattern segmentation and image classification. The


community around YOLO is incredible; search any YOLO model; You’ll find
hundreds of tips, videos, and articles. and you can find the help you need in the
MLOps community, DCAI and more.

• YOLOv8’s training will be faster than the other two-level content. Protect the
model.

4.7.2 YOLOv8 Architecture


The architecture of YOLOv8 is based on previous versions of the YOLO algorithm [19].
YOLOv8 uses convolutional neural networks, which can be divided into three main
parts: backbone, neck and head.
Figure 4.5 shows the architecture of our proposed YOLO (large-scale) network
model. This network model is an improvement of the YOLOv8 model. In the back-
bone part of the network, we use the RepVGG reparameterized convolution module as
the subsampling layer. During training, this convolution model displays 3×3 and 1×1
convolution simultaneously. During inference, two convolutional kernels are combined
into a 3x3 convolutional layer [19]. This technique allows the network to learn more
powerful features without sacrificing inference speed or increasing sample size. On the
neck, we connect the PAFPN pattern to our layer and combine it so that the small cap
is visible. By integrating the sandwich fusion module, spatial and channel features are
extracted from the three-layer feature maps of the network backbone. This optimization
improves the ability of many measuring heads to collect spatial positioning information
of objects to be seen.

24
4.7. YOLOv8

Figure 4.5: YOLOv8 Architecture [19].

25
4.7. YOLOv8

Backbone

The backbone of our proposed network model includes a stem layer and four stage
blocks. As shown in Figure 4.5, the stem layer is a convolutional layer with a stride
of 2. Each stage block has a RepVGG block with a stride of 2, and is used as a down-
sampling module and a C2f convolutional block component, as proposed in YOLOv8.
In the proposed network, the feature map extraction backbone’s essential multi-stage
components are the C2f modules, which were modified from the C3 modules. In the
C3 module of CSPDarkNet, the backbone of YOLOv5, the cross-stage partial structure
was implemented and obtained good results. Its design goal is to reduce the network’s
computational complexity and enhance gradient performance. The design concept of
DenseNet [19] inspired the C3 module’s structure. The C3 module is composed of
three convolutional layers with a 1 × 1 kernel and a DarkNet bottleneck, which has a
sequence of convolutional layers with a 3 × 3 kernel. The module’s input feature map is
divided into two parts: One does not pass through the bottleneck, and the other passes
through the bottleneck. In the part of the feature map used to compute through the bot-
tleneck, the first layer’s output result is concatenated with its input as the input of the
second convolutional layer. The feature map passed through the bottleneck concate-
nates the other part feature map, which only computes through one 1 × 1 convolution
layer and passes through a final 1 × 1 convolutional layer to be the output feature of the
C3 module.
In the backbone of this proposed network, we use the C2f structure from the YOLOv8
network as each stage unit of the backbone. As shown in Figure 4.6, compared with the
C3 structure, the C2f module sufficiently meets the idea proposed by the ELAN module,
which optimizes the network structure from the perspective of controlling the shortest
and longest gradient paths, thereby making the network more trainable. In addition,
without changing the network’s shortest and longest gradient paths, the C2f module can
effectively learn multi-scale features and expand the range of receptive fields through
feature vector diversion and multi-level-nested convolution within its module.

26
4.7. YOLOv8

Figure 4.6: C2f convolutional block structure of YOLOv8 [19].

We deployed RepVGG [19] blocks as the downsampling structure between stages


in the backbone network. The RepVGG component is improved from the VGG, which
is an early proposed single path network characterized by the extensive use of 3 × 3
convolutions as its essential units. The VGG is a memory-saving model because it does
not require memory for identity structures like residual blocks in ResNet. The 3 × 3
convolutional layer is a high-efficiency network structure. Machine learning hardware
acceleration libraries, such as cudnn and MKL, have significantly optimized the com-
putational efficiency of 3 × 3 convolution, making the calculation density (theoretical
FLOPs/time usage) [10] of 3 × 3 convolution reach four times that of 1 × 1 or 5 × 5
convolution. Consequently, extensive utilization of this type of convolutional neural
network offers efficiency advantages in practical computing tasks. As shown in Fig-
ure 4.7, in the RepVGG module, there are two different convolutional kernels, 3 × 3
and 1 × 1, during the training phase. This is due to the consistent movement process
of convolutional kernels of different sizes when calculated on feature maps. When the
model is used for inference, the 1 × 1 and 3 × 3 convolution kernels can be combined
into a single 3 × 3 kernel through structural re-parameterization. The specific approach

27
4.7. YOLOv8

involves padding the surrounding part of the 1 × 1 kernel into a 3 × 3 form. Based on
the additivity principle of convolution kernels of the same size, the padded kernel is
added to the original 3 × 3 convolution kernel to form a 3 × 3 convolution kernel for
inference.

Figure 4.7: (a) RepVGG layer in training (b) RepVGG layer in inference [19].

Neck

In the proposed network, as shown in Figure 4.8, the neck’s structure is a PAFPN struc-
ture network. The neck consists of top-down and bottom-up branches. In the topdown
branch, we have top-down layer 1, layer 2, and layer3. The different layers in the top-
down branch receive feature maps P1, P2, P3, P4, and P5 from the backbone’s stem
layer, and stage layer 1, stage layer 2, stage layer 3, and stage layer 4 through the SPPF
modules, respectively. The bottom-up branch is composed of three parts: layer 0, layer
1, and layer 2. The input of the bottom-up comes from the output of the top-down
branch, as well as the feature map of the backbone’s stage layer 4 through SPPF. Their
output consists of four feature maps of different sizes, C2, C3, C4, and C5, correspond-
ing to the detection heads of targets of different sizes.

28
4.7. YOLOv8

Figure 4.8: The neck of YOLO [19].

Given that the image data size we need to analyze is relatively large and the ob-
jects within these images are comparatively small and dense, following the YOLOv8
strategy of using the three-level PAFPN, suitable for detecting small objects in images,
would limit the effectiveness of detecting a large number of tiny and densely overlap-
ping objects within large-sized images. Therefore, we expanded the PAFPN module
to a four-layer structure, taking feature maps from the backbone’s stem layer and stage
layers and outputting four feature maps of the same size to four different detecting heads

29
4.7. YOLOv8

to detect tiny, small, medium, and large sizes of targets.


As shown in Figure 4.9, we propose sandwich-fusion (SF), a novel fusion module
of a three-size feature map, which optimizes the target’s spatial and semantic informa-
tion for detection heads. The module is applied to the neck’s top-down layers. The
inspiration for this module comes from the BiC model proposed in YOLOv6 [20]. The
input of sandwich-fusion (SF) is shown in the figure, including feature maps from the
backbone’s lower stage, corresponding stage, and higher stage. The goal is to balance
spatial information of low-level features and semantic information of high-level features
to optimize the network head’s recognition of the target position and classification.

Figure 4.9: Structure of sandwich-fusion (SF) [20].

The sandwich-fusion (SF) module differs from the BiC model of YOLOv6 3.0 in
that it implements a depthwise separable convolution layer to extract the feature map
from the lower stage layer, which contains the target’s most detailed and precise posi-
tion information. The module concatenates the lower, corresponding, and upsampled
high-level feature maps as its output. The subsequent C2f module in the top-down
layer extracts this combined information, which is rich in detail regarding the target’s
classification and spatial positioning.
The depthwise separable convolution is a two-stage convolution, which includes
depthwise convolution only for feature maps in a single channel and pointwise con-
volution only for feature map channels. Compared to convolutional layers with the
same-sized kernels, the parameters of depthwise separable convolution are relatively
small. This method achieved good results on RepLKNet by expanding the receptive
field without increasing numerous parameters.
The PAFPN module is deployed in the improved YOLOv8 as its neck. Its classifi-
cation information was obtained well through the SPPF layer, but spatial information

30
4.7. YOLOv8

may be inaccurate. the disadvantage of these features is that in the early stage of the
backbone, the receptive field of the network is small, and there are some defects in
correctly identifying the background and location information from complex texture
features. Therefore, in the sandwich-fusion (SF) module, a large kernel depthwise sep-
arable convolution is deployed to extract more accurate position information from the
early-stage features and filter out irrelevant information related to the detected object.

Head

In the context of YOLOv8, ”detect” refers to the process of identifying and locating ob-
jects in images or videos. It is a core functionality of YOLOv8, enabling it to effectively
perform object detection tasks [20].
The detect operation in YOLOv8 utilizes a specialized architecture that incorporates
several modules designed to extract features, classify objects, and generate bounding
boxes. These modules work together to analyze the input image or video, identifying
objects of interest and determining their locations within the scene.

Figure 4.10: Detect [20]

The overall detect operation in YOLOv8 results in a set of bounding boxes and
corresponding confidence scores, representing the detected objects and their likelihood
of being correctly classified. This information can be used for various applications,
such as object tracking, scene understanding, and anomaly detection.

31
Chapter 5
Result analysis
This chapter shows the performance analysis of detection using different approaches.

5.1 Dataset
All the experimental work was carried out on the Auto-WCEBleedGen challenge Dataset
[1]. A segment of Kvasir-Capsule, a video capsule endoscopy data set, is used for this
project. Each image has 224 × 224 dimensions and is in RGB colour code. There are
total of 2618 images belonging to bleeding and non-bleeding categories.

Table 5.1: Table of training dataset.

Bleeding 1309
Non-bleeding 1309

Table 5.2: Table of testing dataset.

Testing dataset 1 49
Testing dataset 2 515

The Experimentation was carried on multiple subsets of the original Dataset. Ini-
tially, dataset contained 2 folders, bleeding and non-bleeding. In bleeding dataset,
bounding boxes were given in XML, normal and yolo format. In Yolo format only
one bounding box was given for each image, but in normal format multiple boxes data
was given. So, to increase training data, dataset was modified. Normal bounding boxes
were converted into yolo bounding boxes format. Thus, the size of the training dataset
was increased successfully. Then training and validation images were defined by split-
ting this modified dataset with an 80:20 ratio. A few sample images from the dataset
are given in Fig.5.1

32
5.2. Experimental analysis on YOLO models

Figure 5.1: Bleeding images from training dataset [1]

5.2 Experimental analysis on YOLO models


Given below are the accuracy metrics of various models that were used for training.

Model Average precision Mean Average Precision Intersection over union


YOLOv3 0.666 0.596 0.5
YOLOv5 0.3461 0.253 0.5
YOLOv7 0.5352 0.652 0.5
YOLOv8 0.5081 0.652 0.5

Table 5.3: Comparison of accuracy metrics among various models

33
5.2. Experimental analysis on YOLO models

Model Image 1 Image 2 Image 2


YOLOv3 NA 0.42 0.48
YOLOv5 NA NA NA
YOLOv7 0.32 0.39 0.38
YOLOv8 0.29 0.37 0.52

Table 5.4: Comparision of Confidence scores of best de-


tected segments among various models from Figure 5.3

From Table 5.3, We observed that least accurate model is YOLOv5, then we achieved
higher accuracy by using YOLOv3 compared to YOLov5, then we observed YOLOv7
and YOLOv8 performing better than YOLOv3.

From Figure 5.2, we can observe that the area under the PR and F1 curves is more
in YOLOv7 and YOLOv8.

From Figure 5.3, we conclude that YOLOv3, YOLOv7 and YOLOv8 have per-
formed on these three test images. But for all test images, YOLOv8 has performed the
best among all the test images.

So, from the above statements we conclude that YOLOv8 is the best model among
all the models used for object detection and we also conclude that though YOLOv7
model is less accurate than YOLOv8, but the accuracy of YOLOv7 is close to YOLOv8.

34
5.2. Experimental analysis on YOLO models

YOLOv3: PR Curve YOLOv5: PR Curve

YOLOv3: F1 Curve YOLOv5: F1 Curve

YOLOv7: PR Curve YOLOv8: PR Curve

YOLOv7: F1 Curve YOLOv8: F1 Curve

Figure 5.2: Comparison of PR and F1 curves of various models

35
5.2. Experimental analysis on YOLO models

YOLOv3: Image 1 YOLOv3: Image 2 YOLOv3: Image 3

YOLOv5: Image 1 YOLOv5: Image 2 YOLOv5: Image 3

YOLOv7: Image 1 YOLOv7: Image 2 YOLOv7: Image 3

YOLOv8: Image 1 YOLOv8: Image 2 YOLOv8: Image 3

Figure 5.3: Comparison of results on some test images among various models [1]

36
Chapter 6
Conclusion and Future scope

6.1 Conclusion
We have presented an automated technique for bleeding detection in wireless cap-
sule endoscopy (WCE) videos based on color and texture features. The proposed
method combines color information and texture information to effectively detect bleed-
ing frames and perform pixel-level segmentation of bleeding areas. Clinical studies
have been conducted to evaluate the performance of the proposed method, and the re-
sults demonstrate that the method is at least as good as the state-of-the-art in terms of
detection accuracy. The proposed method has the potential to significantly improve the
efficiency and accuracy of bleeding detection in WCE videos.
The proposed method has several advantages over existing methods. First, it is fully
automated, which can significantly reduce the workload of clinical experts. Second, it is
based on machine-learning-based classification methods, which can learn from a large
amount of training data to improve detection accuracy. Third, it utilizes both color and
texture information, which can provide more comprehensive information for bleeding
detection.
We believe that the proposed method is a significant step forward in the development
of automated bleeding detection techniques for WCE. The method has the potential to
improve the efficiency and accuracy of bleeding detection, and it can be used to improve
the diagnosis and treatment of patients with gastrointestinal diseases.

6.2 Future scope


• Train the proposed method on a larger dataset of WCE videos to improve its
generalization ability.

• Incorporate the proposed method into a clinical workflow to provide real-time


bleeding detection feedback to clinicians.

• Extend the proposed method to detect other types of abnormalities in WCE videos,
such as polyps and ulcers.

37
6.3. Certificate of Paticipation

6.3 Certificate of Paticipation

SVNIT_MLIP

Figure 6.1: Auto WCE-Bleedgen Challenge

38
Bibliography
[1] Palak, H. Mangotra, D. Nautiyal, J. Dhatarwal, and N. Gooel, “WCEbleedGen: A
wireless capsule endoscopy dataset containing bleeding and non-bleeding frames,”
Nov. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.10156571

[2] R. B S, “Object detection using region based convolutional neural network: A


survey,” International Journal for Research in Applied Science and Engineering
Technology, vol. 8, pp. 1927–1932, 07 2020.

[3] K. Sharma, S. Rawat, D. Parashar, S. Sharma, S. Roy, and S. Sahoo, “State of-
the-art analysis of multiple object detection techniques using deep learning,” In-
ternational Journal of Advanced Computer Science and Applications, vol. 14, pp.
527–534, 07 2023.

[4] U. Handalage and L. Kuganandamurthy, “Real-time object detection using yolo:


A review,” 05 2021.

[5] M. Horvat, L. Jelečević, and G. Gledec, “A comparative study of yolov5 models


performance for image localization and classification,” 09 2022.

[6] J. Wei and Y. Qu, “Lightweight improvement of yolov6 algorithm for small target
detection,” 03 2023.

[7] C.-Y. Wang, A. Bochkovskiy, and H.-y. Liao, “Yolov7: Trainable bag-of-freebies
sets new state-of-the-art for real-time object detectors,” 07 2022.

[8] P. Muruganantham and S. Balakrishnan, “A survey on deep learning models for


wireless capsule endoscopy image analysis,” International Journal of Cognitive
Computing in Engineering, vol. 2, pp. 83–92, 06 2021.

[9] V. Upadhya and P. Sastry, “An overview of restricted boltzmann machines,” Jour-
nal of the Indian Institute of Science, vol. 99, 02 2019.

[10] Y. Chu, X. Zhao, Y. Zou, P. Xu, J. Han, and Y. Zhao, “A decoding scheme for in-
complete motor imagery eeg with deep belief network,” Frontiers in Neuroscience,
vol. 12, p. 680, 09 2018.

[11] S. Koul and U. Singhania, “Flower species detection over large dataset using con-
volution neural networks and image processing techniques,” 08 2020.

[12] B. P, S. P. Ramalingam, N. Rk, and A. Krishnan, “Type 2: Diabetes mellitus pre-


diction using deep neural networks classifier,” International Journal of Cognitive
Computing in Engineering, vol. 1, pp. 55–61, 06 2020.

39
Bibliography

[13] K. Masita, A. Hasan, and T. Shongwe, “Deep learning in object detection: a re-
view,” 08 2020, pp. 1–11.

[14] J. Laguna, A. Olaya, and D. Borrajo, “A dynamic sliding window approach for
activity recognition,” vol. 6787, 07 2011, pp. 219–230.

[15] A. Bochkovskiy, C.-Y. Wang, and H.-y. Liao, “Yolov4: Optimal speed and accu-
racy of object detection,” 04 2020.

[16] H. Lou, X. Duan, J. Guo, H. Liu, J. Gu, L. Bi, and H. Chen, “Dc-yolov8: Small
size object detection algorithm based on camera sensor,” 04 2023.

[17] K. J C and A. Suruliandi, “Texture and color feature extraction for classification
of melanoma using svm,” 01 2016, pp. 1–6.

[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,
real-time object detection,” 06 2016, pp. 779–788.

[19] Z. Zhang, “Drone-yolo: An efficient neural network method for target detection
in drone images,” Drones, vol. 7, p. 526, 08 2023.

[20] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger, “Densely connected
convolutional networks,” 07 2017.

40

You might also like