Rimsha Qaisar

Outlier Detection in
Gastrointestinal Tract Images

using Machine Learning Algorithms
By
Student Name: Rimsha Qaisar

(Registration No: 00000402802)
Department of Statistics
School of Natural Sciences
National University of Sciences and Technology (NUST)
Islamabad, Pakistan
(2024)
Outlier Detection in
Gastrointestinal Tract Images
using Machine Learning Algorithms
By
Rimsha Qaisar
(Registration No: 00000402802)
A thesis submitted to the National University of Sciences and Technology, Islamabad,
in partial fulfillment of the requirements for the degree of
Masters of Science in
Statistics
Supervisor: Dr.Tahir Mehmood
School of Natural Sciences
National University of Sciences and Technology (NUST)
Islamabad, Pakistan
(2024)
DEDICATION
To my parents and sisters, whose constant love and encouragement have been my pil-
lars of strength. Their sacrifices and unwavering faith have guided me through every
challenge and made this achievement possible.
ACKNOWLEDGEMENTS
All praise and gratitude are due to Allah Almighty, the most compassionate and
benevolent, who created the universe and endowed me with countless blessings. I
am deeply thankful for the strength and perseverance granted to me, which enabled
me to complete this thesis. My heartfelt appreciation extends to my supervisor, Dr.
Tahir Mehmood, whose unwavering support, insightful guidance, and patience have
been instrumental throughout this journey. May Allah bestow upon him His abun-
dant blessings. This research would not have been possible without his expert advice
and encouragement, which have significantly deepened my understanding and respect
for this field. I am also really greatful to my GEC members, Dr. Firdos Khan and
Dr. Zamir Hussain, for their support and direction in finalizing this thesis. Lastly, I
am profoundly grateful to my family and friends for their continuous support and en-
couragement throughout my academic endeavors. May God bless them all for mak-
ing this challenging journey a success.
Contents
LIST OF TABLES VI
LIST OF FIGURES VIII
LIST OF ABBREVIATIONS IX
ABSTRACT X
1 Introduction 1
1.1 A Brief Overview of Machine Learning and Its Methodologies . . . . . 1
1.1.1 Methodologies of Machine Learning . . . . . . . . . . . . . . . . 1
1.2 Applications of Machine Learning in Image Recognition . . . . . . . . . 2
1.2.1 Medical Image Analysis . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Deep Learning in Medical Image Analysis . . . . . . . . . . . . 2
1.2.2.1 Motivation for the Study . . . . . . . . . . . . . . . . . 3
1.2.3 Study Design and Methodology . . . . . . . . . . . . . . . . . . 4
1.3 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1.1 Classes Included in the Dataset . . . . . . . . . . . . . 5
1.3.1.1.1 Anatomical Landmarks: . . . . . . . . . . . . 5
1.3.1.1.2 Pathological Findings: . . . . . . . . . . . . . 5
1.3.1.1.3 Polyp Removal: . . . . . . . . . . . . . . . . . 6
1.3.1.2 Dataset Categories . . . . . . . . . . . . . . . . . . . . 6
2 LITERATURE REVIEW 7
2.1 Previous Researches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Research Gaps: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
II
3 MATERIALS AND METHODS 19
3.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Kvasir Dataset Overview . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1.1 Dataset Categories . . . . . . . . . . . . . . . . . . . . 20
3.1.1.1.1 Anatomical Landmarks . . . . . . . . . . . . 20
3.1.1.1.2 Pathological Findings . . . . . . . . . . . . . 21
3.1.1.1.3 Polyp Removal . . . . . . . . . . . . . . . . . 24
3.1.1.2 Endoscopic Procedures . . . . . . . . . . . . . . . . . . 25
3.1.1.2.1 Colonoscopy . . . . . . . . . . . . . . . . . . . 25
3.1.1.2.2 Gastroscopy . . . . . . . . . . . . . . . . . . . 26
3.1.1.3 Clinical Significance . . . . . . . . . . . . . . . . . . . 26
3.1.1.3.1 Impact on Research . . . . . . . . . . . . . . 26
3.2 Research Workflow: From Data Preprocessing to Model Training . . . . 26
3.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Convolutional Neural Network (CNN) Implementation . . . . . 28
3.3.1.1 Data Augmentation and Preprocessing . . . . . . . . . 28
3.3.1.2 Binary CNN Architecture . . . . . . . . . . . . . . . . 29
3.3.1.2.1 Model Compilation and Training for Binary
CNN: . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1.3 Multi-Class CNN Architecture . . . . . . . . . . . . . 30
3.3.1.3.1 Model Compilation and Training for Multi-
Class CNN: . . . . . . . . . . . . . . . . . . . 31
3.3.2 DenseNet121 Architecture Overview . . . . . . . . . . . . . . . 31
3.3.2.1 Binary and Multi-Class DenseNet Architecture . . . . 32
3.3.2.2 Model Compilation and Training . . . . . . . . . . . . 33
3.3.3 Outlier Detection and Data Refinement . . . . . . . . . . . . . . 34
3.3.3.1 K-Means ClusteringAlgorithm . . . . . . . . . . . . . . 35
3.3.3.1.1 How K-Means Clustering Works: . . . . . . . 35
3.3.3.1.2 Objective of K-Means Clustering: . . . . . . . 36
3.3.3.1.3 Evaluating Clustering Quality Using Silhou-
ette Analysis: . . . . . . . . . . . . . . . . . . 36
3.3.3.1.3.1 Silhouette Score Calculation: . . . . . . 36
3.3.3.1.3.2 Average Silhouette Score: . . . . . . . . 37
3.3.3.1.4 Selecting the Optimal K: . . . . . . . . . . . . 37
3.3.3.1.5 Visual Demonstration: Clustering M1 and
M2 Using K-Means . . . . . . . . . . . . . . . 38
3.3.3.2 Outlier Detection Using K-Means Clustering . . . . . . 45
3.3.3.2.1 Distance Calculation: . . . . . . . . . . . . . . 45
3.3.3.2.2 Compute Quartiles: . . . . . . . . . . . . . . . 45
3.3.3.2.3 Calculate the IQR: . . . . . . . . . . . . . . . 45
3.3.3.2.4 Determine the Upper Bound: . . . . . . . . . 46
3.3.3.2.5 Identify Outliers: . . . . . . . . . . . . . . . . 46
3.3.3.3 Post-Outlier Detection Data Refinement and Classifi-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.3.3.1 Multi-Class Classification: . . . . . . . . . . . 46
3.3.3.3.2 Binary Classification: . . . . . . . . . . . . . . 47
4 RESULTS AND DISCUSSION 48

4.1 Model Training on Raw Data . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1.1 Comparative Analysis of CNN and DenseNet for Bi-
nary Classification on Raw Data: . . . . . . . . . . . . 52
4.1.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . . . . 52
4.1.2.1 Comparative Analysis of CNN and DenseNet for Multi-
Class Classification on Raw Data: . . . . . . . . . . . . 55
4.1.2.2 Comparative Analysis of Multi-Class vs. Binary Clas-
sification Performance on Raw Data: . . . . . . . . . . 55
4.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Analysis for Class 1: Optimal Clustering and Outlier Detection 56
4.2.2 Summary of Clustering and Outlier Detection for Remaining
Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2.1 Optimal Number of Clusters . . . . . . . . . . . . . . . 59
4.2.2.2 Distance Distribution and Outlier Detection . . . . . . 63
4.3 Model Training on Refined Data . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1.1 Comparative Analysis of CNN and DenseNet for Binary-
Class Classification on Refined Data: . . . . . . . . . . 75
4.3.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . . . . 75
4.3.2.1 Comparative Analysis of CNN and DenseNet for Multi-
Class Classification on Refined Data: . . . . . . . . . . 77
4.3.2.2 Comparative Analysis of Multi-Class vs. Binary Clas-
sification Performance on Refined Data: . . . . . . . . 77
4.3.2.3 Comparative Analysis of RAW VS REFINED DATA: . 78
4.4 Outlier Image Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Conclusions 81
List of Tables
4.1 Accuracy per class across different folds, including average accuracy
for each class and the total average accuracy. . . . . . . . . . . . . . . . 49
4.5 Accuracy of CNN on refined data across different runs. . . . . . . . . . 76
4.6 Model accuracy across different runs . . . . . . . . . . . . . . . . . . . 76
4.7 Sample Images of Outliers Across Categories . . . . . . . . . . . . . . . 80
VI
List of Figures
1.1 Gastrointestinal Tract . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Sample visuals from the Kvasir dataset featuring eight classes . . . . . 6
3.1 Esophagitis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Polyp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 ulcerative colitis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 dyed-and-lifted-polyp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 dyed-resection-margin . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 various types of endoscopy examinations . . . . . . . . . . . . . . . . . 26
3.7 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . 31
3.8 Densely Connected Convolutional Network . . . . . . . . . . . . . . . . 34
3.9 x-y axis scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 Randomly Selected Centroids for Initial Clustering with K=2 . . . . . . 39
3.11 Centroid Distance Calculation . . . . . . . . . . . . . . . . . . . . . . . 39
3.12 Cluster Assignment Visualization . . . . . . . . . . . . . . . . . . . . . 40
3.13 Centroid Recalculation Process . . . . . . . . . . . . . . . . . . . . . . 41
3.14 New Cluster Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.15 Centroid Initialization and Assignment . . . . . . . . . . . . . . . . . . 42
3.16 Updated Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.17 Cluster Reassignment Process . . . . . . . . . . . . . . . . . . . . . . . 43
3.18 Converged Clustering Result . . . . . . . . . . . . . . . . . . . . . . . . 44
3.19 Final Cluster Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 CNN test Accuracies per Class . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 DENSENET test Accuracies per Class . . . . . . . . . . . . . . . . . . 53
4.3 CNN Training History . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 DENSENET Training History . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 silhouette score For Class 1 . . . . . . . . . . . . . . . . . . . . . . . . 57
VII
4.6 Distance Distribution Plot with Outlier Threshold . . . . . . . . . . . . 58
4.7 Distance Distribution Plot with Outlier Threshold . . . . . . . . . . . . 58
4.8 Silhouette Score for Class 2 . . . . . . . . . . . . . . . . . . . . . . . . 59
4.15 Distance Distribution Plot with Outlier Threshold for Class 2 . . . . . 63
4.29 CNN Test Accuracies Per Class . . . . . . . . . . . . . . . . . . . . . . 73
4.30 DENSENET Test Accuracies Per Class . . . . . . . . . . . . . . . . . . 75
4.31 Training and validation accuracy and loss for Run 3. . . . . . . . . . . 76
4.32 Accuracy and loss during training and validation for Run 4. . . . . . . 77
LIST OF ABBREVIATIONS
ML Artificial Intelligence
ML Machine Learning
DL Deep Learning
GI Gastrointestinal Tract
CNN Convolutional Neural Network
DENSENET Densely Connected Convolutional Network
KMC K-Means Clustering
IX
Abstract
This study explores the effectiveness of Convolutional Neural Networks (CNNs) and
DenseNet models for binary and multi-class classification tasks using the Kvasir dataset,
which consists of gastrointestinal tract images. The research aims to perform a com-
parative analysis of CNN and DenseNet models on both raw and refined datasets.
Initially, binary and multi-class classifications were conducted on the original dataset
to establish baseline performances for both models. Following this, K-Means Cluster-
ing was applied for outlier detection to refine the dataset by removing anomalies. The
refined dataset was then used to re-evaluate the models’ performances on the same
classification tasks.
The results demonstrated that CNN outperformed DenseNet in binary classifica-
tion tasks, achieving an accuracy of 91.1% on raw data and 91.0% on refined data,
while DenseNet’s accuracy dropped from 83.5% to 73.1% after refinement. For multi-
class classification, DenseNet performed better on the raw dataset with an accuracy
of 84%, compared to CNN’s 74%. However, after data refinement, DenseNet’s perfor-
mance decreased slightly to 79%, whereas CNN showed a minor improvement, achieving
78%. The data refinement process, involving outlier removal, did not significantly af-
fect the overall performance of the CNN model but had a notable negative impact on
DenseNet, especially in binary classification tasks. These findings suggest that while
CNN is robust across both binary and multi-class tasks with or without data refine-
ment, DenseNet is more sensitive to changes in the dataset. This study highlights the
importance of dataset refinement and its varying impact on different neural network
architectures in medical image classification.
Keywords: CNN, DenseNet, Kvasir dataset, gastrointestinal tract images, binary

classification, multi-class classification, outlier detection, K-Means Clustering, data
refinement.
X
Chapter 1
Introduction
1.1 A Brief Overview of Machine Learning and Its

Methodologies
Within the field of artificial intelligence (AI), machine learning (ML) enables systems to
automatically learn from their experiences and improve over time without requiring ex-
plicit programming. It involves researching algorithms whose performance is enhanced
by practice. With the use of sample data, sometimes known as "training data," these
algorithms create a mathematical model that is capable of independent prediction or
decision-making. Machine learning (ML) techniques are widely used in a number of
applications, such as image identification, medical diagnosis, and classification, where
traditional algorithms are challenging or prohibitive to design[2].
1.1.1 Methodologies of Machine Learning

Within machine learning, different methods are designed to address specific tasks and
the unique features of the data. Supervised learning is like teaching with examples:
you show the computer lots of labeled examples, and it learns to predict things based
on what it’s seen before. Unsupervised learning is more about finding hidden patterns
in data without any labels. It’s like sorting things into groups based on similarities.
Then, there’s reinforcement learning, which is like trial and error. The computer learns
by trying different actions and getting feedback, like rewards or penalties, to figure out
the best strategy.
1
1.2 Applications of Machine Learning in Image Recog-
nition
In the domain of image recognition, machine learning is crucial for enabling computers
to accurately identify and interpret visual information, much like how humans perceive
images. Machine learning algorithms, at their core, learn from large sets of labeled
images, figuring out patterns and features that distinguish one object or scene from
another. As they see more examples, these algorithms get better at recognizing specific
visual concepts. Techniques like CNNs analyze images by detecting features and pat-
terns, improving with more training data. Once trained, CNNs can be used to analyze
new images, extracting important features and making predictions about what objects
are present. They’re really good at tasks like finding objects in pictures, understanding
scenes, recognizing faces, and even diagnosing medical conditions from images.
1.2.1 Medical Image Analysis

Medical image analysis is a dynamic field of research within machine learning. The
structured and labeled nature of medical image data makes it more accessible for
computers to learn from. This is probably going to be among the first places where
patients get to see real-world applications of artificial intelligence. There are two key
reasons why this matters. First and foremost, the analysis of medical images is a test to
see whether artificial intelligence can actually enhance patient outcomes and survival.
Second, it provides an opportunity to see how patients engage with AI when making
health-related decisions. [15].
1.2.2 Deep Learning in Medical Image Analysis

Deep learning algorithms applied to medical image analysis have the potential to trans-
form the way medical imaging is used for diagnosis and treatment, as well as signifi-
cantly improve patient outcomes [12].
In the past, AI systems mainly used rule-based expert systems. An early example
was the MYCIN system, which helped doctors choose antibiotics. Over time, AI shifted
to supervised learning methods, like CNNs, especially in medical image analysis. CNNs
were inspired by how our brains process visual information and are now widely used
for recognizing images. Artificial neurons were first developed in the 1940s and evolved
into perceptron’s in the late 1950s. These paved the way for neural networks, which
imitate the brain’s structure. Deep neural networks, with many layers of neurons, can
learn features from data, similar to how our brains recognize objects. CNNs became
popular after their success in the 2012 ImageNet Challenge, where they outperformed
other methods. Innovations like RELU functions and data augmentation have made
CNNs more effective. Since then, CNNs have been the mainstay in medical image
analysis, with many research papers focusing on their use and applications [15].
For many years, researchers in statistics and data mining have been focused on
finding data outliers, which are unusual or unexpected data points. This is because
outliers can provide valuable insights, despite being challenging to identify.
Data points that are recognised to be abnormal but do not fit into the distribution of
the remainder of the dataset are called outliers [10].Finding data samples that deviate
from the general data distribution—also referred to as out-of-distribution samples—is
the primary objective of outliers detection (OD). A number of factors, including noise
in the data being captured or the appearance of novel or undiscovered features in
the context of the data being captured, can result in outliers. OD is particularly
difficult in medical imaging since it looks for anatomical and functional anomalies in
organs and tissues, frequently without easily accessible annotations (labels). However,
advancements in machine learning offer potential solutions for addressing this challenge,
allowing for the detection of outliers in medical image datasets. These techniques
leverage the Utilize patterns and features within the data to identify deviations from
the norm, providing valuable insights into potential anomalies. Consequently, there
has been increasing interest in unsupervised and semi-supervised methods in recent
years within the field of medical imaging.
1.2.2.1 Motivation for the Study
The accurate diagnosis and classification of gastrointestinal (GI) tract diseases are
crucial for effective treatment and patient care. With the increasing availability of
medical imaging data, particularly from endoscopic procedures, there is a pressing
need for robust models that can efficiently classify these images. The Kvasir dataset,
which contains diverse images of the GI tract, provides an excellent foundation for
developing and testing such models.
In recent years, Convolutional Neural Networks (CNNs) and DenseNet architectures
have emerged as powerful tools for image classification. However, their performance
can vary significantly based on the quality and characteristics of the dataset used.
Traditional datasets often contain outliers that can negatively impact model training
and reduce overall accuracy. This issue is particularly relevant in medical imaging,
where outlier data can represent rare or unusual cases that might skew the model’s
performance.
The motivation for this study stems from the need to understand how these ad-
vanced neural network models perform on both raw and refined datasets. By applying
outlier detection using K-Means Clustering, this research aims to refine the dataset and
enhance the reliability of classification results. Evaluating the impact of data refine-
ment on the performance of CNN and DenseNet models will provide valuable insights
into the best practices for preparing medical imaging data, ultimately contributing to
more accurate diagnostic tools.
Furthermore, comparing the performance of CNN and DenseNet on both binary
and multi-class classification tasks, before and after outlier removal, will offer a com-
prehensive understanding of the strengths and limitations of these models in handling
various levels of data complexity. This study not only aims to improve model accuracy
but also seeks to optimize data preprocessing techniques, potentially setting a new
standard for medical image analysis.
1.2.3 Study Design and Methodology

This study begins with binary and multi-class classification using Convolutional Neural
Networks (CNNs) and DenseNet models on the original dataset to establish baseline
performance. Subsequently, I implemented k-means clustering for outlier detection
and removal from the dataset. Post-outlier removal, the refined datasets were used
to retrain the CNN and DenseNet models for both binary and multi-class classifi-
cation tasks. This approach allows for a comparative analysis to determine which
method—classification on the original dataset or on the refined dataset post-outlier
removal—achieves superior results in terms of accuracy and performance metrics.
1.3 Dataset Description

The dataset applied in this study is the Kvasir dataset, comprising images taken from
inside the gastrointestinal (GI) tract, specifically captured during gastroscopy and
colonoscopy procedures using advanced digital endoscopes. It was designed to sup-
port research in medical imaging, focusing on the detection and analysis of various
GI conditions. The dataset includes diverse categories such as anatomical landmarks,
pathological findings, and procedures like polyp removal. Medical experts thoroughly
organized and annotated the dataset to ensure its accuracy and relevance for research.
This rigorous process makes Kvasir a valuable tool for developing computer systems
that can improve diagnostic accuracy and treatment outcomes for GI diseases. Impor-
tantly, Kvasir contains both diseased images and normal images, allowing researchers
not only to study individual diseases but also to develop models capable of detecting
multiple conditions simultaneously, thereby advancing medical imaging and healthcare
research.
1.3.1 Dataset Details

The dataset contains 8000 images, comprising eight categories of gastrointestinal tract
images. These categories are organized into three distinct classes: the first class in-
cludes the first three categories, the second class includes the next three categories,
and the third class comprises the remaining two categories. Each category consists of
1,000 images, providing a balanced and comprehensive collection for analysis.
Figure 1.1: Gastrointestinal Tract
1.3.1.1 Classes Included in the Dataset
1.3.1.1.1 Anatomical Landmarks: Recognizable features within the GI tract as-

sist in navigation and provide reference points for describing findings, including the
Z-line, pylorus, and cecum.
1.3.1.1.2 Pathological Findings: Abnormalities observed during endoscopic ex-

amination, such as esophagitis, polyps, and ulcerative colitis, require accurate detection
and categorization for proper treatment and monitoring.
1.3.1.1.3 Polyp Removal: Polyps, potential precursors to cancer, are removed
during endoscopy using techniques like endoscopic mucosal resection (EMR), followed
by staining and lifting to ensure complete removal and prevent malignancy.
1.3.1.2 Dataset Categories

• Class 1: Anatomical Landmarks
– Normal cecum
– Normal pylorus
– Normal Z-line
• Class 2: Pathological Findings
– Esophagitis
– Polyps
– Ulcerative colitis
• Class 3: Polyp Removal
– Dyed-lifted polyp
– Dyed-resection margin
Figure 1.2: Sample visuals from the Kvasir dataset featuring eight classes
This classification groups the dataset into three classes based on the types of gas-
trointestinal tract images: anatomical landmarks, pathological findings, and images
related to polyp removal techniques. Each class contains distinct categories that repre-
sent different aspects of GI conditions and procedures, offering a structured approach
for analysis and research in medical imaging.
Chapter 2
LITERATURE REVIEW
In this chapter, we review previous approaches to automatic anomaly detection across

various domains, including medical imaging such as gastrointestinal (GI) tract diagnos-
tics, alongside non-medical contexts like smart home energy consumption and power
electronics systems. Our literature survey focused on characterizing GI tract image
features and evaluating existing anomaly detection methods to refine our approach.
Additionally, our review identifies key gaps in current research, such as the inconsis-
tent use of datasets in classification tasks, a predominant focus on multi-class over
binary classification, and the lack of studies applying K-Means Clustering (KMC) for
outlier detection on the same dataset used for classification. Furthermore, there is a
gap in understanding the impact of data refinement—specifically how outlier removal
influences classification outcomes, particularly in binary and multi-class contexts.
This review forms the basis of our study, providing insights crucial for developing
and analyzing our methodology. The next section highlights key studies that have
influenced our research, detailing the algorithms used, datasets employed, stages of
classification, and overall accuracy achieved across different domains. This compilation
sets the stage for presenting our innovative approach to outlier detection in gastroin-
testinal tract images, integrating insights from both medical and non-medical anomaly
detection studies and addressing the identified research gaps.
2.1 Previous Researches

FLATer, a transformer-based model designed for reliable illness diagnosis using GIT en-
doscopic images, was first presented by Wu et al. in 2023. By utilising CNNs and ViTs,
FLATer combines a residual block, vision transformer module, and spatial attention
block to concentrate on both local and global information. They studied multi-class
classification (ulcerative colitis, polyps, oesophagitis) and binary classification (normal
7
vs. diseased pictures). They achieved noteworthy accuracies of 99.7% and 96.4%, re-
spectively, outperforming existing models. Validation on datasets such as ETIS-Larib
Polyp DB (10,000 pictures) and KVASIR showed that FLATer outperformed CNNs and
ViTs in terms of accuracy, precision, and recall. Notably, FLATer achieved an amazing
throughput of 16.4k photos per second while maintaining strong performance even in
the absence of pre-training. Their ablation investigation emphasised how important the
spatial attention module and residual block are for improving classification accuracy.
Although FLATer represents a major step forward in the classification of GIT diseases,
its usefulness could be improved by larger datasets and additional clinical validation,
according to the authors (1). Dheir et al.(2022) from Al-Azhar University employed
deep learning techniques to enhance the classification of gastrointestinal (GI) tract
anomalies using the Kvasir dataset. This dataset, consisting of 8,000 annotated images
across eight classes, includes anatomical landmarks (pylorus, z-line, cecum), patho-
logical findings (esophagitis, polyps, ulcerative colitis), and procedural images (dyed
lifted polyps, dyed resection margins). The researchers retrained and evaluated five
prominent neural network architectures—VGG16, ResNet, MobileNet, Inception-v3,
and Xception—achieving varying accuracies. VGG16 and Xception outperformed the
others with accuracies of 98.3% and demonstrated robust performance due to their pre-
training on ImageNet, effectively handling the classification challenges posed by medical
images. Their approach included robust image preprocessing, data augmentation, and
model evaluation using the F-score metric, highlighting VGG16 as the most effective
model for GI anomaly classification (2). Pogorelov et al. (2017) introduced the Kvasir
dataset to enhance computer-aided detection of gastrointestinal (GI) diseases through
medical imaging. This dataset, curated with input from medical experts, comprises
4,000 annotated images categorized into eight classes.The study conducted baseline
experiments employing global feature extraction (GF), convolutional neural networks
(CNN), and transfer learning (TFL) with models like Inception v3. Results showed that
combining six global features with the Logistic Model Tree (LMT) classifier achieved
the highest performance, yielding an F1 score of 0.747 and 80 frames per second (FPS).
While the 6-layer CNN outperformed the 3-layer CNN in detection performance, TFL
demonstrated superior accuracy among the deep learning methods tested, highlighting
its efficacy. The research underscores the dataset’s pivotal role in enabling reproducible
studies and innovation in medical multimedia applications, serving as a fundamental
resource for advancing GI tract diagnostics (3). Gao et al. (2020) developed a novel
approach for outlier detection in wireless capsule endoscopy (WCE) images. They intro-
duced the Semi-Supervised Deep Model (SODM) framework, leveraging a combination
of Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks
(LSTMs). The model focused on identifying anomalous patterns in WCE images by
analyzing spatial-scale trends across sequential image regions. They utilized a dataset
comprising approximately 22,000 WCE images, categorizing images into normal and
abnormal classes representative of various small intestinal diseases. They compared
their approach with traditional outlier detection methods such as K-nearest neighbors
(KNN), Local Outlier Factor (LOF), and Support Vector Data Description (SVDD),
showing superior performance in terms of accuracy (93.27%) and sensitivity (86.17%).
Their findings underscored the efficacy of integrating deep learning architectures with
anomaly detection techniques for enhancing diagnostic capabilities in WCE imaging,
paving the way for future advancements in medical image analysis and disease detec-
tion (4). Iakovidis et al. (2018) developed a pioneering method using deep learning to
automatically detect and pinpoint gastrointestinal (GI) anomalies in endoscopic video
frames. They leveraged weakly annotated images for training, which proved cost-
effective compared to detailed pixel-level annotations. Their methodology comprised
three phases: first, a weakly supervised CNN classified video frames as normal or ab-
normal; second, a deep saliency detection algorithm identified key points in abnormal
images; and third, an iterative cluster unification technique localized GI anomalies us-
ing these points. Evaluating their approach on the MICCAI Gastroscopy Challenge
Dataset and the KID Dataset, they achieved impressive results with AUC scores sur-
passing 80%, peaking at 96% for anomaly detection in gastroscopy images and 88% for
wireless capsule endoscopy images. Their use of WCNN for classification, coupled with
DSD and ICU for localization, demonstrated significant efficacy in anomaly detection
and localization tasks, offering a robust framework for analyzing GI endoscopy videos
without necessitating intricate pixel-level annotations (5). The goal of Habte et al.’s
(2019) study is to identify gastrointestinal (GI) disorders using endoscopic pictures by
applying deep learning algorithms. To train their models, they used the Kvasir dataset,
which is an openly accessible database of GI photographs divided into eight types. Out
of the 4000 photos in the dataset, 2000 images were used, with 60% going towards train-
ing, 30% going towards testing, and 10% going towards validation. The authors used
two convolutional neural network (CNN) architectures that were optimised using pre-
trained ImageNet weights: ResNet50 and DenseNet121. After undergoing evaluation
on a distinct set of 600 photos, DenseNet121 and ResNet50 demonstrated accuracy
rates of 86.9% and 87.8%, respectively. Overall, the study indicated that the ResNet50
model performed marginally better than the DenseNet121 model, especially when it
came to reliably differentiating between specific groups such as dyed lifted polyps and
dyed resection margins. Additionally, they noted many misclassifications, such as the
confusion of oesophagitis with a normal z-line because of visual similarities between
the pictures (6). Ramzan et al. (2022)introduced the Graft-U-Net, a deep learning
model tailored for the segmentation of gastrointestinal tract polyps from colonoscopy
images. Utilizing datasets Kvasir-SEG and CVC-ClinicDB, comprising 1000 and 612
images respectively, they aimed to improve early detection of colorectal anomalies cru-
cial for cancer prevention. The Graft-U-Net, an enhanced version of UNet, integrates
three stages: preprocessing to enhance image contrast, an encoder for feature analy-
sis, and a decoder for feature synthesis. Evaluations demonstrated superior segmen-
tation performance with mean Dice coefficients of 96.61% (Kvasir-SEG) and 89.95%
(CVC-ClinicDB), surpassing previous models like UNet and ResUNet (7). Ismael et
al.(2020), from various Iraqi institutions developed an automated system for classi-
fying white blood cells (WBCs) based on shape features extracted from medical im-
ages. They focused on five WBC types: Basophil, Eosinophil, Lymphocyte, Monocyte,
and Neutrophil, aiming to streamline diagnosis and reduce errors in medical settings.
The system comprises image preprocessing, segmentation, feature extraction (includ-
ing shape and texture), and classification using machine learning algorithms like K*
classifier, Additive Regression, Bagging, Input Mapped Classifier, and Decision Table.
Evaluations showed that the K* classifier performed best, achieving high accuracy in
classifying WBCs. The study used an undisclosed dataset of WBC images, emphasizing
robust feature selection and classifier performance to enhance diagnostic capabilities
(8).
The usefulness of Generative Adversarial Networks (GANs) for anomaly detection
(AD) in biomedical imaging was assessed by Esmaeili et al. (2023) using seven differ-
ent medical image datasets. The research emphasises the difficulties in AD brought
about by the absence of annotated data and examines cutting-edge GAN-based tech-
niques from both a model-centric and data-centric standpoint, such as F-ANOGAN,
GANomaly, and Multi-KD. The datasets included blood cancer pictures, mammo-
grams, retinal OCT, CT, and MRI images, and varied in sample size, image dimensions,
and anomaly kinds. When performance measures including AUC, F1-Score, Precision,
Recall, and Specificity were applied, the findings were somewhat inconsistent (AUC:
0.475-0.991; Sensitivity: 0.17-0.98; Specificity: 0.14-0.97). The results showed that
none of the techniques worked consistently well, highlighting the need for more reliable
and broadly applicable models. In light of the need for more research to improve AD
models for biomedical imaging, the authors concluded that the current unsupervised
DL-based AD methods are unreliable for clinical applications and suggested taking
anomaly subtlety, spread, and tissue differences into consideration in future AD algo-
rithm designs (10).
The study by Cai et al. (2024) comprehensively evaluates anomaly detection (AD)
methods in medical images across seven datasets. They focused on developing a bench-
mark for fair evaluation, addressing the lack of comprehensive assessments in the field.
The datasets cover a variety of abnormality patterns and comprise images from chest
X-rays, brain MRIs, dermatoscopic images, retinal fundus images, and histology entire
slide images. Twenty-seven AD techniques addressing both pixel-level segmentation
and image-level classification were tested. Notably, they compared reconstruction-
based and self-supervised learning (SSL) methods, highlighting the effectiveness of
SSL approaches like AnatPaste and NSA for generating realistic anomalies. Results
showed SSL methods with realistic synthetic data generally outperforming others, es-
pecially in two-stage paradigms. Additionally, they found that methods utilizing Ima-
geNet pre-trained weights, such as ResNet18, demonstrated strong performance across
datasets, indicating the potential of pre-trained models in medical AD (11). A group
of researchers provides a technique for the early detection and classification of COVID-
19 using chest X-ray pictures in "COVID-19 Anomaly Detection and Classification
Method Based on Supervised Machine Learning of Chest X-ray Images" (2021). Ac-
knowledging the significance of prompt diagnosis in enhancing recuperation prospects
and impeding viral dissemination, they executed an array of picture processing method-
ologies. The procedure entails preprocessing (morphological operations, thresholding,
and noise reduction), segmenting and identifying the Region of Interest (ROI), and
extracting features utilising the Histogram of Orientated Gradients (HOG), Local Bi-
nary Pattern (LBP), and Haralick texture features. They used Support Vector Ma-
chine (SVM) and K-Nearest Neighbour (KNN) for classification, yielding six distinct
models: LBP-KNN, HOG-KNN, Haralick-KNN, LBP-SVM, HOG-SVM, and Haralick-
SVM. These models underwent 5-fold cross-validation testing on 5,000 photos. With
an average accuracy of 98.66%, sensitivity of 97.76%, specificity of 100%, and precision
of 100%, the LBP-KNN model performed the best. This method eliminates the need
for manual feature extraction and selection by demonstrating a reliable and automated
end-to-end solution for the early detection and classification of COVID-19 (13).
Tiwari et al. (2024) discuss techniques for enhancing outlier detection and dimen-
sionality reduction in machine learning, particularly focusing on extreme value analysis.
They emphasize the detrimental impact outliers can have on machine learning mod-
els, leading to inaccurate results and prolonged training times. The paper explores
various methods, including colorful styles for detecting different types of outliers in
high-dimensional datasets. Challenges such as computational complexity in feature
reduction for streaming data are highlighted, alongside the classification of outlier de-
tection techniques into predictive and direct methods. The authors advocate for the de-
velopment of efficient techniques capable of handling large volumes of data while main-
taining accuracy. They underscore the importance of these techniques in real-world
applications such as medical diagnosis and fraud detection, concluding with insights
into dimensionality reduction methods like t-SNE for preserving valuable data structure
(16). Sri Krishna et al.(2024) conducted a study on outlier detection in smart home
energy consumption data using various machine learning and statistical techniques on
the "Tracebase" dataset. They compared methods such as ARIMA, autoencoder, DB-
SCAN, isolation forest, k-means, HDBSCAN, SVM, LOF, LSTM, winsorization, IQR,
and Z-score. Their findings revealed that DBSCAN consistently outperformed other
techniques in accurately identifying outliers, especially those indicating nonlinearity
and unexpected load behavior. DBSCAN’s robust performance was highlighted by its
ability to effectively isolate significant deviations from the dataset’s norm. Conversely,
methods like Z-score, IQR, and winsorization struggled with the complexities of non-
linear data patterns (17). Omar Alghushairy et al. (2024) introduces an anomaly-based
network outlier detection system (NODS) designed to enhance cybersecurity by iden-
tifying abnormal network traffic. Utilizing the NSL-KDD and CICIDS2017 datasets,
the study employs various techniques including normalization, feature selection using
PCA and CFS, and hyperparameter tuning with a Genetic Algorithm (GA) to optimize
detection accuracy. Results demonstrate that their SVM-based approach significantly
reduces false alarms and detection times while improving classification accuracy com-
pared to traditional methods (18). Yalla et al.(2022) developed the OALOFS-MLC
model for financial crisis prediction (FCP) within a big data framework. Utilizing the
German Credit dataset (1,000 samples, 24 features) and the Australian Credit dataset
(690 instances, 14 features), they applied an oppositional ant lion optimizer-based fea-
ture selection and the DRVFLN classification model. Their approach outperformed
other methods such as PIOFS, ACOFS, GWOFS, and PSOFS across various metrics.
The OALOFS-MLC model achieved impressive results: on the German Credit dataset,
it attained an accuracy of 98.75%, sensitivity (sensy) of 97.36%, specificity (specy) of
97.06%, F-score of 97.31%, Matthews correlation coefficient (MCC) of 96.13%, and
kappa of 96.19%. Similarly, on the Australian Credit dataset, it reached an accuracy
of 98.50%, sensy of 97.41%, specy of 96.53%, F-score of 97.92%, MCC of 97.53%, and
kappa of 96.22%. These findings underscore the OALOFS-MLC model’s effectiveness
in enhancing FCP accuracy, suggesting its potential utility in economic forecasting
and risk management applications (19). In their 2024 survey, Rahimighazvini et al.
review methods for anomaly detection and diagnosis in power electronics using ma-
chine learning and deep learning techniques. The study highlights the crucial role of
power electronics in applications like renewable energy and electric vehicles, noting the
systems’ susceptibility to cyber and physical anomalies. They categorize anomalies
into point, contextual, and collective types and discuss detection methods including
supervised, unsupervised, and statistical approaches. Supervised methods like Random
Forest, Extreme Gradient Boosting, Logistic Regression, and K-Nearest Neighbors are
used for classifying anomalies, while deep learning methods such as autoencoders and
LSTM networks are highlighted for their pattern recognition capabilities. Unsuper-
vised techniques, including K-means, DBSCAN, and OPTICS, and statistical methods
like the Mahalanobis Distance and Local Outlier Factor are also detailed for their ef-
ficacy in identifying outliers. The survey emphasizes the importance of distinguishing
between cyber-attacks and physical faults to ensure system reliability and security, un-
derscoring the need for advanced detection and diagnosis systems to handle the growing
complexity of power electronics (20).
In the paper "Blood Donation Prediction using Artificial Neural Network" by Eman
Alajrami et al. (2019), the researchers explore the efficacy of the JustNN environment
in predicting blood donation needs. This study addresses the increasing demand for
blood due to surgeries, accidents, and diseases. By developing an Artificial Neural Net-
work model, the researchers aimed to determine if the JustNN tool could significantly
enhance prediction performance. Accurate forecasting of blood donor numbers is cru-
cial for medical professionals to plan effectively and attract enough volunteers to meet
the rising demand. The study concluded that the ANN model using the JustNN tool
achieved a test set performance accuracy of 99.31%, which is superior to other studies’
results. This indicates that JustNN is a highly effective tool for blood donation pre-
diction (21). In the paper "Age and Gender Prediction and Validation Through Single
User Images Using CNN" by Abdullah M. Abu Nada et al. (2020), published in the
International Journal of Academic Engineering Research (IJAER), the authors pro-
pose a novel method to validate user gender and age from photos using Convolutional
Neural Networks (CNN). The study, utilizing a dataset of 430 University of Palestine
students’ photos, achieved a gender prediction accuracy of 82% overall (89% for males
and 74% for females), but struggled with age prediction, which had an accuracy of 57%.
Challenges included less distinct female facial features and hijabs obscuring features,
as well as natural variations in aging. The research highlights the need for improved
models and diverse datasets to enhance demographic predictions from images (22). In
the work by Mukrimah Nawir et al. (2018), researchers proposed an efficient approach
to network anomaly detection using machine learning algorithms. They addressed
the challenge of limited labeled network datasets by focusing on the UNSW-NB15
dataset, modifying it to enhance experimental reliability by excluding certain irrele-
vant features. The study utilized three Bayesian algorithms—Average One Dependence
Estimator (AODE), Bayesian Network (BN), and Naive Bayes (NB)—implemented
through WEKA tools. Through rigorous experimentation, they demonstrated that
AODE performed exceptionally well with an accuracy of 94.37%. Their findings un-
derscored AODE’s efficiency and effectiveness in handling network anomaly detection,
particularly on the UNSW-NB15 dataset, making it a robust choice compared to BN
and NB algorithms (23). Yadav et al. investigated deep learning techniques for pneu-
monia classification from chest X-ray images in their 2019 study. Three methods were
assessed: a capsule network that was trained from scratch, transfer learning on VGG16
and InceptionV3 CNNs, and a linear SVM classifier with orientation-free and local ro-
tation variables. Through the use of a dataset of 624 testing and 5232 training photos,
they discovered that data augmentation enhanced performance in all cases. The best
results were obtained using transfer learning using VGG16, which achieved an accuracy
of 90.2%. The study stressed how crucial it is to adjust particular parameters and bal-
ance network complexity with dataset quantity. The effectiveness of capsule networks
was lower than that of VGG16. The generalisation of the approach was confirmed by
validating these results on an OCT dataset (24).
Faes et al. (2019) investigated the feasibility of automated deep learning tools for
medical image classification, targeted at healthcare professionals without coding ex-
pertise. They utilized five public datasets: MESSIDOR (retinal fundus), Guangzhou
Medical University and Shiley Eye Institute (OCT), HAM10000 (skin lesions), and NIH
(pediatric and adult chest X-rays). Employing Google Cloud AutoML for neural archi-
tecture search, they developed models achieving high diagnostic properties (sensitivity
73.3% to 97.0%; specificity 67% to 100%) and discriminative performance (AUPRC
0.57 to 1.00) in internal validations. External validation on the Edinburgh Dermofit
Library dataset showed lower performance (AUPRC 0.47, sensitivity 49%, positive pre-
dictive value 52%). The study highlighted the potential of automated tools in medical
image analysis while noting limitations in dataset quality and model complexity for
advanced tasks (25). Amiri et al. (2023) conducted a systematic literature review
on Deep Learning (DL) techniques for pattern recognition across cyber-physical-social
systems. They analyzed 60 articles focusing on DL methods like CNNs, RNNs, GANs,
and more, categorizing them by application and performance metrics. Using Python
for implementation, they evaluated models based on accuracy, adaptability, and se-
curity across various datasets including medical imaging and visual recognition tasks.
The review highlighted advancements and limitations in current DL approaches, em-
phasizing the need for improved security measures and adaptive capabilities in future
research to enhance pattern recognition accuracy and applicability (26). In their 2020
study, Abu-Saqer and Al-Shawwa developed a Grapefruit classification system using
deep learning techniques. They utilized a dataset from Kaggle comprising 1,312 images
of Pink and White Grapefruit, with 70% for training, 30% for validation, and achieved
100% accuracy on both sets. Implementing Convolutional Neural Networks (CNNs)
with four layers and a dropout of 0.2, their model successfully classified Grapefruit
types based on image features extracted through CNNs. The system aims to automate
classification tasks in various applications, such as restaurants and factories, demon-
strating robust performance in distinguishing between different Grapefruit varieties
(28). Huang et al.(2022) proposed a novel hybrid neural network for medical image
classification using a combination of PCANet and DenseNet architectures. They aimed
to enhance classification accuracy despite limited training data. Utilizing datasets in-
cluding DDSM, osteosarcoma histology images, and MIAS, their approach involved
a modified PCANet for initial feature extraction followed by a simplified DenseNet
for precise classification. Achieving superior results compared to popular models like
VGG, ResNet, and DenseNet, their HybridNet demonstrated an accuracy of 83%, sen-
sitivity of 89.3%, and specificity of 78.7%. The hybrid approach effectively addressed
overfitting issues while outperforming other networks in classifying breast tissue densi-
ties, showing promise for future medical imaging applications (29). A DenseNet-based
model for metastatic cancer image categorisation was presented by Zhong et al. in
2020. They made use of an altered version of the PatchCamelyon (PCam) dataset,
which was designed specifically for metastasis detection binary image classification.
There were 220,025 samples in the collection, of which 89,117 were positive (malig-
nant) and 130,908 were negative (non-cancerous).They made use of DenseNet201 and
its enhanced variant, known as DenseNet201 TTA. Their tests showed that in terms
of accuracy and Auc-Roc score, DenseNet201 models performed better than ResNet34
and VGG19. With an accuracy of 0.989 and the greatest Auc-Roc score of 0.971,
DenseNet201 (TTA) outperformed the other models by a wide margin. The study
showcased the robust performance metrics and promise for future improvements in
medical diagnostics of DenseNet designs, emphasising its usefulness in enhancing the
accuracy of cancer picture categorisation (30).
In 2020, Poornima et al. proposed an online anomaly detection system for Wireless
Sensor Networks (WSNs) using the OLWPR algorithm. Their study focused on enhanc-
ing WSN security by identifying anomalous sensor data. They utilized a dataset from
IBRL comprising 40,000 sensor readings, including temperature, humidity, and light
measurements. After preprocessing to handle missing and duplicate records, anoma-
lies were injected for testing. OLWPR, aided by PCA for dimensionality reduction,
achieved an 86% detection rate with a low 16% error rate. The advantage of OL-
WPR was shown in comparisons with Gaussian, SMO, and Linear regression in terms
of RMSE, percentage error, accuracy, F1-score, sensitivity, and specificity. Accord-
ing to the study’s findings, OLWPR performed better in real-time anomaly detection
in WSNs than conventional techniques like Logistic Regression, Decision Tree, Ran-
dom Forest, Adaboost, SVM, and ANN(31). Pratta et al.(2016) presented a study at
MIUA 2016 detailing a Convolutional Neural Network (CNN) approach for diagnos-
ing diabetic retinopathy (DR) using fundus images. They utilized a Kaggle dataset
of 80,000 images, training a CNN with data augmentation to classify DR severity
levels. Achieving 75% accuracy and 95% specificity on 5,000 validation images, the
CNN demonstrated robust performance in automated DR diagnosis, particularly in
distinguishing proliferative cases and absence of DR. However, sensitivity for mild and
moderate DR cases was lower, indicating challenges in detecting subtle features. Fu-
ture plans include refining the CNN with improved datasets and comparing it with
other classification methods like SVM (32). Rao et al.(2011) introduced a method
using K-means clustering and ID3 decision trees for anomaly detection in computer
networks. They applied these techniques to classify normal and anomalous activities,
focusing on both supervised and unsupervised learning approaches. Using datasets
like iris.arff and weather.nominal, they achieved clustering with 67% and 33% distribu-
tion among clusters, and utilized ID3 decision trees to classify weather data effectively.
The combined K-means and ID3 approach aimed to enhance classification performance
by refining decision boundaries within clusters(33). In their 2023 study, Vania et al.
investigated the use of deep learning (DL) and machine learning (ML) methods for
identifying lesions in the upper gastrointestinal (GI) tract. They reviewed 65 studies
using datasets like KVASIR, MEDICO 2018, BIOMEDIA 2019, and others, focusing on
ML models like SVM which achieved accuracies, sensitivities, and specificities ranging
from 0.87 to 0.98, 0.85 to 0.98, and 0.93 to 0.98, respectively. DL models, particularly
CNN-based supervised learning models like SSD and Mask RCNN, were also promi-
nent in GI image analysis. RGB imaging proved crucial for detecting features like
bleeding. Challenges included dataset variability, suggesting a need for standardized
databases to train robust AI systems for GI endoscopy (34). Ramzan et al.(2021) devel-
oped a computer-aided diagnostic system (CADx) for classifying gastrointestinal (GI)
tract infections using deep learning techniques. They utilized color image datasets like
KVASIR, NERTHUS, and stomach ULCER, evaluating models such as InceptionNet,
ResNet50, and VGG-16. Preprocessing in LAB color space and feature fusion with
local binary patterns (LBP) enhanced disease prediction accuracy. Feature selection
methods like PCA and mRMR were employed to optimize characteristics for various
classifiers. The subspace discriminant classifier achieved notable results, with 95.02%
accuracy on KVASIR, outperforming other classifiers. On NERTHUS, the best accu-
racy was 99.9% with cubic SVM, and on ULCER, cubic SVM reached 100% accuracy,
indicating robust performance across datasets (35). The article by de Lange et al.
in 2018 focuses on developing machine learning algorithms to enhance gastrointesti-
nal (GI) endoscopy performance. They address the variability in diagnostic accuracy
among endoscopists, which affects detection rates of mucosal lesions, leading to chal-
lenges like the 20% average polyp miss-rate in colonoscopies. The research employs a
range of machine learning methodologies, encompassing both conventional and deep
learning approaches such as generative adversarial networks (GANs) and convolutional
neural networks (CNNs). They emphasize the importance of dataset quality and size,
recommending at least 1000 images per class for robust deep learning applications.
Results show promising accuracies above 90%, with CNNs often outperforming sim-
pler methods. They advocate for standardized metrics and open datasets to facilitate
reproducibility and comparisons in AI-assisted GI endoscopy systems (36). Using the
Kvasir dataset, Cogan et al.’s (2019) study focusses on applying deep learning to ac-
curately detect illnesses and anatomical landmarks in gastrointestinal tract images.
They present the MAPGI framework for image preprocessing in order to handle is-
sues such as sparse annotations and image variability. With accuracies of 98.45%,
98.48%, and 97.35%, respectively, three deep neural network architectures—Inception-
v4, Inception-ResNet-v2, and NASNet—are trained and compared. With excellent
recall (93.9%), specificity (99.1%), F1 score (93.8%), precision (93.8%), and Matthews
correlation coefficient (MCC) of 92.9%, Inception-v4 performs better than other mod-
els. The authors point out that smaller models—such as Inception-v4—perform better
on this dataset than larger models like NASNet because of their computational effi-
ciency and lower risk of overfitting (37). The study by Song et al. (2021) tackle the
challenge of localizing a colonoscope in the GI tract using monocular images. They in-
novate by blending deep learning with traditional geometry-based methods to improve
localization accuracy despite limited labeled data. Using a Siamese architecture, their
DL models classify images into anatomical zones based on expert-segmented GI tract
zones, aiding in initial pose estimation. Validation on synthetic and in-vivo datasets
shows high zone classification accuracies of up to 98.6% for synthetic data and around
97-98% for in-vivo data. Pose accuracy results are impressive, with deviations as small
as 1.41 degrees and 0.05 units. Comparative analyses indicate their hybrid approach
outperforms pure DL or geometry-based methods, especially when trained on synthetic
data and tested on in-vivo data, achieving superior zone classification accuracies, up to
79%. Future plans include incorporating depth information and enhancing the realism
of synthetic datasets through adversarial learning (38). Future objectives include us-
ing adversarial learning to improve the realism of synthetic datasets and adding depth
information (38). Gautam Buddha University’s Pachauri et al. (2015) discuss fault
detection in medical wireless sensor networks (WSNs). In order to improve anomaly
detection capabilities, they apply machine learning methods, concentrating on cat-
egorising and identifying anomalous sensor readings from the MIMIC dataset (121
records), such as heart rate, SpO2, PULSE, body temperature, and respiration rate.
For classification, algorithms such as J48, Random Forests, and k-Nearest Neighbours
are used; Random Forests perform better in ROC analysis and mean absolute error
comparison. Additive Regression with k-NN produces the best correlation coefficient
and lowest error for regression tasks. Overall, their methodology highlights the poten-
tial of machine learning in healthcare applications by showing promise in enhancing
fault detection efficiency in medical WSNs (39). In their 2021 study, Reddy et al. from
various institutions in India explore machine learning techniques for outlier detection
in medical datasets. They propose a novel algorithm combining supervised and un-
supervised learning to identify outliers based on attributes like heart rate and oxygen
saturation from datasets sourced from Kaggle. Using their approach, they compare
various methods and find their machine learning model achieves superior accuracy in
outlier detection, particularly on real-time medical data. Their experiments highlight
the effectiveness of this approach in enhancing anomaly detection efficiency, suggest-
ing its potential for reducing healthcare industry workloads and improving diagnostic
accuracy (40).
2.2 Research Gaps:

Despite significant advances in automatic anomaly detection and classification across
various domains, several critical gaps remain in the existing literature that this study
aims to address. First, there is a lack of consistency in dataset utilization for clas-
sification tasks, particularly concerning the use of uniform datasets. Many previous
studies have employed different datasets with varying sample sizes, which complicates
direct comparisons of classification performance across models and methods. This in-
consistency highlights the need for research that employs a single, uniform dataset to
evaluate and compare different classification approaches more reliably.
Second, most research involving similar datasets has predominantly focused on
multi-class classification, with binary classification often being overlooked. This focus
limits our understanding of how models perform across different types of classifica-
tion tasks. There is a need for comprehensive studies that incorporate both binary
and multi-class classification methods to provide a more holistic evaluation of model
performance.
Third, while various outlier detection techniques have been explored in the lit-
erature, there is a noticeable gap regarding the application of K-Means Clustering
(KMC) for outlier detection on the same dataset used for classification tasks. Previous
studies have applied different outlier detection methods to various datasets, but none
have specifically employed KMC in conjunction with the dataset used for classification.
This presents an opportunity to explore how KMC can refine datasets and potentially
improve classification outcomes.
Finally, the impact of data refinement—specifically the effects of raw versus refined
data on classification performance—has not been thoroughly examined. Existing stud-
ies have not adequately explored how the removal of outliers influences the accuracy and
reliability of classification tasks, especially in the context of binary versus multi-class
classification. Addressing this gap is crucial for understanding the full implications of
data refinement on model performance and for developing more robust classification
techniques.
By addressing these gaps, this study seeks to provide a comprehensive analysis of
classification performance across both binary and multi-class tasks, utilizing a consis-
tent dataset and applying K-Means Clustering for outlier detection. This approach
will contribute to a deeper understanding of the effects of data refinement and enhance
the methodologies used in anomaly detection and classification in medical imaging.
Chapter 3
MATERIALS AND METHODS
In this study, we provide an overview of the dataset used, emphasizing its pivotal role
in our methodology aimed at refining outlier detection in gastrointestinal tract images.
Our approach involves initial classification using Convolutional Neural Networks (CNN)
and DenseNet models on the dataset, followed by outlier detection using k-means
clustering. We then explore how refining the dataset by removing outliers impacts
classification accuracy using these deep learning models.
3.1 Data Acquisition

Starting with a dataset is like laying the groundwork for a research journey—it’s the
very first step that shapes everything that follows. A dataset forms the backbone of
research, providing the factual basis on which insights are built and theories are tested.
It’s not just numbers and words; it represents real-world observations that researchers
analyze to uncover patterns, make discoveries, and ultimately contribute to our under-
standing of the world. Without a solid dataset, research lacks the essential evidence
needed to draw meaningful conclusions and make informed decisions. Therefore, the
process of gathering and curating data is crucial, ensuring that researchers have reliable
information to explore new ideas, validate hypotheses, and advance knowledge in their
field
3.1.1 Kvasir Dataset Overview

The Kvasir dataset, which was used in this investigation, was collected at Vestre Viken
Health Trust (VV), a healthcare organisation in Norway that manages four hospitals
and around 470,000 patients, utilising endoscopic technology. The data were specifically
obtained from the gastrointestinal department at Baerum Hospital, which is a division
of VV that provides training data and intends to add to the dataset in the future.
19
The Cancer Registry of Norway (CRN) and Vestre Viken medical specialists have
painstakingly annotated every image in the dataset. The CRN, affiliated with the
South-Eastern Norway Regional Health Authority and independently operated under
Oslo University Hospital Trust, conducts cancer research and manages national cancer
screening programs aimed at early detection and prevention of cancer-related deaths.
The Kvasir dataset focuses on images and annotations related to the gastrointestinal
(GI) tract, crucial for understanding and diagnosing diseases such as the three most
common cancers worldwide, which affect this system.
3.1.1.1 Dataset Categories

The dataset used in this study consists of a total of 8000 annotated images categorized
into three main classes: anatomical landmarks, pathological findings, and polyp re-
moval. Each class is further subdivided into specific categories essential for diagnosing
and treating gastrointestinal (GI) tract conditions. The anatomical landmarks include
Z-line, pylorus, and cecum, which serve as critical reference points and indicators of
normal and abnormal GI anatomy. Pathological findings encompass conditions such as
esophagitis, polyps, and ulcerative colitis, visible as abnormal mucosal changes indica-
tive of diseases like inflammation and potential cancerous growth. Additionally, the
dataset includes images related to polyp removal techniques, such as dyed and lifted
polyps and dyed resection margins, crucial for assessing the completeness of polyp
removal procedures. These categories are meticulously annotated and curated, pro-
viding a comprehensive resource for medical research and diagnostic advancements in
gastroenterology.
3.1.1.1.1 Anatomical Landmarks An anatomical landmark is a distinctive fea-

ture inside the GI tract that’s easily seen during an endoscopy. These landmarks are
crucial for navigation and serve as reference points to pinpoint specific findings. They
may also be frequently affected areas by conditions like inflammation or ulceration.
Ideally, a thorough endoscopic report will include short summaries and pictures of
these important locations.
• Z-Line
• The Z-line marks the border where the esophagus meets the stomach. When
viewed through an endoscope, it appears as a clear line where the white esophageal
tissue meets the reddish stomach lining.
• Recognizing the Z-line is important to assess if any disease is present, such as
signs of gastro-esophageal reflux.
• It’s also helpful for describing any problems in the esophagus.
• Pylorus
• The pylorus is the area around the opening from the stomach into the small
intestine (duodenum). This opening has circular muscles that regulate the flow
of food from the stomach.
• Identifying the pylorus is crucial for navigating the endoscope into the duodenum,
which can be challenging during gastroscopy.
• A complete examination involves inspecting both sides of the pyloric opening to

check for issues like ulcers, erosion, or narrowing.
• In an endoscopic image from inside the stomach, the pylorus appears as a smooth,
round opening surrounded by uniform pink stomach tissue.
• Cecum
• The cecum is the first part of the large intestine, located at the beginning of the
colon.
• Reaching the cecum confirms a thorough colonoscopy, and its successful exami-
nation is an important quality indicator.
• One distinctive feature of the cecum is the appendiceal orifice, seen as a crescent-
shaped slit.
• Documentation of the cecum, including its appearance and location via photos
or notes in reports, is essential for verifying the completeness of the colonoscopy.
• In the endoscopic view, the green picture-in-picture display shows the scope’s
position to confirm the cecum’s location.
3.1.1.1.2 Pathological Findings A pathological finding in the gastrointestinal

tract refers to an abnormal feature that can be seen during an endoscopy. These
findings often indicate ongoing diseases or conditions that could potentially lead to
cancer. Detecting and categorizing these abnormalities is crucial for starting the right
treatment and managing the patient’s care.
• Esophagitis
• Breaks in the esophageal lining around the Z-line indicate the presence of esophagi-
tis, an inflammation of the esophagus. Red stains on the white lining of the esoph-
agus are observed in image 3.1 as an illustration. The length of these breaks and
the area of the circle impacted indicate the degree of inflammation.
• The most prevalent causes of this illness include hernias, vomiting, and acid
reflux—the backflow of stomach acid into the esophagus.
• It’s critical to diagnose esophagitis in order to initiate therapy, reduce symp-

toms, and avoid consequences. Using computer technology to assess severity and
automate reporting could enhance diagnostic accuracy.
Figure 3.1: Esophagitis
• Polyps
• Polyps are abnormal growths in the bowel lining that can vary in shape (flat,
raised, or on a stalk). They can be identified from normal tissue by their colour
and surface roughness. While the majority of polyps are benign, some may
eventually develop into malignant ones.
• Detecting and removing polyps is crucial to prevent colorectal cancer. Automated

detection could improve the quality of examinations, as polyps may sometimes
be missed by doctors.
• The green boxes in the image 3.2 illustrate how endoscope positions are tracked
during live procedures, aiding in locating and assessing polyps.
• Computer-aided detection of polyps would be valuable for diagnosis, evaluation,

and reporting.
Figure 3.2: Polyp
• Ulcerative Colitis
• Ulcerative colitis is a chronic inflammatory disease that affects the large intestine,
causing symptoms like bleeding, swelling, and ulceration of the intestinal lining.
• Diagnosis is primarily based on findings from colonoscopy. The severity of the dis-
ease varies, with mild cases showing swollen and reddened mucosa, and moderate
cases displaying prominent ulcerations.
• Image 3.3 depicts ulcerative colitis, where the mucosa is covered in a white layer
(fibrin) over the ulcers.
• Using automated computer systems for assessing disease severity could improve
accuracy in grading and managing this condition.
Figure 3.3: ulcerative colitis

3.1.1.1.3 Polyp Removal Polyps can be found throughout the GI tract and may
potentially develop into cancer, so they are typically removed during an endoscopy
whenever possible. One method for removing polyps is called endoscopic mucosal re-
section (EMR). This technique involves injecting a liquid under the polyp to lift it from
the tissue underneath. The polyp is then grasped and removed using a snare, mini-
mizing the risk of damaging deeper layers of the GI wall with mechanical or electrical
methods. A dye, like diluted indigo carmine, is often used to clearly mark the edges
of the polyp for accurate identification. Detecting dyed polyps and their removal sites
using computer technology could lead to improved automated reporting systems in the
future.
• Dyed and Lifted Polyps:
• A polyp that was lifted by indigo carmine and saline injection is shown in Figure
3.4. The polyp’s pale blue borders stand out sharply against the regular tissue’s
deeper hue.
• Further useful information for automated reporting might include the success
of lifting and any areas that remain unliftable, which could indicate potential
malignancy.
Figure 3.4: dyed-and-lifted-polyp
• Dyed Resection Margins:
• It’s crucial to assess the margins of the resected tissue to confirm whether the
entire polyp has been completely removed.
• Any remaining polyp tissue could lead to further growth and, in the worst case,
develop into cancer.
• Figure 3.5 illustrates the site after removing a polyp. Automatically recognizing
the location of polyp removals is valuable for automated reporting systems and
for assessing how effectively the polyp has been removed.
Figure 3.5: dyed-resection-margin
3.1.1.2 Endoscopic Procedures

Endoscopic examinations, such as gastroscopy and colonoscopy, are indispensable for
investigating the GI tract. Gastroscopy explores the upper GI tract, including the
esophagus, stomach, and initial portion of the small bowel, while colonoscopy focuses
on the large bowel (colon) and rectum. These procedures employ advanced digital
endoscopes for real-time video imaging. However, conducting endoscopic examinations
demands substantial resources, including costly equipment and skilled personnel.
In the Kvasir dataset, the images are derived from both colonoscopy and gastroscopy
procedures.
3.1.1.2.1 Colonoscopy • Colonoscopy is used to examine the large bowel (colon)

and rectum. The categories from the Kvasir dataset typically obtained through colonoscopy
include:
• Cecum
• Polyps
• Ulcerative Colitis
3.1.1.2.2 Gastroscopy • Gastroscopy is used to examine the upper GI tract, in-
cluding the esophagus, stomach, and the first part of the small intestine (duodenum).
The categories from the Kvasir dataset typically obtained through gastroscopy include:
• Z-line
• Pylorus
• Esophagitis
Other categories related to polyp removal (e.g., dyed and lifted polyps, dyed resec-
tion margins) can be associated with either procedure, depending on the location of
the polyps.
Figure 3.6: various types of endoscopy examinations
3.1.1.3 Clinical Significance

Given that the human digestive system can be affected by various diseases, includ-
ing the three most common cancers worldwide occurring in the GI tract, the Kvasir
dataset plays a pivotal role in advancing research. These cancers collectively result in
approximately 2.8 million new cases and 1.8 million deaths annually [3].
3.1.1.3.1 Impact on Research The dataset provides a comprehensive resource

for medical research and diagnostic advancements in gastroenterology. It supports the
development of computer-aided detection systems and enhances diagnostic accuracy in
identifying GI tract conditions.
3.2 Research Workflow: From Data Preprocessing to

Model Training
Research Workflow: From Data Preprocessing to Model Training: To begin my re-
search, I utilized the Kvasir dataset, working on Google Colab for its computational
resources and convenience. The images in the dataset varied in size, so I standardized
them by resizing each image to 64x64 pixels, ensuring uniformity with dimensions of
(64, 64, 3) to account for the RGB color channels.
Next, I conducted preprocessing to prepare the data for analysis. This involved
normalization and other typical image preprocessing steps to enhance the quality and
suitability of the data for training machine learning models.
I began by conducting binary and multi-class classification tasks using Convolu-
tional Neural Networks (CNN) and DenseNet models. In the binary classification, I
structured the dataset into eight separate Y vectors, where each vector represented one
class as positive (1) and all others as negative (0). For multi-class classification, each
class had its distinct label.
Following the initial classification, I proceeded with outlier detection using k-means
clustering. This method identified and removed outliers from the dataset, aiming to
improve the robustness and accuracy of subsequent analyses.
After outlier removal, I refined the dataset and performed binary and multi-class
classification again using CNN and DenseNet architectures. This iterative process
allowed me to evaluate how outlier removal impacts the classification accuracy and
performance of deep learning models.
By integrating outlier detection with classification tasks, my study seeks to enhance
the reliability and effectiveness of CNN and DenseNet models in analyzing gastroin-
testinal tract images. This approach not only explores the impact of outlier removal on
model performance but also contributes to advancing medical image analysis techniques
for improved diagnostic outcomes.
3.2.1 Data Preprocessing

In the preprocessing phase of my research, I implemented a comprehensive image resiz-
ing routine to standardize the dimensions of the dataset images, ensuring uniformity
across all samples. The Kvasir dataset contained images of varying sizes, which neces-
sitated resizing to a consistent dimension of 64x64 pixels. This preprocessing step was
crucial for maintaining consistency and enhancing the computational efficiency of the
subsequent machine learning models.
I utilized Python’s OpenCV Data Preprocessing library for image processing and
created a function named "resize_images" to automate the resizing process. The func-
tion takes three parameters: the input path of the original images, the output path for
saving the resized images, and the target size, which was set to (64,64) pixels. Here is
a detailed breakdown of the steps involved in the resizing process:
• Directory Setup: The function begins by creating the output directory if it

does not already exist. This ensures that the resized images have a designated
location for storage.
• Iterating Over Categories: The function iterates over the categories within
the input path. For each category, a corresponding subdirectory is created in the
output path to maintain the original category structure.
• Image Resizing: Within each category, the function loops through all the image
files. For each image:
o The image is read using OpenCV’s cv2.imread function.
o A check is performed to ensure the image was read correctly.
o The image is resized to the target size (64x64 pixels) using cv2.resize.
• Saving Resized Images: The resized image is saved to the corresponding cat-
egory subdirectory in the output path using cv2.imwrite.
By implementing this method, I successfully resized all images in the Kvasir
dataset to 64x64 pixels, creating a new dataset that was uniformly sized and
ready for further preprocessing and model training. This step was essential for
ensuring that the images fed into the convolutional neural networks (CNN and
DenseNet) were of a consistent size, thereby improving the model training process
and overall performance.
3.3 Classification Models

Following the preprocessing phase, I employed two distinct convolutional neural net-
work architectures for classification: a standard CNN and DenseNet.
3.3.1 Convolutional Neural Network (CNN) Implementation

In my research, I implemented a Convolutional Neural Network (CNN) to perform
image classification on the preprocessed Kvasir dataset. The CNN architecture used
in this study is designed to efficiently capture spatial hierarchies in the input images
through its layered structure, which consists of convolutional layers, pooling layers, and
fully connected layers.
3.3.1.1 Data Augmentation and Preprocessing
To enhance the training process and improve model generalization, I applied data
augmentation techniques using the ImageDataGenerator class from TensorFlow. This
included rescaling pixel values, applying shear transformations, zooming, and horizon-
tal flipping of the images. This step ensures that the model is exposed to a variety of
image transformations, helping it generalize better to unseen data.
3.3.1.2 Binary CNN Architecture
• Input Layer: The input layer is designed to accept images of dimensions 64x64
pixels with 3 color channels (RGB).
• Convolutional Layer: The initial layer applies convolutional filters to the in-
put image. Each filter is of size 3x3, and with 32 filters, the operation can be
represented as:
 
X2 X 2
Outputi,j = σ  Inputi+m,j+n × Filterm,n + Bias
m=0 n=0
σ denotes the ReLU activation function, which introduces non-linearity by set-

ting negative values to zero.
• Pooling Layer: Max-pooling reduces the spatial dimensions of the convolutional

output. With a pool size of 2x2, the operation retains the maximum value within
each 2x2 window, effectively downsampling the features:
Max Pooling: Outputi,j = max(Input2i+m,2j+n )

m,n
• Flattening Layer: After the pooling layer, the feature maps are flattened into
a single vector. This step prepares the data for the fully connected layers by
converting the 2D feature maps into a 1D feature vector.
• Fully Connected Layer: Two dense layers with ReLU activation. The first
layer with 128 neurons learns higher-level representations from the flattened input
data:
 
n
X
Output = σ  Inputi × Weighti + Bias
i=1
• Output Layer: The final output layer uses a sigmoid activation function, pre-
dicting probabilities for binary classification (positive class = 1).
3.3.1.2.1 Model Compilation and Training for Binary CNN: The binary
CNN model was compiled using the Adam optimizer with binary cross-entropy loss,
optimized for binary classification tasks:
N
1 X
Binary Cross-Entropy Loss:L = −

yi log(ŷi ) + (1 − yi ) log(1 − ŷi )
N i=1
where yi are the true labels (0 or 1), and ŷi are the predicted probabilities.
Training involved fitting the model to augmented data in batches of size 16 for 10
epochs. Performance was monitored on a validation set to optimize training accuracy
and generalization. The test set was used for final evaluation to assess the model’s
performance on unseen data.
3.3.1.3 Multi-Class CNN Architecture

• Input Layer: Accepts images of dimensions 64x64 pixels with 3 color channels
(RGB).
• Convolutional Layers: Three layers with increasing filters (32, 64, 128) and
ReLU activation functions extract hierarchical features from the input images:
 
2
X 2 X
X 2
Outputi,j,k = σ  Inputi+m,j+n,p × Filterm,n,p + Bias
m=0 n=0 p=0
• Pooling Layers: Max-pooling layers after each convolutional layer with a pool
size of 2x2 reduce spatial dimensions while retaining significant features.
• Flattening Layer: Converts pooled feature maps into a one-dimensional vector

for dense layers.
• Fully Connected Layers: Two dense layers with ReLU activation. The first
layer with 128 neurons learns high-level representations:
 
n
X
i=1
• Output layer: The final output layer uses softmax activation, outputting prob-
abilities across 8 classes for multi-class classification.
3.3.1.3.1 Model Compilation and Training for Multi-Class CNN: The multi-
class CNN model was compiled using the Adam optimizer with categorical cross-
entropy loss, suitable for multi-class classification problems:
N X
X C
L=− yi,j log(ŷi,j )
i=1 j=1
where yi,j are the true labels (one-hot encoded) and ŷi,j are the predicted probabil-
ities.
Training involved 10 epochs with a batch size of 32, using augmented data for batch
processing. Evaluation used accuracy as the metric, with performance monitored on
a validation set during training to optimize model performance. The final model’s
performance was assessed using a separate test set to provide an unbiased estimate of
its generalization capabilities.
Figure 3.7: Convolutional Neural Network (CNN)
3.3.2 DenseNet121 Architecture Overview

DenseNet121 is a convolutional neural network architecture known for its dense con-
nections between layers. Each layer receives input from all preceding layers, which
helps in reducing the vanishing gradient problem and improves the flow of gradients
throughout the network. This design consists of multiple dense blocks followed by
transition layers, culminating in a final classification layer.
3.3.2.1 Binary and Multi-Class DenseNet Architecture
Input Layer: The input layer accepts images of dimensions 48 × 48 pixels with 3 color
channels (RGB).
Base Model: The architecture utilizes DenseNet121 with pre-trained weights on
ImageNet. The top classification layer is excluded to leverage the learned features for
both binary and multi-class classification tasks.
Dense Blocks: DenseNet121 contains four dense blocks. Each block consists of
multiple convolutional layers with batch normalization and ReLU activation. The
number of layers in each block are as follows:
• Block 1: 6 layers
Transition Layers: Between dense blocks, transition layers are used to reduce the
feature map size. Each transition layer includes:
• A 1 × 1 convolutional layer for dimensionality reduction.
• A 2 × 2 average pooling layer to downsample the feature maps.
Additional Layers:
• Flattening Layer: After the dense blocks, the output feature maps are flattened
into a one-dimensional vector.
• Dense Layers: Two dense layers are added:
– First dense layer with 128 neurons and ReLU activation:

 
Xn
i=1
where σ denotes the ReLU activation function.

– Dropout layer with a dropout rate of 0.5.
– Final dense layer:
∗ For binary classification: Single neuron with sigmoid activation:
 
Xn
i=1
where σ(x) = 1+e1−x is the sigmoid activation function.

∗ For multi-class classification: Neurons equal to the number of classes
with softmax activation:
eInputj
Outputj = PC
k=1 eInputk
where C is the number of classes, and Outputj is the probability of the

input belonging to class j.
• Output Layer: For binary classification, the output layer uses a sigmoid acti-
vation function:
1
Probability = −(Input×Weight+Bias)
1+e
For multi-class classification, the output layer uses softmax activation.
3.3.2.2 Model Compilation and Training
Binary Classification: The binary DenseNet model is compiled using the Adam
optimizer with binary cross-entropy loss:
N
1 X
Binary Cross-Entropy Loss: L = −

yi log(ŷi ) + (1 − yi ) log(1 − ŷi )
N i=1
where yi are the true labels (0 or 1), and ŷi are the predicted probabilities.
Training involves:
• Data Preparation: Data is shuffled, split into training , test and validation
sets, and augmented using an ImageDataGenerator.
• Training Procedure: The model is trained using 5-fold cross-validation with 3

epochs per fold.
• Performance Monitoring: The model’s accuracy is evaluated on the validation

set, and training history is monitored.
Multi-Class Classification: The multi-class DenseNet model is compiled using
the Adam optimizer with categorical cross-entropy loss:
N X
X C
L=− yi,j log(ŷi,j )
i=1 j=1
where yi,j are the true labels (one-hot encoded) and ŷi,j are the predicted probabilities.
Training involves:
• Data Preparation: Generators for training validation and test datas are created
with augmentation.
• Training Procedure: The model is trained for 10 epochs.
Figure 3.8: Densely Connected Convolutional Network
3.3.3 Outlier Detection and Data Refinement

In this section, we address the process of outlier detection and data refinement applied
to the Kvasir dataset, which is crucial for enhancing the performance of classification
models. Initially, outlier detection was performed using k-means clustering. This
method helped identify and remove anomalous data points, leading to a refined dataset
with reduced noise and more representative features.
After outlier detection and refinement, the cleaned dataset was utilized to train
both binary and multi-class classification models. Specifically, Convolutional Neural
Networks (CNN) and DenseNet architectures were applied to this refined dataset. The
primary objective of this step was to evaluate how the removal of outliers impacted the
performance of these models. By comparing the results from the refined dataset with
those obtained from the original dataset, we assessed the impact of outlier detection on
model performance. The goal was to determine whether removing outliers improved
classification accuracy and model robustness or not.
3.3.3.1 K-Means ClusteringAlgorithm
Clustering is a specialized area within Machine Learning focused on grouping data into
homogeneous clusters based on shared characteristics. The K-means algorithm is a
widely recognized unsupervised method used in Clustering.
Unsupervised Machine Learning involves training a computer to work with unla-
beled and unclassified data, allowing the algorithm to function independently without
guidance. In this approach, the machine organizes the data based on similarities, pat-
terns, and variations without prior training on the data.
3.3.3.1.1 How K-Means Clustering Works: K-Means Clustering is an unsu-

pervised learning algorithm used to solve clustering problems by grouping unlabeled
data into a predefined number of clusters. Here’s how it works:
1. Initialization: Begin by randomly selecting K points from the dataset to serve

as the initial cluster centroids.
2. Assignment: For each data point, compute the distance to each of the K
centroids and assign the data point to the cluster with the nearest centroid. This step
creates K clusters.
3. Update Centroids: After all data points are assigned to clusters, recalculate
the centroids of each cluster by averaging the positions of all data points within the
cluster.
4. Repeat: Iterate through steps 2 and 3 until convergence is achieved, which oc-
curs when the centroids stabilize or a predetermined number of iterations is completed.
5. Final Result: Upon convergence, the algorithm produces the final centroids
and assigns each data point to a cluster.
The goal of this iterative procedure is to minimize the sum of distances between
data points and their assigned cluster centroids.
3.3.3.1.2 Objective of K-Means Clustering: The main goal of K-Means clus-

tering is to divide the data into a predetermined number (K) of clusters, where data
points within each cluster are more similar to each other than to those in other clus-
ters. This is accomplished by minimizing the distance between each data point and
the center of its assigned cluster, known as the centroid. The key objectives are:
1. Grouping Similar Data Points: K-Means groups data points with similar
characteristics together in an attempt to find patterns in your data. This enables you
to find hidden patterns in the data.
2. Minimizing Within-Cluster Distance: The algorithm aims to maximise

the distance between data points within a cluster using a distance metric (often the
Euclidean distance). This guarantees very coherent, close-knit clusters.
3. Maximizing Between-Cluster Distance: K-Means, on the other hand, like-

wise seeks to maximise the distance between groups. For each cluster to be distinct
from the others, data points from those clusters should ideally be located widely apart.
By focusing on these objectives,K-Means offers an easy way to find the innate

clusters in an unlabelled dataset without the need for previous training.
3.3.3.1.3 Evaluating Clustering Quality Using Silhouette Analysis: After

applying the K-Means algorithm with various values of K, it is crucial to evaluate the
quality of the resulting clusters to determine which K provides the most meaningful
and well-separated groupings. This is where silhouette analysis comes into play.
3.3.3.1.3.1 Silhouette Score Calculation: The silhouette score measures how

similar a data point is to its own cluster compared to other clusters. For each data
point i, the score is computed as follows:
1- Calculate the Average Distance to Points in the Same Cluster (ai ):

This is the average distance between the data point i and all other points in its own
cluster. Formally,
1 X
ai = d(i, j)
|Ci | − 1 j∈C
i
j̸=i
where |Ci | is the number of points in the cluster Ci containing point i, and d(i, j) is
the distance between points i and j.
2- Calculate the Average Distance to Points in the Nearest Cluster (bi ):

This is the minimum average distance between the data point i and all points in the
nearest cluster Cnearest . Formally,
1 X
bi = d(i, j)
|Ci | j∈C
i
where C represents clusters other than Ci , and d(i, j) is the distance between points
i and j.
3- Compute the Silhouette Score si : The silhouette score for a data point i is
given by:
max(ai , bi ) − ai
si =
b i − ai
where the range of si is -1 to +1. The data point may be improperly clustered
if the score is close to -1, whereas a score near +1 indicates that the data point is
well-clustered.
3.3.3.1.3.2 Average Silhouette Score: The overall quality of clustering for

a particular K is assessed by averaging the silhouette scores of all data points. This
average score provides an indication of how well-separated and cohesive the clusters
are:
N
1 X
Average Silhouette Score = si
N i=1
where N is the total number of data points.
3.3.3.1.4 Selecting the Optimal K: he silhouette scores are compared for vari-
ous values of K in order to find the optimal number of clusters, K.
1- Compute Silhouette Scores for Multiple K Values: Perform silhouette

analysis for a range of K values and compute the average silhouette score for each.
2- Identify the Best K: The optimal K is selected based on the highest average
silhouette score. This value indicates the number of clusters that provides the best
separation and cohesion, meaning that the clusters are well-defined and distinct from
one another.
By leveraging silhouette analysis, we can effectively evaluate and select the most
suitable number of clusters, ensuring that the K-Means clustering results are both
meaningful and robust.
3.3.3.1.5 Visual Demonstration: Clustering M1 and M2 Using K-Means

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given in figure 3.8.
Figure 3.9: x-y axis scatter plot
• Let’s assume k clusters, or K=2.
• To create the cluster, we must select a random k points or centroid. These

points could be any other point or the points from the dataset. As a result, we
are choosing the two points below as k points even though they are not in our
dataset. According to Figure 3.9:
Figure 3.10: Randomly Selected Centroids for Initial Clustering with K=2
• The scatter plot’s data points will now be assigned to the nearest centroid or
K-point. Thus, a median will be drawn between the two centroids. as seen in
Figure 3.10.
Figure 3.11: Centroid Distance Calculation
• As can be seen in figure 3.10, the points on the left side of the line are in close
proximity to the K1, or blue centroid, whereas the points on the right side of the
line are in close proximity to the yellow centroid. To make them easier to see,
let’s colour them blue and yellow, as Figure 3.11 illustrates.
Figure 3.12: Cluster Assignment Visualization
• We will choose a new centroid and repeat the process until we locate the nearest
cluster. As illustrated in figure 3.12, we shall compute the centroids’ centres of
gravity in order to select new centroids.
Figure 3.13: Centroid Recalculation Process
• Every datapoint will then be assigned to the new centroid. We’ll go through the
same steps again to obtain the median line for this. The median will resemble
what figure 3.13 illustrates.
Figure 3.14: New Cluster Assignment

• One yellow point is on the left side of the line, and two blue points are on the
right side of the line, as seen in figure 3.13 above. Thus, the new centroids will
be allocated to these three points. similar to what Figure 3.14 shows.
Figure 3.15: Centroid Initialization and Assignment
• Since there has been reassignment, we will once more go on to step 4, which
involves locating new centroids or K-points. The procedure will be repeated
to determine the centroids’ centres of gravity, resulting in new centroids that
resemble those in figure 3.15.
Figure 3.16: Updated Cluster Centroids
• We will reassign the data points and create the median line once we get the new
centroids. The graphic will resemble that in figure 3.16.
Figure 3.17: Cluster Reassignment Process
• Figure 3.16 shows that there are no dissimilar data points on either side of the
line, indicating that our model has been built. View Image 3.17.
Figure 3.18: Converged Clustering Result
• Now that our model is complete, we may eliminate the posited centroids, resulting
in the two final clusters depicted in figure 3.18.
Figure 3.19: Final Cluster Assignment

3.3.3.2 Outlier Detection Using K-Means Clustering
In this section, we employ the Interquartile Range (IQR) method to detect outliers
within each cluster formed by the K-Means algorithm. Below is a detailed explanation
of each step involved in the outlier detection process.
3.3.3.2.1 Distance Calculation: For each cluster, compute the Euclidean dis-
tance of each data point from the cluster centroid. This distance is a measure of how
far each point is from the center of its cluster. This distance d of a point xi from the
centroid c is calculated as follows:
v
u n
uX
d(xi , c) = t (xij − cj )2
j=1
where xij and cj are the coordinates of the data point and the centroid, respectively.
• In Colab-Based Implementation:
The Euclidean distances are calculated using NumPy’s linear algebra functions
to find the norm of the difference between each data point and the centroid.
3.3.3.2.2 Compute Quartiles: Calculate the first quartile (Q1) and the third
quartile (Q3) of the distances. These quartiles represent the 25th and 75th percentiles
of the distance values, respectively.
Q1 = 25th percentile of the distances
Q3 = 75th percentile of the distances
NumPy’s percentile function is used to compute Q1 and Q3 for the distances of
data points in each cluster.
3.3.3.2.3 Calculate the IQR: Find the Interquartile Range (IQR), which is the
quartile difference between the first and third. The middle 50% of the data spread is
measured by the IQR.
IQR = Q3 − Q1
The IQR is computed by subtracting Q1 from Q3 using basic arithmetic opera-
tions in NumPy.
3.3.3.2.4 Determine the Upper Bound: Compute the upper bound for detect-
ing outliers. In my implementation, the upper bound is set to Q3 + 1 × IQR. This
threshold helps identify data points that are significantly farther from the centroid
compared to the majority of points.
Upper Bound = Q3 + 1 × IQR

The upper bound is calculated and stored. This value is used to identify which
points are considered outliers.
3.3.3.2.5 Identify Outliers: Compare each distance with the upper bound. If the
distance of a point exceeds this threshold, it is marked as an outlier.
If d(xi , c) > Upper Bound, then xi is an outlier

The code iterates over the distances and marks points as outliers if their distance
exceeds the upper bound. These points are then excluded from the final dataset.
The use of the IQR method allows for effective identification and removal of outliers,
ensuring that the refined dataset is more homogeneous and better suited for subsequent
classification tasks.
3.3.3.3 Post-Outlier Detection Data Refinement and Classification

After identifying and removing outliers using K-Means clustering, the dataset was re-
fined to exclude the detected outliers. This refined dataset, devoid of anomalies, served
as the basis for further classification tasks. With the refined dataset, we proceeded to
apply both binary and multi-class classification using Convolutional Neural Networks
(CNN) and DenseNet architectures. The classification process involved the following
steps:
3.3.3.3.1 Multi-Class Classification:

• Models: We utilized both CNN and DenseNet architectures for multi-class clas-
sification.
• Training Runs: Each model was trained across 10 different runs, with each run
consisting of up to 30 epochs and utilizing early stopping to prevent overfitting.
This approach ensured robustness and reliability of the results, balancing com-
putational efficiency with thorough exploration of different model configurations.
3.3.3.3.2 Binary Classification:
• Models: Similar to multi-class classification, CNN and DenseNet models were

employed for binary classification tasks.
• Cross-Validation: To evaluate the performance of the binary classification

models, we employed 5-fold cross-validation. The dataset was partitioned into
5 distinct subsets or folds. For each iteration, 4 folds were used for training,
and the remaining fold was used as a validation set. This process was repeated
5 times, with each fold serving as the validation set once. After completing the
cross-validation process, the final model’s performance was assessed on a separate
test set to provide an unbiased estimate of its generalization capabilities
Chapter 4
RESULTS AND DISCUSSION
4.1 Model Training on Raw Data

4.1.1 Binary Classification
Initially, I divided the raw dataset into training, validation, and test sets using a split
of 70% for training, 10% for validation, and 20% for testing. The dataset comprised
8 classes, each with 1000 images. I used a 5-fold Stratified K-Fold cross-validation
and resized the images to (64, 64, 3) for binary classification using a Convolutional
Neural Network (CNN). I employed data augmentation methods like rescaling, shear
range, zoom range, and horizontal flipping to enhance model performance and avoid
overfitting. Early stopping was applied to monitor validation loss and optimize model
training.
The CNN model was designed with the following hyperparameters:
• Input Shape: (64, 64, 3)
• Conv2D Filters: 32 filters with a kernel size of (3, 3)
• Pooling: MaxPooling2D with a pool size of (2, 2)
• Dense Layer: 128 units
• Output Layer: 1 unit with a sigmoid activation function for binary classification
• Optimizer: Adam
• Loss Function: binary_crossentropy
• Metrics: accuracy
48
• Batch Size: 16
• Epochs: 10, with early stopping
• Early Stopping: Monitored ’val_loss’ with a patience of 3 epochs.
CLASSES FOLD 1 FOLD 2 FOLD 3 FOLD 4 FOLD 5 Average

0 88% 88% 89% 88% 88% 88.2%
1 90% 90% 90% 90% 91% 90.2%
2 92% 92% 93% 92% 93% 92.4%
3 95% 94% 95% 94% 95% 94.6%
4 93% 95% 93% 93% 93% 93.4%
5 89% 89% 88% 90% 89% 87.8%
6 88% 88% 88% 88% 88% 88.0%
7 92% 92% 92% 92% 92% 92.0%
Total Average 91.1%
Table 4.1. Accuracy per class across different folds, including average accuracy for
each class and the total average accuracy.
As we see in table 4.1 , for each fold and each class, I obtained different accuracies.
The highest accuracies for each class are as follows:
• Class 0: Fold 3 had the best accuracy, scoring 89
• Class 1: Fold 5 had the highest accuracy, 91
• Class 2: The highest accuracies were achieved in both Fold 3 and Fold 5, each
with a value of 93%.
• Class 3: The highest accuracies were observed in Folds 1, 3, and 5, all reaching
95%.
• Class 4: Fold 2 had the highest accuracy, at 95%.
• Class 5: With 90% accuracy, Fold 4 had the highest accuracy.
• Class 6: All folds produced the same accuracy of 88%.

• Class 7: All folds resulted in the same accuracy of 92%.
These results reflect the model’s performance variability across different folds and
highlight the folds where the model performed best for each class.
To visually represent these accuracies, refer to the figure 4.1 , which illustrates the
accuracy distributions across different folds for each class.
Figure 4.1: CNN test Accuracies per Class
After training the Convolutional Neural Network (CNN) model, I proceeded to

train a DenseNet model on the full dataset. The DenseNet model used for binary
classification was configured with the following hyperparameters:
• • Input Shape: (64, 64, 3)
• • Base Model: DenseNet121 (pre-trained on ImageNet, excluding the top layer)

• • Dense Layer: 128 units with ReLU activation
• • Dropout: 0.5
• • Output Layer: 1 unit with sigmoid activation for binary classification
• • Optimizer: Adam
• • Loss Function: binary_crossentropy
• • Metrics: accuracy
• • Batch Size: 16
• • Epochs: 10
• • Early Stopping: Not explicitly used in this snippet, but could be added if needed
For DenseNet model training, images were resized to 64×64 pixels and normalized,
with labels one-hot encoded for each class. We used 5-fold cross-validation, further
splitting the data into training, validation, and test sets using a split of 70% for training,
10% for validation, and 20% for testing sets for each fold. Data augmentation with
ImageDataGenerator, including rescaling, was applied to enhance generalization. Each
model was trained for 10 epochs per fold.
After training DenseNet, we see in table 4.2 , for each fold and each class, I obtained
different accuracies. The highest accuracies for each class are as follows:
• • The highest accuracy for Class 1 was 89%, attained at Fold 5.
• • At Folds 3, 4, the maximum accuracy for Class 2 was 88%.
• • The maximum accuracy for Class 3 was 87%, attained at Folds 2 and 4.
• • For Class 4, Folds 1 and 2 yielded the maximum accuracy of 88%.
• • The maximum accuracy for Class 5 was attained at Fold 5, with an accuracy
of 87
• • The best accuracy for Class 6 was 85%, which was attained at Folds 1, 2, and
4.
• • The maximum accuracy for Class 7 was attained at Fold 2, with an accuracy
of 88%.
• • When it came to Class 8, Fold 3 yielded the highest accuracy of 87%.
To visually represent these accuracies, refer to the figure 4.2, which illustrates the
accuracy distributions across different folds for each class:
0 87% 87% 88% 87% 89% 87.6%
1 86% 87% 88% 88% 88% 87.4%
2 84% 87% 86% 87% 86% 86.0%
3 88% 88% 85% 87% 84% 86.4%
4 82% 86% 82% 78% 87% 81.0%
5 85% 85% 79% 85% 76% 80.0%
6 53% 88% 83% 87% 83% 78.8%
7 84% 84% 87% 78% 86% 83.8%
Total Average 83.5%
4.1.1.1 Comparative Analysis of CNN and DenseNet for Binary Classifi-

cation on Raw Data:
We observe that CNN outperforms DenseNet when considering the overall average
accuracy across all classes. Specifically, CNN achieves an average accuracy of 91.1%,
while DenseNet falls short at 83.5%.
Looking closer at the class-wise average accuracies, the results further highlight
CNN’s superiority. For CNN, the average accuracies across classes 0 to 7 are 88.2%,
90.2%, 92.4%, 94.6%, 93.4%, 87.8%, 88.0%, and 92.0%, respectively. In contrast,
DenseNet’s average accuracies for the same classes are 87.6%, 87.4%, 86.0%, 86.4%,
81.0%, 80.0%, 78.8%, and 83.8%. These figures demonstrate that CNN consistently
achieves higher accuracy across most classes, highlighting its overall effectiveness in
binary classification tasks.
4.1.2 Multi-Class Classification

For the multi-class classification task using a Convolutional Neural Network (CNN), I
employed the following approach. The images were resized to a consistent size of (64,
64, 3) to ensure uniformity across the dataset. The dataset was split into training 70%
, valiadtion 10% and test 20% sets.
The CNN model was configured with the following hyperparameters:

Figure 4.2: DENSENET test Accuracies per Class
• Input Shape: (64, 64, 3).
• Conv2D Layers: The model included three convolutional layers with 32, 64, and
128 filters, respectively, each with a kernel size of (3, 3).
• Pooling: MaxPooling2D was applied after each convolutional layer with a pool
size of (2, 2) to reduce the spatial dimensions.
• Dense Layer: 128 units with ReLU activation.
• Dropout: 0.5 to prevent overfitting.
• Output Layer: A softmax activation function was used for the final layer to
classify the images into the respective classes.
• Optimizer: Adam.
• Loss Function: Categorical cross-entropy.
• Metrics: Accuracy.
• Batch Size: 32.

• Epochs: 20, with early stopping.
The CNN achieved a test accuracy of 74%. To visualize the model’s performance,
figures of CNN accuracy and CNN loss over the epochs are provided in figure 4.3.
Figure 4.3: CNN Training History
After working with the CNN model, I proceeded with training a DenseNet model
for the multi-class classification task. The dataset was consistent with the previous
setup.
The following hyperparameters were set when configuring the DenseNet model:
• Input Shape: (64, 64, 3)
• Base Model: DenseNet121, pre-trained on ImageNet, excluding the top layer.
• Global Average Pooling: Applied after the base model to reduce the feature
dimensions.
• Dense Layer: 128 units with ReLU activation.
• Dropout: 0.5 to prevent overfitting.
• Output Layer: A softmax activation function was used for the final layer to
classify the images into the respective classes.
• Optimizer: Adam.
• Loss Function: Categorical cross-entropy.
• Metrics: Accuracy.
• Batch Size: 32.
• Epochs: 20, with early stopping.
The DENSENET achieved a test accuracy of 84%. To visualize the model’s perfor-
mance, figures of CNN accuracy and CNN loss over the epochs are provided in figure
4.4.
Figure 4.4: DENSENET Training History
4.1.2.1 Comparative Analysis of CNN and DenseNet for Multi-Class Clas-

sification on Raw Data:
With an overall accuracy of 84%, the DenseNet model outperformed the CNN model
in the multi-class classification setting. By contrast, the accuracy of the CNN model
was 74%. This outcome demonstrates how well DenseNet manages the complexity of
multi-class classification tasks.
4.1.2.2 Comparative Analysis of Multi-Class vs. Binary Classification Per-

formance on Raw Data:
In comparing the performance of the models on multi-class and binary classification

tasks, it’s evident that both models performed better in binary classification. Specif-
ically, An overall average accuracy attained by the CNN model was 91.1% in binary
classification, significantly higher than its 74% accuracy in multi-class classification.
Similarly, DenseNet performed better in binary classification, with an overall average
accuracy of 83.5%, compared to 84% in multi-class classification. This indicates that
both CNN and DenseNet are more effective when the classification task is simplified
to binary classes, with CNN showing a greater advantage in both scenarios.
4.2 Outlier Detection

We employed K-Means clustering and outlier detection based on the Interquartile
Range (IQR) approach to refine the dataset. First, we computed the silhouette score
for each class independently in order to get the optimal number of clusters. Using this
knowledge, we separated the data into two clusters using K-Means clustering.
The Euclidean distance between each data point and the centroid of its correspond-
ing cluster was then determined:
v
u n
uX
Euclidean distance = d(xi , c) = t (xij − cj )2
j=1
Using these distances, we identified outliers by employing the Interquartile Range

(IQR) method. Specifically, we computed the first and third quartiles (Q1 and Q3) of
the distances and determined the IQR and upper bound to detect outliers.
IQR = Q3 − Q1
upper_bound = Q3 + 1 × IQR
outliers = distances > upper_bound
Points with distances exceeding this upper bound were marked as outliers and re-
moved from the dataset.
The refined dataset, free from outliers, was then used for binary and multi-class
classification tasks with CNN and DenseNet models to ensure more accurate and reli-
able classification results.
4.2.1 Analysis for Class 1: Optimal Clustering and Outlier De-

tection
Each of the 8 classes in the dataset, consisting of 1000 images each, was resized to a
uniform size of (48, 48, 3). I began by calculating the silhouette score for the first class,
which indicated that the optimal number of clusters was 2, as shown in the figure 4.5.
Figure 4.5: silhouette score For Class 1
After determining the optimal number of clusters, I applied K-Means clustering to

segment the data accordingly. This process enabled the calculation of distances from
each data point to the centroid of its respective cluster. Subsequently, to identify and
manage outliers, the Interquartile Range (IQR) method was employed. Distances that
exceeded the upper bound defined by the IQR were flagged as outliers.
The plots shown, are the "Distance Distribution Plot with Outlier Threshold" for
Cluster 0 and Cluster 1 of class 1, as illustrated in the figures 4.6 and 4.7.
Figure 4.6: Distance Distribution Plot with Outlier Threshold
Figure 4.7: Distance Distribution Plot with Outlier Threshold

The y-axis shows the number of points within each distance range, and the x-axis
shows the distance from the centroid of each cluster. The upper bound for outlier
detection is represented by the red dotted line.
From these plots, it is evident that distances exceeding the red dotted line are con-
sidered outliers. After applying the outlier detection process, 29 points were identified
as outliers and removed from the dataset. Consequently, 971 points remained for class
1, which is designated as "dyed and lifted polyps."
4.2.2 Summary of Clustering and Outlier Detection for Re-

maining Classes
For the remaining 7 classes, similar procedures were followed to determine the optimal
number of clusters and perform outlier detection. Here is a summary of the process
and findings for each class:
4.2.2.1 Optimal Number of Clusters
To find the optimal number of clusters, the silhouette score for each class was computed.
Below are the silhouette score plots for each class:
Figure 4.8: Silhouette Score for Class 2




As can be seen in the figures, the silhouette scores for each class show that two
clusters is the optimal number every time. This finding suggests that it is advisable
to divide the data into two separate clusters for all classes. The above figures show
how the silhouette scores corroborate this finding for every class, confirming that k=2
is the optimal number of clusters.
4.2.2.2 Distance Distribution and Outlier Detection
The K-Means clustering algorithm was used to determine the optimal number of clus-
ters for each class, and the distances between each data point and the centroid of its
cluster were then calculated. Then, to find outliers, the Interquartile Range (IQR)
approach was applied. The following figures display the "Distance Distribution Plot
with Outlier Threshold" for each class:
Figure 4.15: Distance Distribution Plot with Outlier Threshold for Class 2
In all figures, the red dotted line represents the threshold beyond which points are
considered outliers. Distances greater than this threshold are marked as outliers and
are removed from the dataset.
For each class, the number of outliers identified and removed is as follows:
• Class 2: 7
• Class 3: 67
• Class 4: 38
• Class 5: 61
• Class 6: 43
• Class 7: 47
• Class 8: 50
In total, 342 outlier points were identified and removed across all 8 classes. After
excluding these outliers, the refined dataset comprises 7,658 data points out of the
original 8,000. This refined dataset was then used for subsequent classification tasks.
After removing the outliers, the refined dataset has been used for classification
tasks.
4.3 Model Training on Refined Data

4.3.1 Binary Classification
Using a convolutional neural network (CNN), binary classification was carried out for
the refined dataset. The images were resized the same size, (48 , 48 ,3). Five-fold
cross-validation was used, with the dataset divided into training , test and validation
sets , using a split of 70% for training, 10% for validation, and 20% for testing. Ten
epochs and a batch size of 16 were used to train the CNN model. To reduce overfitting,
early stopping was used with three epochs of patience.
The rest of the hyperparameters, including Conv2D filters, pooling layers, Dense
layers, optimizer, loss function, metrics, batch size, and early stopping criteria, are the
same as in the raw training scenario.

0 89% 88% 88% 89% 89% 88.6%
1 91% 90% 90% 90% 89% 90.0%
2 93% 94% 92% 93% 93% 93.0%
3 95% 95% 95% 93% 95% 94.6%
4 89% 97% 98% 99% 97% 95.2%
5 91% 90% 89% 90% 92% 90.4%
6 90% 89% 89% 90% 88% 89.2%
7 92% 92% 93% 92% 92% 92.2%
Total Average 91.0%
Table 4.3 shows that the accuracies varied for each class across different folds. The
highest accuracies for each class are summarized as follows:
• Class 0: The highest accuracy was achieved in Folds 1, 4, and 5, each with a
value of 89%.
• Class 1: The highest accuracy was recorded in Fold 1, at 91%.
• Class 2: The highest accuracies were achieved in both Fold 2 and Fold 5, each
with a value of 94%.
• Class 3: The highest accuracies were observed in Folds 1, 2, 3, and 5, all reaching
95%.
• Class 4: At 99
• Class 5: Fold 5 had the highest accuracy, at 92
• Class 6: Fold 4 recorded the highest accuracy, at 90
• Class 7: All folds resulted in the same accuracy of 92%.
These findings show how the model’s performance varies at various folds and em-
phasize the folds where the model achieved the highest accuracy for each class.
For a visual representation of these accuracies, see Figure 4.15, which shows the
distribution of accuracy across the different folds for each class.
Figure 4.29: CNN Test Accuracies Per Class
DenseNet was used in the analysis after the Convolutional Neural Network (CNN)
was trained on the improved dataset for binary classification. The DenseNet model
was set up with an input shape of (48, 48, 3) at this stage. The DenseNet model used a
basis of DenseNet121 with a Dense layer of 128 units with ReLU activation, pre-trained
on ImageNet and omitting the top layer. For binary classification, dropout was set to
0.5 and there was one unit in the output layer with a sigmoid activation function. The
binary cross-entropy loss function and Adam optimiser were employed in the model,
and accuracy was the evaluation metric. There were three training epochs and a batch
size of 4288. ImageDataGenerator was used to enrich the data with features including
rescaling, shearing, zooming, and horizontal flipping. For evaluation. With the dataset
divided into training , test and validation sets , 5-fold cross-validation was used for
evaluation. The optimiser, loss function, metrics, and dropout rate—the remaining
hyperparameters, align with those used in the raw training scenario.
Table 4.4 illustrates the variation in accuracies for each class across different folds.
The summary of the highest accuracies achieved for each class is as follows:
• Class 0: The highest accuracy of 0.8731 was consistently achieved across all
folds (Folds 1, 2, 3, 4, and 5).
• Class 1: The highest accuracy of 0.8703 was recorded in Folds 4 and 5.

0 87% 87% 87% 87% 87% 87.0%
1 13% 13% 13% 87% 87% 44.6%
2 88% 88% 88% 88% 88% 88.0%
3 88% 88% 87% 87% 87% 87.4%
4 88% 88% 88% 88% 88% 88.0%
5 88% 88% 88% 88% 88% 88.0%
6 87% 87% 88% 88% 88% 87.6%
7 88% 88% 12% 12% 88% 39.6%
Total Average 73.1%
• Class 2: The highest accuracy of 0.8787 was achieved in Fold 4 and Fold 5.
• Class 3: The highest accuracy of 0.8750 was consistently observed across Folds
1, 2, 4, and 5.
• Class 4: The highest accuracy of 0.8778 was achieved in Folds 1, 2, and 3.
• Class 5: The highest accuracy of 0.8750 was consistent across all folds (Folds 1,
2, 3, 4, and 5).
• Class 6: Folds 3, 4, and 5 had the maximum accuracy of 0.8759.
• Class 7: Folds 1, 2, and 5 had the highest accuracy, which was 0.8759.
Figure 4.30: DENSENET Test Accuracies Per Class
4.3.1.1 Comparative Analysis of CNN and DenseNet for Binary-Class Clas-

sification on Refined Data:
For binary classification on refined data, CNN outperforms DenseNet when considering
the overall average accuracy across all classes. CNN achieves an average accuracy of
91.0%, while DenseNet falls behind at 73.1%. Examining class-wise average accuracies,
CNN consistently shows strong performance across all classes, with accuracies ranging
from 88.6% to 95.2%. In contrast, DenseNet’s performance is notably inconsistent,
with accuracies varying significantly between 44.6% and 88.0%, indicating a less reliable
classification across different classes.
4.3.2 Multi-Class Classification

The photos were resized to the same size of (48, 48, 3) for the multi-class classification
task using Convolutional Neural Network (CNN) using refined data. The dataset
was split into 70% for training, 20% for testing, and 10% for validation In order to
avoid overfitting, the CNN model was trained for 30 epochs with early stopping based
on validation loss and a patience of 3 epochs. The training process was conducted
across 10 runs, with accuracy plotted after each run to monitor performance. Other
hyperparameters, such as the optimizer, loss function, and architecture details, remain
consistent with those used in the raw data scenario.
Runs Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Run 9 Run 10
Accuracy 75% 71% 78% 75% 75% 75% 73% 72% 71% 75%
Table 4.5. Accuracy of CNN on refined data across different runs.
Here we can see from table 4.5 that Run 3 achieved the highest accuracy of 78%.
Figure 4.31: Training and validation accuracy and loss for Run 3.
After trainin cnn for the multi-class classification task using DenseNet on refined
data, images were resized to (48, 48, 3), matching the size used for CNN in this
scenario. For 30 epochs, the DenseNet model was trained with early stopping, and
the training process was repeated for 10 runs. Accuracy was plotted for each run to
assess performance. All other hyperparameters, including the optimizer, loss function,
and model architecture, are consistent with the configurations used in the raw data
scenario.
RUNS RUN 1 RUN 2 RUN 3 RUN 4 RUN 5 RUN 6 RUN 7 RUN 8 RUN 9 RUN 10
Accuracy 67% 69% 61% 79% 73% 69% 55% 66% 75% 67%
Table 4.6. Model accuracy across different runs
RUN 4 achieving the highest accuracy of 79% as illustrated in table 4.6.

Figure 4.32: Accuracy and loss during training and validation for Run 4.
4.3.2.1 Comparative Analysis of CNN and DenseNet for Multi-Class Clas-

sification on Refined Data:
For the multi-class classification task using refined data, The accuracy achieved by the
Convolutional Neural Network (CNN) was 78% at Run 3, whereas the DenseNet model
achieved a slightly higher accuracy of 79% in Run 4. This result indicates a competitive
performance between the two models, with DenseNet showing a marginal advantage in
this scenario. The performance of both models highlights their effectiveness in handling
multi-class classification problems when refined data is utilized.
4.3.2.2 Comparative Analysis of Multi-Class vs. Binary Classification Per-

formance on Refined Data:
When evaluating the performance of CNN and DenseNet in both binary and multi-class
classification tasks on refined data, distinct differences in overall average accuracy are
observed.
For CNN, the binary classification scenario demonstrates superior performance com-
pared to the multi-class classification task. The CNN achieves an overall average accu-
racy of 91.0% in binary classification, which is significantly higher than its performance
in multi-class classification, where the overall accuracy is 78% in Run 3. This indicates
that CNN performs notably better in binary classification scenarios on refined data,
reflecting its strength in distinguishing between two classes.
In the case of DenseNet. For binary classification, DenseNet achieves an overall
average accuracy of 73.1%, while its multi-class performance is slightly better, with
an accuracy of 79% in Run 4. Although DenseNet performs better in multi-class clas-
sification compared to its binary classification scenario, the improvement is relatively
modest compared to the gains seen in CNN.
4.3.2.3 Comparative Analysis of RAW VS REFINED DATA:
• 1. Binary Classification • CNN maintains a high performance on both raw

(91.1%) and refined (91.0%) data, with a slight decrease in accuracy from 91.1%
to 91.0%. • DenseNet experiences a significant drop in performance on refined
data, reducing from raw data accuracy 83.5% to refined data accuracy 73.1%.
This indicates that the refined data did not substantially improve DenseNet’s
performance compared to CNN.
• 2. Multi-Class Classification • For cnn refined data outperforms with accuracy

of 78% and cnn achieved accuracy of raw data to 74%. • For densenet raw data
outperforms with accuracy of 84% and refined data achieved 79% accuracy.
4.4 Outlier Image Samples

In this section, we present a selection of images identified as outliers from our dataset.
These outlier images exhibit various visual characteristics that distinguish them from
the non-outlier images. Specifically, we have observed the following issues:
• Darkened Images: Some images appear significantly darker than their non-
outlier counterparts. This reduction in brightness can obscure important details,
making it challenging to accurately assess the condition of the gastrointestinal
tract.
• Excessive Lighting: Conversely, other images suffer from excessive lighting.

This overexposure can lead to a loss of detail and contrast, further complicating
the interpretation of the images.
• Slight Blurring: A number of outlier images are slightly blurred, which dimin-
ishes the clarity of the visual information. Blurred images may hinder the precise
identification of features and conditions.
• Polyp Removal Class: Specifically for the "Polyp Removal" class, the outlier
images exhibit significant issues that impede the visibility of critical details. In
these images, it is difficult to discern whether the polyps have been properly dyed
or removed, making it challenging to evaluate the effectiveness of the removal
procedure.
To illustrate these issues, two sample outlier images from each category are shown
in Figure 4.7. This figure presents representative examples of the visual anomalies
observed across the different categories, providing a clear view of the types of challenges
encountered with outlier data.
These visual anomalies highlight the limitations and challenges of working with
outlier data, underscoring the importance of ensuring high-quality, well-lit, and sharp
images for accurate medical analysis.
Category Image 1 Image 2
Normal Z-Line
Normal Pylorus
Normal Cecum
Esophagitis
Polyps
Ulcerative Colitis
Dyed and Lifted Polyps
Dyed Resection Margins
Table 4.7. Sample Images of Outliers Across Categories

Chapter 5
Conclusions
The analysis of model performance on both raw and refined datasets reveals notable
insights into the effectiveness of data refinement through outlier detection.
For binary classification tasks, CNN demonstrated consistent performance, achiev-
ing a high accuracy of 91.1% with raw data and a slightly improved accuracy of 91.0%
with refined data. This indicates that the CNN model’s ability to classify binary cat-
egories is robust and relatively unaffected by the refinement process.
In contrast, DenseNet showed a significant drop in binary classification accuracy,
falling from 83.5% with raw data to 73.1% with refined data. This decline suggests
that the data refinement process, which involves outlier detection, adversely impacted
DenseNet’s performance in binary classification.
For multi-class classification, DenseNet initially outperformed CNN with an accu-
racy of 84% on raw data, compared to CNN’s 74%. However, after data refinement,
DenseNet’s accuracy decreased to 79%, while CNN’s accuracy improved slightly to 78%.
Despite this improvement, CNN still did not surpass DenseNet’s original multi-class
performance.
In summary, the refinement process through outlier detection had no significant
impact on CNN’s binary classification accuracy and only a slight enhancement in multi-
class accuracy. Conversely, it led to a notable decrease in DenseNet’s performance
across both classification tasks. Overall, while the refinement process showed limited
benefits for CNN and detrimental effects for DenseNet, it did not significantly enhance
the overall performance of the models.
81
Bibliography
[1] Shibin Wu, Ruxin Zhang, Jiayi Yan, Chengquan Li, Qicai Liu, Liyang Wang,
and Haoqian Wang. High-speed and accurate diagnosis of gastrointestinal disease:
Learning on endoscopy images using lightweight transformer with local feature
attention. Bioengineering, 10(12):1416, 2023.
[2] Ibtesam M Dheir and Samy S Abu-Naser. Classification of anomalies in gastroin-
testinal tract using deep learning. 2022.
[3] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada
Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien
Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, et al. Kvasir: A multi-class
image dataset for computer aided gastrointestinal disease detection. In Proceedings
of the 8th ACM on Multimedia Systems Conference, pages 164–169, 2017.
[4] Yan Gao, Weining Lu, Xiaobei Si, and Yu Lan. Deep model-based semi-supervised
learning way for outlier detection in wireless capsule endoscopy images. IEEE
Access, 8:81621–81632, 2020.
[5] Dimitris K Iakovidis, Spiros V Georgakopoulos, Michael Vasilakakis, Anastasios
Koulaouzidis, and Vassilis P Plagianakos. Detecting and locating gastrointestinal
anomalies using deep learning and iterative cluster unification. IEEE transactions
on medical imaging, 37(10):2196–2210, 2018.
[6] Abel KahsayGebreslassie, Misgina Tsighe Hagos, et al. Automated gastrointestinal
disease recognition for endoscopic images. In 2019 International Conference on
Computing, Communication, and Intelligent Systems (ICCCIS), pages 312–316.
IEEE, 2019.
[7] Muhammad Ramzan, Mudassar Raza, Muhammad Imran Sharif, and Seifedine
Kadry. Gastrointestinal tract polyp anomaly segmentation on colonoscopy images
using graft-u-net. Journal of Personalized Medicine, 12(9):1459, 2022.
[8] Sami H Ismael, Shahab W Kareem, and Firas H Almukhtar. Medical image
classification using different machine learning algorithms. AL-Rafidain Journal of
Computer Sciences and Mathematics, 14(1):135–147, 2020.
82
[9] Marc D Kohli, Ronald M Summers, and J Raymond Geis. Medical image data
and datasets in the era of machine learning—whitepaper from the 2016 c-mimi
meeting dataset session. Journal of digital imaging, 30:392–399, 2017.
[10] Marzieh Esmaeili, Amirhosein Toosi, Arash Roshanpoor, Vahid Changizi, Marjan
Ghazisaeedi, Arman Rahmim, and Mohammad Sabokrou. Generative adversarial
networks for anomaly detection in biomedical imaging: A study on seven medical
image datasets. IEEE Access, 11:17906–17921, 2023.
[11] Yu Cai, Weiwen Zhang, Hao Chen, and Kwang-Ting Cheng. Medianomaly:
A comparative study of anomaly detection in medical images. arXiv preprint
arXiv:2404.04518, 2024.
[12] Mengfang Li, Yuanyuan Jiang, Yanzhou Zhang, and Haisheng Zhu. Medical image
analysis using deep learning algorithms. Frontiers in Public Health, 11:1273253,
2023.
[13] Jamal N Hasoon, Ali Hussein Fadel, Rasha Subhi Hameed, Salama A Mostafa,
Bashar Ahmed Khalaf, Mazin Abed Mohammed, and Jan Nedoma. Covid-19
anomaly detection and classification method based on supervised machine learning
of chest x-ray images. Results in Physics, 31:105045, 2021.
[14] Alexander P Abadir, Mohammed Fahad Ali, William Karnes, and Jason B Sama-
rasena. Artificial intelligence in gastrointestinal endoscopy. Clinical endoscopy,
53(2):132–141, 2020.
[15] Justin Ker, Lipo Wang, Jai Rao, and Tchoyoson Lim. Deep learning applications
in medical image analysis. Ieee Access, 6:9375–9389, 2017.
[16] Ashish Jain, Rohit Singh, and Priyanka Singh. Enhancing outlier detection and
dimensionality reduction in machine learning for extreme value analysis. Int. J.
Advanced Networking and Applications, 15(06):6204–6210, 2024.
[17] N Sri Krishna, YV Pavan Kumar, K Purna Prakash, and G Pradeep Reddy.
Machine learning and statistical techniques for outlier detection in smart home
energy consumption. In 2024 IEEE Open Conference of Electrical, Electronic and
Information Sciences (eStream), pages 1–4. IEEE, 2024.
[18] Omar Alghushairy, Raed Alsini, Zakhriya Alhassan, Abdulrahman A Alshdadi,

Ameen Banjar, Ayman Yafoz, and Xiaogang Ma. An efficient support vector
machine algorithm based network outlier detection system. IEEE Access, 2024.
[19] Yalla Venkateswarlu, K Baskar, Anupong Wongchai, Venkatesh Gauri Shankar,
Christian Paolo Martel Carranza, José Luis Arias Gonzáles, and AR Murali Dha-
ran. An efficient outlier detection with deep learning-based financial crisis predic-
tion model in big data environment. Computational Intelligence and Neuroscience,
2022(1):4948947, 2022.
[20] Hossein Rahimighazvini, Zeyad Khashroum, Maryam Bahrami, Milad Hadizadeh

Masali, et al. Power electronics anomaly detection and diagnosis with machine
learning and deep learning methods: A survey. International Journal of Science
and Research Archive, 11(2):730–739, 2024.
[21] Eman Alajrami, Bassem S Abu-Nasser, Ahmed J Khalil, Musleh M Musleh,

Alaa M Barhoom, and SS Abu Naser. Blood donation prediction using arti-
ficial neural network. International Journal of Academic Engineering Research
(IJAER), 3(10), 2019.
[22] AM Abu Nada, Eman Alajrami, Ahmed A Al-Saqqa, and Samy S Abu-Naser. Age
and gender prediction and validation through single user images using cnn. Int.
J. Acad. Eng. Res.(IJAER), 4:21–24, 2020.
[23] Mukrimah Nawir, Amiza Amir, Ong Bi Lynn, Naimah Yaakob, and
R Badlishah Ahmad. Performances of machine learning algorithms for binary
classification of network anomaly detection system. In Journal of Physics: Con-
ference Series, volume 1018, page 012015. IOP Publishing, 2018.
[24] Samir S Yadav and Shivajirao M Jadhav. Deep convolutional neural network based
medical image classification for disease diagnosis. Journal of Big data, 6(1):1–18,
2019.
[25] Livia Faes, Siegfried K Wagner, Dun Jack Fu, Xiaoxuan Liu, Edward Korot,
Joseph R Ledsam, Trevor Back, Reena Chopra, Nikolas Pontikos, Christoph Kern,
et al. Automated deep learning design for medical image classification by health-
care professionals with no coding experience: a feasibility study. The Lancet Digital
Health, 1(5):e232–e242, 2019.
[26] Zahra Amiri, Arash Heidari, Nima Jafari Navimipour, Mehmet Unal, and Ali
Mousavi. Adventures in data analysis: A systematic review of deep learning
techniques for pattern recognition in cyber-physical-social systems. Multimedia
Tools and Applications, 83(8):22909–22973, 2024.
[27] Joost N Kok, Jacek Koronacki, Ramon Lopez de Mantaras, Stan Matwin, and
Dunja Mladenic. Knowledge Discovery in Databases: PKDD 2007: 11th European
Conference on Principles and Practice of Knowledge Discovery in Databases, War-
saw, Poland, September 17-21, 2007, Proceedings, volume 4702. Springer Science
& Business Media, 2007.
[28] Mohammed M Abu-Saqer and Mohammed O Al-Shawwa. Grapefruit classification

using deep learning. 2020.
[29] Zhiwen Huang, Xingxing Zhu, Mingyue Ding, and Xuming Zhang. Medical image
classification using a light-weighted hybrid neural network based on pcanet and
densenet. Ieee Access, 8:24697–24712, 2020.
[30] Ziliang Zhong, Muhang Zheng, Huafeng Mai, Jianan Zhao, and Xinyi Liu. Cancer
image classification based on densenet model. In Journal of physics: conference
series, volume 1651, page 012143. IOP Publishing, 2020.
[31] I Gethzi Ahila Poornima and B Paramasivan. Anomaly detection in wireless sensor
network using machine learning algorithm. Computer communications, 151:331–
337, 2020.
[32] Harry Pratt, Frans Coenen, Deborah M Broadbent, Simon P Harding, and Yalin
Zheng. Convolutional neural networks for diabetic retinopathy. Procedia computer
science, 90:200–205, 2016.
[33] K Hanumantha Rao, G Srinivas, Ankam Damodhar, and M Vikas Krishna. Imple-
mentation of anomaly detection technique using machine learning algorithms. In-
ternational journal of computer science and telecommunications, 2(3):25–31, 2011.
[34] Malinda Vania, Bayu Adhi Tama, Hasan Maulahela, and Sunghoon Lim. Re-
cent advances in applying machine learning and deep learning to detect upper
gastrointestinal tract lesions. IEEE Access, 2023.
[35] Muhammad Ramzan, Mudassar Raza, Muhammad Sharif, Muhammad Attique

Khan, and Yunyoung Nam. Gastrointestinal tract infections classification using
deep learning. Comput. Mater. Contin, 69:3239–3257, 2021.
[36] Thomas De Lange, Pål Halvorsen, and Michael Riegler. Methodology to develop
machine learning algorithms to improve performance in gastrointestinal endoscopy.
World journal of gastroenterology, 24(45):5057, 2018.
[37] Timothy Cogan, Maribeth Cogan, and Lakshman Tamil. Mapgi: Accurate identi-
fication of anatomical landmarks and diseased tissue in gastrointestinal tract using
deep learning. Computers in biology and medicine, 111:103351, 2019.
[38] Jingwei Song, Mitesh Patel, Andreas Girgensohn, and Chelhwon Kim. Combining
deep learning with geometric features for image-based localization in the gastroin-
testinal tract. Expert Systems with Applications, 185:115631, 2021.
[39] Girik Pachauri and Sandeep Sharma. Anomaly detection in medical wireless sensor
networks using machine learning algorithms. Procedia Computer Science, 70:325–
333, 2015.
[40] R. Vijaya Kumar Reddy et al. Machine learning based outlier detection for med-
ical data. Indonesian Journal of Electrical Engineering and Computer Science,
24(1):564–569, 2021.
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]
[23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40]

Rimsha Qaisar

Uploaded by

Copyright:

Available Formats

Rimsha Qaisar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rimsha Qaisar

Uploaded by

Copyright:

Available Formats

Outlier Detection in

Gastrointestinal Tract Images

Student Name: Rimsha Qaisar

School of Natural Sciences

National University of Sciences and Technology (NUST)

(Registration No: 00000402802)

A thesis submitted to the National University of Sciences and Technology, Islamabad,

in partial fulfillment of the requirements for the degree of

Supervisor: Dr.Tahir Mehmood

School of Natural Sciences

National University of Sciences and Technology (NUST)

LIST OF FIGURES VIII

4 RESULTS AND DISCUSSION 48

1.1 Gastrointestinal Tract . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 CNN test Accuracies per Class . . . . . . . . . . . . . . . . . . . . . . . 50

Keywords: CNN, DenseNet, Kvasir dataset, gastrointestinal tract images, binary

1.1 A Brief Overview of Machine Learning and Its

1.1.1 Methodologies of Machine Learning

1.2.1 Medical Image Analysis

1.2.2 Deep Learning in Medical Image Analysis

1.2.2.1 Motivation for the Study

1.2.3 Study Design and Methodology

1.3 Dataset Description

1.3.1 Dataset Details

Figure 1.1: Gastrointestinal Tract

1.3.1.1 Classes Included in the Dataset

1.3.1.1.1 Anatomical Landmarks: Recognizable features within the GI tract as-

1.3.1.1.2 Pathological Findings: Abnormalities observed during endoscopic ex-

1.3.1.2 Dataset Categories

In this chapter, we review previous approaches to automatic anomaly detection across

2.1 Previous Researches

2.2 Research Gaps:

MATERIALS AND METHODS

3.1 Data Acquisition

3.1.1 Kvasir Dataset Overview

3.1.1.1 Dataset Categories

3.1.1.1.1 Anatomical Landmarks An anatomical landmark is a distinctive fea-

• A complete examination involves inspecting both sides of the pyloric opening to

3.1.1.1.2 Pathological Findings A pathological finding in the gastrointestinal

• It’s critical to diagnose esophagitis in order to initiate therapy, reduce symp-

Figure 3.1: Esophagitis

• Detecting and removing polyps is crucial to prevent colorectal cancer. Automated

• Computer-aided detection of polyps would be valuable for diagnosis, evaluation,

Figure 3.3: ulcerative colitis

• Dyed and Lifted Polyps:

Figure 3.4: dyed-and-lifted-polyp

• Dyed Resection Margins:

Figure 3.5: dyed-resection-margin

3.1.1.2 Endoscopic Procedures

3.1.1.2.1 Colonoscopy • Colonoscopy is used to examine the large bowel (colon)

Figure 3.6: various types of endoscopy examinations

3.1.1.3 Clinical Significance

3.1.1.3.1 Impact on Research The dataset provides a comprehensive resource

3.2 Research Workflow: From Data Preprocessing to

3.2.1 Data Preprocessing

• Directory Setup: The function begins by creating the output directory if it

3.3 Classification Models

3.3.1 Convolutional Neural Network (CNN) Implementation

3.3.1.1 Data Augmentation and Preprocessing

σ denotes the ReLU activation function, which introduces non-linearity by set-