Received 10 February 2023, accepted 27 February 2023, date of publication 6 March 2023, date of current version 15 March 2023.
Digital Object Identifier 10.1109/ACCESS.2023.3253022
PAACDA: Comprehensive Data Corruption
Detection Algorithm
CHARVI BANNUR 1 , (Student Member, IEEE), CHAITRA BHAT1 , (Student Member, IEEE),
KUSHAGRA SINGH1 , (Student Member, IEEE),
SHRIRANG AMBAJI KULKARNI 2 , (Senior Member, IEEE),
AND MRITYUNJAY DODDAMANI 3
1 Department
of Computer Science and Engineering, People’s Education Society, Bengaluru 560085, India
of Computer Science and Engineering, National Institute of Engineering, Mysore 570008, India
of Mechanical and Materials Engineering, Indian Institute of Technology–Mandi, Mandi, Himachal Pradesh 175075, India
2 Department
3 School
Corresponding author: Shrirang Ambaji Kulkarni (
[email protected])
ABSTRACT With the advent of technology, data and its analysis are no longer just values and attributes
strewn across spreadsheets, they are now seen as a stepping stone to bring about revolution in any significant
field. Data corruption can be brought about by a variety of unethical and illegal sources, making it crucial
to develop a method that is highly effective to identify and appropriately highlight the various corrupted
data existing in the dataset. Detection of corrupted data, as well as recovering data from a corrupted dataset,
is a challenging problem. This requires utmost importance and if not addressed at earlier stages may pose
problems in later stages of data processing with machine or deep learning algorithms. In the following work
we begin by introducing the PAACDA: Proximity based Adamic Adar Corruption Detection Algorithm
and consolidating the results whilst particularly accentuating the detection of corrupted data rather than
outliers. Current state of the art models, such as Isolation forest, DBSCAN also called ‘‘Density-Based
Spatial Clustering of Applications with Noise’’ and others, are reliant on fine-tuning parameters to provide
high accuracy and recall, but they also have a significant level of uncertainty when factoring the corrupted
data. In the present work, the authors look into the most niche performance issues of several unsupervised
learning algorithms for linear and clustered corrupted datasets. Also, a novel PAACDA algorithm is proposed
which outperforms other unsupervised learning benchmarks on 15 popular baselines including K-means
clustering, Isolation forest and LOF (Local Outlier Factor) with an accuracy of 96.35% for clustered data
and 99.04% for linear data. This article also conducts a thorough exploration of the relevant literature from
the previously stated perspectives. In this research work, we pinpoint all the shortcomings of the present
techniques and draw direction for future work in this field.
INDEX TERMS Adamic Adar algorithm, corrupted datasets, outlier detection, probabilistic models,
statistical models, unsupervised learning.
I. INTRODUCTION
Ever since technological evolution dawned upon humankind
there has been massive progress in about every domain that
the human mind can perceive. The major credit for driving the
ongoing technological advancement lies in intensive amounts
of data, without which the majority of this industry might
come to a standstill [1]. Data seems to have reached such a
The associate editor coordinating the review of this manuscript and
approving it for publication was Mostafa M. Fouda
24908
.
significant level of importance that, it is often deemed that the
companies possessing larger amounts of data seem to have
a monopoly in that sector. Data often lays the foundation
for the development, growth and maturity of an algorithm
or technology. In today’s world data is significant to all
organizations and thereby it becomes all the more crucial to
protect this critical entity from being manipulated by malicious means [1].
A dataset can undergo a snowball effect with just a few
changes, which could ultimately be detrimental. Even though
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
there are multiple unethical ways to corrupt data, persistent
research has been conducted throughout time to identify efficient ways to learn about data corruption, to name a few commendable works in detecting data corruption including [2],
[3], [4], and [5]. Before figuring out a unique approach
for identifying data corruption we delved deep into other
pre-existing methods available for the detection of data corruption primarily concerning outliers. The deep study of
various approaches provided us with insightful knowledge of
various algorithms having varying levels of accuracy when
tested against the dataset containing corrupted data rather
than outliers.
K-means clustering uses clusters and their centroids as
part of an unsupervised technique to address issues with
categories and their classification [6], [7]. DBSCAN is
another clustering-based technique however it tends to perform well with data containing clusters with similar density, as it finds core samples of high density and expands
upon them. When applied to identify outliers in the supplied
dataset, both of these methods provided a satisfactory level of
accuracy [8], [9].
Moving along the research we started exploring methodologies such as Isolation forest, Elliptic envelope outlier
detection and histogram-based outlier detection. Isolation
forest provides us with an algorithm through which we can
partition the dataset features in order to identify the outliers
which exceed the defined range [10], [11]. The Elliptic envelope model tends to create an ellipse around the scatter plot
for the dataset and all points lying outside its boundaries
signify the outliers present in the dataset [12], [13]. Another
approach involving plotting and analyzing histograms is the
Histogram based algorithm for outlier detection (HBOD)
method which is also an effective unsupervised method to
detect anomalies. These algorithms also have a fairly decent
level of accuracy in terms of identifying anomalies in the
dataset [14], [15].
Algorithms such as ‘Principal Component Analysis’
(PCA),’DeepSVDD’ and ’Rotation based Outlier Detection’(ROD) were also looked at to tabulate the level of accuracy in predicting the outliers for the synthetically generated
dataset. To name a few PCA [16], [17], ROD [18], [19], Local
Outlier Factor [20], [21], DeepSVDD [22], [23] and more
were used. In spite of the distinct strategies given out by
various models, the unique methods proposed as a part of this
research stood out in the following metrics Accuracy, Recall,
Precision, Sensitivity and F1 score.
Adamic Adar is a promising algorithm for data correlation
in graph networks and hence has increasing amount of potential in data corruption detection. The study mentioned above
led to the realization that it is feasible to avoid the current
work’s inefficiencies. The motivation behind this work would
be to consolidate the current work in this field of study and
enhance the accuracy of the present corruption detection algorithm by leveraging the Adamic Adar algorithm’s prominance
in data correlation. The novel method proposed as a part of
the research largely revolves around a graph-based algorithm
VOLUME 11, 2023
is Adamic Adar. Adamic Adar gives us access to the Adamic
Adar index, which aids in anticipating links, particularly in
areas like social networks [24]. By taking into account how
many common links there are between two nodes, the Adamic
Adar index is determined [25]. The research puts forth a
modified approach of Adamic Adar called PAACDA (Proximity based Adamic Adar Corruption Detection Algorithm)
which when put into use detect the data corruption provides
us with the best accuracy compared to the above-mentioned
algorithms.
After a deep study involving the existing methods for the
detection of corruption and the novel method presented as
a part of this research work, attention was diverted towards
figuring out feasible methods to revert back to the original
data for the ones deemed as corrupt. However, this is beyond
the scope of this study. The linear regression approach works
considerably well for two attributes datasets for data regneration however most datasets deal with humongous amounts
of data consisting of multiple features. GANs (Generative
Adversarial Networks ) can be a plausible approach to regenerating corrupted data using the generator and discriminator model [26], [27]. However, utilizing the various evolved
forms of GAN, mainly tabular GANs in order to address
the problems of regeneration of contaminated still remains
unexplored.
The remainder of this article is divided into the following sections. After highlighting some related work about
the approaches taken into consideration for this study in
Section II, we go on to illustrate the data and methods used
in Section III, as well as the proposed methodology to tackle
the issue in Section IV. In Section V, we put forth the results
that indicate the cogency of our strategies, and in Section VI,
we present the conclusions and future scope of this research.
II. RELATED WORK
A key area of research that has numerous practical applications is anomaly identification in a given dataset. As a result,
this topic has frequently been the focus of research. Multiple
approaches utilizing various aspects of the dataset have been
proposed to detect anomalies however only few methodologies lay emphasis on the detection of corrupted data which
would further provide the most efficient results with respect
to varying dataset sizes, higher dimensionality or varying
degrees of corruption present. A study by Chandola et al.
in their publication [2] compares numerous anomaly detection methods for diverse applications. By contrasting the
benefits and drawbacks of various techniques, Hodge and
Austin [28] conducted a review of outlier detection methods. An overview of cutting-edge methods for spotting suspicious behaviour is presented by Patcha and Park [29]
Jiang et al. [30] together with detection scenarios for several
real-world settings.
Dimensionality reduction approaches and the underlying mathematical understandings are categorized by
Sorzano et al. [31]. The issues with anomaly detection are
further laid out by a number of other reports, including papers
24909
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
by Gama et al. [32], Gupta et al. [33], Heydari et al. [34],
Jindal and Liu [35], and many more.
Outliers make up the majority of anomalies that can exist
in a dataset. The first method based on distance for detection of outliers was put forth by Knorr et al. [36], and
Ramaswamy et al. [37] expanded on it by suggesting that
the greatest n locations with highest Pk be supposed outliers (Pk(p) signifies the kth nearest neighbour corresponding
to p). They used a clustering technique to separate a dataset
into several categories. To improve the success of outlier
detection for these groups, batch processing and pruning
may be beneficial [38]. Deviation-based outlier detection was
another method that was suggested for effectively detecting
outliers. Objects or data points that vary significantly from
the bulk of data points constitute outliers. Therefore, outliers
are frequently called deviations [39] as given by the name
deviation-based outlier detection.
Several other methods have been invented over the years
to detect anomalies, to name Breunig et al. [21] developed
a method based on density. Cluster-based anomaly identification methods pinpointed anomalies by eliminating clusters
from the actual dataset [40] or by classifying small clusters
as outliers [41]. Additionally, Aggarwal and Yu [42] proposed a novel strategy for catching outliers that is remarkably effective for extremely elevated dimensional datasets.
Their methodology focuses on finding locally sparse lower
dimensional projections which are otherwise difficult to differentiate using brute force methods due to the vast amount
of possible combinations. However, the study is inclined
towards detection of outliers and does not focus on the detection of corrupted or modified datasets.
Li et al. [43] in their paper proposed a unique outlier detection approach called ‘Empirical Cumulative
distribution-based Outlier Detection’ (ECOD). This method
uses the empirical cumulative distribution to measure outlier
values present in the dataset. They extensively applied it
to 30 datasets which showed that ECOD outperformed the
existing state of the model as it is fast and scalable. However,
the method doesn’t deal with an outlier that might not be in
either the left or right tails and demands readers to come up
with another promising route that is, to find a mechanism to
expand ECOD to such environments while keeping it quick
and scalable.
The bulk of them, meanwhile, are primarily focused on
outlier identification without paying much attention to data
that contains corrupted values. Many cutting edge poisoning
and outlier identification practices have been developed and
they may generally belong to one of the following categories:
distribution based [43], [44], [45], depth based [46], distance based [47], [48], [49], density based [50], [51], clusterbased [52], [53], [54] and generative models [55].
Thus in this work the following research objectives have
been addressed:
• To explore the efficacy of various unsupervised models
to detect the corrupted data in an efficient manner and
provide a comprehensive and detailed review.
24910
To propose a novel method PAACDA, an unsupervised model to detect corrupted data in a more accurate
manner.
Despite the many different approaches that have been
suggested, each of which has its own set of benefits and
downsides, the search for the ideal, all-encompassing algorithm never seems to stop. However, here we present a novel
practice that, when thoroughly compared to prior current edge
approaches and evaluated against various sets of data sizes,
yields satisfactory to superior results.
•
III. MATERIALS AND METHODS
In the section III we put forward our proposed approach to
pursue the data corruption detection problem. We elucidate
the process in detail in the subdivisions below. An overview
of our approach is illustrated in FIGURE 1.
FIGURE 1. llustration of proposed methodology.
A. EXPERIMENTAL ENVIRONMENT
All the tests were implemented on GoogleColab,
Python 3.7.13 [62] version was used for implementing all the
algorithms. We used Keras [63] backend as the deep learning
framework. We plan to make the research and dataset fully
reproducible on GitHub to the research community.
B. DATASETS AND CORRUPTION TECHNIQUE
The scikit-learn library for Python was used to create the
synthetic datasets utilised in the following study. The datasets
are linear and clustering in nature. Testing must be done on a
wide variety of data and corruption rates in order to detect
corrupted data in datasets that have been tainted. We use
univariate data produced especially for this research paper.
The authors would like to mention that this research work
focuses solely on corrupted data and not just outliers and
anomalies unlike the work mentioned above. However, the
corrupted data may contain outliers and anomalies as part
of the corruption. In light of this, we will now provide a
description of our curated dataset and a discussion of the
curation methods.
The linear dataset has the following parameters:
• Number of features =1
• Noise = 10
• The graph for the same is shown in FIGURE 2.
The clustering dataset has the following parameters:
• Number of features = 2
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 1. A review of prominent data corruption detection techniques based on different algorithms.
VOLUME 11, 2023
24911
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 1. (Continued.) A review of prominent data corruption detection techniques based on different algorithms.
24912
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
This is extended to more complex situations, including
replacing the original value with exaggerated and incorrect
ones. To explore in more detail, a piece of code is used to
randomly select cells with random rows and columns and
replace them with a completely random value. The final contaminated dataset is then retrieved, and different techniques
are applied to precisely predict and address the corrupted
values.
FIGURE 2. Representing linear generated data for small dataset.
1) Realistic Data
The authors also conducted the data corruption detection
analysis on a Realistic dataset to improve the practicality of
the proposed methodology. The dataset used here is a standard corruption detection dataset - ‘‘The Complete Pokemon
dataset’’, with 802 instances and a 3% corruption rate. The
dataset was not additionally or synthetically corrupted. The
various corruption detection models along with PAACDA
were applied to preceisely predict the corrupted values.
FIGURE 3. Representing clustering data generated for small dataset.
Number of centres = 5
• The graph for the same is shown in FIGURE 3.
The datasets are of 3 different sizes:
• 10,000 samples (Small)
• 40,000 samples (Medium)
• 75,000 samples (Medium-Large)
The datasets are further categorised on the basis of the percentage of corruption:
• 20 % corrupted values
• 40 % corrupted values
• 60 % corrupted values
•
TABLE 2. Summarizing the different types of dataset and corruption
levels used.
Thus in the present work, a total of 18 datasets of varying
sizes and corruptions were used to demonstrate the impact on
the 16 underlying models and our proposed Proximity based
Adamic Adar Corruption Detection Algorithm (PAACDA).
The data points in these provenant datasets are corrupted
at random in every conceivable way, including by substituting fake values for actual ones, outliers, and missing data
(0 or NaN). The most fundamental form of data corruption includes deleting the data from the datasets, which is
similar to how lost system data is frequently experienced.
VOLUME 11, 2023
C. METHODS
1) LOCAL OUTLIER FACTOR
LOF is a type of density-based system [64]. Outliers are segregated in density-based [65], [66] systems because anomalies emerge in low-density areas [67]. LOF compares a location’s local density to that of k of its neighbours, indicating
points with considerably lower density than their neighbours.
As Breunig et al. [21] in their work emphasised LOF as
quite promising as it can detect relevant local outliers that
earlier techniques could not find. They demonstrated that
their strategy of discovering local outliers is effective for the
datasets with closest neighbour searches. For other objects,
they provide strict upper and lower constraints on the value
of LOF, irrespective of if the MinPts nearest neighbours are
from single or several clusters. In addition, they investigated
the effect of MinPts parameter e on the value of LOF. The
experimental findings show that their heuristic is effective.
Eq. (1) calculates Average RD(Reachability Density) and
Eq. (2) calculates Local RD which is reciprocal of RD.
1X
max(kth_distanceof _A′ s_neighbour,
A(u, v) =
k
distance(A, kth_neighbour))
(1)
1
LRD =
(2)
RD
Furthermore, we obtain LOF as in Eq. (3), using which the
points are classed as an outlier(-1) or not (1) [68].
P
1 i=0 to k LRD(i)
(3)
LOF =
k
LRD(A)
Typically, We usually recognise A as an anomaly when its
LOF is lower than that of its k neighbours [69], [70], i.e. when
LOF>1.1, albeit this depends on the context.
Lee and Tukhvatov [71] further proposed three augmentation schemes which are the LOF’, LOF’’, and GridLOF which
optimised known state of the art model, LOF. By offering
a new computation technique to find neighbours, the LOF’’
24913
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
FIGURE 4. Representing corrupted data detected using various methods.
addresses scenarios that the LOF cannot effectively handle.
By trimming inliers, the GridLOF enhances the efficacy of
outlier identification. Because of its intricacy, this approach
has several drawbacks, including a lengthy run time.
In the present work and on experimentation, the improved
LOF resulted in a performance of 59.47% on the clustering dataset and 58.79% for linear data with a corruption
rate of 20%, decreasing as the corruption percentage grew
and imperceptible change in accuracy as the dataset size
increased.
24914
FIGURE 4(a) Shows the resulting corrupt data detected
using the model.
2) ONE-CLASS SVM
For the past decade, SVM has been one of the most effective machine learning approaches. To discriminate between
distinct classes of data, SVMs [72] use hyperplanes in multidimensional space. Naturally, SVM is utilised to handle
multi-class classification challenges [73]. Semi-supervised
variant of SVM, i.e One-Class SVM exists for the anomaly
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
detection. In this case, the algorithm has been trained to
comprehend ‘‘normal,’’ so that whenever new data is provided, it will determine if or not it must be included. Otherwise, the new data is labelled as anomalous or out of the
norm.It employs the One-vs-All Anomaly Detection concept.
It effectively optimises the distance from this hyperplane
to (0,0) while keeping all data points away from the (0,0)
(in feature space F). As a result, a binary function is generated
that may detect input points where the density of the data is
discovered. As a result, in a small region, it returns +1, and -1
for others [74], [75], [76]. FIGURE 4(b) depicts the model’s
detection of fraudulent data. Kernels can also be used to scale
data to a higher dimensional for improved performance.
However, it has been found that a one-class SVM is prone
to data outliers. Amer et al. [75] in their paper, apply two
changes to one-class SVMs to make them more suited for
unsupervised anomaly identification: robust and eta one-class
SVMs. The main notion behind these changes is that outliers
should influence less than regular cases. According to the
research on UCI machine learning collections, their alterations are quite promising: The upgraded one-class SVMs
outperform other standard unsupervised anomaly detection
techniques on 2 of 4 datasets. The suggested eta one-class
SVM in particular has yielded encouraging results. In the
present work and on experimentation, the One-Class SVM
achieved 76.82% on the clustering dataset and 72.28% for
linear data with a corruption rate of 20%, dropping as the corruption percentage rose and showing no discernible change in
accuracy as the dataset size increased. FIGURE 4(b) Shows
the resulting corrupt data detected using the model.
3) K-MEANS CLUSTERING
Another well-known state of the art model wherein the data
are divided into k groups in the K-means-based outlier identification [77] approach by allocating them to the closest
cluster centres. Once assigned, we can calculate how far
each item is from the cluster’s centre and select those with
the largest gaps as outliers. It determines the distance and
its associated centroids. After determining the distance, the
threshold ratio is chosen as a percentile. The data is deemed
poisoned if the threshold ratio is exceeded.
The Elbow procedure is an empirical method to get the best
value of k. Here, a metric known as ‘‘Within Cluster Sum of
Squares’’ (WCSS) [78] as shown in Eq. (4) is determined with
respect to its cluster centroid and recorded.
WCSS(m) = argo min
Pn
j=1
P
yi ∈cluster
||yi − ȳj ||2
(4)
where o is the collection of observations, m is the total set of
predictors, yi is the observational data point in cluster i and ȳi
is the sample mean in the cluster i [79].
The K-means algorithm should be manually supplied or
must employ additional procedures with the number of clusters. Instead of finding global optimum solutions for nonconvex problems, the K-means method becomes trapped on
local optimum solutions. As Xiong et al. [80] in their paper
VOLUME 11, 2023
FIGURE 5. The corrupted data detected using K-means Clustering.
for optimisation of initial clusters centers of text classification
came up with an algorithm wherein the density parameter of
the data items was used to calculate the first cluster centres,
ensuring the logic of the initial cluster centres. Their new
approach, to a considerable part, decreased the susceptibility
of the K-means algorithm to the original cluster centres and
produced improved text clustering results.
When there are anomalies in the data, they do have a crucial
impact on all cluster centroids as it focuses on the mean of
the values for its centre [80], [81]. Hence, the groups that
would be made in the presence and absence of these outliers
would vary greatly. The distances of the values from the
centres would also vary and a new set of corrupted outliers
would be produced every time. In the present work and on
experimentation, the K-Means obtained 86.06% on the clustering dataset and 86.70% for linear data with a corruption
rate of 20%, declining as the corruption percentage grew
and demonstrating no discernable change in accuracy as the
dataset size increased. FIGURE 5 depicts the model-detected
faulty data.
4) ISOLATION FOREST
An Isolation Tree is a Random Forest variant that may be
utilised for anomaly identification. They extract one random characteristic at a time and divide it into homogenous
partitions. However, the goal of Isolation Forest [82], [83],
[84] is not to create homogeneous partitions, but rather to
create partitions in which each datapoint is isolated (That
particular isolation contains only the datapoint). The rationale
underlying Isolation Trees is that a regular point is more
difficult to isolate than an aberrant one.
During the training phase of this approach, we take a sample of the data and generate an itree till each point is visited.
Choose a feature at random and split it along at random. The
forecast is then completed by calculating the Anomaly Scores
as given in Eq. (5) for the new points [85].
S(x1, n1) = 2
•
•
•
•
•
E(p(x1))
c(n1)
(5)
x1: data point
n1: sample size
PS(x1,n1): Prediction Score
E(p(x1)): iTrees average search heights for x
c(n1): Average value of p(x)
24915
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
The samples that travelled farther into the tree are less
likely to be abnormal since it requires more cuttings to separate them. Shorter branches are likely to include anomalies
since the tree finds it easier to identify them from other
data [86].
If E(p(x1) ≪ c(n1) => PS(x1, n1) = 1 => Anomaly
If E(p(x1) == c(n1) => PS(x1, n1) = 0.5 => Regular
As a result, Isolation Forest produces a score bound in
the range of 0 to 1, where values close to 1 are regarded
Anomalous and values less than 0.5 are considered Regular. However, the values produced by sklearn [86] have
an inverted interpretation, i.e. numbers less than −0.5 are
more regular while values more than −0.5 are more likely
to be anomalous. In the present work and on experimentation, the Isolation Forest achieved 82.37% on the clustering dataset and 82.2% with a corruption rate of 20%,
with accuracy decreasing as the corruption percentage rose
and showing no discernible change as the dataset size
increased. The model-detected erroneous data is depicted
in FIGURE 4(c).
5) DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS
WITH NOISE(DBSCAN)
The abbreviation DBSCAN refers to Density-Based Spatial
Clustering of Applications with Noise [87]. It is an unsupervised technique that divides a set of points into sets
with comparable qualities. It uses density-based clustering
to find outliers that do not fit into any of the clusters or
sets [88], [89]. FIGURE 4(d) Shows the resulting corrupt data
detected using the model.
DBSCAN takes in 2 input parameters•
•
ε
minpts() [90], [91]
where ε represents the radius of the circle formed with data
object as centre and minpts() represents the number of points
inside the circle.
As a result, three types of datapoints are obtained
(i) Core point - Satisfies the input requirements.
(ii) Boundary point - the core point’s neighbour.
(iii) Noise point - Neither centre nor border.
The DBSCAN starts by determining the surroundings
starting from an unexplored, random starting point. If the
point has enough the neighbours the clustering begins
and is labelled as visited or else the point is labelled as
noise. This procedure is repeated until all points in a cluster have been realised and all points have been marked
visited.
In the present work and on experimentation, the DBSCAN
obtained 39.60% accuracy on the clustering dataset and
43.20% with a 20% corruption rate, with accuracy falling as
the corruption percentage grew and displaying no noticeable
change as the dataset size increased. FIGURE 4(d) depicts the
model-detected incorrect data.
24916
6) ELLIPTICAL ENVELOPE
The basis of the Elliptical Envelope algorithm is to create
a hypothetical oval shape similar to an ellipse around the
given dataset values [92]. The points which fall inside the
elliptical shape are regarded as normal data and the values in
the dataset outside the elliptical shapes are considered outliers
or anomalies. This unsupervised algorithm is mostly used
on a gaussian distributed dataset. To find out the data points
which are at a further distance from the boundary of the shape
minimum-covariance matrix is found. FIGURE 4(e) Shows
the resulting corrupt data detected using the model.
In essence, the Elliptical Envelope algorithm fits a Gaussian onto the data and then tries to find the outliers which
are the data points which do not fit adequately. Since this
is primarily intended to be used for the outlier detection
job, our aim is to fetch a reliable estimation of the mean
and covariance matrix that will allow us to accept certain
outliers in the training dataset while still attempting to recover
the true covariance matrix. The Mahobalies distance dMH is
used to obtain the distance measure between an instance ‘P’
and a given allocation denoted by ‘D’. It is computed with
respect to all the multidimensional data vector x, and the
resultant distances(dMH ) are sorted in ascending order. The
dMH calculated is then used to define a threshold in order
to define a boundary which would classify the data points
as normal or anomalous. Mahalanobis defined ‘‘Mahalanobis
distance’’ [93], [94] as shown in Eq. (6).
q
(6)
dMH = (y − µ)T (C)−1 (y − µ)
where C denotes the covariance matrix. When the covariance
is equal to the identity matrix, dMH simplifies to Euclidean
distance [95] and if a covariance matrix is a diagonal matrix,
to the normalized Euclidean distance [96].
In the present work and on experimentation, the Elliptic
Envelope outlier detection algorithm has performed significantly well for datasets of varying sizes and levels of corruption. It delivered an accuracy of about 86% for datasets of
smaller size but 60% corruption rate. It kept up its consistency
in the identification of corruption for medium and big datasets
as well, maintaining an accuracy of about 80% and 85%,
respectively, for the datasets injected with 60% corruption
rate.
7) ROTATION-BASED OUTLIER DETECTION
‘Rotation Based Outlier Detection’ or ROD is an approach
that can be used for anomalies and outliers’ detection in
multivariate data. ROD was first designed to deal with
scenarios such as complicated outliers concealed in subspaces [97], [98], taking into account data generated by disparate means [99], and smoothing the detection of outliers in
higher dimensions that would otherwise go unnoticed [100].
The robust approach known as ‘‘rotation-based outlier detection’’ rotates the three-dimensional (3D) vectors that represent the data points twice counterclockwise around the
geometric median. This rotation is done in accordance with
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
the Rodrigues rotation formula [101]. The rotation produces
parallelepipeds, the volumes of which are investigated using
mathematical means such as cost functions and utilized to
determine the median absolute deviations and generate the
outlier score. When the original data space is divided into
3D subspaces, the total score is determined by averaging
the 3D-subspace scores for high dimensions. FIGURE 4(f)
Shows the resulting corrupt data detected using the model.
The algorithm performed fairly well for datasets of varying
sizes having lower level of corruption. As per the tests run
by the authors ROD provided an accuracy around 62.71% in
detecting corruption for clustering data and 62.83% for linear
data with 20% corruption, however the accuracy drastically
declined as the level of corruption was increased keeping the
dataset size fixed.
use PCA for dimension reduction [103], [104], [105], [106].
Anomaly detection using PCA is based on the decomposition of data metrics, However, anomaly detection using this
technique is mostly restricted to numerical data which is also
one of the drawbacks of this methodology. The various tests
conducted as a part of this research to measure the effectiveness of PCA for detection of data corruption revealed some
astonishing results. As the level of corruption was increased
from 20% to 60% for small, medium and large data sized
a significant drop in detecting corruption was observed.The
accuracy dropped from around 70% for 20% corruption to
around 15-20% for 60% corruption across datasets of all
sizes. Thus highlighting the algorithms effectiveness in situations where there are highly corrupted datasets.
9) DEEP SUPPORT VECTOR DATA DESCRIPTION
8) PRINCIPAL COMPONENT ANALYSIS
Principal component analysis or PCA is a traditional mathematical approach where the data matrix is split into principal
components. The principal components’ poor interpretability
and the trade-off between losing crucial information/data and
reducing dimensionality, which has a powerful impact on
the accuracy, are the main reasons why the method is significantly less effective than other approaches when applied
to the synthetic datasets used in this study. The disadvantages of this technique are exacerbated by the requirement to
provide the dataset’s contamination rate in order to identify
outliers. The key elements can be used in multiple situations. As demonstrated in this work, ‘‘Principal Component
Analysis’’ (PCA), which is frequently utilized for exploratory
analysis and dimension reduction, can also be used to detect
corrupted data. FIGURE 4(g) Displays the model’s detection
of corrupt data.
Principal Component Analysis is primarily used to
decrease the dimensionality for the dataset which consists of
numerous variables that are correlated, whilst simultaneously
retaining the variation and essence of the original dataset.
This feat is reached by converting into a new set of variables
that are principal components which are not much correlated
as well as ordered such that the first few hold the majority of the variation with respect to original variables [102].
FIGURE 6 adequate puts forth the basic steps in order to
FIGURE 6. The steps involved in PCA.
VOLUME 11, 2023
Deep support vector data description(DeepSVDD) approach
proposes a modification of support vector data description
model which is another traditional paradigm for anomalies
detection. DeepSVDD employs a specific type of neural
network to learn appropriate data representations. To distinguish between regular and anomalous data, DeepSVDD [107]
employs the hyper-sphere rather than the hyper-plane.
DeepSVDD extracts discriminative features from the initial
data using a neural network.
min 1n
Pn
2
i=1 ||φ(xi ; W )−a||
+
λ
2
PL
b=1 (||W
b ||
2
F)
(7)
In Eq. (7), ‘‘a’’ represents the center of the sphere, x represents features extracted and W being the weights of the
hidden layers and subscript F represents Frobenius norm
which cycles through all the entries, adds their squares and
then takes the square root as represented in Eq. (8).
v
uX
m
u n X
|aij|2
(8)
||A||F = t
i=1 j=1
The first part of Eq. (8) [108] reflects the loss which
varies with the distance to the sphere’s centre. The next
term denotes a W decay regularizer with > 0 inserted as a
hyperparameter. The Python PyOD library’s Deep SVDD,
which was employed to test this model, calls for the specification of the contamination or corruption rate. The default
hyper-parameters offered by the PyOD module serve as the
foundation for the results that the deepSVDD generates on the
aforementioned datasets which are trained for 100 epochs.
DeepSVDD is used to train a neural network [109], [110],
which reduces the size of the hypersphere that surrounds the
data network representations, driving the network to identify recurring sources of variation. DeepSVDD, like PCA,
may be used to discover outlier items in data by calculating their distance from the center. FIGURE 4(h) depicts
the model-detected faulty data. DeepSVDD when tested on
detection of corruption did not show much deviation in
the accuracies for different size and rate of corruption for
24917
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
the dataset. In the present work and on experimentation,
it gave an accuracy of 70-75% for 20% corruption of small,
medium and large datasets which only dropped slightly to
around 50-55% for 40% and 60% corruption for small and
medium sized datasets. Whereas it maintained a consistent
accuracy of around 70% for large datasets with varying corruption rates.
10) LUNAR
LUNAR stands for Learnable Unified Neighbourhood-based
Anomaly Ranking [60], it develops a trainable method for
using data from each node’s closest neighbours to detect
anomalies. Along with the PAACDA algorithm, LUNAR is
another graph-based outlier detection technique that offers
comparable comparison points.
The LUNAR is unified using K-NN, DBSCAN, LOF and
GNNs to provide a faster computing speed and better performance. The outlier score for the KNN is given by Eq. (9).
Where y is the data sample and n is the number of neighbours.
KNN (yi ) = dist(yi , y2i )
(9)
The outlier score for the LOF factor is given by the Eq. (10)
where lrd is ‘‘local reachability density’’.
P
lrdj (yi )
(10)
LOF(yk ) = i∈D
|Dk |lrdj (yi )
The edges in the graph component are computed using the
Eq. (11) which is the Euclidean distance of the 2 points, where
yi and yj are the two data samples and N is the number of
neighbours.
ei,j = dist(yi , yj ifj ∈ Nj ), 0 otherwise
(11)
The K closest neighbours of a sequence of data points
are produced as input to the neural network. Finding the
k nearest neighbours is one of LUNAR’s limitations, as it is
with all local outlier approaches. This is mostly a problem
in extremely high-dimensional spaces, but because the aforementioned datasets have smaller dimensions, the LUNAR
model outperforms conventional probabilistic models in
these situations. Other outlier detectors like the local outlier
factor (LOF) and DBSCAN have been compared to LUNAR.
However, we will focus on the detection of damaged data in
tainted datasets in this paper rather than just outliers. In the
present work and on experimentation, LUNAR models seems
to have performed fairly well in small, medium and large
datasets for 20% corruption rate and thereby provided an
accuracy of around 85-87%. However the accuracy dropped
to about 70% for 40% corruption rate and subsequently to
around 50% when the level of corruption was increased to
60% in the datasets. FIGURE 4(i) Shows the resulting corrupt
data detected using the model.
called ECOD (Empirical Cumulative Outlier Detection) [43].
ECOD is a highly interpretable approach for outlier detection and requires no parameters. The synthetically generated datasets have a rather high sensitivity metric when this
method is applied to them. Here the investigators formulated
that one of the algorithm’s important characteristics is that
there are no hyper-parameters, which makes it simpler to
implement [111]. However, the function in the Python PyOD
package also requires the definition of the corruption percentage, just like other statistical approaches. The authors
proposed that ECOD [43] was easily interpretable by looking at the left or right tailed probability which was highly
optimised as both of them contributed to the total outlier
score. Thus the importance of the tailed probabilities was
illustrated in the work we drew inspiration from and thus
gained crucial results. In the present work and on experimentation, the optimized ECOD led to a performance of 82.71%
on the clustering dataset and 82.30% for linear data with a
corruption rate of 20%, decreasing with increase in corruption
percentage and indistinguishable change in accuracy when
size of the dataset increased.
An overview of the methodology, where data is less likely
(low-density) and hence more likely to be corrupted, the
ECOD employs information about the distribution of the data.
For each variable in the data, ECOD individually estimates an
Empirical Cumulative Distribution Function (ECDF) [112],
[113]. ECOD uses a univariate ECDF to determine tail probabilities for each variable and then combines them together
to produce a score for an observation. The computation takes
into account both the left and right tails of each dimension and
is performed in log space. Although this method is designed
to find outliers, we’ll use it to find instances of faulty data in a
dataset [114]. FIGURE 4(j) Shows the resulting corrupt data
detected using the model. The outlier score for the ECOD
algorithm is calculated using the below-mentioned formulae.
The Outlier scores are calculated for the left tail, right tail and
another measure called auto as shown in Eq. (12), Eq. (13),
Eq. (14).
Pa
ˆ (i) (i)
Oleft−only (Yk ) := − log D\
left (yk ) = −
i=1 log(Dleft (Yk ))
(12)
\
Oright−only (Yk ) := − log Dright
(yk ) = −
Oauto (Yk ) = −
Pa
i=1 [γi
ˆ
< 0 log(Dleft
(i)
Pa
ˆ
i=1 log(Dright
(i)
(Yk )) + γi
(i)
(i)
(Yk ))
(13)
(i)
(i)
ˆ
< 0 log(Dright
(Yk ))]
(14)
where O is the outlier scores, Y is the input data, D is the
corresponding ECDF. The final outlier score is derived by
aggregating the above-mentioned outlier scores using the
formula mentioned in Eq. (15).
Oi = max (Oleft−only (Yi ), Oright−only (Yi ), Oauto (Yi ))
(15)
11) EMPIRICAL CUMULATIVE OUTLIER DETECTION
12) GAUSSIAN MIXTURE MODELS
It is important to note that for outlier detection with the
Empirical Cumulative distribution function, there is a class
GMMs or Gaussian mixture models use sets of parameterised probabilistic functions as the weighted components of
24918
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
the pre-trained model using expectation maximisation technique [115]. Previous work on outlier detection has shown
that gaussian mixture models have proved effective in detecting outliers providing optimised performance at k=2 gaussians and other parameters being randomly initialised [116].
In the present work and on experimentation, the optimized
ECOD led to a performance of 82.71% on the clustering
dataset and 82.83% on linear data with a corruption rate
of 20%, decreasing with increase in corruption percentage
and indistinguishable change in accuracy when size of the
dataset increased.
GMMs when used on the aforementioned datasets, it has
a very high recall but a very low accuracy. In terms of outlier detection it performs better than other similar clustering
techniques like K-Means clustering and DBSCAN clustering.
It necessitates the specification of the contamination rate, just
like other probabilistic models. This unsupervised clustering
method is the Gaussian Mixture Model and can be described
using Eq. (16). As a sum of component densities for a particular point. In contrast to K-Means, we fit ‘k’ Gaussians in
the data in this technique. The parameters such as the mean
and the variance for each of the cluster as well as the weight
of the cluster also called the distribution parameters are then
determined [116]. We then determine the odds that each
data point will belong to each of the clusters. FIGURE 4(k)
Shows the resulting corrupt data detected using the model.
The unsupervised method is applied on univariate data and
produces an outlier score which then becomes the threshold
and the primary criteria to filter the outliers and anomalies
from the data [117]. The outlier score for a particular data
instance is calculated by the Eq. (17), where p(y) is the PDF or
the probability density function and f would be any constant
to scale the outlier score and u is the mean [118]. Further the
outlier score is normalised to an interval of 0 to 10 in order to
ease the comparison.
p(y|λ) =
N
X
vk g(y|µk ,
k=1
OSy = (log(p(y)))2f
X
)
(16)
k
(17)
GMMs can also be used to identify potential anomalies in
multidimensional datasets. It can also be aggregated with an
LSTM (long short term memory) to examine the correlation
between the multivariate parameterised data in order to obtain
improved outcome in detecting potential outliers [119].
13) MEDIAN ABSOLUTE DEVIATIONS
The MAD or ‘‘median absolute deviation’’ for a set of
attributes is the median of that dataset’s absolute deviation.
This concept is the crux of the Median Absolute Deviation
also called MAD algorithm [120]. The absolute deviation of
an instant is the pairwise displacement between such a tuple
and the distribution’s mean. In previous works the MAD is a
reliable indicator of the variability in a sample of numeric
data, is used in statistics. Because it is so successful and
efficient, the Median Absolute Deviation model is frequently
VOLUME 11, 2023
employed for this kind of anomaly identification [121].
Instead of pure outliers and anomalies, we will concentrate
on corrupt data and provide optimised results [121]. The primary reason this algorithm does well on the aforementioned
datasets is that MAD is primarily used for evaluating the
distance between data instance and its median in the terms of
the median distance and is strictly for univariate data [122].
The time required to compute the MAD score is fairly less
and the MAD algorithm is aimed towards symmetric distributions [123]. Unlike the other probabilistic models, the Median
Absolute Deviation does not require the corruption rate to be
specified which adds to its credibility and provides optimised
results [123]. FIGURE 4(l) Shows the resulting corrupt data
detected using the model.
The MAD is an alternative to the earlier used threshold
equal to the sum of the mean of a distribution and three
standard deviations which causes problems as the mean value
and the standard deviations are extremely sensitive to outliers.
This model is another threshold based outlier detection technique to identify outliers based on the statistical formulae like
mean, median, mode. Unfortunately though this algorithm
like many other statistical algorithms adds a bias to multiple statistical measures used in outlier detection algorithms
which lead to inaccurate results [124].
In the present work and on experimentation, the optimised
MAD successfully accomplished an effectiveness of 94.46%
on the clustering dataset and 94.77% on linear data with a
corruption rate of 20%, decreasing with increasing corruption percentage and indistinguishable change in accuracy as
dataset size increased.
14) COPULA-BASED OUTLIER DETECTION
The Python PyOD module contains a set of probabilistic
algorithms one of them named COPOD [57], one of which is
the Copula-Based Outlier Detector. COPOD is an empirical
copula model-based, parameter-free, and highly interpretable
outlier detection algorithm. It is important to note that
achieving top optimised performance on anomaly detection
datasets, interpretable and straightforward corruption visualisation, speed and computational efficiency and scaling to
high-dimensional datasets are some of its key distinguishing
characteristics [125]. The corruption rate is the only known
parameter we use in this investigation. FIGURE 4(m) Shows
the resulting corrupt data detected using the model. In previous work the authors have highlighted that the optimised
COPOD is highly correlated with the ECOD algorithm [57]
and that it outperforms all its variants by being deterministic
without any hyper parameters and highly effective for high
dimensional datasets.
In statistics and probability, copula is a CDF and a multivariate function where the marginal probability distributions
of the variables is uniform in the range [0,1]. Copulas are also
used for representing or modeling the dependence of random
variables [126]. This is the main motivation behind using
copulas to detect anomalies in various application [127].
24919
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
The COPOD algorithm has three main steps to detect
outliers [57]:
•
•
•
•
Calculate the cumulative distributive function based on
the dataset.
Calculate the empirical copula.
Find the tail-probabilities [128] using the above mentioned empirical copula.
The outlier score is found using the max of the tail
probabilities calculated for each data instance.
COPOD (Copula-Based Outlier Detector) does not
require pairwise distance measurement, in contrast to other
proximity-based algorithms. It mainly applies to multivariate
data. To determine and identify the anomalies contained
in the dataset, the COPOD generates an empirical copula
before calculating the tail-based probability for each data
occurrence [129].
In the present work and on experimentation, the optimised COPOD achieved a clustering dataset effectiveness
of 92.43% with a corruption rate of 20%, decreasing to
88% with increasing corruption percentage and indistinguishable change in accuracy as dataset size increased.
The proposed linear data algorithm had a 92.67% accuracy
for a dataset with 20% corrupted data, with comparable
trends when the corruption percentage and dataset size were
varied.
15) HISTOGRAM-BASED OUTLIER DETECTION
A potent method is called unsupervised histogram-based outlier identification also called HBOS. It establishes the level
of corruption while assuming feature independence by producing histograms. The focus of the authors’ work will be on
the application of histogram-based outlier detection (HBOS),
a statistical model that is primarily for outliers, to identify corrupted data in datasets that have been tainted. The histogram
algorithm assesses the level of anomalies while assuming
feature independence by producing histograms [14]. After
multivariate anomaly detection, a histogram for each feature
can be generated, graded separately, and aggregated [130].
It is mostly relevant to multivariate data, although outperforming many other probabilistic models when applied with
the aforementioned bespoke dataset. Similar to other probabilistic and statistical models, the corruption rate must be
given for this histogram based algorithm for outlier detection.
FIGURE 4(n) Shows the resulting corrupt data detected using
the model.
The histogram algorithm for numerical data is essentially
based on two approaches: the first uses the renowned histogram bins with static bin-width that do not vary, and the
second uses the bin-width that changes approach (dynamic
bin-width). These bins with a wider interval of values or range
have lesser density and less height. As a result, the density of
each bin is represented by its height, which is then normalised
to guarantee that the anomaly is given the same weight and
score. The following step involves applying the Eq. (18) to
24920
calculate the HBOS value [131].
HBOS(q) =
a
X
log
j=0
1
histj (q)
(18)
where HBOS for instance q with dimension a is calculated
by use of height of the bins where q is located. HBOS works
well on tasks involving global anomaly identification, but it
is unable to identify local outliers since it is unable to model
or depict histograms with local anomaly density. The algorithm works well on univariate data [132]. A similar anomaly
detection approach can be used there, even though histograms
for multidimensional data are computationally intensive and
need a large number of operations [133], [134].
In the present work and on experimentation, the optimised HBOS achieved a clustering dataset effectiveness of
95.05% with a corruption rate of 20%, decreasing to 88%
with increasing corruption percentage and indistinguishable
change in accuracy as dataset size increased. For a dataset
with 20% corrupted data, the proposed linear data algorithm
had an accuracy of 95%, with comparable trends when the
corruption percentage and dataset size were varied.
IV. PROPOSED METHODOLOGY
PAACDA (Proximity based Adamic Adar Corruption Detection Algorithm): Adamic Adar [135], [136] is a graph algorithm used to link nodes in a social network. In this study,
we utilise the concept behind this algorithm to detect outliers and missing and modified values while leveraging its
prominence in data correlation in graph networks by applying
PAACDA to a numerical, tabular dataset. This algorithm is
used to compute the accuracy of a particular data instance
within a dataset as a whole. The Adamic Adar Index [65] is
calculated using Eq. (19).
X
1
A(x, y) =
(19)
log |Deg(n)|
n∈D(x)∩D(y)
The formula in Eq. (19) presented determines the Adamic
Adar index for each node in a network where D(x) represents
the neighbors of x and D(y) represents the neighbors of y and
Deg(n) is the degree of the common neighbours. Typically,
the Adamic Adar Algorithm is utilized to evaluate the closeness of two nodes in a graph. It is based on the notion that
values with greater discrepancies between them and values
with less in are less likely to be regarded as important than
common. The parameter of a value’s network closeness is
oppositely related [71] to the Adamic-Adar index. Since we
are dealing with numerical data, we use the data’s mean as
a metric to verify the link with each data point and spot the
altered or distorted values. The proposed Eq. (20) is used in
the PAACDA Algorithms for data corruption detection.
PAACDA Index =
PN
1
x=1 log |(Numberofvalues in x∗range)
(20)
The steps listed below illustrate the method followed:
• The mean is calculated for the column being analysed.
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
•
•
•
•
•
The range is set as mean/4.
Each data instance is iterated and Eq. (20) is applied,
where x is each data instance.
If the instance is missing then the PAACDA Index value
is set to infinity.
The PAACDA Index is compared amongst each other
and the set of corrupted values is determined.
The accuracy metric is obtained and confusion matrix is
determined.
The reason why defining the range as mean/4 explains
intuitively is that the mean accounts for all the components
of the provided data and contains information from each
observation in a dataset. The mean serves as a link between
the actual values and the corrupted data in this manner. Using
probe and analysis, we come to the conclusion that mean/4
is the appropriate range for the following method. Again, the
algorithm’s iteration count is determined experimentally and
we utilise 3 iterations for this algorithm and dataset as shown
in Algorithm 1. The proposed algorithm loops through twice.
The outer loop iterates through each datapoint and the inner
loop further iterates and compares each cell data with the one
held by outer loop. The time complexity is quadratic. Thus
this algorithm runs in 2(n*n) time, where n is number of data
entries in a column.
The PAACDA algorithm is based on the notion that values
with fewer similarities [71] are more likely to be viewed as
significant than values with greater differences. The performance of this algorithm might vary from dataset to dataset
depending on the data distribution as well as the percentage
and level of corruption it is subjected to. The PAACDA
algorithm uses mean to compute the index which acts as an
advantage because each time the data is corrupted the mean
varies accordingly. The PAACDA has proven to be more
effective than the other clustering and statistical techniques
at locating outliers, missing values and corrupted data where
accuracy is the primary concern. This makes PAACDA the
most suitable for data corruption detection for numerical
datasets.
Algorithm 1 Proximity Based Adamic Adar Corruption
Detection Algorithm(PAACDA)
indices ← []
for each i ∈Corrupted Column do
first ← 1
second ← 1
third ← 1
for each j ∈Corrupted Column do
if abs(j − i) ≤ range then
first ← first + 1
end if
if abs(j − i) ≤ 2 ∗ range then
second ← second + 1
end if
if abs(j − i) ≤ 3 ∗ range then
third ← third + 1
end if
end for
index ← 0
if first ̸= 1 then
1
index ← log(first)
end if
if second ̸= 1 then
1
index ← index + log(second)
end if
if third ̸= 1 then
1
index ← index + log(third)
end if
indices.append(index)
end for
After extensive experiments, the following conclusions
were drawn. Table 3 shows the accuracy values of the top
performing models. The rest of the results can be found in
the appendix.
A. RESULTS FOR CLUSTERED DATA
V. RESULTS AND DISCUSSION
On the various sizes and corruption variations of the synthetically created dataset, several experiments were attempted and
put to use. The study primarily focuses on 2 types of datasets:
Clustered and Linear
Each of the aforementioned datasets was subsequently
examined in various sizes, including,
•
•
•
Small Dataset - 10000 values
Medium Dataset - 40000 values
Medium-Large Dataset - 75000 values
Again, different corruption levels were explored each category of various dataset sizes:
•
•
•
20%
40%
60%
VOLUME 11, 2023
PAACDA, HBOS, MAD perform best in this situation, with
accuracy values of 99.74%, 95.05%, and 94.46%, respectively. The middling performers include COPOD, GMM,
LUNAR, Elliptic Envelop, K-Means clustering, ECOD and
Isolation Forest with accuracy values of 92.43%, 91.95%,
87.01%, 72.17%,86.06%, 82.71% and 82.37% respectively.
The One-Class SVM, DeepSVDD, PCA, ROD, LOF and
DBSCAN with accuracies of 76.82%, 72.25%, 72.53%,
62.71%, 59.47% and 39.60% respectively generally performed the worst, while the results varied depending on the
degree of corruption and the size of the sample.
There were several noticeable variations as the corruption
rate went from 20 to 60, including some models that did
well in smaller sizes but did poorly as the size expanded. But
PAACDA consistently demonstrated its superiority over the
competition with unwavering accuracy.
24921
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 3. Accuracy values of the top performing algorithms.
FIGURE 7. Depicts results for the clustering data for small dataset and corruption rates 20%, 40% and 60%.
FIGURE 8. Depicts results for the clustering data for medium dataset and corruption rates 20%, 40% and 60%.
24922
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
FIGURE 9. Depicts results for the clustering data for medium-large dataset and for corruption rates 20%, 40% and 60%.
FIGURE 10. Depicts results for the linear data for small dataset and corruption rates 20%, 40% and 60%.
FIGURE 11. Depicts results for the linear data for medium dataset and corruption rates 20%, 40% and 60%.
VOLUME 11, 2023
24923
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
FIGURE 12. Depicts results for the linear data for medium-large dataset and for corruption rates 20%, 40% and 60%.
TABLE 4. Results of realistic dataset.
Furthermore, the observed pattern shows unequivocally
that the PAACDA model’s performance declined as the extent
of corruption rose as shown in FIGURE 7, 8, 9. Although,
in general, the hierarchy of the model’s performance was
not greatly impacted by the amount of corruption since
the same models performed best and worse, the accuracy
statistics substantially reduced as the level of corruption
rose.
B. RESULTS FOR LINEAR DATA
The same findings as in the previous instance were reached.
PAACDA fared better in this instance as well with an
24924
accuracy of 99.94%. Most of the higher-performing models from the past including HBOS, MAD, COPOD and
GMM did better in this instance with accuracies 95.00%,
94.77%, 92,27% and 92.15% respectively. K-Means clustering, LUNAR, Isolation forest, ECOD and DeepSVDD fared
with accuracies of 86.70%, 86.87%, 82.22%, 82.83% and
76.25% respectively. Results were better for models that
were more geared toward linear data. Once again, the PCA,
One Class SVM, ROD, LOF and DBSCAN Clustering were
the worst performers with accuracies of 73.01%, 72.28%,
62.83%, 58.79% and 43.20% respectively. As the dataset
size changed, there were no appreciable changes in accuracy.
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 5. Accuracy values for clustering data.
TABLE 6. Accuracy values for linear data.
However, just like in the previous instance, performance suffered as the amount of corruption rose.
Slightly different trends were observed in the case of the
other metrics like Precision, Recall, Sensitivity, and F1 score
in the case of both Clustered and Linear Datasets. Several
VOLUME 11, 2023
models that had previously had moderate and semi-moderate
accuracy now had either high precision, recall, or F1 score,
despite the ordering hierarchy being relatively constant as
these models were geared towards linear data when compared
to clustering data as shown in FIGURE 10, 11, 12.
24925
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 7. Recall values for clustering data.
TABLE 8. Recall values for linear data.
C. RESULTS FOR REALISTIC DATA
The same set of experiments with 16 different models
were carried out on Realistic existing dataset with outliers.
PAACDA performed the best with 99.75% accuracy. Unlike
the synthetic data, the close competitors for our proposed
model was not HBOS and MAD, but instead COPOD and
24926
Elliptical Envelope with accuracy of 99.12% and 98.50%
respectively. Every model performed decently well except K
means which has accuracy of 7.74% as it is not suitable for all
kinds of dataset. Further, for each of the above models other
metrics such as precision, recall, sensitivity and F1 score are
tabulated in the Table 4.
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 9. Precision values for clustering data.
TABLE 10. Precision values for linear data.
D. LIMITATIONS AND CONSTRAINTS
The proposed methodology however has some limitations.
The PAACDA requires the specification of the corruption percentage as one of the parameters which further
VOLUME 11, 2023
will require parameter tuning. In addition to this PAACDA
works well with uni-variate data unlike ROD, GMM and
HBOD which can handle multi-variate data at the a
time.
24927
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 11. Sensitivity values for clustering data.
TABLE 12. Sensitivity values for linear data.
VI. CONCLUSION AND FUTURE SCOPE
Data that is reliable and accurate is essential for conducting
effective research. This is because faulty and untrustworthy
data produces erroneous or false results. Inadvertently entering incorrect data into a computer will produce an output,
24928
which could be fatal in fields such as healthcare and defence.
While being written, edited, or transferred to another drive,
data might become corrupted. Additionally, a virus can damage files. Usually, this is done on purpose to harm crucial
system files. Finding outliers that are silently existing in a
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 13. F1-Score values for clustering data.
dataset is only half the issue; data with high rates of corruption can seriously impair model accuracy and the outcomes of
data analytics. In this case, accuracy is a necessary requirement to verify the information from the sources. Research
accuracy ensures that the information gathered is accurate or
inaccurate. Therefore, it is crucial to check the accuracy of
any survey.
In this article, firstly we introduce the primary concepts of
outlier detection and how these models and techniques can be
used to detect corrupted data. Then, we encapsulate the quality improvement approaches of data corruption detection and
split the data to two categories based on their behavior: linearly distributed data and clustered data on 3 high structured
synthetic datasets- small, medium, medium-large. The results
showed That the PAACDA outperformed the other algorithms
with an accuracy of 96.35% for clustered data and 99.04% for
linear data. Lastly, we lay-out an experiment-based comparison of multiple cutting edge quality improvement approaches
using a plethora of quality evaluation metrics, the authors
have combined the findings of various other probabilistic and
statistical models and explained how they used the innovative
PAACDA algorithm to get desired results on the data. With
accuracy values of 95.05% and 94.46%, respectively, HBOS
and MAD are among the other top performers for the clustering dataset. COPOD, GMM, LUNAR, Elliptic Envelop,
K-Means clustering, ECOD, and Isolation Forest are among
the middle performers, with accuracy values of 92.43%,
91.95%, 87.01%, 72.17%, 86.06%, 82.71, and 82.37%,
respectively. The One-Class SVM, DeepSVDD, PCA, ROD,
LOF, and DBSCAN performed the worst, with accuracies
VOLUME 11, 2023
of 76.82%, 72.25%, 72.53%, 62.71%, 59.47%, and 39.60%,
respectively. Most previous higher-performing models,
including HBOS, MAD, COPOD, and GMM, performed
better on the linear dataset, with accuracies of 95.00%,
94.77%, 92.27%, and 92.15%, respectively. Accuracy rates
for K-Means clustering, LUNAR, Isolation forest, ECOD,
and DeepSVDD were 86.70%, 86.87%, 82.22%, 82.83%, and
76.25%, respectively. Models that were more geared toward
linear data produced better results. PCA, One Class SVM,
ROD, LOF, and DBSCAN Clustering performed the worst,
with accuracies of 73.01%, 72.28%, 62.83%, 58.79%, and
43.20%, respectively. There were no discernible changes in
accuracy as the dataset size increased. However, as in the
previous case, performance suffered as the level of corruption
increased.
To the authors knowledge, the paper is the first one that
systematically addressed the detection of corrupted data from
different aspects such as data distribution, dataset size and
also variations of corruption rates. We reviewed most of the
published papers in well reputed libraries. With an accuracy
of 96.35% for clustered data and 99.04% for linear data, the
PAACDA algorithm exceeds the other models. In this work
an exhaustive review of many unsupervised and probabilistic
models was conducted. The other top performing algorithms
are the Histogram based outlier detection model, K-Means
Clustering, Elliptical Envelope outlier detection and Isolation
forest.
This study correctly identified the PAACDA algorithm as
one of the better methods and provided a comprehensive
compilation of numerous alternative approaches for solving
24929
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
TABLE 14. F1-Score values for linear data.
the problem of correctly differentiating corrupted data in
the dataset. The future work in this field must focus on the
following aspects: 1. To find tainted values that stray just
a little from the original values, more algorithmic research
could be explored. 2. The current effort leans towards outliers and focuses mostly on the detection of tainted data. 3.
It is possible to conduct additional studies on the topic of
recovering the original data using tools like backpropagation
and GANs (Generative Adversarial Networks). When the
output feature is known, backpropagation can be used to
create the input features. 4. The missing values can be filled
in using GANs. Furthermore, traditional restoration models
are comparitively complex, which limits our propensity to
expand pragmatic studies and applications. The potential of
these strategies is genuinely tremendous, and listing them
only scratches the surface. This study can be expanded further
to include categorical and even picture datasets in addition to
numerical data. 5. In the realm of picture collections, GANs
have a wide range of uses, particularly for identifying false
images and producing Deep fakes.
APPENDIX
The entire tables under the results and discussion section have
been included in the Appendix A below.
APPENDIX A TABLES
Additional tables that support the experiment can be found
here. All the experimental results including the results
for small, medium and large datasets for all corruption
rates (20%, 40%, 60%) are included in this document.
Table 1,2,3,4,5,6,7,8,9,10 represent the accuracy, recall,
24930
precision, sensitivity and F1-score of both linear and clustering data.
REFERENCES
[1] E. Burgdorf, ‘‘Predicting the impact of data corruption on the operation of
cyber-physical systems,’’ Missouri Univ. Sci. Technol., Rolla, MO, USA,
Tech. Rep. 27929030, 2017.
[2] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’
ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, Jul. 2009.
[3] M. Pang-Ning and V. Steinbach, Introduction to Data Mining. London,
U.K.: Pearson, 2016.
[4] H. M. Touny, A. S. Moussa, and A. S. Hadi, ‘‘Fuzzy multivariate outliers
with application on BACON algorithm,’’ in Proc. IEEE Int. Conf. Fuzzy
Syst. (FUZZ-IEEE), Jul. 2020, pp. 1–7.
[5] S. Thudumu, P. Branch, J. Jin, and J. Singh, ‘‘A comprehensive survey
of anomaly detection techniques for high dimensional big data,’’ J. Big
Data, vol. 7, no. 1, pp. 1–30, 2020, doi: 10.1186/s40537-020-00320-x.
[6] O. J. Oyelade, O. O. Oladipupo, and I. C. Obagbuwa, ‘‘Application
of K means clustering algorithm for prediction of students academic
performance,’’ 2010, arXiv:1002.2425.
[7] H. L. Sari, D. Suranti, and L. N. Zulita, ‘‘Implementation of kmeans clustering method for electronic learning model,’’ J. Phys.,
Conf. Ser., vol. 930, Dec. 2017, Art. no. 012021, doi: 10.1088/17426596/930/1/012021.
[8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ‘‘A density-based algorithm
for discovering clusters in large spatial databases with noise,’’ in Proc.
KDD, vol. 96, Jan. 1996, pp. 226–231.
[9] D. Deng, ‘‘DBSCAN clustering algorithm based on density,’’ in Proc. 7th
Int. Forum Electr. Eng. Autom. (IFEEA), Sep. 2020, pp. 949–953.
[10] F. T. Liu, K. M. Ting, and Z.-H. Zhou, ‘‘Isolation forest,’’ in Proc. 8th
IEEE Int. Conf. Data Mining, Dec. 2008, pp. 413–422.
[11] R. Gao, T. Zhang, S. Sun, and Z. Liu, ‘‘Research and improvement of
isolation forest in detection of local anomaly points,’’ J. Phys., Conf.
Ser., vol. 1237, no. 5, Jun. 2019, Art. no. 052023, doi: 10.1088/17426596/1237/5/052023.
[12] M. Ashrafuzzaman, S. Das, A. A. Jillepalli, Y. Chakhchoukh, and
F. T. Sheldon, ‘‘Elliptic envelope based detection of stealthy false data
injection attacks in smart grid control systems,’’ in Proc. IEEE Symp. Ser.
Comput. Intell. (SSCI), Dec. 2020, pp. 1131–1137.
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
[13] C. McKinnon, J. Carroll, A. McDonald, S. Koukoura, D. Infield, and
C. Soraghan, ‘‘Comparison of new anomaly detection technique for wind
turbine condition monitoring using gearbox SCADA data,’’ Energies,
vol. 13, no. 19, p. 5152, Oct. 2020, doi: 10.3390/en13195152.
[14] M. Goldstein and A. Dengel, ‘‘Histogram-based outlier score (HBOS):
A fast unsupervised anomaly detection algorithm,’’ in Proc. KI, Poster
Demo Track, vol. 9, 2012, pp. 59–63.
[15] N. Paulauskas and A. Baskys, ‘‘Application of histogram-based outlier
scores to detect computer network anomalies,’’ Electronics, vol. 8, no. 11,
p. 1251, Nov. 2019, doi: 10.3390/electronics8111251.
[16] I. T. Jolliffe and J. Cadima, ‘‘Principal component analysis: A review
and recent developments,’’ Phil. Trans. Roy. Soc. A, Math., Phys. Eng.
Sci., vol. 374, no. 2065, Apr. 2016, Art. no. 20150202, doi: 10.1098/
rsta.2015.0202.
[17] S. Mishra, U. Sarkar, S. Taraphder, S. Datta, D. Swain, R. Saikhom,
S. Panda, and M. Laishram, ‘‘Principal component analysis,’’ Int. J.
Livestock Res., vol. 2, no. 4, pp. 433–459, 2017, doi: 10.5455/ijlr.
20170415115235.
[18] A. Karimian, Z. Yang, and R. Tron, ‘‘Rotational outlier identification in
pose graphs using dual decomposition,’’ in Computer Vision ECCV 2020.
Cham, Switzerland: Springer, 2020, pp. 391–407.
[19] Y. Almardeny, N. Boujnah, and F. Cleary, ‘‘A novel outlier detection
method for multivariate data,’’ IEEE Trans. Knowl. Data Eng., vol. 34,
no. 9, pp. 4052–4062, Sep. 2022, doi: 10.1109/tkde.2020.3036524.
[20] O. Alghushairy, R. Alsini, T. Soule, and X. Ma, ‘‘A review of local outlier
factor algorithms for outlier detection in big data streams,’’ Big Data Cognit. Comput., vol. 5, no. 1, p. 1, Dec. 2020, doi: 10.3390/bdcc5010001.
[21] M. M. Breunig, R. T. Kriegel, and J. Ng, ‘‘LOF: Identifying density-based
local outliers,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2000,
pp. 93–104.
[22] L. Ruff, ‘‘Deep one-class classification,’’ in Proc. Int. Conf. Mach. Learn.,
2018, pp. 4393–4402.
[23] Z. Zhang and X. Deng, ‘‘Anomaly detection using improved deep SVDD
model with data structure preservation,’’ Pattern Recognit. Lett., vol. 148,
pp. 1–6, Aug. 2021, doi: 10.1016/j.patrec.2021.04.020.
[24] L. Adamic and E. Adar, ‘‘How to search a social network,’’ Social Netw.,
vol. 27, no. 3, pp. 187–203, 2005, doi: 10.1016/j.socnet.2005.01.007.
[25] F. Gao, K. Musial, C. Cooper, and S. Tsoka, ‘‘Link prediction methods
and their accuracy for different social networks and network metrics,’’ Sci.
Program., vol. 2015, pp. 1–13, Jan. 2015, doi: 10.1155/2015/172879.
[26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial networks,’’ Commun. ACM, vol. 63, no. 11, pp. 139–144, 2020, doi:
10.1145/3422622.
[27] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni,
‘‘Modeling tabular data using conditional GAN,’’ 2019,
arXiv:1907.00503.
[28] V. Hodge and J. Austin, ‘‘A survey of outlier detection methodologies,’’ Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, Oct. 2004, doi:
10.1007/s10462-004-4304-y.
[29] A. Patcha and J.-M. Park, ‘‘An overview of anomaly detection techniques: Existing solutions and latest technological trends,’’ Comput.
Netw., vol. 51, no. 12, pp. 3448–3470, Aug. 2007, doi: 10.1016/j.
comnet.2007.02.001.
[30] M. Jiang, P. Cui, and C. Faloutsos, ‘‘Suspicious behavior detection:
Current trends and future directions,’’ IEEE Intell. Syst., vol. 31, no. 1,
pp. 31–39, Jan. 2016, doi: 10.1109/mis.2016.5.
[31] C. O. S. Sorzano, J. Vargas, and A. P. Montano, ‘‘A survey of dimensionality reduction techniques,’’ 2014, arXiv:1403.2877.
[32] J. Gama, A. Ganguly, O. Omitaomu, R. Vatsavai, and M. Gaber, ‘‘Knowledge discovery from data streams,’’ Intell. Data Anal., vol. 13, no. 3,
pp. 403–404, May 2009, doi: 10.3233/ida-2009-0372.
[33] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, ‘‘Outlier detection for
temporal data: A survey,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 9,
pp. 2250–2267, Sep. 2014, doi: 10.1109/tkde.2013.184.
[34] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari, ‘‘Detection of
review spam: A survey,’’ Expert Syst. Appl., vol. 42, no. 7, pp. 3634–3642,
2015, doi: 10.1016/j.eswa.2014.12.029.
[35] N. Jindal and B. Liu, ‘‘Review spam detection,’’ in Proc. 16th Int. Conf.
World Wide Web, May 2007, pp. 1189–1190.
[36] E. M. Knorr, R. T. Ng, and V. Tucakov, ‘‘Distance-based outliers: Algorithms and applications,’’ Very Large Data Bases J., vol. 8, nos. 3–4,
pp. 237–253, 2000, doi: 10.1007/s007780050006.
VOLUME 11, 2023
[37] S. Ramaswamy, R. Rastogi, and K. Shim, ‘‘Efficient algorithms for
mining outliers from large data sets,’’ in Proc. ACM SIGMOD Int. Conf.
Manage. Data, May 2000, pp. 93–104.
[38] C. C. Aggarwal and P. S. Yu, ‘‘Outlier detection for high dimensional
data,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, May 2001,
pp. 37–46.
[39] J. Han and M. Kamber, Data Mining Concepts and Techniques.
San Mateo, CA, USA: Morgan Kaufmann, 2001.
[40] D. Yu and G. Sheikholeslami, ‘‘A find out: Finding outliers in
very large datasets,’’ in Knowledge and Information Systems, 2002,
pp. 387–412.
[41] M. F. Jiang, S. S. Tseng, and C. M. Su, ‘‘Two-phase clustering process for
outlier detection,’’ Pattern Recognit. Lett., vol. 22, no. 6–7, pp. 691–700,
2001.
[42] C. C. Aggarwal and P. S. Yu, ‘‘An effective and efficient algorithm for high-dimensional outlier detection,’’ Int. J. Very Large Data
Bases, vol. 14, no. 2, pp. 211–221, 2005, doi: 10.1007/s00778-0040125-5.
[43] Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. Chen, ‘‘ECOD:
Unsupervised outlier detection using empirical cumulative distribution
functions,’’ IEEE Trans. Knowl. Data Eng., early access, Mar. 16, 2022,
doi: 10.1109/tkde.2022.3159580.
[44] G. Dudek and J. Szkutnik, ‘‘Daily load curves in distribution networks—
Analysis of diversity and outlier detection,’’ in Proc. 18th Int. Sci. Conf.
Electr. Power Eng. (EPE), May 2017, pp. 1–5.
[45] E. Andersen, M. Chiarandini, M. Hassani, S. Janicke, P. Tampakis, and
A. Zimek, ‘‘Evaluation of probability distribution distance metrics in
traffic flow outlier detection,’’ in Proc. 23rd IEEE Int. Conf. Mobile Data
Manage. (MDM), Jun. 2022, pp. 64–69.
[46] Y. Chen, X. Dang, H. Peng, H. L. Bart, and H. L. Bart, ‘‘Outlier
detection with the kernelized spatial depth function,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 288–305, Feb. 2009, doi:
10.1109/TPAMI.2008.72.
[47] S. Lu, L. Liu, J. Li, and T. D. Le, ‘‘Effective outlier detection based on
Bayesian network and proximity,’’ in Proc. IEEE Int. Conf. Big Data (Big
Data), Dec. 2018, pp. 134–139.
[48] M. Kim, S. Jung, and S. Kim, ‘‘Fault detection method using inverse
distance weight-based local outlier factor,’’ in Proc. Int. Conf. Fuzzy
Theory Appl. (iFUZZY), Oct. 2021, pp. 1–5.
[49] X. Wang, Y. Chen, and X. L. Wang, ‘‘A centroid-based outlier detection method,’’ in Proc. Int. Conf. Comput. Sci. Comput. Intell. (CSCI),
Dec. 2017, pp. 1411–1416.
[50] M. A. Haque and H. Mineno, ‘‘Proposal of online outlier detection in
sensor data using kernel density estimation,’’ in Proc. 6th IIAI Int. Congr.
Adv. Appl. Informat. (IIAI-AAI), Jul. 2017, pp. 1051–1052.
[51] Y. Tao and D. Pi, ‘‘Unifying density-based clustering and outlier detection,’’ in Proc. 2nd Int. Workshop Knowl. Discovery Data Mining,
Jan. 2009, pp. 644–647.
[52] G. Liu, J. Pang, X. Piao, and S. Huang, ‘‘The discovery of attribute
feature cluster for any clustering result based on outlier detection technique,’’ in Proc. Int. Conf. Internet Comput. Sci. Eng., Jan. 2008,
pp. 68–72.
[53] R. Pamula, J. K. Deka, and S. Nandi, ‘‘An outlier detection method
based on clustering,’’ in Proc. 2nd Int. Conf. Emerg. Appl. Inf. Technol.,
Feb. 2011, pp. 253–256.
[54] B. Angelin and A. Geetha, ‘‘Outlier detection using clustering
techniques—K-means and K-median,’’ in Proc. 4th Int.
Conf. Intell. Comput. Control Syst. (ICICCS), May 2020,
pp. 373–378.
[55] Y. Wang, B. Dai, G. Hua, J. Aston, and D. Wipf, ‘‘Recurrent variational autoencoders for learning nonlinear generative models in the
presence of outliers,’’ IEEE J. Sel. Topics Signal Process., vol. 12, no. 6,
pp. 1615–1627, Dec. 2018, doi: 10.1109/jstsp.2018.2876995.
[56] Y. Li and H. Wu, ‘‘A clustering method based on K-means
algorithm,’’ Phys. Proc., vol. 25, pp. 1104–1109, Jan. 2012, doi:
10.1016/j.phpro.2012.03.206.
[57] Z. Li, Y. Zhao, N. Botta, C. Ionescu, and X. Hu, ‘‘COPOD: Copulabased outlier detection,’’ in Proc. IEEE Int. Conf. Data Mining (ICDM),
Nov. 2020, pp. 1118–1123.
[58] K. J. Paul and R. Harilal, ‘‘Implementation of MAD and mean absolute
deviation based smoothing algorithm for displacement data in digital
image correlation technique,’’ Indian Inst. Technol. Hyderabad, Hyderabad, India, Tech. Rep., 2014, pp. 1–6.
24931
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
[59] Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. Chen, ‘‘ECOD:
Unsupervised outlier detection using empirical cumulative distribution
functions,’’ IEEE Trans. Knowl. Data Eng., early access, Mar. 16, 2022,
doi: 10.1109/tkde.2022.3159580.
[60] A. Goodge, B. Hooi, S. K. Ng, and W. S. Ng, ‘‘LUNAR: Unifying local outlier detection methods via graph neural networks,’’ 2021,
arXiv:2112.05355.
[61] A. Bounsiar and M. G. Madden, ‘‘One-class support vector machines
revisited,’’ in Proc. Int. Conf. Inf. Sci. Appl. (ICISA), May 2014, pp. 1–4.
[62] Welcome to. Python.org. Accessed: Dec. 24, 2022. [Online]. Available:
http://www.python.org
[63] F. Chollet, Deep Learning for Humans. Mountain View, CA, USA: Keras,
2017.
[64] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, ‘‘LOF: Identifying
density-based local outliers,’’ in Proc. ACM SIGMOD Int. Conf. Manage.
Data (SIGMOD), 2000, pp. 93–104.
[65] Z. Cheng, C. Zou, and J. Dong, ‘‘Outlier detection using isolation forest
and local outlier factor,’’ in Proc. Conf. Res. Adapt. Convergent Syst.,
Sep. 2019, pp. 161–168.
[66] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, ‘‘LoOP: Local outlier
probabilities,’’ in Proc. 18th ACM Conf. Inf. Knowl. Manage., 2009,
pp. 1649–1652.
[67] R. Gupta and K. Pandey, ‘‘Density based outlier detection technique,’’
in Advances in Intelligent Systems and Computing, New Delhi, India:
Springer, 2016, pp. 51–58.
[68] O. Alghushairy, R. Alsini, T. Soule, and X. Ma, ‘‘A review of local outlier
factor algorithms for outlier detection in big data streams,’’ Big Data Cognit. Comput., vol. 5, no. 1, p. 1, Dec. 2020, doi: 10.3390/bdcc5010001.
[69] H. Xu, L. Zhang, P. Li, and F. Zhu, ‘‘Outlier detection algorithm based on
k-nearest neighbors-local outlier factor,’’ J. Algorithms Comput. Technol.,
vol. 16, Jan. 2022, Art. no. 174830262210781.
[70] A. Liu and J. Zhang, ‘‘An outlier mining algorithm based on local
weighted k-density,’’ in Proc. 8th Int. Conf. Fuzzy Syst. Knowl. Discovery
(FSKD), Jul. 2011, pp. 1504–1508.
[71] J. Y. Lee and R. Tukhvatov, ‘‘Evaluations of similarity measures on VK
for link prediction,’’ Data Sci. Eng., vol. 3, no. 3, pp. 277–289, 2018, doi:
10.1007/s41019-018-0073-5.
[72] E. M. Jordaan and G. F. Smits, ‘‘Robust outlier detection using SVM
regression,’’ in Proc. IEEE Int. Joint Conf. Neural Netw., Jul. 2004,
pp. 2017–2022.
[73] X.-Y. Yang, J. Liu, M.-Q. Zhang, and K. Niu, ‘‘A new multi-class SVM
algorithm based on one-class SVM,’’ in Computational Science ICCS
2007. Berlin, Germany: Springer, 2007, pp. 677–684.
[74] E. H. Budiarto, A. E. Permanasari, and S. Fauziati, ‘‘Unsupervised
anomaly detection using K-means, local outlier factor and one class
SVM,’’ in Proc. 5th Int. Conf. Sci. Technol. (ICST), Jul. 2019,
pp. 1–5.
[75] M. Amer, M. Goldstein, and S. Abdennadher, ‘‘Enhancing one-class
support vector machines for unsupervised anomaly detection,’’ in Proc.
ACM SIGKDD Workshop Outlier Detection Description, Aug. 2013,
pp. 8–15.
[76] H. Lukashevich, S. Nowak, and P. Dunker, ‘‘Using one-class SVM
outliers detection for verification of collaboratively tagged image training sets,’’ in Proc. IEEE Int. Conf. Multimedia Expo., Jun. 2009,
pp. 682–685.
[77] J. A. Hartigan and M. A. Wong, ‘‘Algorithm AS 136: A K-means
clustering algorithm,’’ Appl. Statist., vol. 28, no. 1, p. 100, 1979, doi:
10.2307/2346830.
[78] D. Marutho, S. H. Handaka, E. Wijaya, and Muljono, ‘‘The determination
of cluster number at k-mean using elbow method and purity evaluation
on headline news,’’ in Proc. Int. Seminar Appl. Technol. Inf. Commun.,
Sep. 2018, pp. 533–538.
[79] T. M. Kodinariya and P. R. Makwana, ‘‘Review on determining number of cluster in K-means clustering,’’ Int. J., vol. 1, no. 6, pp. 90–95,
2013.
[80] C. Xiong, Z. Hua, K. Lv, and X. Li, ‘‘An improved K-means text clustering algorithm by optimizing initial cluster centers,’’ in Proc. 7th Int. Conf.
Cloud Comput. Big Data (CCBD), Nov. 2016, pp. 265–268.
[81] A. Kuraria, N. Jharbade, and M. Soni, ‘‘Centroid selection process using
WCSS and elbow method for K-mean clustering algorithm in data mining,’’ Int. J. Sci. Res. Sci., Eng. Technol., pp. 190–195, Dec. 2018, doi:
10.32628/ijsrset21841122.
24932
[82] R. Gao, T. Zhang, S. Sun, and Z. Liu, ‘‘Research and improvement
of isolation forest in detection of local anomaly points,’’ J. Phys.
Conf. Ser., vol. 1237, no. 5, 2019, Art. no. 052023, doi: 10.1088/17426596/1237/5/052023.
[83] M. Tokovarov and P. Karczmarek, ‘‘A probabilistic generalization of
isolation forest,’’ Inf. Sci., vol. 584, pp. 433–449, Jan. 2022, doi:
10.1016/j.ins.2021.10.075.
[84] W. S. Al Farizi, I. Hidayah, and M. N. Rizal, ‘‘Isolation forest based
anomaly detection: A systematic literature review,’’ in Proc. 8th Int.
Conf. Inf. Technol., Comput. Electr. Eng. (ICITACEE), Sep. 2021,
pp. 118–122.
[85] F. T. Liu, K. M. Ting, and Z.-H. Zhou, ‘‘Isolation forest,’’ in Proc. 8th
IEEE Int. Conf. Data Mining, Dec. 2008, pp. 413–422.
[86] M. U. Togbe, ‘‘Anomaly detection for data streams based on isolation forest using scikit-multiflow,’’ in Computational Science and
Its Applications ICCSA 2020. Cham, Switzerland: Springer, 2020,
pp. 15–30.
[87] S. Wibisono, M. T. Anwar, A. Supriyanto, and I. H. A. Amin, ‘‘Multivariate weather anomaly detection using DBSCAN clustering algorithm,’’
J. Phys., Conf. Ser., vol. 1869, no. 1, Apr. 2021, Art. no. 012077, doi:
10.1088/1742-6596/1869/1/012077.
[88] M. Celik, F. Dadaser-Celik, and A. S. Dokuz, ‘‘Anomaly detection in
temperature data using DBSCAN algorithm,’’ in Proc. Int. Symp. Innov.
Intell. Syst. Appl., Jun. 2011, pp. 91–95.
[89] D. Birant and A. Kut, ‘‘Spatio-temporal outlier detection in large
databases,’’ in Proc. 28th Int. Conf. Inf. Technol. Interface, 2006,
pp. 291–297.
[90] Z. Akbari and R. Unland, ‘‘Automated determination of the input parameter of DBSCAN based on outlier detection,’’ in IFIP Advances in Information and Communication Technology. Cham, Switzerland: Springer,
2016, pp. 280–291.
[91] T. Manh Thang and J. Kim, ‘‘The anomaly detection by using DBSCAN
clustering with multiple parameters,’’ in Proc. Int. Conf. Inf. Sci. Appl.,
Apr. 2011, pp. 1–5.
[92] J. Dugundji, ‘‘Envelopes and pre-envelopes of real waveforms,’’
IRE Trans. Inf. Theory, vol. 4, no. 1, pp. 53–57, Mar. 1958, doi:
10.1109/tit.1958.1057435.
[93] P. Mahalanobis, ‘‘On the generalized distance in statistics,’’
Tech. Rep., 1936.
[94] G. J. McLachlan, ‘‘Mahalanobis distance,’’ Resonance, vol. 4, no. 6,
pp. 20–26, Jun. 1999, doi: 10.1007/bf02834632.
[95] M. D’Agostino and V. Dardanoni, ‘‘What’s so special about Euclidean
distance?: A characterization with applications to mobility and spatial
voting,’’ Social Choice Welfare, vol. 33, no. 2, pp. 211–233, Aug. 2009,
doi: 10.1007/s00355-008-0353-5.
[96] R. Hidayat, I. T. R. Yanto, A. A. Ramli, M. F. M. Fudzee, and
A. S. Ahmar, ‘‘Generalized normalized Euclidean distance based fuzzy
soft set similarity for data classification,’’ Comput. Syst. Sci. Eng., vol. 38,
no. 1, pp. 119–130, 2021, doi: 10.32604/csse.2021.015628.
[97] E. Müller, I. Assent, P. Iglesias, Y. Mulle, and K. Bohm, ‘‘Outlier ranking
via subspace analysis in multiple views of the data,’’ in Proc. IEEE 12th
Int. Conf. Data Mining, Dec. 2012, pp. 529–538.
[98] C. C. Aggarwal, ‘‘High-dimensional outlier detection: The subspace
method,’’ in Outlier Analysis. Cham, Switzerland: Springer, 2017,
pp. 149–184.
[99] H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek, ‘‘Outlier detection
in arbitrarily oriented subspaces,’’ in Proc. IEEE 12th Int. Conf. Data
Mining, Dec. 2012, pp. 379–388.
[100] L. Parsons, E. Haque, and H. Liu, ‘‘Subspace clustering for high dimensional data: A review,’’ ACM SIGKDD Explorations Newslett., vol. 6,
no. 1, pp. 90–105, Jun. 2004.
[101] Y. Almardeny, N. Boujnah, and F. Cleary, ‘‘A novel outlier detection
method for multivariate data,’’ IEEE Trans. Knowl. Data Eng., vol. 34,
no. 9, pp. 4052–4062, Sep. 2022, doi: 10.1109/tkde.2020.3036524.
[102] Q. Wang, Q. Gao, X. Gao, and F. Nie, ‘‘Angle principal component analysis,’’ in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017,
pp. 2936–2942.
[103] S. Dray, ‘‘On the number of principal components: A test of dimensionality based on measurements of similarity between matrices,’’ Comput. Statist. Data Anal., vol. 52, no. 4, pp. 2228–2237, Jan. 2008, doi:
10.1016/j.csda.2007.07.015.
VOLUME 11, 2023
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
[104] H. T. Eastment and W. J. Krzanowski, ‘‘Cross-validatory choice of the
number of components from a principal component analysis,’’ Technometrics, vol. 24, no. 1, p. 73, 1982, doi: 10.2307/1267581.
[105] J. Gower, ‘‘Statistical methods of comparing different multivariate analyses of the same data,’’ Math. Archaeolog. historical Sci., vol. 138, p. 149,
Jan. 1971.
[106] R. J. Harris, A Primer of Multivariate Statistics. Mahway, NJ, USA:
Lawrence, 2001.
[107] W.-C. Chang, C.-P. Lee, and C.-J. Lin, ‘‘A revisit to support vector data
description,’’ Dept. Comput. Sci., Nat. Taiwan Univ., Taipei, Taiwan,
Tech. Rep., 2013.
[108] C. Liu and K. Gryllias, ‘‘A deep support vector data description method
for anomaly detection in helicopters,’’ in Proc. PHM Soc. Eur. Conf.,
vol. 6, 2021, p. 9.
[109] S. Kim, Y. Choi, and M. Lee, ‘‘Deep learning with support vector data
description,’’ Neurocomputing, vol. 165, pp. 111–117, Oct. 2015, doi:
10.1016/j.neucom.2014.09.086.
[110] D. M. J. Tax and R. P. W. Duin, ‘‘Support vector data description,’’ Mach.
Learn., vol. 54, no. 1, pp. 45–66, Jan. 2004.
[111] K. Song, Y. Qin, B. Xu, N. Zhang, and J. Yang, ‘‘Study on outlier
detection method of the near infrared spectroscopy analysis by probability
metric,’’ Spectrochimica Acta A, Mol. Biomolecular Spectrosc., vol. 280,
Nov. 2022, Art. no. 121473, doi: 10.1016/j.saa.2022.121473.
[112] S. N. Lahiri, M. S. Kaiser, N. Cressie, and N.-J. Hsu, ‘‘Prediction of
spatial cumulative distribution functions using subsampling,’’ J. Amer.
Stat. Assoc., vol. 94, no. 445, p. 86, 1999, doi: 10.2307/2669680.
[113] W. W. Esty and J. D. Banfield, ‘‘The box-percentile plot,’’ J. Stat. Softw.,
vol. 8, no. 17, pp. 1–14, 2003, doi: 10.18637/jss.v008.i17.
[114] C. Reimann, P. Filzmoser, and R. G. Garrett, ‘‘Background and
threshold: Critical comparison of methods of determination,’’
Sci. Total Environ., vol. 346, nos. 1–3, pp. 1–16, Jun. 2005, doi:
10.1016/j.scitotenv.2004.11.023.
[115] R. Chellappa, ‘‘Gaussian mixture models,’’ in Encyclopedia Biometrics.
Boston, MA, USA: Springer, 2009, pp. 659–663.
[116] D. W. Scott, ‘‘Outlier detection and clustering by partial
mixture modeling,’’ in COMPSTAT 2004 Proceedings in
Computational Statistics, Heidelberg, Germany: Physica-Verlag, 2004,
pp. 453–464.
[117] L. Li, J. Hansman, R. Palacios, and R. Welsch, ‘‘Anomaly detection via
a Gaussian mixture model for flight operation and safety monitoring,’’
Transp. Res. C, Emerg. Technol., vol. 64, pp. 45–57, Mar. 2016, doi:
10.1016/j.trc.2016.01.007.
[118] A. Reddy, M. Ordway-West, M. Lee, M. Dugan, J. Whitney, R. Kahana,
B. Ford, J. Muedsam, A. Henslee, and M. Rao, ‘‘Using Gaussian
mixture models to detect outliers in seasonal univariate network traffic,’’ in Proc. IEEE Secur. Privacy Workshops (SPW), May 2017,
pp. 229–234.
[119] N. Ding, H. Ma, H. Gao, Y. Ma, and G. Tan, ‘‘Real-time anomaly
detection based on long short-term memory and Gaussian mixture
model,’’ Comput. Electr. Eng., vol. 79, Oct. 2019, Art. no. 106458, doi:
10.1016/j.compeleceng.2019.106458.
[120] D. C. Howell, ‘‘Median absolute deviation,’’ in Encyclopedia of Statistics
in Behavioral Science. Hoboken, NJ, USA: Wiley, 2005.
[121] C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, ‘‘Detecting outliers:
Do not use standard deviation around the mean, use absolute deviation around the median,’’ J. Experim. Social Psychol., vol. 49, no. 4,
pp. 764–766, Jul. 2013, doi: 10.1016/j.jesp.2013.03.013.
[122] K. Kannan, K. Senthamarai, and S. Manoj, ‘‘Labeling methods for
identifying outliers,’’ Int. J. Statist. Syst., vol. 10, no. 2, pp. 231–238,
2015.
[123] P. J. Rousseeuw and C. Croux, ‘‘Alternatives to the median absolute
deviation,’’ J. Amer. Stat. Assoc., vol. 88, no. 424, p. 1273, 1993, doi:
10.2307/2291267.
[124] J. Yang, S. Rahardja, and P. Fränti, ‘‘Outlier detection: How to threshold
outlier scores?’’ in Proc. Int. Conf. Artif. Intell., Inf. Process. Cloud
Comput., Dec. 2019, pp. 1–6.
[125] A. Kharitonov, A. Nahhas, M. Pohl, and K. Turowski, ‘‘Comparative
analysis of machine learning models for anomaly detection in manufacturing,’’ Proc. Comput. Sci., vol. 200, pp. 1288–1297, Jan. 2022, doi:
10.1016/j.procs.2022.01.330.
[126] J.-D. Fermanian, D. Radulovic, and M. Wegkamp, ‘‘Weak convergence
of empirical copula processes,’’ Bernoulli, vol. 10, no. 5, pp. 847–860,
Oct. 2004, doi: 10.3150/bj/1099579158.
VOLUME 11, 2023
[127] X. Wang, L. Wang, J. Wang, K. Sun, and Q. Wang, ‘‘Hyperspectral
anomaly detection via background purification and spatial difference
enhancement,’’ IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022,
doi: 10.1109/lgrs.2021.3140087.
[128] M. L. Katz, ‘‘The probability in the tail of a distribution,’’ Ann.
Math. Statist., vol. 34, no. 1, pp. 312–318, Mar. 1963, doi: 10.1214/
aoms/1177704268.
[129] A. Agarwal and N. Gupta, ‘‘Comparison of outlier detection techniques
for structured data,’’ 2021, arXiv:2106.08779.
[130] J. Zhang, ‘‘Advancements of outlier detection: A survey,’’ ICST
Trans. Scalable Inf. Syst., vol. 13, no. 1, p. e2, Feb. 2013, doi:
10.4108/trans.sis.2013.01-03.e2.
[131] N. Paulauskas and A. Baskys, ‘‘Application of histogram-based outlier
scores to detect computer network anomalies,’’ Electronics, vol. 8, no. 11,
p. 1251, Nov. 2019, doi: 10.3390/electronics8111251.
[132] Y. Wang, S. Zhu, and C. Li, ‘‘Research on an ensemble anomaly
detection algorithm,’’ J. Phys., Conf. Ser., vol. 1314, no. 1, Oct. 2019,
Art. no. 012198, doi: 10.1088/1742-6596/1314/1/012198.
[133] M. Gebski and R. K. Wong, ‘‘An efficient histogram method for outlier
detection,’’ in Advances in Databases: Concepts, Systems and Applications. Berlin, Germany: Springer, 2007, pp. 176–187.
[134] X. Zhao, Y. Zhang, S. Xie, Q. Qin, S. Wu, and B. Luo, ‘‘Outlier
detection based on residual histogram preference for geometric multimodel fitting,’’ Sensors, vol. 20, no. 11, p. 3037, May 2020, doi:
10.3390/s20113037.
[135] L. Pappalardo, G. Rossetti, and D. Pedreschi, ‘‘‘How well do we know
each other?’ detecting tie strength in multidimensional social networks,’’
in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal. Mining, Aug. 2012,
pp. 1040–1045.
[136] L. A. Adamic and E. Adar, ‘‘Friends and neighbors on the web,’’
Soc. Netw., vol. 25, no. 3, pp. 211–230, 2003, doi: 10.1016/s03788733(03)00009-1.
CHARVI BANNUR (Student Member, IEEE) is
currently pursuing the B.Tech. degree in computer
science engineering with People’s Education Society (PES), Bengaluru, India.
Since 2021, she has been a Research Assistant
with the Research Laboratory of PES. She is the
author of multiple research articles in the fields
of machine learning and artificial intelligence. Her
research interests include deep neural networks,
graph theory applications in the realm of social
network analysis, data mining, and information retrieval.
Ms. Bannur has received numerous academic accolades and scholarships
at her university and strives toward academic excellence. She was a recipient
of the 7th IEEE International Conference on Recent Advances and Innovations in Engineering Best Paper Award, in 2022.
CHAITRA BHAT (Student Member, IEEE) is currently pursuing the B.Tech. degree in computer
science and engineering with People’s Education
Society (PES) University, Bengaluru.
Since 2021, she has been a Research Assistant
with PES University, in the field of natural language processing. She is the author of multiple
research articles in the fields of machine learning
and graph theory. Her research interests include
machine learning, deep learning, image classification and recognition, graph theory and applications, and natural language
processing.
Ms. Bhat received the Best Paper Award at the 7th IEEE International
Conference on Recent Advances and Innovations in Engineering (ICRAIE),
in 2022.
24933
C. Bannur et al.: PAACDA: Comprehensive Data Corruption Detection Algorithm
KUSHAGRA SINGH (Student Member, IEEE) is
currently pursuing the bachelor’s degree in computer science with People’s Education Society
(PES) University, Bengaluru. In addition to being
a motivated learner, who places a strong focus
on academic performance, he favors exploring a
number of areas, particularly those related to computer science. He has initiated and collaborated in
numerous projects that make use of his knowledge
in machine learning, deep learning, blockchain
technology, and big data. He has taken part in numerous hackathons and
extracurricular events, where he earned honors in many of them. The IEEE
Bangalore Chapter’s Internship Program was successfully completed by him.
His research article ‘‘Data Regeneration From Poisoned Dataset,’’ written
along with his colleagues, was presented at the ICRAIE 2022 Conference
and was chosen for the Best Paper Award.
MRITYUNJAY DODDAMANI received the Ph.D.
degree in mechanical engineering from the
National Institute of Technology Karnataka,
Surathkal, in 2012. He is currently an Associate
Professor with the School of Mechanical and
Materials Engineering, Indian Institute of Technology (IIT), Mandi, Himachal Pradesh, India. He has
published more than 75 articles in the areas of
materials development for specified applications,
additive manufacturing, and machine learning.
He was funded by various funding agencies in India for his research works.
SHRIRANG AMBAJI KULKARNI (Senior Member, IEEE) received the B.E. degree in computer
science and engineering from Karnatak University,
Dharwad, in 2000, the M.Tech. degree in computer
science and engineering from Visvesvaraya Technological University, Belgaum, in 2004, and the
Ph.D. degree from the Faculty of Computer and
Information Science, Visvesvaraya Technological
University, in 2012.
From 2001 to 2018, he has worked in various
capacities as an assistant professor, an associate professor, and a professor in
multiple universities and engineering institutes in India. From 2021 to 2022,
he was a Postdoctoral Research Fellow with the Health Informatics Laboratory, University of Central Florida, USA. He is currently an Associate Professor with the Department of Computer Science and Engineering, National
Institute of Engineering, Mysore, India. He is also a Project Consultant with
Mobirey Technologies Pvt. Ltd., USA, on AI/ML technologies applied to the
financial domain. He is the author of three books, more than 40 articles, and
many patents under review. He is a Senior Member of ACM.
24934
VOLUME 11, 2023