ADII11 Metode Deteksi Outlier
ADII11 Metode Deteksi Outlier
ADII11 Metode Deteksi Outlier
2
Supervised and Unsupervised
Approaches
3
-
4
-
Note:
This slides are based on the additional material provided with the textbook that we use: J. Han, M. Kamber and J. Pei, “Data Mining:
Concepts and Techniques” and P. Tan, M. Steinbach, and V. Kumar "Introduction to Data Mining“.
Outlier Analysis Method
Statistical Approach
● Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as
outliers
7
Parametric Methods I: Detection Univariate Outliers Based on
Normal Distribution
○ Taking derivatives with respect to μ and σ2, we derive the following
maximum likelihood estimates
8
Parametric Methods I: The Grubb’s Test
10
Parametric Methods II: Detection of Multivariate Outliers
11
Parametric Methods III: Using Mixture of Parametric Distributions
12
Parametric Methods III: Using Mixture of Parametric Distributions
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
● Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from
data
● An object o is an outlier if it does not belong to any cluster
13
Non-Parametric Methods: Detection Using Histogram
14
Non-Parametric Methods: Detection Using Histogram
transactions
○ A transaction in the amount of $7,500 is an outlier, since only
positive
○ Too big bin size → outliers in some frequent bins, false
negative
15
Non-Parametric Methods: Detection Using Histogram
● Solution:
○ Adopt kernel density estimation to estimate the probability
16
Proximity-Based Approaches: Distance-Based vs. Density-Based Outlier
Detection
● Intuition: Objects that are far away from the others are
outliers
● Assumption of proximity-based approach: The proximity of
an outlier deviates significantly from that of most of the
others in the data set
● Two types of proximity-based outlier detection methods
○ Distance-based outlier detection: An object o is an outlier if its
neighborhood does not have enough other points
○ Density-based outlier detection: An object o is an outlier if its
density is relatively much lower than that of its neighbors
17
Distance-Based Outlier Detection
18
Statistical Approaches
loop
○ Otherwise, oi is a DB(r, π) outlier
● Efficiency: Actually CPU time is not O(n2) but linear to the data set
size since for most non-outlier objects, the inner loop terminates
early
19
Distance-Based Outlier Detection: A Grid-Based Method
20
Statistical Approaches
21
Density-Based Outlier Detection
22
Statistical Approaches
identical distance to o
23
Local Outlier Factor: LOF
● LOF (Local outlier factor) of an object o is the average of the ratio of local
reachability of o and those of o’s k-nearest neighbors
24
Local Outlier Factor: LOF
● The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
● This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN
25
Clustering-Base Approaches
26
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
● An object is an outlier if (1) it does not belong to any cluster, (2)
there is a large distance between the object and its closest cluster
, or (3) it belongs to a small or sparse cluster
● Case I: Not belong to any cluster
○ Identify animals not part of a flock: Using a density-based
clustering method such as DBSCAN
● Case 2: Far from its closest cluster
○ Using k-means, partition data points of into clusters
○ For each object o, assign an outlier score based on its
distance from its closest center
○ If dist(o, co)/avg_dist(co) is large, likely an outlier
27
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
● Ex. Intrusion detection: Consider the similarity between data
points and the clusters in a training data set
○ Use a training set to find patterns of “normal” data, e.g.,
28
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
● FindCBLOF: Detect outliers in small clusters
○ Find clusters, and sort them in decreasing size
○ To each data point, assign a cluster-based local outlier
factor (CBLOF):
○ If obj p belongs to a large cluster, CBLOF = cluster_size X
similarity between p and cluster
○ If p belongs to a small one, CBLOF = cluster size X
similarity betw. p and the closest large cluster
● Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity
between o and C1 is small. For any point in C3, its closest large cluster is C2 but
its similarity from C2 is low, plus |C3| = 3 is small
29
Clustering-Based Method: Strength and Weakness
Strength Weaknesses
● Detect outliers without requiring any ● Effectiveness depends highly on the clustering
labeled data method used—they may not be optimized for
outlier detection
● Work for many types of data
● High computational cost: Need to first find
● Clusters can be regarded as summaries clusters
of the data ● A method to reduce the cost: Fixed-width
● Once the cluster are obtained, need only clustering
compare any object against the clusters ○ A point is assigned to a cluster if the
to determine whether it is an outlier (fast) center of the cluster is within a pre-
defined distance threshold from the point
○ If a point cannot be assigned to any
existing cluster, a new cluster is created
and the distance threshold may be learned
from the training data under certain
conditions
30
Classification-Based Method I: One-Class Model
31
Classification-Based Method II: Semi-Supervised Learning
32
Mining Contextual Outliers I: Transform into
Conventional Outlier Detection
● If the contexts can be clearly identified, transform it to
conventional outlier detection
○ Identify the context of the object using the contextual
attributes
○ Calculate the outlier score for the object in the context using a
33
Mining Contextual Outliers I: Transform into
Conventional Outlier Detection
● Steps: (1) locate c’s context, (2) compare c with the other customers in
the same group, and (3) use a conventional outlier detection method
● If the context contains very few customers, generalize contexts
○ Ex. Learn a mixture model U on the contextual attributes, and another
mixture model V of the data on the behavior attributes
○ Learn a mapping p(Vi|Uj): the probability that a data object o
belonging to cluster Uj on the contextual attributes is generated by
cluster Vi on the behavior attributes
○ Outlier score:
34
Mining Contextual Outliers II: Modeling Normal Behavior with
Respect to Contexts
● In some applications, one cannot clearly partition the data into contexts
○ Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products
browsed earlier should be considered as the context
● Model the “normal” behavior with respect to contexts
○ Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute
values
○ An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
35
Mining Contextual Outliers II: Modeling Normal Behavior with
Respect to Contexts
● Using a prediction model that links the contexts and behavior,
these methods avoid the explicit identification of specific
contexts
● Methods: A number of classification and prediction techniques
can be used to build such models, such as:
○ Regression
○ Markov Models
36
Mining Collective Outliers I: On the Set of “Structured Objects”
37
Statistical Approaches
unit
○ Collective outlier: An outlier subgraph in the social network
39
Mining Collective Outliers II: Direct Modeling of the Expected
Behavior of Structure Units
● Ex. 2. Detect collective outliers in temporal sequences
○ Learn a Markov model from the sequences
application dependent
○ The computational cost is often high due to the sophisticated
mining process
40
Challenges for Outlier Detection in High-Dimensional Data
● Interpretation of outliers
○ Detecting outliers without saying why they are outliers is not
● Data subspaces
○ Adaptive to the subspaces signifying the outliers
42
Approach I: Extending Conventional Outlier Detection
43
Approach I: Extending Conventional Outlier Detection
44
Approach II: Finding Outliers in Subspaces
45
Approach II: Finding Outliers in Subspaces
48
Summary
● Types of outliers
○ global, contextual & collective outliers
● Outlier detection
○ supervised, semi-supervised, or unsupervised