ADII11 Metode Deteksi Outlier

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Minggu ke-11:

Metode Deteksi Outlier

Dr.Eng. Fitra Abdurrachman Bachtiar, S.T., M.Eng


Dr. Diva Kurnianingtyas, S.Kom
Outline
1
Statistical Approaches

2
Supervised and Unsupervised
Approaches

3
-

4
-
Note:
This slides are based on the additional material provided with the textbook that we use: J. Han, M. Kamber and J. Pei, “Data Mining:
Concepts and Techniques” and P. Tan, M. Steinbach, and V. Kumar "Introduction to Data Mining“.
Outlier Analysis Method
Statistical Approach

● Statistical approaches assume that the objects in a data set are


generated by a stochastic process (a generative model)

● Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as
outliers

● Methods are divided into two categories:


○ parametric
○ non-parametric
Parametric vs Non-Parametric
Parametric method Non-parametric method

§ Assumes that the normal data is § Not assume an a-priori


generated by a parametric statistical model and determine
distribution with parameter θ the model from the input data
§ The probability density function § Not completely parameter free
of the parametric distribution but consider the number and
f(x, θ) gives the probability that nature of the parameters are
object x is generated by the flexible and not fixed in advance
distribution § Examples: histogram and kernel
§ The smaller this value, the more density estimation
likely x is an outlier
Parametric Methods I: Detection Univariate Outliers Based on
Normal Distribution
● Univariate data: A data set involving only one attribute or variable
● Often assume that data are generated from a normal distribution,
learn the parameters from the input data, and identify the points
with low probability as outliers
● Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3,
29.4}
○ Use the maximum likelihood method to estimate μ and σ

7
Parametric Methods I: Detection Univariate Outliers Based on
Normal Distribution
○ Taking derivatives with respect to μ and σ2, we derive the following
maximum likelihood estimates

n For the above data with n = 10, we have


n Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

8
Parametric Methods I: The Grubb’s Test

● Univariate outlier detection: The Grubb's test (maximum normed


residual test) ─ another statistical method under normal
distribution
○ For each object x in a data set, compute its z-score: x is an outlier if

where is the value taken by a t-distribution at a


significance level of α/(2N), and N is the # of objects in the data
set 9
Parametric Methods II: Detection of Multivariate Outliers

● Multivariate data: A data set involving two or more attributes or


variables
● Transform the multivariate outlier detection task into a univariate
outlier detection problem
● Method 1. Compute Mahalanobis distance
○ Let ō be the mean vector for a multivariate data set. Mahalanobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō)
where S is the covariance matrix

10
Parametric Methods II: Detection of Multivariate Outliers

○ Use the Grubb's test on this measure to detect outliers

● Method 2. Use χ2 –statistic: (chi square)


○ where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
○ If χ2 –statistic is large, then object oi is an outlier

11
Parametric Methods III: Using Mixture of Parametric Distributions

● Assuming data generated by a normal


distribution could be sometimes overly
simplified
● Example (right figure): The objects between
the two clusters cannot be captured as
outliers since they are close to the
estimated mean

12
Parametric Methods III: Using Mixture of Parametric Distributions

● To overcome this problem, assume the normal data is generated


by two normal distributions. For any object o in the data set, the
probability that o is generated by the mixture of the two
distributions is given by

where fθ1 and fθ2 are the probability density functions of θ1 and θ2
● Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from
data
● An object o is an outlier if it does not belong to any cluster
13
Non-Parametric Methods: Detection Using Histogram

● The model of normal data is


learned from the input data without
any a priori structure.

● Often makes fewer assumptions


about the data, and thus can be
applicable in more scenarios

14
Non-Parametric Methods: Detection Using Histogram

● Outlier detection using histogram:


○ Figure shows the histogram of purchase amounts in

transactions
○ A transaction in the amount of $7,500 is an outlier, since only

0.2% transactions have an amount higher than $5,000


● Problem: Hard to choose an appropriate bin size for histogram
○ Too small bin size → normal objects in empty/rare bins, false

positive
○ Too big bin size → outliers in some frequent bins, false

negative
15
Non-Parametric Methods: Detection Using Histogram

● Solution:
○ Adopt kernel density estimation to estimate the probability

density distribution of the data. If the estimated density


function is high, the object is likely normal. Otherwise, it is
likely an outlier.

16
Proximity-Based Approaches: Distance-Based vs. Density-Based Outlier
Detection
● Intuition: Objects that are far away from the others are
outliers
● Assumption of proximity-based approach: The proximity of
an outlier deviates significantly from that of most of the
others in the data set
● Two types of proximity-based outlier detection methods
○ Distance-based outlier detection: An object o is an outlier if its
neighborhood does not have enough other points
○ Density-based outlier detection: An object o is an outlier if its
density is relatively much lower than that of its neighbors
17
Distance-Based Outlier Detection

● For each object o, examine the # of other objects in the r-


neighborhood of o, where r is a user-specified distance threshold
● An object o is an outlier if most (taking π as a fraction threshold)
of the objects in D are far away from o, i.e., not in the r-
neighborhood of o
● An object o is a DB(r, π) outlier if
● Equivalently, one can check the distance between o and its k-th
nearest neighbor ok, where . o is an outlier if dist(o, ok)
>r

18
Statistical Approaches

● Efficient computation: Nested loop algorithm


○ For any object oi, calculate its distance from other objects,

and count the # of other objects in the r-neighborhood.


○ If π∙n other objects are within r distance, terminate the inner

loop
○ Otherwise, oi is a DB(r, π) outlier

● Efficiency: Actually CPU time is not O(n2) but linear to the data set
size since for most non-outlier objects, the inner loop terminates
early

19
Distance-Based Outlier Detection: A Grid-Based Method

● Why efficiency is still a concern? When the complete set of


objects cannot be held into main memory, cost I/O swapping
● The major cost: (1) each object tests against the whole data set,
why not only its close neighbor? (2) check objects one by one, why
not group by group?
● Grid-based method (CELL): Data space is partitioned into a multi-
D grid. Each cell is a hyper cube with diagonal length r/2

20
Statistical Approaches

● Pruning using the level-1 & level 2 cell properties:


○ For any possible point x in cell C and any

possible point y in a level-1 cell, dist(x,y) ≤ r


○ For any possible point x in cell C and any point

y such that dist(x,y) ≥ r, y is in a level-2 cell


● Thus, we only need to check the objects that
cannot be pruned, and even for such an object o,
only need to compute the distance between o and
the objects in the level-2 cells (since beyond level-
2, the distance from o is more than r)

21
Density-Based Outlier Detection

● Local outliers: Outliers comparing to their local


neighborhoods, instead of the global data
distribution
● In Fig., o1 and o2 are local outliers to C1, o3 is a
global outlier, but o4 is not an outlier. However,
proximity-based clustering cannot find o1 and o2
are outlier (e.g., comparing with O4).
● Intuition (density-based outlier detection): The
density around an outlier object is significantly
different from the density around its neighbors

22
Statistical Approaches

● Method: Use the relative density of an object against its neighbors


as the indicator of the degree of the object being outliers
● k-distance of an object o, distk(o): distance between o and its k-th
NN
● k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤
distk(o)}
○ Nk(o) could be bigger than k since multiple objects may have

identical distance to o

23
Local Outlier Factor: LOF

● Reachability distance from o’ to o:

○ where k is a user-specified parameter


● Local reachability density of o:

● LOF (Local outlier factor) of an object o is the average of the ratio of local
reachability of o and those of o’s k-nearest neighbors

24
Local Outlier Factor: LOF

● The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
● This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN

25
Clustering-Base Approaches

26
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
● An object is an outlier if (1) it does not belong to any cluster, (2)
there is a large distance between the object and its closest cluster
, or (3) it belongs to a small or sparse cluster
● Case I: Not belong to any cluster
○ Identify animals not part of a flock: Using a density-based
clustering method such as DBSCAN
● Case 2: Far from its closest cluster
○ Using k-means, partition data points of into clusters
○ For each object o, assign an outlier score based on its
distance from its closest center
○ If dist(o, co)/avg_dist(co) is large, likely an outlier

27
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
● Ex. Intrusion detection: Consider the similarity between data
points and the clusters in a training data set
○ Use a training set to find patterns of “normal” data, e.g.,

frequent itemsets in each segment, and cluster similar


connections into groups
○ Compare new data points with the clusters mined—Outliers

are possible attacks

28
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
● FindCBLOF: Detect outliers in small clusters
○ Find clusters, and sort them in decreasing size
○ To each data point, assign a cluster-based local outlier
factor (CBLOF):
○ If obj p belongs to a large cluster, CBLOF = cluster_size X
similarity between p and cluster
○ If p belongs to a small one, CBLOF = cluster size X
similarity betw. p and the closest large cluster

● Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity
between o and C1 is small. For any point in C3, its closest large cluster is C2 but
its similarity from C2 is low, plus |C3| = 3 is small
29
Clustering-Based Method: Strength and Weakness
Strength Weaknesses

● Detect outliers without requiring any ● Effectiveness depends highly on the clustering
labeled data method used—they may not be optimized for
outlier detection
● Work for many types of data
● High computational cost: Need to first find
● Clusters can be regarded as summaries clusters
of the data ● A method to reduce the cost: Fixed-width
● Once the cluster are obtained, need only clustering
compare any object against the clusters ○ A point is assigned to a cluster if the
to determine whether it is an outlier (fast) center of the cluster is within a pre-
defined distance threshold from the point
○ If a point cannot be assigned to any
existing cluster, a new cluster is created
and the distance threshold may be learned
from the training data under certain
conditions

30
Classification-Based Method I: One-Class Model

● Idea: Train a classification model that can


distinguish “normal” data from outliers

● A brute-force approach: Consider a training


set that contains samples labeled as
“normal” and others labeled as “outlier”
○ But, the training set is typically heavily
biased: # of “normal” samples likely far
exceeds # of outlier samples
○ Cannot detect unseen anomaly

31
Classification-Based Method II: Semi-Supervised Learning

● Semi-supervised learning: Combining classification-


based and clustering-based methods
● Method:
○ Using a clustering-based approach, find a large cluster, C,
and a small cluster, C1
○ Since some objects in C carry the label “normal”, treat all
objects in C as normal
○ Use the one-class model of this cluster to identify normal
objects in outlier detection
○ Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers
○ Any object that does not fall into the model for C (such as
a) is considered an outlier as well

32
Mining Contextual Outliers I: Transform into
Conventional Outlier Detection
● If the contexts can be clearly identified, transform it to
conventional outlier detection
○ Identify the context of the object using the contextual

attributes
○ Calculate the outlier score for the object in the context using a

conventional outlier detection method


● Ex. Detect outlier customers in the context of customer groups
○ Contextual attributes: age group, postal code

○ Behavioral attributes: # of trans/yr, annual total trans. amount

33
Mining Contextual Outliers I: Transform into
Conventional Outlier Detection
● Steps: (1) locate c’s context, (2) compare c with the other customers in
the same group, and (3) use a conventional outlier detection method
● If the context contains very few customers, generalize contexts
○ Ex. Learn a mixture model U on the contextual attributes, and another
mixture model V of the data on the behavior attributes
○ Learn a mapping p(Vi|Uj): the probability that a data object o
belonging to cluster Uj on the contextual attributes is generated by
cluster Vi on the behavior attributes
○ Outlier score:

34
Mining Contextual Outliers II: Modeling Normal Behavior with
Respect to Contexts
● In some applications, one cannot clearly partition the data into contexts
○ Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products
browsed earlier should be considered as the context
● Model the “normal” behavior with respect to contexts
○ Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute
values
○ An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model

35
Mining Contextual Outliers II: Modeling Normal Behavior with
Respect to Contexts
● Using a prediction model that links the contexts and behavior,
these methods avoid the explicit identification of specific
contexts
● Methods: A number of classification and prediction techniques
can be used to build such models, such as:
○ Regression

○ Markov Models

○ Finite State Automaton

36
Mining Collective Outliers I: On the Set of “Structured Objects”

● Collective outlier if objects as a group deviate


significantly from the entire data
● Need to examine the structure of the data set, i.e,
the relationships between multiple data objects
● Each of these structures is inherent to its
respective type of data
○ For temporal data (such as time series and sequences), we explore
the structures formed by time, which occur in segments of the time
series or subsequences
○ For spatial data, explore local areas

○ For graph and network data, we explore subgraphs

37
Statistical Approaches

● Difference from the contextual outlier detection: the structures


are often not explicitly defined, and have to be discovered as part
of the outlier detection process.
● Collective outlier detection methods: two categories
○ (1) Reduce the problem to conventional outlier detection

■ Identify structure units, treat each structure unit (e.g.,

subsequence, time series segment, local area, or


subgraph) as a data object, and extract features
■ Then outlier detection on the set of “structured objects”

constructed as such using the extracted features


38
Mining Collective Outliers II: Direct Modeling of the Expected
Behavior of Structure Units
● (2) Models the expected behavior of structure units directly
● Ex. 1. Detect collective outliers in online social network of
customers
○ Treat each possible subgraph of the network as a structure

unit
○ Collective outlier: An outlier subgraph in the social network

■ Small subgraphs that are of very low frequency

■ Large subgraphs that are surprisingly frequent

39
Mining Collective Outliers II: Direct Modeling of the Expected
Behavior of Structure Units
● Ex. 2. Detect collective outliers in temporal sequences
○ Learn a Markov model from the sequences

○ A subsequence can then be declared as a collective outlier if

it significantly deviates from the model


● Collective outlier detection is subtle due to the challenge of
exploring the structures in data
○ The exploration typically uses heuristics, and thus may be

application dependent
○ The computational cost is often high due to the sophisticated

mining process
40
Challenges for Outlier Detection in High-Dimensional Data

● Interpretation of outliers
○ Detecting outliers without saying why they are outliers is not

very useful in high-D due to many features (or dimensions) are


involved in a high-dimensional data set
○ E.g., which subspaces that manifest the outliers or an

assessment regarding the “outlier-ness” of the objects


● Data sparsity
○ Data in high-D spaces are often sparse

○ The distance between objects becomes heavily dominated by

noise as the dimensionality increases


41
Challenges for Outlier Detection in High-Dimensional Data

● Data subspaces
○ Adaptive to the subspaces signifying the outliers

○ Capturing the local behavior of data

● Scalable with respect to dimensionality


○ # of subspaces increases exponentially

42
Approach I: Extending Conventional Outlier Detection

● Method 1: Detect outliers in the full space, e.g., HilOut Algorithm


○ Find distance-based outliers, but use the ranks of distance instead of
the absolute distance in outlier detection
○ For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o)
○ The weight of object o:

○ All objects are ranked in weight-descending order


○ Top-l objects in weight are output as outliers (l: user-specified parm)
○ Employ space-filling curves for approximation: scalable in both time
and space w.r.t. data size and dimensionality

43
Approach I: Extending Conventional Outlier Detection

● Method 2: Dimensionality reduction


○ Works only when in lower-dimensionality, normal instances

can still be distinguished from outliers


○ PCA: Heuristically, the principal components with low variance

are preferred because, on such dimensions, normal objects


are likely close to each other and outliers often deviate from
the majority

44
Approach II: Finding Outliers in Subspaces

● Extending conventional outlier detection: Hard for outlier


interpretation
● Find outliers in much lower dimensional subspaces: easy to
interpret why and to what extent the object is an outlier
○ E.g., find outlier customers in certain subspace: average
transaction amount >> avg. and purchase frequency << avg.
● Ex. A grid-based subspace outlier detection method
○ Project data onto various subspaces to find an area whose
density is much lower than average
○ Discretize the data into a grid with φ equi-depth (why?)
regions

45
Approach II: Finding Outliers in Subspaces

○ Search for regions that are significantly sparse


■ Consider a k-d cube: k ranges on k dimensions, with n
objects
■ If objects are independently distributed, the expected
number of objects falling into a k-dimensional region is
(1/ φ)kn = fkn,the standard deviation is

■ The sparsity coefficient of cube C:

■ If S(C) < 0, C contains less objects than expected


■ The more negative, the sparser C is and the more likely
the objects in C are outliers in the subspace
46
Approach III: Modeling High-Dimensional Outliers

● Develop new models for high-dimensional


outliers directly
● Avoid proximity measures and adopt new
heuristics that do not deteriorate in high-
dimensional data
● Ex. Angle-based outliers: Kriegel, Schubert,
and Zimek [KSZ08]
● For each point o, examine the angle Δxoy for
every pair of points x, y.
○ Point in the center (e.g., a), the angles
formed differ widely
○ An outlier (e.g., c), angle variable is
substantially smaller
47
Approach III: Modeling High-Dimensional Outliers

● Use the variance of angles for a point to determine outlier


● Combine angles and distance to model outliers
○ Use the distance-weighted angle variance as the outlier score
○ Angle-based outlier factor (ABOF):

○ Efficient approximation computation method is developed


○ It can be generalized to handle arbitrary types of data

48
Summary

● Types of outliers
○ global, contextual & collective outliers

● Outlier detection
○ supervised, semi-supervised, or unsupervised

● Statistical (or model-based) approaches


● Proximity-base approaches
● Clustering-base approaches
● Classification approaches
● Mining contextual and collective outliers
● Outlier detection in high dimensional data
49
Terimakasih

You might also like