Machine Learning General Concepts
Machine Learning General Concepts
Machine Learning General Concepts
Contents
1
Machine learning
1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1
1.2.1
Relation to statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1
1.4.2
1.4.3
1.4.4
1.4.5
1.4.6
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.7
Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.8
Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.9
Representation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1
Open-source software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2
1.6.3
Commercial software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7
Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8
Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data mining
2.1
1.2
Etymology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
ii
CONTENTS
2.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1
10
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.3.1
Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3.2
Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3.3
Results validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.5
Notable uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.5.1
Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.5.2
Business . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.5.3
13
2.5.4
Human rights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.5.5
14
2.5.6
14
2.5.7
15
2.5.8
15
2.5.9
15
15
2.5.11 Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
15
16
16
16
2.6.1
Situation in Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.6.2
16
Copyright Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.7.1
Situation in Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.7.2
17
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.8.1
17
2.8.2
18
2.8.3
Marketplace surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
22
23
Statistical classication
24
3.1
24
3.2
Frequentist procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3
Bayesian procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4
25
2.3
2.6
2.7
2.8
2.9
CONTENTS
iii
3.5
Feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.6
Linear classiers
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.7
Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.8
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.9
Application domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
26
3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27
Cluster analysis
28
4.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.2
Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.2.1
29
4.2.2
Centroid-based clustering
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.2.3
Distribution-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.2.4
Density-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2.5
Recent developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2.6
Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3.1
Internal evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3.2
External evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.4
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.5
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.5.1
34
4.5.2
35
4.5.3
35
4.5.4
Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.6
References
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.7
External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Anomaly detection
37
5.1
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.2
Popular techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.3
37
5.4
Software
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.5
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.6
References
38
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.2
Useful Concepts
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.3
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
iv
CONTENTS
6.4
History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
6.5
41
6.6
41
6.7
Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
6.7.1
Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
6.7.2
Eclat algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
6.7.3
FP-growth algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
6.7.4
Others
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
6.8
Lore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
6.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
44
6.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
45
6.12.1 Bibliographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.12.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Reinforcement learning
47
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
7.2
Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
7.3
48
7.3.1
Criterion of optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
7.3.2
Brute force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
7.3.3
49
7.3.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.4
Theory
7.5
Current research
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.6
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.7
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
7.8
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
7.9
References
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
53
Structured prediction
54
8.1
54
8.2
Structured perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
8.3
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
8.4
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
8.5
External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Conferences, journals
Feature learning
56
9.1
56
CONTENTS
9.2
9.3
9.1.1
56
9.1.2
Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
9.2.1
K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
9.2.2
57
9.2.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
9.2.4
58
9.2.5
58
Multilayer/Deep architectures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
9.3.1
58
9.3.2
Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
9.4
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
9.5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
60
60
61
61
62
62
62
62
10.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
62
11 Semi-supervised learning
63
64
64
64
64
11.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
64
64
65
65
65
65
66
11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
66
12 Grammar induction
67
vi
CONTENTS
12.1 Grammar Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
12.3 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
. . . . . . . . . . . . . . . . . . . . . . . . . .
67
. . . . . . . . . . . . . . . . . . . . . . . .
67
68
68
68
68
12.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
12.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
12.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
70
12.8.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
12.8.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
73
Chapter 1
Machine learning
For the journal, see Machine Learning (journal).
"Computing Machinery and Intelligence" that the question Can machines think?" be replaced with the ques[1] tion Can machines do what we (as thinking entities) can
Machine learning is a subeld of computer science
[9]
that evolved from the study of pattern recognition and do?"
computational learning theory in articial intelligence.[1]
Machine learning explores the construction and study of
1.1.1 Types of problems and tasks
algorithms that can learn from and make predictions on
data.[2] Such algorithms operate by building a model from
Machine learning tasks are typically classied into three
example inputs in order to make data-driven predictions
broad categories, depending on the nature of the learnor decisions,[3]:2 rather than following strictly static proing signal or feedback available to a learning system.
gram instructions.
These are:[10]
Machine learning is closely related to and often overlaps with computational statistics; a discipline that also
Supervised learning. The computer is presented
specializes in prediction-making. It has strong ties to
with example inputs and their desired outputs, given
mathematical optimization, which deliver methods, theby a teacher, and the goal is to learn a general rule
ory and application domains to the eld. Machine learnthat maps inputs to outputs.
ing is employed in a range of computing tasks where designing and programming explicit, rule-based algorithms
Unsupervised learning, no labels are given to the
is infeasible. Example applications include spam lterlearning algorithm, leaving it on its own to nd strucing, optical character recognition (OCR),[4] search enture in its input. Unsupervised learning can be a goal
gines and computer vision. Machine learning is somein itself (discovering hidden patterns in data) or a
times conated with data mining,[5] although that focuses
means towards an end.
more on exploratory data analysis.[6] Machine learning
In reinforcement learning, a computer program inand pattern recognition can be viewed as two facets of
[3]:vii
teracts with a dynamic environment in which it must
the same eld.
perform a certain goal (such as driving a vehicle),
When employed in industrial contexts, machine learnwithout a teacher explicitly telling it whether it has
ing methods may be referred to as predictive analytics or
come close to its goal or not. Another example
predictive modelling.
is learning to play a game by playing against an
opponent.[3]:3
1.1 Overview
Between supervised and unsupervised learning is semisupervised learning, where the teacher gives an incomIn 1959, Arthur Samuel dened machine learning as a plete training signal: a training set with some (often
Field of study that gives computers the ability to learn many) of the target outputs missing. Transduction is a
without being explicitly programmed.[7]
special case of this principle where the entire set of probTom M. Mitchell provided a widely quoted, more for- lem instances is known at learning time, except that part
mal denition: A computer program is said to learn of the targets are missing.
from experience E with respect to some class of tasks T Among other categories of machine learning problems,
and performance measure P, if its performance at tasks learning to learn learns its own inductive bias based on
in T, as measured by P, improves with experience E.[8] previous experience. Developmental learning, elaboThis denition is notable for its dening machine learn- rated for robot learning, generates its own sequences (also
ing in fundamentally operational rather than cognitive called curriculum) of learning situations to cumulatively
terms, thus following Alan Turing's proposal in his paper acquire repertoires of novel skills through autonomous
1
self-exploration and social interaction with human teachers, and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.
Another categorization of machine learning tasks arises
when one considers the desired output of a machinelearned system:[3]:3
In classication, inputs are divided into two or more
classes, and the learner must produce a model that
assigns unseen inputs to one (or multi-label classication) or more of these classes. This is typically
tackled in a supervised way. Spam ltering is an example of classication, where the inputs are email
(or other) messages and the classes are spam and
not spam.
1.4. APPROACHES
learner accuracy. Much of the confusion between these
two research communities (which do often have separate conferences and separate journals, ECML PKDD
being a major exception) comes from the basic assumptions they work with: in machine learning, performance
is usually evaluated with respect to the ability to reproduce known knowledge, while in Knowledge Discovery and Data Mining (KDD) the key task is the discovery of previously unknown knowledge. Evaluated with
respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by supervised
methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training
data.
Machine learning also has intimate ties to optimization:
many learning problems are formulated as minimization
of some loss function on a training set of examples. Loss
functions expresses the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in classication, one wants
to assign a label to instances, and models are trained
to correctly predict the pre-assigned labels of a set examples). The dierence between the two elds arises
from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine
learning is concerned with minimizing the loss on unseen
samples.[12]
1.2.1
Relation to statistics
3
resentative of the space of occurrences) and the learner
has to build a general model about this space that enables it to produce suciently accurate predictions in new
cases.
The computational analysis of machine learning algorithms and their performance is a branch of theoretical
computer science known as computational learning theory. Because training sets are nite and the future is uncertain, learning theory usually does not yield guarantees
of the performance of algorithms. Instead, probabilistic bounds on the performance are quite common. The
biasvariance decomposition is one way to quantify generalization error.
In addition to performance bounds, computational learning theorists study the time complexity and feasibility of
learning. In computational learning theory, a computation is considered feasible if it can be done in polynomial
time. There are two kinds of time complexity results.
Positive results show that a certain class of functions can
be learned in polynomial time. Negative results show that
certain classes cannot be learned in polynomial time.
There are many similarities between machine learning
theory and statistical inference, although they use dierent terms.
1.4 Approaches
Main article: List of machine learning algorithms
1.3 Theory
Association rule learning is a method for discovering interesting relations between variables in large databases.
tional aspects of biological neural networks. Computations are structured in terms of an interconnected
group of articial neurons, processing information using
a connectionist approach to computation. Modern neural networks are non-linear statistical data modeling tools.
They are usually used to model complex relationships between inputs and outputs, to nd patterns in data, or to
capture the statistical structure in an unknown joint probability distribution between observed variables.
1.4.4
1.4.5
Several learning algorithms, mostly unsupervised learning algorithms, aim at discovering better representations
of the inputs provided during training. Classical examples include principal components analysis and cluster
analysis. Representation learning algorithms often at1.4.6 Clustering
tempt to preserve the information in their input but transform it in a way that makes it useful, often as a preMain article: Cluster analysis
processing step before performing classication or predictions, allowing to reconstruct the inputs coming from
Cluster analysis is the assignment of a set of observations the unknown data generating distribution, while not being
into subsets (called clusters) so that observations within necessarily faithful for congurations that are implausible
the same cluster are similar according to some predes- under that distribution.
ignated criterion or criteria, while observations drawn Manifold learning algorithms attempt to do so under
from dierent clusters are dissimilar. Dierent cluster- the constraint that the learned representation is lowing techniques make dierent assumptions on the struc- dimensional. Sparse coding algorithms attempt to do
ture of the data, often dened by some similarity metric so under the constraint that the learned representation is
and evaluated for example by internal compactness (simi- sparse (has many zeros). Multilinear subspace learning
larity between members of the same cluster) and separa- algorithms aim to learn low-dimensional representations
tion between dierent clusters. Other methods are based directly from tensor representations for multidimensional
on estimated density and graph connectivity. Clustering is data, without reshaping them into (high-dimensional)
a method of unsupervised learning, and a common tech- vectors.[17] Deep learning algorithms discover multiple
nique for statistical data analysis.
levels of representation, or a hierarchy of features, with
1.5. APPLICATIONS
higher-level, more abstract features dened in terms of techniques have been used to improve the performance
(or generating) lower-level features. It has been argued of genetic and evolutionary algorithms.[23]
that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation
that explain the observed data.[18]
1.5 Applications
1.4.10
1.4.11
Computational advertising
In this method, a datum is represented as a linear combination of basis functions, and the coecients are assumed to be sparse. Let x be a d-dimensional datum, D
be a d by n matrix, where each column of D represents
a basis function. r is the coecient to represent x using
D. Mathematically, sparse dictionary learning means the
following x Dr where r is sparse. Generally speaking,
n is assumed to be larger than d to allow the freedom for
a sparse representation.
Computational nance
Learning a dictionary along with sparse representations is strongly NP-hard and also dicult to solve
approximately.[19] A popular heuristic method for sparse
dictionary learning is K-SVD.
Machine perception
Medical diagnosis
Natural language processing[25]
Optimization and metaheuristic
Recommender systems
Robot locomotion
Search engines
Sentiment analysis (or opinion mining)
1.4.12
Genetic algorithms
Sequence mining
Software engineering
Speech and handwriting recognition
Stock market analysis
Structural health monitoring
Syntactic pattern recognition
In 2006, the online movie company Netix held the rst 1.6.2 Commercial software with open"Netix Prize" competition to nd a program to better
source editions
predict user preferences and improve the accuracy on its
existing Cinematch movie recommendation algorithm by
KNIME
at least 10%. A joint team made up of researchers from
AT&T Labs-Research in collaboration with the teams Big
RapidMiner
Chaos and Pragmatic Theory built an ensemble model to
win the Grand Prize in 2009 for $1 million.[26] Shortly
after the prize was awarded, Netix realized that viewers ratings were not the best indicators of their view- 1.6.3 Commercial software
ing patterns (everything is a recommendation) and they
Amazon Machine Learning
changed their recommendation engine accordingly.[27]
In 2010 The Wall Street Journal wrote about money management rm Rebellion Researchs use of machine learning to predict economic movements. The article describes Rebellion Researchs prediction of the nancial
crisis and economic recovery.[28]
In 2014 it has been reported that a machine learning algorithm has been applied in Art History to study ne art
paintings, and that it may have revealed previously unrecognized inuences between artists.[29]
1.6 Software
Software suites containing a variety of machine learning
algorithms include the following:
1.6.1
Open-source software
dlib
ELKI
Encog
H2O
Mahout
Angoss KnowledgeSTUDIO
Databricks
IBM SPSS Modeler
KXEN Modeler
LIONsolver
Mathematica
MATLAB
Microsoft Azure
NeuroSolutions
Oracle Data Mining
RCASE
SAS Enterprise Miner
STATISTICA Data Miner
mlpy
MLPACK
MOA (Massive Online Analysis)
ND4J with Deeplearning4j
OpenCV
OpenNN
Orange
1.7 Journals
Journal of Machine Learning Research
Machine Learning
Neural Computation
R
scikit-learn
1.8 Conferences
Shogun
Spark
Yooreeka
Weka
1.10. REFERENCES
Adaptive control
Adversarial machine learning
Automatic reasoning
Cache language model
Cognitive model
Cognitive science
Computational intelligence
Computational neuroscience
Ethics of articial intelligence
Existential risk of articial general intelligence
Explanation-based learning
Hidden Markov model
Important publications in machine learning
List of machine learning algorithms
1.10 References
[1] http://www.britannica.com/EBchecked/topic/1116194/
machine-learning This is a tertiary source that clearly
includes information from other sources but does not
name them.
[2] Ron Kohavi; Foster Provost (1998). Glossary of terms.
Machine Learning 30: 271274.
[3] C. M. Bishop (2006). Pattern Recognition and Machine
Learning. Springer. ISBN 0-387-31073-8.
[4] Wernick, Yang, Brankov, Yourganov and Strother, Machine Learning in Medical Imaging, IEEE Signal Processing Magazine, vol. 27, no. 4, July 2010, pp. 25-38
[5] Mannila, Heikki (1996). Data mining: machine learning,
statistics, and databases. Int'l Conf. Scientic and Statistical Database Management. IEEE Computer Society.
[6] Friedman, Jerome H. (1998). Data Mining and Statistics:
Whats the connection?". Computing Science and Statistics
29 (1): 39.
[7] Phil Simon (March 18, 2013). Too Big to Ignore: The
Business Case for Big Data. Wiley. p. 89. ISBN 9781118638170.
[8]
[9] Harnad, Stevan (2008), The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence,
in Epstein, Robert; Peters, Grace, The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest
for the Thinking Computer, Kluwer
[11] Langley, Pat (2011). The changing science of machine learning. Machine Learning 82 (3): 275279.
doi:10.1007/s10994-011-5242-y.
[12] Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew
(2012). Improving First and Second-Order Methods by
Modeling Uncertainty. In Sra, Suvrit; Nowozin, Sebastian; Wright, Stephen J. Optimization for Machine Learning. MIT Press. p. 404.
[13] MI Jordan (2014-09-10). statistics and machine learning. reddit. Retrieved 2014-10-01.
[14] http://projecteuclid.org/download/pdf_1/euclid.ss/
1009213726
[15] Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning.
Springer. p. vii.
[16] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar
(2012) Foundations of Machine Learning, MIT Press
ISBN 9780262018258.
[17] Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N.
(2011). A Survey of Multilinear Subspace Learning for
Tensor Data (PDF). Pattern Recognition 44 (7): 1540
1551. doi:10.1016/j.patcog.2011.01.004.
[18] Yoshua Bengio (2009). Learning Deep Architectures for
AI. Now Publishers Inc. pp. 13. ISBN 978-1-60198294-0.
[19] A. M. Tillmann, "On the Computational Intractability of
Exact and Approximate Dictionary Learning", IEEE Signal Processing Letters 22(1), 2015: 4549.
[20] Aharon, M, M Elad, and A Bruckstein. 2006. KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. Signal Processing,
IEEE Transactions on 54 (11): 4311-4322
[21] Goldberg, David E.; Holland, John H. (1988). Genetic
algorithms and machine learning. Machine Learning 3
(2): 9599.
[22] Michie, D.; Spiegelhalter, D. J.; Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classication. Ellis
Horwood.
[23] Zhang, Jun; Zhan, Zhi-hui; Lin, Ying; Chen, Ni; Gong,
Yue-jiao; Zhong, Jing-hui; Chung, Henry S.H.; Li, Yun;
Shi, Yu-hui (2011). Evolutionary Computation Meets
Machine Learning: A Survey (PDF). Computational Intelligence Magazine (IEEE) 6 (4): 6875.
[24] Tesauro, Gerald (March 1995). Temporal Dierence
Learning and TD-Gammon". Communications of the
ACM 38 (3).
[25] Daniel Jurafsky and James H. Martin (2009). Speech and
Language Processing. Pearson Education. pp. 207 .
[26] BelKor Home Page research.att.com
[27]
[28]
[29] When A Machine Learning Algorithm Studied Fine Art
Paintings, It Saw Things Art Historians Had Never Noticed, The Physics at ArXiv blog
Chapter 2
Data mining
Not to be confused with analytics, information extrac- might identify multiple groups in the data, which can then
tion, or data analysis.
be used to obtain more accurate prediction results by a
decision support system. Neither the data collection, data
Data mining (the analysis step of the Knowledge Dis- preparation, nor result interpretation and reporting are
part of the data mining step, but do belong to the overall
covery in Databases process, or KDD),[1] an interdisci[2][3][4]
plinary subeld of computer science,
is the com- KDD process as additional steps.
putational process of discovering patterns in large data
sets involving methods at the intersection of articial intelligence, machine learning, statistics, and database systems.[2] The overall goal of the data mining process is
to extract information from a data set and transform it
into an understandable structure for further use.[2] Aside
from the raw analysis step, it involves database and
data management aspects, data pre-processing, model
and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered
structures, visualization, and online updating.[2]
2.1 Etymology
The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amount
of data, not the extraction of data itself.[5] It also is
a buzzword[6] and is frequently applied to any form of
large-scale data or information processing (collection,
extraction, warehousing, analysis, and statistics) as well
as any application of computer decision support system, including articial intelligence, machine learning,
and business intelligence. The popular book Data mining: Practical machine learning tools and techniques with
Java[7] (which covers mostly machine learning material)
was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons.[8] Often the more general terms "(large
scale) data analysis", or "analytics" or when referring to
actual methods, articial intelligence and machine learning are more appropriate.
In the 1960s, statisticians used terms like Data Fishing or Data Dredging to refer to what they considered the bad practice of analyzing data without an a-priori
hypothesis. The term Data Mining appeared around
1990 in the database community. For a short time in
1980s, a phrase database mining", was used, but since
it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation;[9] researchers consequently turned to data mining. Other
terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction,
etc. Gregory Piatetsky-Shapiro coined the term Knowledge Discovery in Databases for the rst workshop on
the same topic (KDD-1989) and this term became more
popular in AI and Machine Learning Community. However, the term data mining became more popular in the
business and press communities.[10] Currently, Data MinThe actual data mining task is the automatic or semi- ing and Knowledge Discovery are used interchangeably.
automatic analysis of large quantities of data to extract Since about 2007, Predictive Analytics and since 2011,
previously unknown interesting patterns such as groups of Data Science terms were also used to describe this eld.
data records (cluster analysis), unusual records (anomaly
detection) and dependencies (association rule mining).
This usually involves using database techniques such as
spatial indices. These patterns can then be seen as a kind 2.2 Background
of summary of the input data, and may be used in further analysis or, for example, in machine learning and The manual extraction of patterns from data has occurred
predictive analytics. For example, the data mining step for centuries. Early methods of identifying patterns in
9
10
data include Bayes theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automated data processing, aided
by other discoveries in computer science, such as neural
networks, cluster analysis, genetic algorithms (1950s),
decision trees and decision rules (1960s), and support
vector machines (1990s). Data mining is the process
of applying these methods with the intention of uncovering hidden patterns[11] in large data sets. It bridges
the gap from applied statistics and articial intelligence
(which usually provide the mathematical background) to
database management by exploiting the way data is stored
and indexed in databases to execute the actual learning
and discovery algorithms more eciently, allowing such
methods to be applied to ever larger data sets.
2.2.1
2.3 Process
The premier professional body in the eld is the The Knowledge Discovery in Databases (KDD) proAssociation for Computing Machinery's (ACM) Special cess is commonly dened with the stages:
Interest Group (SIG) on Knowledge Discovery and Data
Mining (SIGKDD).[12][13] Since 1989 this ACM SIG has
(1) Selection
hosted an annual international conference and published
(2) Pre-processing
its proceedings,[14] and since 1999 it has published a bian(3) Transformation
nual academic journal titled SIGKDD Explorations.[15]
(4) Data Mining
Computer science conferences on data mining include:
(5) Interpretation/Evaluation.[1]
CIKM Conference ACM Conference on InformaIt exists, however, in many variations on this theme, such
tion and Knowledge Management
as the Cross Industry Standard Process for Data Mining
DMIN Conference International Conference on (CRISP-DM) which denes six phases:
Data Mining
DMKD Conference Research Issues on Data Mining and Knowledge Discovery
ECDM Conference European Conference on Data
Mining
ECML-PKDD Conference European Conference
on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases
2.4. STANDARDS
2.3.1
11
Pre-processing
12
2.5.1
Games
2.5.2
Business
In business, data mining is the analysis of historical business activities, stored as static data in data warehouse
databases. The goal is to reveal hidden patterns and
trends. Data mining software uses advanced pattern
recognition algorithms to sift through large amounts of
data to assist in discovering previously unknown strategic business information. Examples of what businesses
use data mining for include performing market analysis
to identify new product bundles, nding the root cause
of manufacturing problems, to prevent customer attrition
and acquire new customers, cross-selling to existing customers, and proling customers with more accuracy.[23]
In todays world raw data is being collected by companies at an exploding rate. For example, Walmart
processes over 20 million point-of-sale transactions
every day. This information is stored in a centralized
database, but would be useless without some type of
data mining software to analyze it. If Walmart analyzed their point-of-sale data with data mining techniques they would be able to determine sales trends,
develop marketing campaigns, and more accurately
predict customer loyalty.[24]
Every time a credit card or a store loyalty card is
being used, or a warranty card is being lled, data
is being collected about the users behavior. Many
people nd the amount of information stored about
13
of mining historical die-test data to create a probabilistic model of patterns of die failure. These patterns are then utilized to decide, in real time, which
die to test next and when to stop testing. This system
has been shown, based on experiments with historical test data, to have the potential to improve prots
on mature IC products. Other examples[32][33] of the
application of data mining methodologies in semiconductor manufacturing environments suggest that
data mining methodologies may be particularly useful when data is scarce, and the various physical and
chemical parameters that aect the process exhibit
highly complex interactions. Another implication is
that on-line monitoring of the semiconductor manufacturing process using data mining may be highly
eective.
14
(HITECH Act) helped to initiate the adoption of the electronic health record (EHR) and supporting technology in
the United States.[46] The HITECH Act was signed into
law on February 17, 2009 as part of the American Recovery and Reinvestment Act (ARRA) and helped to open
the door to medical data mining.[47] Prior to the signing
of this law, estimates of only 20% of United States-based
physicians were utilizing electronic patient records.[46]
Sren Brunak notes that the patient record becomes as
information-rich as possible and thereby maximizes the
data mining opportunities.[46] Hence, electronic patient
records further expands the possibilities regarding medical data mining thereby opening the door to a vast source
of medical data analysis.
2.5.4
Human rights
2.5.5
15
applications such as air pollution monitoring.[51] A characteristic of such networks is that nearby sensor nodes
monitoring an environmental feature typically register
similar values. This kind of data redundancy due to the
spatial correlation between sensor observations inspires
the techniques for in-network data aggregation and mining. By measuring the spatial correlation between data
sampled by dierent sensors, a wide class of specialized
algorithms can be developed to develop more ecient
There are several critical research challenges in geo[52]
graphic knowledge discovery and data mining. Miller and spatial data mining algorithms.
Han[50] oer the following list of emerging research topics in the eld:
2.5.9 Visual data mining
Developing and supporting geographic data
warehouses (GDWs): Spatial properties are often
reduced to simple aspatial attributes in mainstream
data warehouses. Creating an integrated GDW requires solving issues of spatial and temporal data interoperability including dierences in semantics,
referencing systems, geometry, accuracy, and position.
Better spatio-temporal representations in geographic knowledge discovery: Current geographic
knowledge discovery (GKD) methods generally use
very simple representations of geographic objects
and spatial relationships. Geographic data mining methods should recognize more complex geographic objects (i.e., lines and polygons) and relationships (i.e., non-Euclidean distances, direction,
connectivity, and interaction through attributed geographic space such as terrain). Furthermore, the
time dimension needs to be more fully integrated
into these geographic representations and relationships.
Geographic knowledge discovery using diverse
data types: GKD methods should be developed
that can handle diverse data types beyond the traditional raster and vector models, including imagery
and geo-referenced multimedia, as well as dynamic
data types (video streams, animation).
2.5.7
In the process of turning from analogical into digital, large data sets have been generated, collected, and
stored discovering statistical patterns, trends and information which is hidden in data, in order to build predictive patterns. Studies suggest visual data mining is
faster and much more intuitive than is traditional data
mining.[53][54][55] See also Computer vision.
2.5.11 Surveillance
Data mining has been used by the U.S. government. Programs include the Total Information Awareness (TIA)
program, Secure Flight (formerly known as ComputerAssisted Passenger Prescreening System (CAPPS II)),
Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE),[57] and the Multi-state AntiTerrorism Information Exchange (MATRIX).[58] These
programs have been discontinued due to controversy over
whether they violate the 4th Amendment to the United
States Constitution, although many programs that were
formed under them continue to be funded by dierent
organizations or under dierent names.[59]
16
For example, an association rule beer potato chips ment or commercial data sets for national security or law
(80%)" states that four out of ve customers that bought enforcement purposes, such as in the Total Information
beer also bought potato chips.
Awareness Program or in ADVISE, has raised privacy
[69][70]
In the context of pattern mining as a tool to identify concerns.
terrorist activity, the National Research Council provides the following denition: Pattern-based data mining looks for patterns (including anomalous data patterns)
that might be associated with terrorist activity these
patterns might be regarded as small signals in a large
ocean of noise.[60][61][62] Pattern Mining includes new
areas such a Music Information Retrieval (MIR) where
patterns seen both in the temporal and non temporal
domains are imported to classical knowledge discovery
search methods.
2.5.14
Knowledge grid
Europe has rather strong privacy laws, and eorts are underway to further strengthen the rights of the consumers.
However, the U.S.-E.U. Safe Harbor Principles currently
eectively expose European users to privacy exploitation
by U.S. companies. As a consequence of Edward Snowden's Global surveillance disclosure, there has been in2.6 Privacy concerns and ethics
creased discussion to revoke this agreement, as in particWhile the term data mining itself has no ethical im- ular the data will be fully exposed to the National Security
plications, it is often associated with the mining of in- Agency, and attempts to reach an agreement have failed.
formation in relation to peoples behavior (ethical and
otherwise).[67]
2.8. SOFTWARE
controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their informed consent regarding information they provide and its intended present and future uses.
According to an article in Biotech Business Week', "'[i]n
practice, HIPAA may not oer any greater protection than
the longstanding regulations in the research arena,' says
the AAHC. More importantly, the rules goal of protection
through informed consent is undermined by the complexity
of consent forms that are required of patients and participants, which approach a level of incomprehensibility to
average individuals.[76] This underscores the necessity for
data anonymity in data aggregation and mining practices.
17
fair use. For example as part of the Google Book settlement the presiding judge on the case ruled that Googles
digitisation project of in-copyright books was lawful, in
part because of the transformative uses that the digitisation project displayed - one being text and data mining.[80]
2.8 Software
See also: Category:Data mining and machine learning
software.
Situation in Europe
2.7.2
18
SCaViS: Java cross-platform data analysis framework developed at Argonne National Laboratory.
Marketplace surveys
Tanagra: A visualisation-oriented data mining softSeveral researchers and organizations have conducted reware, also for teaching.
views of data mining tools and surveys of data miners.
Torch: An open source deep learning library for the These identify some of the strengths and weaknesses of
Lua programming language and scientic comput- the software packages. They also provide an overview
ing framework with wide support for machine learn- of the behaviors, preferences and views of data miners.
Some of these reports include:
ing algorithms.
UIMA: The UIMA (Unstructured Information
Management Architecture) is a component framework for analyzing unstructured content such as text,
audio and video originally developed by IBM.
Weka: A suite of machine learning software applications written in the Java programming language.
2.8.2
Anomaly/outlier/change detection
Association rule learning
Classication
Cluster analysis
Decision tree
Factor analysis
Genetic algorithms
Intention mining
Multilinear subspace learning
Neural networks
Regression analysis
Sequence mining
2.10. REFERENCES
Structured data analysis
Support vector machines
Text mining
Online analytical processing (OLAP)
Application domains
Analytics
Bioinformatics
Business intelligence
Data analysis
Data warehouse
Decision support system
Drug discovery
Exploratory data analysis
Predictive analytics
Web mining
Application examples
See also: Category:Applied data mining.
Customer analytics
Data mining in agriculture
Data mining in meteorology
Educational data mining
National Security Agency
Police-enforced ANPR in the UK
Quantitative structureactivity relationship
Surveillance / Mass surveillance (e.g., Stellar Wind)
Related topics
19
2.10 References
[1] Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth,
Padhraic (1996). From Data Mining to Knowledge
Discovery in Databases (PDF). Retrieved 17 December
2008.
[2] Data Mining Curriculum. ACM SIGKDD. 2006-0430. Retrieved 2014-01-27.
[3] Clifton, Christopher (2010). Encyclopdia Britannica:
Denition of Data Mining. Retrieved 2010-12-09.
[4] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome
(2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Retrieved 2012-08-07.
[5] Han, Jiawei; Kamber, Micheline (2001). Data mining:
concepts and techniques. Morgan Kaufmann. p. 5.
ISBN 9781558604896. Thus, data mining should habe
been more appropriately named knowledge mining from
data, which is unfortunately somewhat long
[6] See e.g. OKAIRP 2005 Fall Conference, Arizona State
University About.com: Datamining
[7] Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning
Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12374856-0.
[8] Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.;
Holmes, Georey; Pfahringer, Bernhard; Reutemann,
Peter; Witten, Ian H. (2010). WEKA Experiences with
a Java open-source project. Journal of Machine Learning Research 11: 25332541. the original title, Practical
machine learning, was changed ... The term data mining was [added] primarily for marketing reasons.
[9] Mena, Jess (2011). Machine Learning Forensics for Law
Enforcement, Security, and Intelligence. Boca Raton, FL:
CRC Press (Taylor & Francis Group). ISBN 978-1-43986069-4.
[10] Piatetsky-Shapiro, Gregory; Parker, Gary (2011).
Lesson: Data Mining, and Knowledge Discovery: An
Introduction. Introduction to Data Mining. KD Nuggets.
Retrieved 30 August 2012.
Data mining is about analyzing data; for information [11] Kantardzic, Mehmed (2003). Data Mining: Concepts,
Models, Methods, and Algorithms. John Wiley & Sons.
about extracting information out of data, see:
ISBN 0-471-22852-4. OCLC 50055336.
Data integration
Data transformation
Electronic discovery
Information extraction
Information integration
Named-entity recognition
20
2.10. REFERENCES
21
[43] Zernik, Joseph; Data Mining as a Civic Duty Online Public Prisoners Registration Systems, International
Journal on Social Media: Monitoring, Measurement, Mining, 1: 8496 (2010)
[60] Agrawal, Rakesh; Mannila, Heikki; Srikant, Ramakrishnan; Toivonen, Hannu; and Verkamo, A. Inkeri; Fast discovery of association rules, in Advances in knowledge discovery and data mining, MIT Press, 1996, pp. 307328
[61] National Research Council, Protecting Individual Privacy
in the Struggle Against Terrorists: A Framework for Program Assessment, Washington, DC: National Academies
Press, 2008
[62] Haag, Stephen; Cummings, Maeve; Phillips, Amy (2006).
Management Information Systems for the information age.
Toronto: McGraw-Hill Ryerson. p. 28. ISBN 0-07095569-7. OCLC 63194770.
[63] Ghanem, Moustafa; Guo, Yike; Rowe, Anthony;
Wendel, Patrick (2002).
Grid-based knowledge
discovery services for high throughput informatics.
Proceedings 11th IEEE International Symposium on
High Performance Distributed Computing. p. 416.
doi:10.1109/HPDC.2002.1029946. ISBN 0-7695-16866.
[64] Ghanem, Moustafa; Curcin, Vasa; Wendel, Patrick;
Guo, Yike (2009).
Building and Using Analytical Workows in Discovery Net.
Data Mining Techniques in Grid Computing Environments.
p. 119. doi:10.1002/9780470699904.ch8. ISBN
9780470699904.
[65] Cannataro, Mario; Talia, Domenico (January 2003). The
Knowledge Grid: An Architecture for Distributed Knowledge Discovery (PDF). Communications of the ACM 46
(1): 8993. doi:10.1145/602421.602425. Retrieved 17
October 2011.
[66] Talia, Domenico; Truno, Paolo (July 2010). How distributed data mining tasks can thrive as knowledge services (PDF). Communications of the ACM 53 (7): 132
137. doi:10.1145/1785414.1785451. Retrieved 17 October 2011.
[53] Zhao, Kaidi; and Liu, Bing; Tirpark, Thomas M.; and
Weimin, Xiao; A Visual Data Mining Framework for Convenient Identication of Useful Knowledge
[67] Seltzer, William. The Promise and Pitfalls of Data Mining: Ethical Issues (PDF).
[68] Pitts, Chip (15 March 2007). The End of Illegal Domestic Spying? Don't Count on It. Washington Spectator.
[69] Taipale, Kim A. (15 December 2003). Data Mining and
Domestic Security: Connecting the Dots to Make Sense
of Data. Columbia Science and Technology Law Review
5 (2). OCLC 45263753. SSRN 546782.
[70] Resig, John; and Teredesai, Ankur (2004). A Framework for Mining Instant Messaging Services. Proceedings of the 2004 SIAM DM Conference.
22
[71] Think Before You Dig: Privacy Implications of Data Mining & Aggregation, NASCIO Research Brief, September
2004
[72] Ohm, Paul. Don't Build a Database of Ruin. Harvard
Business Review.
[73] Darwin Bond-Graham, Iron Cagebook - The Logical End
of Facebooks Patents, Counterpunch.org, 2013.12.03
[74] Darwin Bond-Graham, Inside the Tech industrys Startup
Conference, Counterpunch.org, 2013.09.11
[75] AOL search data identied individuals, SecurityFocus,
August 2006
[76] Biotech Business Week Editors (June 30, 2008);
BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical
Research, Biotech Business Week, retrieved 17 November
2009 from LexisNexis Academic
[77] UK Researchers Given Data Mining Right Under New
UK Copyright Laws. Out-Law.com. Retrieved 14
November 2014
[78] Licences for Europe - Structured Stakeholder Dialogue
2013. European Commission. Retrieved 14 November
2014.
[79] Text and Data Mining:Its importance and the need for
change in Europe. Association of European Research Libraries. Retrieved 14 November 2014.
[80] Judge grants summary judgment in favor of Google
Books a fair use victory. Lexology.com. Antonelli
Law Ltd. Retrieved 14 November 2014.
[81] Mikut, Ralf; Reischl, Markus (SeptemberOctober
2011). Data Mining Tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1 (5): 431
445. doi:10.1002/widm.24. Retrieved October 21, 2011.
[82] Karl Rexer, Heather Allen, & Paul Gearan (2011);
Understanding Data Miners, Analytics Magazine,
May/June 2011 (INFORMS: Institute for Operations
Research and the Management Sciences).
[83] Kobielus, James; The Forrester Wave: Predictive Analytics
and Data Mining Solutions, Q1 2010, Forrester Research,
1 July 2008
[84] Herschel, Gareth; Magic Quadrant for Customer DataMining Applications, Gartner Inc., 1 July 2008
[85] Nisbet, Robert A. (2006); Data Mining Tools: Which One
is Best for CRM? Part 1, Information Management Special
Reports, January 2006
[86] Haughton, Dominique; Deichmann, Joel; Eshghi, Abdolreza; Sayek, Selin; Teebagy, Nicholas; and Topi, Heikki
(2003); A Review of Software Packages for Data Mining,
The American Statistician, Vol. 57, No. 4, pp. 290309
[87] Goebel, Michael; Gruenwald, Le (1999); A Survey of
Data Mining and Knowledge Discovery Software Tools,
SIGKDD Explorations, Vol. 1, Issue 1, pp. 2033
23
Chapter 3
Statistical classication
For the unsupervised learning approach, see Cluster stances, the explanatory variables are termed features
analysis.
(grouped into a feature vector), and the possible categories to be predicted are classes. There is also some argument over whether classication methods that do not
In machine learning and statistics, classication is the
problem of identifying to which of a set of categories involve a statistical model can be considered statistical. Other elds may use dierent terminology: e.g.
(sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or in community ecology, the term classication normally
refers to cluster analysis, i.e. a type of unsupervised
instances) whose category membership is known. An example would be assigning a given email into spam or learning, rather than the supervised learning described in
this article.
non-spam classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
3.1 Relation to other problems
In the terminology of machine learning,[1] classication is
considered an instance of supervised learning, i.e. learning where a training set of correctly identied observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data
into categories based on some measure of inherent similarity or distance.
Often, the individual observations are analyzed into a set
of quantiable properties, known variously explanatory
variables, features, etc. These properties may variously be categorical (e.g. A, B, AB or O, for
blood type), ordinal (e.g. large, medium or small),
integer-valued (e.g. the number of occurrences of a part
word in an email) or real-valued (e.g. a measurement
of blood pressure). Other classiers work by comparing observations to previous observations by means of a
similarity or distance function.
A common subclass of classication is probabilistic classication. Algorithms of this nature use statistical inference to nd the best class for a given instance. Unlike other algorithms, which simply output a best class,
probabilistic algorithms output a probability of the instance being a member of each of the possible classes.
An algorithm that implements classication, especially in The best class is normally then selected as the one with
a concrete implementation, is known as a classier. The the highest probability. However, such an algorithm has
term classier sometimes also refers to the mathemat- numerous advantages over non-probabilistic classiers:
ical function, implemented by a classication algorithm,
that maps input data to a category.
It can output a condence value associated with its
choice (in general, a classier that can do this is
Terminology across elds is quite varied. In statistics,
known as a condence-weighted classier).
where classication is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent
variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to
be possible values of the dependent variable. In machine learning, the observations are often known as in-
24
25
Algorithms with this basic setup are known as linear classiers. What distinguishes them is the procedure for determining (training) the optimal weights/coecients and
the way that the score is interpreted.
Examples of such algorithms are
Logistic regression and Multinomial logistic regression
Probit regression
26
3.7 Algorithms
Examples of classication algorithms include:
Linear classiers
Fishers linear discriminant
Logistic regression
Naive Bayes classier
Perceptron
Support vector machines
Least squares support vector machines
Quadratic classiers
Kernel estimation
k-nearest neighbor
Boosting (meta-algorithm)
Decision trees
Random forests
Neural networks
Learning vector quantization
3.8 Evaluation
Classier performance depends greatly on the characteristics of the data to be classied. There is no single classier that works best on all given problems (a phenomenon
that may be explained by the no-free-lunch theorem).
Various empirical tests have been performed to compare
classier performance and to nd the characteristics of
data that determine classier performance. Determining
a suitable classier for a given problem is however still
more an art than a science.
The measures precision and recall are popular metrics
used to evaluate the quality of a classication system.
More recently, receiver operating characteristic (ROC)
curves have been used to evaluate the tradeo between
true- and false-positive rates of classication algorithms.
As a performance metric, the uncertainty coecient has
the advantage over simple accuracy in that it is not affected by the relative sizes of the dierent classes. [10]
Further, it will not penalize an algorithm for simply rearranging the classes.
3.11 References
[1] Alpaydin, Ethem (2010). Introduction to Machine Learning. MIT Press. p. 9. ISBN 978-0-262-01243-0.
[2] Fisher R.A. (1936) " The use of multiple measurements
in taxonomic problems, Annals of Eugenics, 7, 179188
[3] Fisher R.A. (1938) " The statistical utilization of multiple
measurements, Annals of Eugenics, 8, 376386
[4] Gnanadesikan, R. (1977) Methods for Statistical Data
Analysis of Multivariate Observations, Wiley. ISBN 0471-30845-5 (p. 8386)
[5] Rao, C.R. (1952) Advanced Statistical Methods in Multivariate Analysis, Wiley. (Section 9c)
[6] Anderson,T.W. (1958) An Introduction to Multivariate
Statistical Analysis, Wiley.
[7] Binder, D.A. (1978) Bayesian cluster analysis,
Biometrika, 65, 3138.
[8] Binder, D.A. (1981) Approximations to Bayesian clustering rules, Biometrika, 68, 275285.
[9] Har-Peled, S., Roth, D., Zimak, D. (2003) Constraint
Classication for Multiclass Classication and Ranking.
In: Becker, B., Thrun, S., Obermayer, K. (Eds) Advances
in Neural Information Processing Systems 15: Proceedings
of the 2002 Conference, MIT Press. ISBN 0-262-02550-7
[10] Peter Mills (2011). Ecient statistical classication of
satellite measurements. International Journal of Remote
Sensing. doi:10.1080/01431161.2010.507795.
27
Chapter 4
Cluster analysis
For the supervised learning approach, see Statistical clas- ing and model parameters until the result achieves the desication.
sired properties.
Cluster analysis or clustering is the task of grouping Besides the term clustering, there are a number of terms
with similar meanings, including automatic classication,
numerical taxonomy, botryology (from Greek
grape) and typological analysis. The subtle dierences
are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classication the resulting discriminative power
is of interest. This often leads to misunderstandings between researchers coming from the elds of data mining
and machine learning, since they use the same terms and
often the same algorithms, but have dierent goals.
Cluster analysis was originated in anthropology by Driver
and Kroeber in 1932 and introduced to psychology by Zubin in 1938 and Robert Tryon in 1939[1][2] and famously
used by Cattell beginning in 1943[3] for trait theory classication in personality psychology.
4.1 Denition
According to Vladimir Estivill-Castro, the notion of a
cluster cannot be precisely dened, which is one of the
reasons why there are so many clustering algorithms.[4]
There is a common denominator: a group of data objects. However, dierent researchers employ dierent
cluster models, and for each of these cluster models again
dierent algorithms can be given. The notion of a cluster, as found by dierent algorithms, varies signicantly
in its properties. Understanding these cluster models is
key to understanding the dierences between the various
algorithms. Typical cluster models include:
28
Connectivity models: for example hierarchical clustering builds models based on distance connectivity.
Centroid models: for example the k-means algorithm represents each cluster by a single mean vector.
Distribution models: clusters are modeled using statistical distributions, such as multivariate normal
distributions used by the Expectation-maximization
algorithm.
4.2. ALGORITHMS
29
Density models: for example DBSCAN and will only list the most prominent examples of clustering
OPTICS denes clusters as connected dense regions algorithms, as there are possibly over 100 published clusin the data space.
tering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview
Subspace models: in Biclustering (also known as of algorithms explained in Wikipedia can be found in the
Co-clustering or two-mode-clustering), clusters are list of statistics algorithms.
modeled with both cluster members and relevant atThere is no objectively correct clustering algorithm,
tributes.
but as it was noted, clustering is in the eye of the
Group models: some algorithms do not provide a beholder.[4] The most appropriate clustering algorithm
rened model for their results and just provide the for a particular problem often needs to be chosen expergrouping information.
imentally, unless there is a mathematical reason to prefer
one cluster model over another. It should be noted that
Graph-based models: a clique, i.e., a subset of nodes
an algorithm that is designed for one kind of model has
in a graph such that every two nodes in the subset are
no chance on a data set that contains a radically dierconnected by an edge can be considered as a protoent kind of model.[4] For example, k-means cannot nd
typical form of cluster. Relaxations of the complete
non-convex clusters.[4]
connectivity requirement (a fraction of the edges can
be missing) are known as quasi-cliques.
4.2.1 Connectivity based clustering (hierA clustering is essentially a set of such clusters, usually
archical clustering)
containing all objects in the data set. Additionally, it may
specify the relationship of the clusters to each other, for Main article: Hierarchical clustering
example a hierarchy of clusters embedded in each other.
Clusterings can be roughly distinguished as:
Connectivity based clustering, also known as hierarchical
clustering, is based on the core idea of objects being
hard clustering: each object belongs to a cluster or
more related to nearby objects than to objects farther
not
away. These algorithms connect objects to form clus soft clustering (also: fuzzy clustering): each object ters based on their distance. A cluster can be described
belongs to each cluster to a certain degree (e.g. a largely by the maximum distance needed to connect parts
of the cluster. At dierent distances, dierent clusters
likelihood of belonging to the cluster)
will form, which can be represented using a dendrogram,
which explains where the common name hierarchical
There are also ner distinctions possible, for example:
clustering comes from: these algorithms do not provide
strict partitioning clustering: here each object be- a single partitioning of the data set, but instead provide
an extensive hierarchy of clusters that merge with each
longs to exactly one cluster
other at certain distances. In a dendrogram, the y-axis
strict partitioning clustering with outliers: objects marks the distance at which the clusters merge, while the
can also belong to no cluster, and are considered objects are placed along the x-axis such that the clusters
don't mix.
outliers.
overlapping clustering (also: alternative clustering, Connectivity based clustering is a whole family of methmulti-view clustering): while usually a hard cluster- ods that dier by the way distances are computed. Apart
from the usual choice of distance functions, the user also
ing, objects may belong to more than one cluster.
needs to decide on the linkage criterion (since a clus hierarchical clustering: objects that belong to a child ter consists of multiple objects, there are multiple candicluster also belong to the parent cluster
dates to compute the distance to) to use. Popular choices
are known as single-linkage clustering (the minimum of
subspace clustering: while an overlapping clusterobject distances), complete linkage clustering (the maxiing, within a uniquely dened subspace, clusters are
mum of object distances) or UPGMA (Unweighted Pair
not expected to overlap.
Group Method with Arithmetic Mean, also known as average linkage clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single elements
4.2 Algorithms
and aggregating them into clusters) or divisive (starting
with the complete data set and dividing it into partitions).
Main category: Data clustering algorithms
These methods will not produce a unique partitioning of
the data set, but a hierarchy from which the user still
Clustering algorithms can be categorized based on their needs to choose appropriate clusters. They are not very
cluster model, as listed above. The following overview robust towards outliers, which will either show up as ad-
30
ditional clusters or even cause other clusters to merge
(known as chaining phenomenon, in particular with
single-linkage clustering). In the general case, the complexity is O(n3 ) , which makes them too slow for large
data sets. For some special cases, optimal ecient methods (of complexity O(n2 ) ) are known: SLINK[5] for
single-linkage and CLINK[6] for complete-linkage clustering. In the data mining community these methods are
recognized as a theoretical foundation of cluster analysis,
but often considered obsolete. They did however provide
inspiration for many later methods such as density based
clustering.
Linkage clustering examples
Single-linkage on Gaussian data. At 35 clusters, the
biggest cluster starts fragmenting into smaller parts,
while before it was still connected to the second
largest due to the single-link eect.
4.2. ALGORITHMS
31
4.2.4
Density-based clustering
32
4.2.6
Other methods
4.3.1
Internal evaluation
min1i<jn d(i,j)
max1kn d (k)
T P +T N
T P +F P +F N +T N
TP
T P +F P
R=
TP
T P +F N
( 2 +1)P R
2 P +R
33
when = 0 , and increasing allocates an increasing amount of weight to recall in the nal
F-measure.
Jaccard index
The Jaccard index is used to quantify the similarity between two datasets. The Jaccard index
takes on a value between 0 and 1. An index of
1 means that the two dataset are identical, and
an index of 0 indicates that the datasets have
no common elements. The Jaccard index is dened by the following formula:
J(A, B) =
|AB|
|AB|
TP
T P +F P +F N
TP
P
F M = T PT+F
P T P +F N
where T P is the number of true positives, F P
is the number of false positives, and F N is the
number of false negatives. The F M index is
the geometric mean of the precision and recall
P and R , while the F-measure is their harmonic mean.[35] Moreover, precision and recall
are also known as Wallaces indices B I and
B II .[36]
The Mutual Information is an information theoretic measure of how much information is shared
between a clustering and a ground-truth classication that can detect a non-linear similarity between
two clusterings. Adjusted mutual information is the
corrected-for-chance variant of this that has a reduced bias for varying cluster numbers.
Confusion matrix
A confusion matrix can be used to quickly visualize the results of a classication (or clustering) algorithm. It shows how dierent a cluster
is from the gold standard cluster.
34
4.4 Applications
4.5 See also
4.5.1 Specialized types of cluster analysis
Others
Social science
Computer science
World wide web
Business and marketing
Medicine
Biology, computational biology and bioinformatics
Plant and animal ecologycluster analysis is used to
describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is
also used in plant systematics to generate articial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that
share a number of attributes
Transcriptomicsclustering is used to build groups
of genes with related expression patterns (also
known as coexpressed genes). Often such
groups contain functionally related proteins,
such as enzymes for a specic pathway, or genes
that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) Medical imaging
or DNA microarrays can be a powerful tool
for genome annotation, a general aspect of
genomics.
Sequence analysisclustering is used to group homologous sequences into gene families. This is a
very important concept in bioinformatics, and
evolutionary biology in general. See evolution
by gene duplication.
High-throughput genotyping platformsclustering algorithms are used to automatically assign genotypes.
Human genetic clusteringThe similarity of genetic
data is used in clustering to infer population
structures.
On PET scans, cluster analysis can be used to
dierentiate between dierent types of tissue
and blood in a three-dimensional image. In this
application, actual position does not matter,
but the voxel intensity is considered as a vector,
with a dimension for each image that was taken
over time. This technique allows, for example,
accurate measurement of the rate a radioactive
tracer is delivered to the area of interest,
without a separate sampling of arterial blood,
an intrusive technique that is most common
today.
Market research
Software evolution
Crime analysis
4.6. REFERENCES
Clustering high-dimensional data
Conceptual clustering
Consensus clustering
Constrained clustering
Data stream clustering
Sequence clustering
Spectral clustering
4.5.2
4.5.3
Dimension reduction
Principal component analysis
Multidimensional scaling
4.5.4
Other
Cluster-weighted modeling
Curse of dimensionality
Determining the number of clusters in a data set
Parallel coordinates
Structured data analysis
4.6 References
[1] Bailey, Ken (1994). Numerical Taxonomy and Cluster
Analysis. Typologies and Taxonomies. p. 34. ISBN
9780803952591.
[2] Tryon, Robert C. (1939). Cluster Analysis: Correlation
Prole and Orthometric (factor) Analysis for the Isolation
of Unities in Mind and Personality. Edwards Brothers.
35
[5] Sibson, R. (1973). SLINK: an optimally ecient algorithm for the single-link cluster method (PDF). The Computer Journal (British Computer Society) 16 (1): 3034.
doi:10.1093/comjnl/16.1.30.
[6] Defays, D. (1977). An ecient algorithm for a complete
link method. The Computer Journal (British Computer
Society) 20 (4): 364366. doi:10.1093/comjnl/20.4.364.
[7] Lloyd, S. (1982). Least squares quantization in PCM.
IEEE Transactions on Information Theory 28 (2): 129
137. doi:10.1109/TIT.1982.1056489.
[8] Kriegel, Hans-Peter; Krger, Peer; Sander, Jrg; Zimek,
Arthur (2011). Density-based Clustering. WIREs
Data Mining and Knowledge Discovery 1 (3): 231240.
doi:10.1002/widm.30.
[9] Microsoft academic search: most cited data mining articles: DBSCAN is on rank 24, when accessed on:
4/18/2010
[10] Ester, Martin; Kriegel, Hans-Peter; Sander, Jrg; Xu, Xiaowei (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In
Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M.
Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining (KDD-96). AAAI
Press. pp. 226231. ISBN 1-57735-004-9. CiteSeerX:
10.1.1.71.1980.
[11] Ankerst, Mihael; Breunig, Markus M.; Kriegel, HansPeter; Sander, Jrg (1999). OPTICS: Ordering Points To
Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press.
pp. 4960. CiteSeerX: 10.1.1.129.6542.
[12] Achtert, E.; Bhm, C.; Krger, P. (2006). DeLiClu: Boosting Robustness, Completeness, Usability, and
Eciency of Hierarchical Clustering by a Closest Pair
Ranking. LNCS: Advances in Knowledge Discovery and
Data Mining. Lecture Notes in Computer Science 3918:
119128. doi:10.1007/11731139_16. ISBN 978-3-54033206-0.
[13] Roy, S.; Bhattacharyya, D. K. (2005). An Approach
to nd Embedded Clusters Using Density Based Techniques. LNCS Vol.3816. Springer Verlag. pp. 523535.
[14] Sculley, D. (2010). Web-scale k-means clustering. Proc.
19th WWW.
[15] Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2: 283304.
[3] Cattell, R. B. (1943). The description of personality: Basic traits resolved into clusters. Journal of Abnormal and
Social Psychology 38: 476506. doi:10.1037/h0054116.
36
[18] Can, F.; Ozkarahan, E. A. (1990). Concepts and eectiveness of the cover-coecient-based clustering methodology for text databases. ACM Transactions on Database
Systems 15 (4): 483. doi:10.1145/99935.99938.
[19] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P.
(2005). Automatic Subspace Clustering of High Dimensional Data. Data Mining and Knowledge Discovery 11:
5. doi:10.1007/s10618-005-1396-1.
[20] Karin Kailing, Hans-Peter Kriegel and Peer Krger.
Density-Connected Subspace Clustering for HighDimensional Data. In: Proc. SIAM Int. Conf. on Data
Mining (SDM'04), pp. 246-257, 2004.
[21] Achtert, E.; Bhm, C.; Kriegel, H. P.; Krger, P.; MllerGorman, I.; Zimek, A. (2006). Finding Hierarchies
of Subspace Clusters. LNCS: Knowledge Discovery in
Databases: PKDD 2006. Lecture Notes in Computer Science 4213: 446453. doi:10.1007/11871637_42. ISBN
978-3-540-45374-1.
[22] Achtert, E.; Bhm, C.; Kriegel, H. P.; Krger, P.; MllerGorman, I.; Zimek, A. (2007). Detection and Visualization of Subspace Cluster Hierarchies. LNCS: Advances in Databases: Concepts, Systems and Applications.
Lecture Notes in Computer Science 4443: 152163.
doi:10.1007/978-3-540-71703-4_15. ISBN 978-3-54071702-7.
[23] Achtert, E.; Bhm, C.; Krger, P.; Zimek, A.
(2006). Mining Hierarchies of Correlation Clusters.
Proc. 18th International Conference on Scientic and
Statistical Database Management (SSDBM): 119128.
doi:10.1109/SSDBM.2006.35. ISBN 0-7695-2590-3.
[24] Bhm, C.; Kailing, K.; Krger, P.; Zimek, A. (2004).
Computing Clusters of Correlation Connected objects.
Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04. p. 455.
doi:10.1145/1007568.1007620. ISBN 1581138598.
[25] Achtert, E.; Bohm, C.; Kriegel, H. P.; Krger, P.; Zimek,
A. (2007). On Exploring Complex Relationships of
Correlation Clusters. 19th International Conference on
Scientic and Statistical Database Management (SSDBM
2007). p. 7. doi:10.1109/SSDBM.2007.21. ISBN 07695-2868-6.
[26] Meil, Marina (2003). Comparing Clusterings by the
Variation of Information. Learning Theory and Kernel
Machines. Lecture Notes in Computer Science 2777:
173187. doi:10.1007/978-3-540-45167-9_14. ISBN
978-3-540-40720-1.
[27] Kraskov, Alexander; Stgbauer, Harald; Andrzejak,
Ralph G.; Grassberger, Peter (1 December 2003) [28
November 2003]. Hierarchical Clustering Based on Mutual Information. arXiv:q-bio/0311039.
[28] Auarth, B. (July 1823, 2010). Clustering by a Genetic
Algorithm with Biased Mutation Operator. WCCI CEC
(IEEE). CiteSeerX: 10.1.1.170.869.
[29] Frey, B. J.; Dueck, D. (2007). Clustering by Passing Messages Between Data Points.
Science 315
(5814): 972976. doi:10.1126/science.1136800. PMID
17218491.
Chapter 5
Anomaly detection
In data mining, anomaly detection (or outlier detec- tically signicant increase in accuracy.[4][5]
tion) is the identication of items, events or observations
which do not conform to an expected pattern or other
items in a dataset.[1] Typically the anomalous items will
translate to some kind of problem such as bank fraud, a 5.2 Popular techniques
structural defect, medical problems or nding errors in
text. Anomalies are also referred to as outliers, novelties, Several anomaly detection techniques have been proposed in literature. Some of the popular techniques are:
noise, deviations and exceptions.[2]
In particular in the context of abuse and network intrusion detection, the interesting objects are often not rare
objects, but unexpected bursts in activity. This pattern
does not adhere to the common statistical denition of an
outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such
data, unless it has been aggregated appropriately. Instead,
a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns.[3]
Three broad categories of anomaly detection techniques
exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set
are normal by looking for instances that seem to t least
to the remainder of the data set. Supervised anomaly
detection techniques require a data set that has been labeled as normal and abnormal and involves training a
classier (the key dierence to many other statistical classication problems is the inherent unbalanced nature of
outlier detection). Semi-supervised anomaly detection
techniques construct a model representing normal behavior from a given normal training data set, and then testing
the likelihood of a test instance to be generated by the
learnt model.
5.1 Applications
Anomaly detection is applicable in a variety of domains,
such as intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor
networks, and detecting Eco-system disturbances. It is
often used in preprocessing to remove anomalous data
from the dataset. In supervised learning, removing the
anomalous data from the dataset often results in a statis-
Density-based techniques (k-nearest neighbor,[6][7][8] local outlier factor,[9] and many more
variations of this concept[10] ).
Subspace-[11] and correlation-based [12] outlier detection for high-dimensional data.[13]
One class support vector machines.[14]
Replicator neural networks.
Cluster analysis based outlier detection.[15]
Deviations from association rules and frequent itemsets.
Fuzzy logic based outlier detection.
Ensemble techniques, using feature bagging,[16][17]
score normalization[18][19] and dierent sources of
diversity.[20][21]
37
38
5.4 Software
ELKI is an open-source Java data mining toolkit that
contains several anomaly detection algorithms, as
well as index acceleration for them.
5.6 References
[1] Chandola, V.; Banerjee, A.; Kumar, V. (2009).
Anomaly detection: A survey (PDF). ACM Computing
Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
[2] Hodge, V. J.; Austin, J. (2004). A Survey of Outlier Detection Methodologies (PDF). Articial Intelligence Review 22 (2): 85. doi:10.1007/s10462-004-4304-y.
[13] Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). A survey on unsupervised outlier detection in high-dimensional
numerical data. Statistical Analysis and Data Mining 5
(5): 363387. doi:10.1002/sam.11161.
[15] He, Z.; Xu, X.; Deng, S. (2003). Discovering clusterbased local outliers. Pattern Recognition Letters 24 (9
10): 1641. doi:10.1016/S0167-8655(03)00003-5.
[16] Lazarevic, A.; Kumar, V. (2005). Feature bagging for
outlier detection. Proc. 11th ACM SIGKDD international
conference on Knowledge Discovery in Data Mining: 157
166. doi:10.1145/1081870.1081891.
[17] Nguyen, H. V.; Ang, H. H.; Gopalkrishnan, V. (2010).
Mining Outliers with Ensemble of Heterogeneous Detectors
on Random Subspaces. Database Systems for Advanced
Applications. Lecture Notes in Computer Science 5981.
p. 368. doi:10.1007/978-3-642-12026-8_29. ISBN 9783-642-12025-1.
[18] Kriegel, H. P.; Krger, P.; Schubert, E.; Zimek,
A. (2011). Interpreting and Unifying Outlier Scores
(PDF). Proceedings of the 2011 SIAM International Conference on Data Mining.
pp.
1324.
doi:10.1137/1.9781611972818.2. ISBN 978-0-89871992-5.
[19] Schubert, E.; Wojdanowski, R.; Zimek, A.; Kriegel, H.
P. (2012). On Evaluation of Outlier Rankings and Outlier Scores (PDF). Proceedings of the 2012 SIAM International Conference on Data Mining. pp. 10471058.
doi:10.1137/1.9781611972825.90. ISBN 978-1-61197232-0.
5.6. REFERENCES
39
Chapter 6
6.1 Denition
Following the original denition by Agrawal et al.[2] the
problem of association rule mining is dened as: Let
I = {i1 , i2 , . . . , in } be a set of n binary attributes called
items. Let D = {t1 , t2 , . . . , tm } be a set of transactions
called the database. Each transaction in D has a unique
transaction ID and contains a subset of the items in I .
A rule is dened as an implication of the form X Y
where X, Y I and X Y = . The sets of items (for
short itemsets) X and Y are called antecedent (left-handside or LHS) and consequent (right-hand-side or RHS) of
the rule respectively.
To illustrate the concepts, we use a small example from
the supermarket domain. The set of items is I =
{milk, bread, butter, beer, diapers} and in the table to the
right is shown a small database containing the items (1
codes presence and 0 codes absence of an item in a transaction). An example rule for the supermarket could be
{butter, bread} {milk} meaning that if butter and
40
6.4. HISTORY
41
Conviction[13]
Leverage[14]
42
appear to be associated, there is a large risk of nding many spurious associations. These are collections
of items that co-occur with unexpected frequency in the
data, but only do so by chance. For example, suppose we are considering a collection of 10,000 items
and looking for rules containing two items in the lefthand-side and 1 item in the right-hand-side. There are
approximately 1,000,000,000,000 such rules. If we apply a statistical test for independence with a signicance
level of 0.05 it means there is only a 5% chance of accepting a rule if there is no association. If we assume
there are no associations, we should nonetheless expect
to nd 50,000,000,000 rules. Statistically sound association discovery[17][18] controls this risk, in most cases reducing the risk of nding any spurious associations to a
user-specied signicance level.
6.7 Algorithms
Many algorithms for generating association rules were
presented over time.
Some well known algorithms are Apriori, Eclat and FPGrowth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs
to be done after to generate rules from frequent itemsets
found in a database.
Once the recursive process has completed, all large item
sets with minimum coverage have been found, and association rule creation begins.[19]
6.7.1
Apriori algorithm
6.7.4 Others
Apriori[7] is the best-known algorithm to mine association rules. It uses a breadth-rst search strategy to count
the support of itemsets and uses a candidate generation
function which exploits the downward closure property
of support.
AprioriDP
6.7.2
Eclat algorithm
Eclat[8] (alt. ECLAT, stands for Equivalence Class Transformation) is a depth-rst search algorithm using set intersection. It is a naturally elegant algorithm suitable for
both sequential as well as parallel execution with locality enhancing properties. It was rst introduced by Zaki,
Parthasarathy, Li and Ogihara in a series of papers written in 1997.
AprioriDP[20] utilizes Dynamic Programming in Frequent itemset mining. The working principle is to eliminate the candidate generation like FP-tree, but it stores
support count in specialized data structure instead of tree.
Context Based Association Rule Mining Algorithm
Main article: Context Based Association Rules
CBPNARM is the newly developed algorithm which is
developed in 2013 to mine association rules on the basis
of context. It uses context variable on the basis of which
the support of an itemset is changed on the basis of which
the rules are nally populated to the rule set.
43
Multi-Relation Association Rules: Multi-Relation Association Rules (MRAR) is a new class of association
rules which in contrast to primitive, simple and even
multi-relational association rules (that are usually extracted from multi-relational databases), each rule item
consists of one entity but several relations. These relations indicate indirect relationship between the entities.
Consider the following MRAR where the rst item consists of three relations live in, nearby and humid: Those
who live in a place which is near by a city with humid
climate type and also are younger than 20 -> their health
condition is good. Such association rules are extractable
from RDBMS data or semantic web data.[30]
GUHA is a general method for exploratory data analysis that has theoretical foundations in observational calculi.[24]
6.8 Lore
A famous story about association rule mining is the beer
and diaper story. A purported survey of behavior of supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer.
This anecdote became popular as an example of how unexpected association rules might be found from everyday
data. There are varying opinions as to how much of the
story is true.[29] Daniel Powers says:[29]
44
Sequential Rules discovering relationships between [9] Hjek, Petr; Havel, Ivan; Chytil, Metodj; The GUHA
method of automatic hypotheses determination, Computitems while considering the time ordering. It is genering 1 (1966) 293-308
ally applied on a sequence database. For example, a sequential rule found in database of sequences of customer
[10] Hjek, Petr; Feglar, Tomas; Rauch, Jan; and Coufal,
transactions can be that customers who bought a comDavid; The GUHA method, data preprocessing and minputer and CD-Roms, later bought a webcam, with a given
ing, Database Support for Data Mining Applications,
condence and support.
Springer, 2004, ISBN 978-3-540-22479-2
Warmr is shipped as part of the ACE data mining suite. [11] Omiecinski, Edward R.; Alternative interest measures for
It allows association rule learning for rst order relational
mining associations in databases, IEEE Transactions on
rules.[38]
Knowledge and Data Engineering, 15(1):57-69, Jan/Feb
2003
6.11 References
[1] Piatetsky-Shapiro, Gregory (1991), Discovery, analysis,
and presentation of strong rules, in Piatetsky-Shapiro,
Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA.
[2] Agrawal, R.; Imieliski, T.; Swami, A. (1993). Mining
association rules between sets of items in large databases.
Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. p. 207.
doi:10.1145/170035.170072. ISBN 0897915925.
[3] Michael Hahsler (2015). A Probabilistic Comparison
of Commonly Used Interest Measures for Association
Rules. http://michael.hahsler.net/research/association_
rules/measures.html
[4] Hipp, J.; Gntzer, U.; Nakhaeizadeh, G. (2000). Algorithms for association rule mining --- a general survey and
comparison. ACM SIGKDD Explorations Newsletter 2:
58. doi:10.1145/360402.360421.
[5] Tan, Pang-Ning; Michael, Steinbach; Kumar, Vipin
(2005). Chapter 6. Association Analysis: Basic Concepts and Algorithms (PDF). Introduction to Data Mining. Addison-Wesley. ISBN 0-321-32136-7.
[6] Pei, Jian; Han, Jiawei; and Lakshmanan, Laks V. S.; Mining frequent itemsets with convertible constraints, in Proceedings of the 17th International Conference on Data Engineering, April 26, 2001, Heidelberg, Germany, 2001,
pages 433-442
[7] Agrawal, Rakesh; and Srikant, Ramakrishnan; Fast algorithms for mining association rules in large databases, in
Bocca, Jorge B.; Jarke, Matthias; and Zaniolo, Carlo; editors, Proceedings of the 20th International Conference on
Very Large Data Bases (VLDB), Santiago, Chile, September 1994, pages 487-499
[8] Zaki, M. J. (2000). Scalable algorithms for association
mining. IEEE Transactions on Knowledge and Data Engineering 12 (3): 372390. doi:10.1109/69.846291.
[12] Aggarwal, Charu C.; and Yu, Philip S.; A new framework
for itemset generation, in PODS 98, Symposium on Principles of Database Systems, Seattle, WA, USA, 1998, pages
18-24
[13] Brin, Sergey; Motwani, Rajeev; Ullman, Jerey D.; and
Tsur, Shalom; Dynamic itemset counting and implication
rules for market basket data, in SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD 1997), Tucson, Arizona,
USA, May 1997, pp. 255-264
[14] Piatetsky-Shapiro, Gregory; Discovery, analysis, and
presentation of strong rules, Knowledge Discovery in
Databases, 1991, pp. 229-248
[15] Brin, Sergey; Motwani, Rajeev; Ullman, Jerey D.; and
Tsur, Shalom; Dynamic itemset counting and implication
rules for market basket data, in SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD 1997), Tucson, Arizona,
USA, May 1997, pp. 265-276
[16] Tan, Pang-Ning; Kumar, Vipin; and Srivastava, Jaideep;
Selecting the right objective measure for association analysis, Information Systems, 29(4):293-313, 2004
[17] Webb, Georey I. (2007); Discovering Signicant Patterns, Machine Learning 68(1), Netherlands: Springer,
pp. 1-33 online access
[18] Gionis, Aristides; Mannila, Heikki; Mielikinen, Taneli;
and Tsaparas, Panayiotis; Assessing Data Mining Results
via Swap Randomization, ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 3
(December 2007), Article No. 14
[19] Witten, Frank, Hall: Data mining practical machine
learning tools and techniques, 3rd edition
[20] D. Bhalodiya, K. M. Patel and C. Patel. An Ecient way to Find Frequent Pattern with Dynamic Programming Approach . NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING, NUiCONE-2013, 28-30 NOVEMBER, 2013.
[21] Z. H. Deng and S. L. Lv. Fast mining frequent itemsets using Nodesets.. Expert Systems with Applications, 41(10):
45054512, 2014.
[22] Z. H. Deng, Z. Wang,and J. Jiang. A New Algorithm for
Fast Mining Frequent Itemsets Using N-Lists . SCIENCE
CHINA Information Sciences, 55 (9): 2008 - 2030, 2012.
45
[38] Warmr: a data mining tool for chemical data.. J Comput Aided Mol Des 15 (2): 17381. Feb 2001. PMID
11272703.
[25] Hjek, Petr; Havrnek, Tom (1978). Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory. Springer-Verlag. ISBN 3-540-08738-9.
[26] Webb, Georey I. (1995); OPUS: An Ecient Admissible
Algorithm for Unordered Search, Journal of Articial Intelligence Research 3, Menlo Park, CA: AAAI Press, pp.
431-465 online access
[27] Bayardo, Roberto J., Jr.; Agrawal, Rakesh; Gunopulos,
Dimitrios (2000). Constraint-based rule mining in large,
dense databases. Data Mining and Knowledge Discovery
4 (2): 217240. doi:10.1023/A:1009895914772.
[28] Webb, Georey I. (2000); Ecient Search for Association Rules, in Ramakrishnan, Raghu; and Stolfo, Sal; eds.;
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2000), Boston, MA, New York, NY: The Association for
Computing Machinery, pp. 99-107 online access
6.12.1 Bibliographies
Extensive Bibliography on Association Rules by
J.M. Luna
Annotated Bibliography on Association Rules by M.
Hahsler
Statsoft Electronic Statistics Textbook: Association
Rules by Dell Software
6.12.2 Implementations
Open-Source data-mining suites
Christian Borgelts implementations of Apriori, FPGrowth and Eclat written in C with Python bindings.
[29] http://www.dssresources.com/newsletters/66.php
46
Ruby implementation (AI4R)
Zaki, Mohammed J.; Data Mining Software
Commercial oers
KNIME, an open source workow oriented data
preprocessing and analysis platform
KXEN, a commercial Data Mining software
LISp Miner, mines for generalized (GUHA) association rules (uses bitstrings, not apriori algorithm)
Magnum Opus, a system for statistically sound association discovery
RapidMiner, a Java data mining software suite
STATISTICA, commercial statistics software with
an Association Rules module
Chapter 7
Reinforcement learning
For reinforcement
Reinforcement.
learning
in
psychology,
see
7.1 Introduction
The basic reinforcement learning model consists of:
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with
how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in
many other disciplines, such as game theory, control theory, operations research, information theory, simulationbased optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. In the operations research and control literature, the eld where reinforcement learning methods are studied is called approximate dynamic programming. The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions
and their characterization, and not with the learning or
approximation aspects. In economics and game theory,
reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.
In machine learning, the environment is typically formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize
dynamic programming techniques. The main dierence
between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge
about the MDP and they target large MDPs where exact
methods become infeasible.
A reinforcement learning agent interacts with its environment in discrete time steps. At each time t , the agent
receives an observation ot , which typically includes the
reward rt . It then chooses an action at from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state st+1 and the
reward rt+1 associated with the transition (st , at , st+1 )
is determined. The goal of a reinforcement learning agent
Reinforcement learning diers from standard supervised is to collect as much reward as possible. The agent can
learning in that correct input/output pairs are never pre- choose any action as a function of the history and it can
sented, nor sub-optimal actions explicitly corrected. Fur- even randomize its action selection.
ther, there is a focus on on-line performance, which
involves nding a balance between exploration (of un- When the agents performance is compared to that of an
charted territory) and exploitation (of current knowl- agent which acts optimally from the beginning, the difedge). The exploration vs. exploitation trade-o in re- ference in performance gives rise to the notion of regret.
inforcement learning has been most thoroughly studied Note that in order to act near optimally, the agent must
through the multi-armed bandit problem and in nite reason about the long term consequences of its actions:
In order to maximize my future income I had better go
MDPs.
to school now, although the immediate monetary reward
associated with this might be negative.
Thus, reinforcement learning is particularly well suited to
problems which include a long-term versus short-term reward trade-o. It has been applied successfully to various
47
48
= E[R|],
The only way to collect information about the environment is by interacting with it.
where the random variable R denotes the return and is
dened by
The rst two of these problems could be considered planning problems (since some form of the model is availN
1
R=
t rt+1 ,
t=0
giving rise to the total expected discounted reward criterion. Here 0 1 is the so-called discount-factor.
Since the undiscounted return is a special case of the discounted return, from now on we will assume discounting.
Although this looks innocent enough, discounting is in
fact problematic if one cares about online performance.
This is because discounting makes the initial time steps
more important. Since a learning agent is likely to make
mistakes during the rst few steps after its life starts, no
uninformed learning algorithm can achieve near-optimal
performance under discounting even if the class of environments is restricted to that of nite MDPs. (This
does not mean though that, given enough time, a learning agent cannot gure how to act near-optimally, if time
was restarted.)
The problem then is to specify an algorithm that can be
used to nd a policy with maximum expected return.
From the theory of MDPs it is known that, without loss
of generality, the search can be restricted to the set of the
7.3.2
Brute force
49
The brute force approach entails the following two steps: Q (s, a) = E[R|s, a, ],
where, now, R stands for the random return associated
1. For each possible policy, sample returns while folwith rst taking action a in state s and following , therelowing it
after.
2. Choose the policy with the largest expected return
One problem with this is that the number of policies can
be extremely large, or even innite. Another is that variance of the returns might be large, in which case a large
number of samples will be required to accurately estimate
the return of each policy.
These problems can be ameliorated if we assume some
structure and perhaps allow samples generated from one
policy to inuence the estimates made for another. The
two main approaches for achieving this are value function
estimation and direct policy search.
Assuming full knowledge of the MDP, there are two basic approaches to compute the optimal action-value function, value iteration and policy iteration. Both algorithms
compute a sequence of functions Qk ( k = 0, 1, 2, . . . ,
) which converge to Q . Computing these functions
involves computing expectations over the whole statespace, which is impractical for all, but the smallest (nite)
7.3.3 Value function approaches
MDPs, never mind the case when the MDP is unknown.
In reinforcement learning methods the expectations are
Value function approaches attempt to nd a policy that
approximated by averaging over samples and one uses
maximizes the return by maintaining a set of estimates
function approximation techniques to cope with the need
of expected returns for some policy (usually either the
to represent value functions over large state-action spaces.
current or the optimal one).
These methods rely on the theory of MDPs, where optimality is dened in a sense which is stronger than the
above one: A policy is called optimal if it achieves the
best expected return from any initial state (i.e., initial distributions play no role in this denition). Again, one can
always nd an optimal policy amongst stationary policies.
To dene optimality in a formal manner, dene the value
of a policy by
V (s) = E[R|s, ],
where R stands for the random return associated with following from the initial state s . Dene V (s) as the
maximum possible value of V (s) , where is allowed
to change:
50
originated from (s, a) over time. Given enough time, this
procedure can thus construct a precise estimate Q of the
action-value function Q . This nishes the description
of the policy evaluation step. In the policy improvement
step, as it is done in the standard policy iteration algorithm, the next policy is obtained by computing a greedy
policy with respect to Q : Given a state s , this new policy returns an action that maximizes Q(s, ) . In practice
one often avoids computing and storing the new policy,
but uses lazy evaluation to defer the computation of the
maximizing actions to when they are actually needed.
Q(s, a) =
i i (s, a)
i=1
The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action
A few problems with this procedure are as follows:
pairs. However, linear function approximation is not the
only choice. More recently, methods based on ideas from
The procedure may waste too much time on evalu- nonparametric statistics (which can be seen to construct
ating a suboptimal policy;
their own features) have been explored.
It uses samples ineciently in that a long trajectory So far, the discussion was restricted to how policy iteris used to improve the estimate only of the single ation can be used as a basis of the designing reinforcestate-action pair that started the trajectory;
ment learning algorithms. Equally importantly, value iteration can also be used as a starting point, giving rise to
When the returns along the trajectories have high
the Q-Learning algorithm (Watkins 1989) and its many
variance, convergence will be slow;
variants.
It works in episodic problems only;
The problem with methods that use action-values is that
It works in small, nite MDPs only.
Temporal dierence methods
The rst issue is easily corrected by allowing the procedure to change the policy (at all, or at some states) before
the values settle. However good this sounds, this may be
dangerous as this might prevent convergence. Still, most
current algorithms implement this idea, giving rise to the
class of generalized policy iteration algorithm. We note in
passing that actor critic methods belong to this category.
The second issue can be corrected within the algorithm
by allowing trajectories to contribute to any state-action
pair in them. This may also help to some extent with
the third problem, although a better solution when returns
have high variance is to use Sutton's temporal dierence
(TD) methods which are based on the recursive Bellman
equation. Note that the computation in TD methods can
be incremental (when after each transition the memory
is changed and the transition is thrown away), or batch
(when the transitions are collected and then the estimates
are computed once based on a large number of transitions). Batch methods, a prime example of which is the
least-squares temporal dierence method due to Bradtke
and Barto (1996), may use the information in the samples
better, whereas incremental methods are the only choice
when batch methods become infeasible due to their high
computational or memory complexity. In addition, there
exist methods that try to unify the advantages of the two
approaches. Methods based on temporal dierences also
overcome the second but last issue.
() = .
In order to address the last issue mentioned in the previ- Under mild conditions this function will be dierentiable
ous section, function approximation methods are used. In as a function of the parameter vector . If the gradient
51
A large class of methods avoids relying on gradient information. These include simulated annealing, crossentropy search or methods of evolutionary computation.
Many gradient-free methods can achieve (in theory and
in the limit) a global optimum. In a number of cases they Reinforcement learning algorithms such as TD learning
have indeed demonstrated remarkable performance.
are also being investigated as a model for DopamineThe issue with policy search methods is that they may based learning in the brain. In this model, the dopaminconverge slowly if the information based on which they ergic projections from the substantia nigra to the basal
act is noisy. For example, this happens when in episodic ganglia function as the prediction error. Reinforcement
problems the trajectories are long and the variance of the learning has also been used as a part of the model for
returns is large. As argued beforehand, value-function human skill learning, especially in relation to the interbased methods that rely on temporal dierences might action between implicit and explicit learning in skill achelp in this case. In recent years, several actor-critic al- quisition (the rst publication on this application was
gorithms have been proposed following this idea and were in 1995-1996, and there have been many follow-up
studies). See http://webdocs.cs.ualberta.ca/~{}sutton/
demonstrated to perform well in various problems.
RL-FAQ.html#behaviorism for further details of these
research areas above.
7.6 Literature
7.4 Theory
The theory for small, nite MDPs is quite mature. Both
the asymptotic and nite-sample behavior of most algorithms is well-understood. As mentioned beforehand,
algorithms with provably good online performance (addressing the exploration issue) are known. The theory of large MDPs needs more work. Ecient exploration is largely untouched (except for the case of bandit problems). Although nite-time performance bounds
appeared for many algorithms in the recent years, these
bounds are expected to be rather loose and thus more
work is needed to better understand the relative advantages, as well as the limitations of these algorithms.
For incremental algorithm asymptotic convergence issues
have been settled. Recently, new incremental, temporaldierence-based algorithms have appeared which converge under a much wider set of conditions than was previously possible (for example, when used with arbitrary,
smooth function approximation).
52
inforcement Learning (ADPRL) and the biannual European Workshop on Reinforcement Learning (EWRL) are
two regularly held meetings where RL researchers meet.
7.8 Implementations
RL-Glue provides a standard interface that allows
you to connect agents, environments, and experiment programs together, even if they are written in
dierent languages.
Maja Machine Learning Framework The Maja Machine Learning Framework (MMLF) is a general
framework for problems in the domain of Reinforcement Learning (RL) written in python.
Software Tools for Reinforcement Learning (Matlab
and Python)
PyBrain(Python)
TeachingBox is a Java reinforcement learning
framework supporting many features like RBF networks, gradient descent learning methods, ...
C++ and Python implementations for some well
known reinforcement learning algorithms with
source.
Orange, a free data mining software suite, module
orngReinforcement
Policy Gradient Toolbox provides a package for
learning about policy gradient approaches.
BURLAP is an open source Java library that provides a wide range of single and multi-agent learning
and planning methods.
7.9 References
Sutton, Richard S. (1984). Temporal Credit Assignment in Reinforcement Learning (PhD thesis). University of Massachusetts, Amherst, MA.
Williams, Ronald J. (1987). A class of gradientestimating algorithms for reinforcement learning in
neural networks. Proceedings of the IEEE First International Conference on Neural Networks.
Sutton, Richard S. (1988).
Learning to
predict by the method of temporal dierences. Machine Learning (Springer) 3: 944.
doi:10.1007/BF00115009.
Watkins, Christopher J.C.H. (1989). Learning from
Delayed Rewards (PDF) (PhD thesis). Kings College, Cambridge, UK.
Bradtke, Steven J.; Andrew G. Barto (1996).
Learning to predict by the method of temporal differences. Machine Learning (Springer) 22: 3357.
doi:10.1023/A:1018056104778.
Bertsekas, Dimitri P.; John Tsitsiklis (1996).
Neuro-Dynamic Programming.
Nashua, NH:
Athena Scientic. ISBN 1-886529-10-8.
Kaelbling, Leslie P.; Michael L. Littman; Andrew
W. Moore (1996). Reinforcement Learning: A
Survey. Journal of Articial Intelligence Research
4: 237285.
Sutton, Richard S.; Barto, Andrew G. (1998).
Reinforcement Learning: An Introduction. MIT
Press. ISBN 0-262-19398-1.
Peters, Jan; Sethu Vijayakumar; Stefan Schaal
(2003). Reinforcement Learning for Humanoid
Robotics (PDF). IEEE-RAS International Conference on Humanoid Robots.
Powell, Warren (2007). Approximate dynamic programming: solving the curses of dimensionality.
Wiley-Interscience. ISBN 0-470-17155-3.
Auer, Peter; Thomas Jaksch; Ronald Ortner (2010).
Near-optimal regret bounds for reinforcement
learning. Journal of Machine Learning Research
11: 15631600.
Szita, Istvan; Csaba Szepesvari (2010). Modelbased Reinforcement Learning with Nearly Tight
Exploration Complexity Bounds (PDF). ICML
2010. Omnipress. pp. 10311038.
53
The Reinforcement Learning Toolbox from the
(Graz University of Technology)
Hybrid reinforcement learning
Piqle: a Generic Java Platform for Reinforcement
Learning
A Short Introduction To Some Reinforcement
Learning Algorithms
Reinforcement Learning applied to Tic-Tac-Toe
Game
Scholarpedia Reinforcement Learning
Scholarpedia Temporal Dierence Learning
Stanford Reinforcement Learning Course
Real-world reinforcement learning experiments at
Delft University of Technology
Reinforcement Learning Tools for Matlab
Stanford University Andrew Ng Lecture on Reinforcement Learning
Chapter 8
Structured prediction
Structured prediction or structured (output) learning
is an umbrella term for supervised machine learning techniques that involve predicting structured objects, rather
than scalar discrete or real values.[1]
For example, the problem of translating a natural language sentence into a syntactic representation such as a
parse tree can be seen as a structured prediction problem in which the structured output domain is the set of
all possible parse trees.
Probabilistic graphical models form a large class of structured prediction models. In particular, Bayesian networks
and random elds are popularly used to solve structured
prediction problems in a wide variety of application domains including bioinformatics, natural language processing, speech recognition, and computer vision. Other algorithms and models for structured prediction include
inductive logic programming, structured SVMs, Markov
logic networks and constrained conditional models.
Similar to commonly used supervised learning techniques, structured prediction models are typically trained
by means of observed data in which the true prediction
value is used to adjust model parameters. Due to the
complexity of the model and the interrelations of predicted variables the process of prediction using a trained
model and of training itself is often computationally infeasible and approximate inference and learning methods
are used.
tagged JJ
sentence NN
..
The main challenge in this problem is to resolve
ambiguity: the word sentence can also be a verb in English, and so can tagged.
While this problem can be solved by simply performing classication of individual tokens, that approach does
not take into account the empirical fact that tags do not
occur independently; instead, each tag displays a strong
conditional dependence on the tag of the previous word.
This fact can be exploited in a sequence model such as a
hidden Markov model or conditional random eld[2] that
predicts the entire tag sequence for a sentence, rather than
just individual tags, by means of the Viterbi algorithm.
8.4 References
[1] Gkhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard
Schlkopf, Alex Smola and SVN Vishwanathan (2007),
Predicting Structured Data, MIT Press.
[2] Laerty, J., McCallum, A., Pereira, F. (2001).
Conditional random elds: Probabilistic models for
segmenting and labeling sequence data (PDF). Proc.
18th International Conf. on Machine Learning. pp.
282289.
[3] Collins, Michael (2002). Discriminative training methods
for hidden Markov models: Theory and experiments with
perceptron algorithms (PDF). Proc. EMNLP 10.
55
Chapter 9
Feature learning
Feature learning or representation learning[1] is a set
of techniques that learn a transformation of raw data input to a representation that can be eectively exploited in
machine learning tasks.
Feature learning is motivated by the fact that machine
learning tasks such as classication often require input
that is mathematically and computationally convenient to
process. However, real-world data such as images, video,
and sensor measurement is usually complex, redundant,
and highly variable. Thus, it is necessary to discover useful features or representations from raw data. Traditional
hand-crafted features often require expensive human labor and often rely on expert knowledge. Also, they normally do not generalize well. This motivates the design
of ecient feature learning techniques.
weights may be found by minimizing the average representation error (over the input data), together with a L1
regularization on the weights to enable sparsity (i.e., the
representation of each data point has only a few nonzero
weights).
Multilayer neural networks can be used to perform feature learning, since they learn a representation of their
input at the hidden layer(s) which is subsequently used
for classication or regression at the output layer.
56
9.2.1
K-means clustering
K-means clustering is an approach for vector quantization. In particular, given a set of n vectors, k-means clustering groups them into k clusters (i.e., subsets) in such a
way that each vector belongs to the cluster with the closest mean. The problem is computationally NP-hard, and
suboptimal greedy algorithms have been developed for kmeans clustering.
In feature learning, k-means clustering can be used to
group an unlabeled set of inputs into k clusters, and then
use the centroids of these clusters to produce features.
These features can be produced in several ways. The
simplest way is to add k binary features to each sample,
where each feature j has value one i the jth centroid
learned by k-means is the closest to the sample under
consideration.[3] It is also possible to use the distances to
the clusters as features, perhaps after transforming them
through a radial basis function (a technique that has used
to train RBF networks[9] ). Coates and Ng note that certain variants of k-means behave similarly to sparse coding
algorithms.[10]
In a comparative evaluation of unsupervised feature
learning methods, Coates, Lee and Ng found that kmeans clustering with an appropriate transformation outperforms the more recently invented auto-encoders and
RBMs on an image classication task.[3] K-means has
also been shown to improve performance in the domain of
NLP, specically for named-entity recognition;[11] there,
it competes with Brown clustering, as well as with distributed word representations (also known as neural word
embeddings).[8]
9.2.2
Principal component analysis (PCA) is often used for dimension reduction. Given a unlabeled set of n input data
vectors, PCA generates p (which is much smaller than the
dimension of the input data) right singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the data matrix is the kth input data vector shifted by the sample mean of the input
(i.e., subtracting the sample mean from the data vector).
57
Equivalently, these singular vectors are the eigenvectors
corresponding to the p largest eigenvalues of the sample
covariance matrix of the input vectors. These p singular vectors are the feature vectors learned from the input
data, and they represent directions along which the data
has the largest variations.
PCA is a linear feature learning approach since the p singular vectors are linear functions of the data matrix. The
singular vectors can be generated via a simple algorithm
with p iterations. In the ith iteration, the projection of the
data matrix on the (i-1)th eigenvector is subtracted, and
the ith singular vector is found as the right singular vector
corresponding to the largest singular of the residual data
matrix.
PCA has several limitations. First, it assumes that the
directions with large variance are of most interest, which
may not be the case in many applications. PCA only relies
on orthogonal transformations of the original data, and it
only exploits the rst- and second-order moments of the
data, which may not well characterize the distribution of
the data. Furthermore, PCA can eectively reduce dimension only when the input data vectors are correlated
(which results in a few dominant eigenvalues).
58
In general, the training of RBM by solving the above maximization problem tends to result in non-sparse representations. The sparse RBM, [19] a modication of the RBM,
was proposed to enable sparse representations. The idea
is to add a regularization term in the objective function
of data likelihood, which penalizes the deviation of the
expected hidden variables from a small constant p .
9.3.2 Autoencoder
An autoencoder consisting of encoder and decoder is a
paradigm for deep learning architectures. An example is
provided by Hinton and Salakhutdinov[18] where the encoder uses raw data (e.g., image) as input and produces
feature or representation as output, and the decoder uses
the extracted feature from the encoder as input and reconstructs the original input raw data as output. The encoder
and decoder are constructed by stacking multiple layers
of RBMs. The parameters involved in the architecture
are trained in a greedy layer-by-layer manner: after one
layer of feature detectors is learned, they are fed to upper
layers as visible variables for training the corresponding
RBM. The process can be repeated until some stopping
criteria is satised.
9.3.1
9.5. REFERENCES
9.5 References
[1] Y. Bengio; A. Courville; P. Vincent (2013). Representation Learning: A Review and New Perspectives. IEEE
Trans. PAMI, special issue Learning Deep Architectures.
[2] Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola
(2004). Maximum-Margin Matrix Factorization. NIPS.
[3] Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An
analysis of single-layer networks in unsupervised feature
learning (PDF). Int'l Conf. on AI and Statistics (AISTATS).
[4] Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin;
Willamowski, Jutta; Bray, Cdric (2004). Visual categorization with bags of keypoints (PDF). ECCV Workshop
on Statistical Learning in Computer Vision.
[5] Daniel Jurafsky; James H. Martin (2009). Speech and
Language Processing. Pearson Education International.
pp. 145146.
[6] Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro,
Guillermo; Zisserman, Andrew (2009). Supervised Dictionary Learning. Advances in neural information processing systems.
[7] Percy Liang (2005). Semi-Supervised Learning for Natural Language (PDF) (M. Eng.). MIT. pp. 4452.
[8] Joseph Turian; Lev Ratinov; Yoshua Bengio (2010).
Word representations: a simple and general method for
semi-supervised learning (PDF). Proceedings of the 48th
Annual Meeting of the Association for Computational
Linguistics.
[9] Schwenker, Friedhelm; Kestler, Hans A.; Palm, Gnther (2001). Three learning phases for radial-basisfunction networks.
Neural Networks 14: 439
458. doi:10.1016/s0893-6080(01)00027-2. CiteSeerX:
10.1.1.109.312.
[10] Coates, Adam; Ng, Andrew Y. (2012). Learning feature
representations with k-means. In G. Montavon, G. B. Orr
and K.-R. Mller. Neural Networks: Tricks of the Trade.
Springer.
[11] Dekang Lin; Xiaoyun Wu (2009). Phrase clustering for
discriminative learning (PDF). Proc. J. Conf. of the ACL
and 4th Int'l J. Conf. on Natural Language Processing of
the AFNLP. pp. 10301038.
[12] Roweis, Sam T; Saul, Lawrence K (2000). Nonlinear
Dimensionality Reduction by Locally Linear Embedding. Science, New Series 290 (5500): 23232326.
doi:10.1126/science.290.5500.2323.
[13] Saul, Lawrence K; Roweis, Sam T (2000). An Introduction to Locally Linear Embedding.
[14] Hyvrinen, Aapo; Oja, Erkki (2000). Independent Component Analysis: Algorithms and Applications. Neural
networks (4): 411430.
[15] Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew Y
(2007). Ecient sparse coding algorithms. Ad- vances
in neural information processing systems.
59
Chapter 10
In the setting of supervised learning, or learning from examples, we are interested in learning a function f : X
Y , where X is thought of as a space of inputs and Y
as a space of outputs, that predicts well on instances that
are drawn from a joint probability distribution p(x, y) on
X Y . In this setting, we are given a loss function
V : Y Y R , such that V (f (x), y) measures the
dierence between the predicted value f (x) and the true
value y . The ideal goal is to select a function f H ,
where H is a space of functions called a hypothesis space,
Ideally in online learning, the memory needed to store
so as to minimize the expected risk:
the function remains constant even with added datapoints,
since the solution computed at one step is updated when
a new datapoint becomes available, after which that datapoint can then be discarded. For many formulations, for I[f ] = E[V (f (x), y)] = V (f (x), y) dp(x, y) .
example nonlinear kernel methods, true online learning
is not possible, though a form of hybrid online learning In reality, the learner never knows the true distribution
with recursive algorithms can be used. In this case, the p(x, y) over instances. Instead, the learner usually has acspace requirements are no longer guaranteed to be con- cess to a training set of examples (x , y ), . . . , (x , y )
1 1
n n
stant since it requires storing all previous datapoints, but that are assumed to have been drawn i.i.d. from the true
the solution may take less time to compute with the ad- distribution p(x, y) . A common paradigm in this situdition of a new datapoint, as compared to batch learning
ation is to estimate a function f through empirical risk
techniques.
minimization or regularized empirical risk minimization
As in all machine learning problems, the goal of the algo- (usually Tikhonov regularization). The choice of loss
rithm is to minimize some performance criteria using a function here gives rise to several well-known learning
loss function. For example, with stock market predic- algorithms such as regularized least squares and support
tion the algorithm may attempt to minimize the mean vector machines.
squared error between the predicted and true value of a
The above paradigm is not well-suited to the online learnstock. Another popular performance criterion is to mining setting though, as it requires complete a priori knowlimize the number of mistakes when dealing with classiedge of the entire training set. In the pure online learncation problems. In addition to applications of a sequening approach, the learning algorithm should update a setial nature, online learning algorithms are also relevant in
quence of functions f1 , f2 , . . . in a way such that the funcapplications with huge amounts of data such that tradition ft+1 depends only on the previous function ft and
tional learning approaches that use the entire data set in
the next data point (xt , yt ) . This approach has low memaggregate are computationally infeasible.
ory requirements in the sense that it only requires storage
of a representation of the current function ft and the next
data point (xt , yt ) . A related approach that has larger
memory requirements allows ft+1 to depend on ft and
all previous data points (x1 , y1 ), . . . , (xt , yt ) . We focus solely on the former approach here, and we consider
both the case where the data is coming from an innite
60
61
10.1.1
The third interpretation of the above recursion is distinctly dierent from the rst two and concerns the case
of sequential trials discussed above, where the data are
potentially not i.i.d. and can perhaps be selected in an adversarial manner. At each step of this process, the learner
is given an input xt and makes a prediction based on the
current linear function wt . Only after making this prediction does the learner see the true label yt , at which
point the learner is allowed to update wt to wt+1 . Since
we are not making any distributional assumptions about
wt+1 wt t V (wt , xt , yt ) ,
the data, the goal here is to perform as well as if we could
view the entire sequence of examples ahead of time; that
where w1 0 , V (wt , xt , yt ) is the gradient of the is, we would like the sequence of functions w , w , . . . to
1
2
loss for the next data point (xt , yt ) evaluated at the cur- have low regret relative to any vector w :
rent linear functional wt , and t > 0 is a step-size parameter. In the case of an innite stream of data, one can
run this iteration, in principle, forever, and in the case of
T
T
a nite but large set of data, one can consider a single pass RT (w ) =
V (wt , xt , yt )
V (w , xt , yt ) .
or multiple passes (epochs) through the data.
t=1
t=1
Interestingly enough, the above simple iterative online
learning algorithm has three distinct interpretations, each
of which has distinct implications about the predictive
quality of the sequence of functions w1 , w2 , . . . . The
rst interpretation considers the above iteration as an instance of the stochastic gradient descent method applied
to the problem of minimizing the expected risk I[w] dened above.[1] Indeed, in the case of an innite stream
of data, since the examples (x1 , y1 ), (x2 , y2 ), . . . are assumed to be drawn i.i.d. from the distribution p(x, y) ,
the sequence of gradients of V (, ) in the above iteration
are an i.i.d. sample of stochastic estimates of the gradient of the expected risk I[w] and therefore one can apply complexity results for the stochastic gradient descent
method to bound the deviation I[wt ] I[w ] , where w
is the minimizer of I[w] .[2] This interpretation is also
valid in the case of a nite training set; although with multiple passes through the data the gradients are no longer
independent, still complexity results can be obtained in
special cases.
62
10.2.1
Batch Learning
10.2.2
Online Learning
The recursive least squares algorithm considers an online approach to the least squares problem. It can be
shown that for suitable initializations of w0 Rd and
0 Rdxd , the solution of the linear least squares problem given in the previous section can be computed by the
following iteration:
i = i1
i1 xi xTi i1
1 + xTi i1 xi
10.5 References
[1] Bottou, Lon (1998). Online Algorithms and Stochastic
Approximations. Online Learning and Neural Networks.
Cambridge University Press. ISBN 978-0-521-65263-6
[2] Stochastic Approximation Algorithms and Applications,
Harold J. Kushner and G. George Yin, New York:
Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed.,
titled Stochastic Approximation and Recursive Algorithms
and Applications, 2003, ISBN 0-387-00894-2.
[3] Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a
survey. Optimization for Machine Learning, 85.
[4] Shalev-Shwartz, S. (2011). Online learning and online
convex optimization. Foundations and Trends in Machine
Learning, 4(2), 107-194.
Chapter 11
Semi-supervised learning
ing the 3D structure of a protein or determining whether
there is oil at a particular location). The cost associated
with the labeling process thus may render a fully labeled
training set infeasible, whereas acquisition of unlabeled
data is relatively inexpensive. In such situations, semisupervised learning can be of great practical value. Semisupervised learning is also of theoretical interest in machine learning and as a model for human learning.
As in the supervised learning framework, we are given
a set of l independently identically distributed examples
x1 , . . . , xl X with corresponding labels y1 , . . . , yl
Y . Additionally, we are given u unlabeled examples
xl+1 , . . . , xl+u X . Semi-supervised learning attempts to make use of this combined information to surpass the classication performance that could be obtained
either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.
Semi-supervised learning may refer to either transductive
learning or inductive learning. The goal of transductive
learning is to infer the correct labels for the given unlabeled data xl+1 , . . . , xl+u only. The goal of inductive
learning is to infer the correct mapping from X to Y .
Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled
data for training - typically a small amount of labeled data
with a large amount of unlabeled data. Semi-supervised
learning falls between unsupervised learning (without any
labeled training data) and supervised learning (with completely labeled training data). Many machine-learning
researchers have found that unlabeled data, when used
in conjunction with a small amount of labeled data, can
produce considerable improvement in learning accuracy.
The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an
audio segment) or a physical experiment (e.g. determin-
63
64
The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s.[4] Interest in inductive learning using generative models also began in the
1970s. A probably approximately correct learning bound
In order to make any use of unlabeled data, we must for semi-supervised learning of a Gaussian mixture was
assume some structure to the underlying distribution of demonstrated by Ratsaby and Venkatesh in 1995 [5]
data. Semi-supervised learning algorithms make use of
Semi-supervised learning has recently become more popat least one of the following assumptions. [1]
ular and practically relevant due to the variety of problems for which vast quantities of unlabeled data are
availablee.g. text on websites, protein sequences, or
11.1.1 Smoothness assumption
images. For a review of recent work see a survey article
[6]
Points which are close to each other are more likely to by Zhu (2008).
share a label. This is also generally assumed in supervised
learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised 11.3
learning, the smoothness assumption additionally yields
a preference for decision boundaries in low-density regions, so that there are fewer points close to each other
but in dierent classes.
11.3.1
11.1.2
Cluster assumption
11.1.3
Manifold assumption
11.2 History
11.3.2
65
Low-density separation
and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Dening the
Another major class of methods attempts to place bound- graph Laplacian L = D W where Dii = l+u
j=1 Wij
aries in regions where there are few data points (labeled or and f the vector [f (x1 ) . . . f (xl+u )] , we have
unlabeled). One of the most commonly used algorithms
is the transductive support vector machine, or TSVM
2
f = argmin
(1 yi f (xi ))+ + 1 ||h||H + 2
(1
|f
(xi )|)
beled
data,
but+instead make use of unlabeled data within
f
i=1
i=l+1 a supervised learning framework. For instance, the labeled and unlabeled examples x1 , . . . , xl+u may inform
An exact solution is intractable due to the non-convex
a choice of representation, distance metric, or kernel for
term (1 |f (x)|)+ , so research has focused on nding
the data in an unsupervised rst step. Then supervised
[8]
useful approximations.
learning proceeds from only the labeled examples.
Other approaches that implement low-density separation
Self-training is a wrapper method for semi-supervised
include Gaussian process models, information regularizalearning. First a supervised learning algorithm is used
tion, and entropy minimization (of which TSVM is a speto select a classier based on the labeled data only. This
cial case).
classier is then applied to the unlabeled data to generate
more labeled examples as input for another supervised
learning problem. Generally only the labels the classier
11.3.3 Graph-based methods
is most condent of are added at each step.
Graph-based methods for semi-supervised learning use Co-training is an extension of self-training in which mula graph representation of the data, with a node for each tiple classiers are trained on dierent (ideally disjoint)
labeled and unlabeled example. The graph may be con- sets of features and generate labeled examples for one anstructed using domain knowledge or similarity of exam- other.
ples; two common methods are to connect each data point
to its k nearest neighbors or to examples within some distance . The weight Wij of an edge between xi and xj
(
is then set to e
||xi xj ||2
ing
childhood)
combined
with large amounts of unlabeled
2
2
1
argmin l
V (f (xi ), yi ) + A ||f ||H + I
||M f (x)|| dp(x)
experience
(e.g.
observation
of objects without naming
f H
M
i=1
[8]
or counting them, or at least without feedback).
66
examples available, but the sampling process from which
labeled examples arise.[13][14]
11.6 References
[1] Chapelle, Olivier; Schlkopf, Bernhard; Zien, Alexander (2006). Semi-supervised learning. Cambridge, Mass.:
MIT Press. ISBN 978-0-262-03358-9.
[2] Stevens, K.N.(2000), Acoustic Phonetics, MIT Press,
ISBN 0-262-69250-3, 978-0-262-69250-2
[3] Scudder, H.J. Probability of Error of Some Adaptive
Pattern-Recognition Machines. IEEE Transaction on Information Theory, 11:363371 (1965). Cited in Chapelle
et al. 2006, page 3.
[4] Vapnik, V. and Chervonenkis, A. Theory of Pattern
Recognition [in Russian]. Nauka, Moscow (1974). Cited
in Chapelle et al. 2006, page 3.
[5] Ratsaby, J. and Venkatesh, S. Learning from a mixture
of labeled and unlabeled examples with parametric side
information. In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pages 412-417
(1995). Cited in Chapelle et al. 2006, page 4.
[6] Zhu, Xiaojin. Semi-supervised learning literature survey.
Computer Sciences, University of Wisconsin-Madison
(2008).
[7] Cozman, F. and Cohen, I. Risks of semi-supervised learning: how unlabeled data can degrade performance of generative classiers. In: Chapelle et al. (2006).
[8] Zhu, Xiaojin. Semi-Supervised Learning University of
Wisconsin-Madison.
[9] M. Belkin, P. Niyogi. Semi-supervised Leifolds. Machine
Learning, 56, Special Issue on Clustering, 209-239, 2004.
[10] M. Belkin, P. Niyogi, V. Sindhwani. On Manifold Regularization. AISTATS 2005.
[11] Zhu, Xiaojin; Goldberg, Andrew B. (2009). Introduction
to semi-supervised learning. Morgan & Claypool. ISBN
9781598295481.
[12] Younger, B. A. and Fearing, D. D. (1999), Parsing Items
into Separate Categories: Developmental Change in Infant Categorization. Child Development, 70: 291303.
[13] Xu, F. and Tenenbaum, J. B. (2007), Sensitivity to sampling in Bayesian word learning. Developmental Science,
10: 288297.
[14] Gweon, H., Tenenbaum J.B., and Schulz L.E (2010), Infants consider both the sample and the sampling process
in inductive generalization. Proc Natl Acad Sci U S A.,
107(20):9066-71.
Chapter 12
Grammar induction
Grammar induction, also known as grammatical inference or syntactic pattern recognition, refers to the process in machine learning of learning a formal grammar
(usually as a collection of re-write rules or productions
or alternatively as a nite state machine or automaton of
some kind) from a set of observations, thus constructing
a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is
that branch of machine learning where the instance space
consists of discrete combinatorial objects such as strings,
trees and graphs.
12.3 Methodologies
There are a wide variety of methods for grammatical inference. Two of the classic sources are Fu (1977) and Fu
(1982). Duda, Hart & Stork (2001) also devote a brief
section to the problem, and cite a number of references.
The basic trial-and-error method they present is discussed
below. For approaches to infer subclasses of regular languages in particular, see Induction of regular languages.
A more recent textbook is de la Higuera (2010) [1] which
covers the theory of grammatical inference of regular lanThere is now a rich literature on learning dierent types of guages and nite state automata. D'Ulizia, Ferri and Grigrammar and automata, under various dierent learning foni [2] provide a survey that explores grammatical inference methods for natural languages.
models and using various dierent methodologies.
67
68
nary string representation of genetic algorithms, but the 12.3.4 Distributional Learning
inherently hierarchical structure of grammars couched in
the EBNF language made trees a more exible approach. A more recent approach is based on Distributional Learning. Algorithms using these approaches have been
Koza represented Lisp programs as trees. He was able
applied to learning context-free grammars and mildly
to nd analogues to the genetic operators within the stancontext-sensitive languages and have been proven to
dard set of tree operators. For example, swapping subbe correct and ecient for large subclasses of these
trees is equivalent to the corresponding process of genetic
grammars.[3]
crossover, where sub-strings of a genetic code are transplanted into an individual of the next generation. Fitness
is measured by scoring the output from the functions of 12.3.5 Learning of Pattern languages
the lisp code. Similar analogues between the tree structured lisp representation and the representation of gram- Angluin denes a pattern to be a string of constant
mars as trees, made the application of genetic program- symbols from and variable symbols from a disjoint
ming techniques possible for grammar induction.
set. The language of such a pattern is the set of all
In the case of Grammar Induction, the transplantation of
sub-trees corresponds to the swapping of production rules
that enable the parsing of phrases from some language.
The tness operator for the grammar is based upon some
measure of how well it performed in parsing some group
of sentences from the target language. In a tree representation of a grammar, a terminal symbol of a production
rule corresponds to a leaf node of the tree. Its parent
nodes corresponds to a non-terminal symbol (e.g. a noun
phrase or a verb phrase) in the rule set. Ultimately, the
root node might correspond to a sentence non-terminal.
12.3.3
Grammatical inference by greedy Erlebach et al. give a more ecient version of Angluins pattern learning algorithm, as well as a parallelized
algorithms
version.[5]
Like all greedy algorithms, greedy grammar inference algorithms make, in iterative manner, decisions that seem
to be the best at that stage. These made decisions deal
usually with things like the making of a new or the removing of the existing rules, the choosing of the applied
rule or the merging of some existing rules. Because there
are several ways to dene 'the stage' and 'the best', there
are also several greedy grammar inference algorithms.
12.7. REFERENCES
Create the basic classes of stochastic models applied
by listing the deformations of the patterns.
Synthesize (sample) from the models, not just analyze signals with it.
Broad in its mathematical coverage, Pattern Theory spans
algebra and statistics, as well as local topological and
global entropic properties.
12.4 Applications
The principle of grammar induction has been applied to
other aspects of natural language processing, and have
been applied (among many other problems) to morpheme
analysis, and even place name derivations. Grammar induction has also been used for lossless data compression
and statistical inference via MML and MDL principles.
12.6 Notes
[1] The language of a pattern with at least two occurrences
of the same variable is not regular due to the pumping
lemma.
[2] x may occur several times, but no other variable y may
occur
12.7 References
[1] de la Higuera, Colin (2010). Grammatical Inference:
Learning Automata and Grammars. Cambridge: Cambridge University Press.
[2] DUlizia, A., Ferri, F., Grifoni, P. (2011) A Survey of
Grammatical Inference Methods for Natural Language
Learning, Articial Intelligence Review, Vol. 36, No.
1, pp. 1-27.
[3] Clark and Eyraud (2007) Journal of Machine Learning
Research, Ryo Yoshinaka (2011) Theoretical Computer
Science
69
70
Text
71
jeremy, A m sheldon, AntiSpamBot, LeighvsOptimvsMaximvs, Ramkumar.krishnan, Shoessss, Josephjthomas, Parikshit Basrur, Doug4,
Cometstyles, DH85868993, DorganBot, Bonadea, WinterSpw, Mark.hornick, Andy Marchbanks, Yecril, BernardZ, RJASE1, Idioma-bot,
RonFredericks, Je G., Jimmaths, DataExp, Philip Trueman, Adamminstead, TXiKiBoT, Deleet, Udufruduhu, Deanabb, Valerie928,
TyrantX, OlavN, Arpabr, Vlad.gerchikov, Don4of4, Raymondwinn, Mannafredo, 1yesfan, Bearian, Jkosik1, Wykypydya, Billinghurst,
Atannir, Hadleywickham, Hherbert, Falcon8765, Sebastjanmm, Pjoef, Mattelsen, AlleborgoBot, Burkeangirl, NHRHS2010, Rknasc, Pdfpdf, Equilibrioception, Calliopejen1, VerySmartNiceGuy, Euryalus, Dawn Bard, Estard, Srp33, Jerryobject, Kexpert, Mark Klamberg, Curuxz, Flyer22, Eikoku, JCLately, Powtroll, Jpcedenog, Strife911, Pyromaniaman, Oxymoron83, Gpswiki, Dodabe~enwiki, Gargvikram07,
Mtys, Fratrep, Chrisguyot, Odo Benus, Stfg, StaticGull, Sanya r, DixonD, Kjtobo, Melcombe, 48states, LaUs3r, Pinkadelica, Ypouliot,
Denisarona, Sbacle, Kotsiantis, Loren.wilton, Sfan00 IMG, Nezza 4 eva, ClueBot, The Thing That Should Not Be, EoGuy, Supertouch,
Kkarimi, Blanchardb, Edayapattiarun, Lbertolotti, Shaw76, Verticalsearch, Sebleouf, Hanifbbz, Abrech, Sterdeus, DrCroco, Nano5656,
Aseld, Amossin, Dekisugi, SchreiberBike, DyingIce, Atallcostsky, 9Nak, Dank, Versus22, Katanada, Qwfp, DumZiBoT, Sunsetsky,
XLinkBot, Articdawg, Cgfjpfg, Ecmalthouse, Little Mountain 5, WikHead, SilvonenBot, Badgernet, Foxyliah, Freestyle-69, Texterp,
Addbot, DOI bot, Mabdul, Landon1980, Mhahsler, AndrewHZ, Elsendero, Matt90855, Jpoelma13, Cis411, Drkknightbatman, MrOllie, Download, RTG, M.r santosh kumar., Glane23, Delaszk, Chzz, Swift-Epic (Refectory), AtheWeatherman, Fauxstar, Jesuja, Luckasbot, Yobot, Adelpine, Bunnyhop11, Ptbotgourou, Cm001, Hulek, Alusayman, Ryanscraper, Carleas, Nallimbot, SOMart, Tiany9027,
AnomieBOT, Rjanag, Jim1138, JackieBot, Fahadsadah, OptimisticCynic, Dudukeda, Materialscientist, Citation bot, Schul253, Cureden,
Capricorn42, Gtfjbl, Lark137, Liwaste, The Evil IP address, Tomwsulcer, BluePlateSpecial, Dr Oldekop, Rosannel, Rugaaad, RibotBOT,
Charvest, Tareq300, Cmccormick8, Smallman12q, Andrzejrauch, Davgrig04, Stekre, Whizzdumb, Thehelpfulbot, Kyleamiller, OlafvanD,
FrescoBot, Mark Renier, Ph92, W Nowicki, X7q, Colewaldron, Er.piyushkp, HamburgerRadio, Atlantia, Webzie, Citation bot 1, Killian441, Manufan 11, Rustyspatula, Pinethicket, Guerrerocarlos, Toohuman1, BRUTE, Elseviereditormath, Stpasha, MastiBot, SpaceFlight89, Jackverr, UngerJ, Juliustch, Priyank782, TobeBot, Pamparam, Btcoal, Kmettler, Jonkerz, GregKaye, Glenn Maddox, Jayrde,
Angelorf, Reaper Eternal, Chenzheruc, Pmauer, DARTH SIDIOUS 2, Mean as custard, RjwilmsiBot, Mike78465, D vandyke67, Ripchip
Bot, Slon02, Aaronzat, Helwr, Ericmortenson, EmausBot, Acather96, BillyPreset, Fly by Night, WirlWhind, GoingBatty, Emilescheepers444, Stheodor, Lawrykid, Uploadvirus, Wikipelli, Dcirovic, Joanlofe, Anir1uph, Chire, Cronk28, Zedutchgandalf, Vangelis12, T789,
Rick jens, Donner60, Terryholmsby, MainFrame, Phoglenix, Raomohsinkhan, ClueBot NG, Mathstat, Aiwing, Nuwanmenuka, Statethatiamin, CherryX, Candace Gillhoolley, Robiminer, Leonardo61, Twillisjr, Widr, WikiMSL, Luke145, EvaJamax, Debuntu, Helpful Pixie
Bot, AlbertoBetulla, HMSSolent, Ngorman, Inoshika, Data.mining, ErinRea, BG19bot, Wanming149, PhnomPencil, Lisasolomonsalford, Uksas, Naeemmalik036, Chafe66, Onewhohelps, Netra Nahar, Aranea Mortem, Jasonem, Flaticida, Funkykeith777, Moshiurbd,
Nathanashleywild, Anilkumar 0587, Mpaye, Rabarbaro70, Thundertide, BattyBot, Aacruzr, Warrenxu, IjonTichyIjonTichy, Harsh 2580,
Dexbot, Webclient101, Mogism, TwoTwoHello, Frosty, Bradhill14, 7376a73b3bf0a490fa04bea6b76f4a4b, L8fortee, Dougs campbell,
Mark viking, Cmartines, Epicgenius, THill182, Delaf, Melonkelon, Herpderp1235689999, Revengetechy, Amykam32, The hello doctor,
Mimarios1, Huang cynthia, DavidLeighEllis, Gnust, Rbrandon87, Astigitana, Alihaghi, Philip Habing, Wccsnow, Jianhui67, Tahmina.tithi,
Yeda123, Skr15081997, Charlotth, Jfrench7, Zjl9191, Davidhart007, Routerdecomposer, Augt.pelle, Justincahoon, Gstoel, Wiki-jonne,
MatthewP42, 115ash, LiberumConsilium, Ran0512, Daniel Bachar, Galaktikasoft, Prof PD Hoy, Gary2015 and Anonymous: 973
Statistical classication Source: http://en.wikipedia.org/wiki/Statistical%20classification?oldid=630022839 Contributors: The Anome,
Michael Hardy, GTBacchus, Hike395, Robbot, Benwing, Giftlite, Beland, Violetriga, Kierano, Jrme, Anthony Appleyard, Denoir, Oleg
Alexandrov, Bkkbrad, Qwertyus, Bgwhite, Roboto de Ajvol, YurikBot, Jrbouldin, Tianicita, Tobi Kellner, SmackBot, Object01, Mcld,
Chris the speller, Nervexmachina, Can't sleep, clown will eat me, Memming, Cybercobra, Richard001, Bohunk, Beetstra, Hu12, [email protected], Trauber, Juansempere, Thijs!bot, Prolog, Mack2, Peteymills, VoABot II, Robotman1974, Quocminh9, RJASE1,
Jamelan, ThomHImself, Gdupont, Junling, Melcombe, WikiBotas, Agor153, Addbot, Giggly37, Fgnievinski, SpBot, Movado73, Yobot,
Oleginger, AnomieBOT, Ashershow1, Verbum Veritas, FrescoBot, Gire 3pich2005, DrilBot, Classier1234, Jonkerz, Fly by Night, Microfries, Chire, Sigma0 1, Rmashhadi, ClueBot NG, Girish280, MerlIwBot, Helpful Pixie Bot, Chyvve, Swsboarder366, Klilidiplomus,
Ferrarisailor, Mark viking, Francisbach, Imphil, I Less than3 Maths, LdyBruin and Anonymous: 65
Cluster analysis Source: http://en.wikipedia.org/wiki/Cluster%20analysis?oldid=662268192 Contributors: The Anome, Fnielsen,
Nealmcb, Michael Hardy, Shyamal, Kku, Tomi, GTBacchus, Den fjttrade ankan~enwiki, Cherkash, BAxelrod, Hike395, Dbabbitt,
Phil Boswell, Robbot, Gandalf61, Babbage, Aetheling, Giftlite, Lcgarcia, Cfp, BenFrantzDale, Soundray~enwiki, Ketil, Khalid hassani, Angelo.romano, Dfrankow, Gadum, Pgan002, Gene s, EBB, Sam Hocevar, Pwaring, Jutta, Abdull, Bryan Barnard, Rich Farmbrough, Mathiasl26, NeuronExMachina, Yersinia~enwiki, Bender235, Alex Kosoruko, Aaronbrick, John Vandenberg, Greenleaf~enwiki,
Ahc, NickSchweitzer, 3mta3, Jonsafari, Jumbuck, Jrme, Terrycojones, Denoir, Jnothman, Stefan.karpinski, Hazard, Oleg Alexandrov, Soultaco, Woohookitty, Linas, Uncle G, Borb, Ruud Koot, Tabletop, Male1979, Joerg Kurt Wegner, DESiegel, Ruziklan, Sideris,
BD2412, Qwertyus, Rjwilmsi, Koavf, Salix alba, Michal.burda, Denis Diderot, Klonimus, FlaBot, Mathbot, BananaLanguage, Kcarnold,
Payo, Jrtayloriv, Windharp, BMF81, Roboto de Ajvol, The Rambling Man, YurikBot, Wavelength, Argav, SpuriousQ, Pseudomonas,
NawlinWiki, Gareth Jones, Bayle Shanks, TCrossland, JFD, Hirak 99, Zzuuzz, Rudrasharman, Zigzaglee, Closedmouth, Dontaskme,
Kevin, Killerandy, Airconswitch, SmackBot, Drakyoko, Jtneill, Pkirlin, Object01, Mcld, Ohnoitsjamie, KaragouniS, Bryan Barnard1,
MalafayaBot, Drewnoakes, Tenawy, DHN-bot~enwiki, Iwaterpolo, Zacronos, MatthewKarlsen, Krexer, Bohunk, MOO, Lambiam, Friend
of facts, Benash, ThomasHofmann, Dfass, Beetstra, Ryulong, Nabeth, Hu12, Iridescent, Ralf Klinkenberg, Madla~enwiki, Alanbino,
Origin415, Bairam, Ioannes Pragensis, Joaoluis, Megannnn, Nczempin, Harej bot, Slack---line, Playtime, Endpoint, Dgtized, Skittleys,
DumbBOT, Talgalili, Thijs!bot, Barticus88, Vinoduec, Mailseth, Danhoppe, Phoolimin, Onasraou, Denaxas, AndreasWittenstein, Daytona2, MikeLynch, JAnDbot, Inverse.chi, .anacondabot, Magioladitis, Andrimirzal, Fallschirmjger, JBIdF, David Eppstein, User A1,
Eeera, Varun raptor, LedgendGamer, Jiuguang Wang, Sommersprosse, Koko90, Smite-Meister, McSly, Dvdpwiki, DavidCBryant, AStrathman, Camrn86, TXiKiBoT, Rnc000, Tams Kdr, Mundhenk, Maxim, Winterschlaefer, Lamro, Wheatin, Arrenbas, Sesilbumu,
Tomfy, Kerveros 99, Seemu, WRK, Drdan14, Harveydrone, Graham853, Wcdriscoll, Zwerglein~enwiki, Osian.h, FghIJklm, Melcombe,
Kotsiantis, Freeman77, Victor Chmara, Kl4m, Mugvin, Manuel freire, Boing! said Zebedee, Tim32, PixelBot, Lartoven, Chaosdruid,
Aprock, Practical321, Qwfp, FORTRANslinger, Sunsetsky, Ocean931, Phantom xxiii, XLinkBot, Pichpich, Gnowor, Sujaykoduri, WikHead, Addbot, Allenchue, DOI bot, Bruce rennes, Fgnievinski, Gangcai, MrOllie, FerrousTigrus, Delaszk, Tide rolls, Lightbot, PAvdK,
Fjrohlf, Tobi, Luckas-bot, Yobot, Gulfera, Hungpuiki, AnomieBOT, Flamableconcrete, Materialscientist, Citation bot, Xqbot, Erud, Sylwia Ufnalska, Simeon87, Omnipaedista, Kamitsaha, Playthebass, FrescoBot, Sacomoto, D'ohBot, Dan Golding, JohnMeier, Slowmo0815,
Atlantia, Citation bot 1, Boxplot, Edfox0714, MondalorBot, Lotje, E.V.Krishnamurthy, Capez1, Koozedine, Tbalius, RjwilmsiBot, Ripchip
Bot, Jchemmanoor, GodfriedToussaint, Aaronzat, Helwr, EmausBot, John of Reading, Stheodor, Elixirrixile, BOUMEDJOUT, ZroBot,
Sgoder, Chire, Darthhappyface, Jucypsycho, RockMagnetist, Wakebrdkid, Fazlican, Anita5192, ClueBot NG, Marion.cuny, Ericfouh,
Simeos, Poirel, Robiminer, Michael-stanton, Girish280, Helpful Pixie Bot, Novusuna, BG19bot, Cpkex0102, Wiki13, TimSwast, Cricetus, Douglas H Fisher, Mu.ting, ColanR, Cornelius3, Illia Connell, Compsim, Mogism, Frosty, Abewley, Mark viking, Metcalm, Ninjarua,
Trouveur de faits, TCMemoire, Monkbot, Leegrc, Imsubhashjha, , Olosko, Angelababy00 and Anonymous: 325
72
12.8.2
Images
File:Ambox_important.svg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public domain Contributors: Own work, based o of Image:Ambox scales.svg Original artist: Dsmurat (talk contribs)
File:Animation2.gif Source: http://upload.wikimedia.org/wikipedia/commons/c/c0/Animation2.gif License: CC-BY-SA-3.0 Contributors: Own work Original artist: MG (talk contribs)
File:Cluster-2.svg Source: http://upload.wikimedia.org/wikipedia/commons/c/c8/Cluster-2.svg License: Public domain Contributors:
Cluster-2.gif Original artist: Cluster-2.gif: hellisp
File:Commons-logo.svg Source: http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: ? Contributors: ? Original
artist: ?
File:Edit-clear.svg Source: http://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The
Tango! Desktop Project. Original artist:
The people from the Tango! project. And according to the meta-data in the le, specically: Andreas Nilsson, and Jakub Steiner (although
minimally).
File:Example_of_unlabeled_data_in_semisupervised_learning.png Source: http://upload.wikimedia.org/wikipedia/commons/d/d0/
Example_of_unlabeled_data_in_semisupervised_learning.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Techerin
File:Fisher_iris_versicolor_sepalwidth.svg Source: http://upload.wikimedia.org/wikipedia/commons/4/40/Fisher_iris_versicolor_
sepalwidth.svg License: CC BY-SA 3.0 Contributors: en:Image:Fisher iris versicolor sepalwidth.png Original artist: en:User:Qwfp (original); Pbroks13 (talk) (redraw)
73
12.8.3
Content license