ML - Co4 Enotes
ML - Co4 Enotes
ML - Co4 Enotes
Motivation: [https://ieeexplore.ieee.org/document/149926]
• It is notoriously difficult to obtain rules directly from human experts for prediction, diagnosis,
simulation, etc., – Knowledge Acquisition Bottleneck.
• Hence, if an existing database of sample data of the domain is available, a rule induction
algorithm would be very useful.
• This is known as “A data reduction process’, where we want to reduce a large database of
information to a small number of rules describing the domain.
Introduction
Learning sets of rules from data is useful to learn the target concept. This is one of the most expressive
and human readable representations for learned hypotheses. A rule-based system (containing
algorithm for the induction of rules from examples ) is a special type of expert system, which typically
consists of a set of if–then rules. Rule-based systems are involved in knowledge discovery tasks and
predictive modeling tasks.
• Rule: (Condition) y
• where
Distinct instances, Syntactically distinct hypotheses, Semantically distinct hypotheses for the
considered dataset
• You add two possible values for each feature 1. (?) wild card & 2. (Φ) null so your "syntactically
distinct hypotheses" would be = 17.4.8.4.5.4.4.4 = 6,96,320 (syntactically distinct hypotheses)
• if you have only wild card for each feature and one instance for the empty set (Φ) you get
"semantically distinct hypotheses" i.e., 1 + (16.3.7.3.4.3.3.3) = 1,08,865 (semantically distinct
hypotheses)
This is a larger hypothesis space. Solution is to take advantage of naturally occurring structure over the
hypothesis space in order to search for the target concept-based hypothesis.
CN2 Algorithm:
• Determine the rule consequent by taking majority class of instances covered by the
rule
• The CN2 algorithm is a greedy, depth-first search for hypotheses (set of rules) with no
backtracking.
• With Greedy search, there is a danger of selecting suboptimal rule at any step.
• On each search step, descendants (specializations) are generated for each of these ‘k’
best candidates, and the resulting set is again reduced to the ‘k’ most promising
members.
• Beam search keeps track of the most promising alternatives to the current top-rated
hypothesis, so that all of their successors can be considered at each search step.
• At each step, the preconditions of the best rule are specialized in all possible ways.
Entropy
0.863 bits
0.721 bits
• Important observation is when both coverage and accuracy are taken for rule
evaluation, one rule is selected for further growing.
• When entropy is taken for rule evaluation, an-other rule is selected for rule growing.
• When number of positive examples heuristic is considered, then a width of k is
considered. [in beam search]
Note: Heuristics can also be used to select the best node for further expansion.
[https://d-nb.info/1106118391/34]
Rule Coverage and Accuracy for determining the quality of the rule
• Coverage of a rule:
• Fraction of records that satisfy the antecedent (A) of a rule
Coverage(r) = |A| / |D|
• Accuracy of a rule:
• Fraction of records that satisfy both the antecedent (A) and consequent (y) of a rule
Accuracy(r) = |A U y| / |A|
Rule Evaluation
Where;
The problem is that propositional representations offer no general way to describe the
essential relations among the values of the attributes.
FOIL Algorithm
TRAINING DATA
Female(x) 1 1 15 1 3 2
Female(y) 1 1 15 1 3 2
Father(x, y) 0 1 15 0 3 0
Father(y, x) 1 1 15 1 2 2.401
Father(x, z) 0 1 15 0 3 0
Father(z, x) 1 1 15 1 2 2.401
Father(y, z) 1 1 15 1 1 3
Father(z, y) 1 1 15 1 2 2.401
In this manner, the next predicates that will be chosen to add to the existing rule are Father(z,
x) and Female(y). At this point all positive examples will be removed by FOIL Algorithm.
• The expression X |-- Y is read "Y follows deductively from X," or alternatively "X entails
Y."
Example
Inverting Resolution
The inverse entailment operator must derive C2 given the resolvent C and C1.
Say C = A ∨ B, and C1= B ∨ D. How do we derive C2 s.t. C1 ∧ C2 → C?
Find the L that appears in C1 and not in C, then form C2 by including the
following literals
C2 = (C - (C1 - {L})) ∪ {¬ L}
so
C2 = A ∨ ¬D
C2 can also be A ∨ ¬D ∨ B.
In general, inverse resolution can produce multiple clauses C2.
https://jmvidal.cse.sc.edu/talks/learningrules/invertingresolution.html
Prolog-EBG
Prolog-EGB(TargetConcept, TraningExamples, DomainTheory)
1. LearnedRules = {}
2. Pos = the positive examples from TraningExamples.
4. return LearnedRules
Give a proof, using the domain theory, that the (positive) training
satisfies the target concept.
In our ongoing example the positive example of
SafeToStack(o1,o2) can be explained by using the domain theory,
as such:
Which of the many features of the objects are relevant to the target
concept?
Those that appear in the explanation we just built.
We can collect these and substitute x,y for o1,o2 to get
SafeToStack(x, y)
4. Lighter(o1, o2) → SafeToStack(o1,o2)
Lighter(x, y)
3. Weight(o1, 0.6) ∧ LessThan(0.6, 5) ∧ Weight(o2,5) →
Lighter(o1,o2)
Weight(x,wx), LessThan(wx, wy), Weight(y,wy)
2. Type(o2,endtable) → Weight(o2,5)
Weight(x,wx), LessThan(wx,5), Type(y,endtable)
1. Volume(o1,2) ∧ Density(o1,0.3) ∧ Equal(0.6, 2*0.3) →
Weight(o1,0.6)
Volume(x,vx), Density(x,dx), Equal(wx, vx*dx), LessThan(wx,5),
Type(y,endtable)
The final Horn clause has a body that corresponds to the weakest
preconditions, and the head is the concept:
Supervised learning is a general method for training a parameterized function approximator, such as a
neural network, to represent functions. However, supervised learning requires sample input-output
pairs from the function to be learned. In other words, supervised learning requires a set of questions
with the right answers. For example, we might not know the best way to program a computer to
recognize an infrared picture of a tank, but we do have a large collection of infrared pictures, and we
do know whether each picture contains a tank or not. Supervised learning could look at all the
examples with answers, and learn how to recognize tanks in general.
Unfortunately, there are many situations where we don’t know the correct answers that supervised
learning requires. For example, in a flight control system, the question would be the set of all sensor
readings at a given time, and the answer would be how the flight control surfaces should move during
the next millisecond. Simple neural networks can’t learn to fly the plane unless there is a set of known
answers, so if we don’t know how to build a controller in the first place, simple supervised learning
won’t help.
For these reasons there has been much interest recently in a different approach known as
reinforcement learning (RL). Reinforcement learning is not a type of neural network, nor is it an
alternative to neural networks. Rather, it is an orthogonal approach that addresses a different, more
difficult question. Reinforcement learning combines the fields of dynamic programming and
supervised learning to yield powerful machine-learning systems. Reinforcement learning appeals to
many researchers because of its generality. In RL, the computer is simply given a goal to achieve. The
computer then learns how to achieve that goal by trial-and-error interactions with its environment.
Thus, many researchers are pursuing this form of machine intelligence and are excited about the
possibility of solving problems that have been previously unsolvable.
To provide the intuition behind reinforcement learning consider the problem of learning to ride a
bicycle. The goal given to the RL system is simply to ride the bicycle without falling over. In the first
trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle
being tilted 45 degrees to the right. At this point their are two actions possible: turn the handle bars
left or turn them right. The RL system turns the handle bars to the left and immediately crashes to the
ground, thus receiving a negative reinforcement. The RL system has just learned not to turn the handle
bars left when tilted 45 degrees to the right. In the next trial the RL system performs a series of actions
that again result in the bicycle being tilted 45 degrees to the right. The RL system knows not to turn
the handle bars to the left, so it performs the only other possible action: turn right. It immediately
crashes to the ground, again receiving a strong negative reinforcement. At this point the RL system has
not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad,
but that the "state" of being titled 45 degrees to the right is bad. Again, the RL system begins another
trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. Two
actions are possible: turn right or turn left. The RL system turns the handle bars left which results in
the bicycle being tilted 45 degrees to the right, and ultimately results in a strong negative
reinforcement. The RL system has just learned not to turn the handle bars to the left when titled 40
degrees to the right. By performing enough of these trial-and-error interactions with the environment,
the RL system will ultimately learn how to prevent the bicycle from ever falling over.
http://www.cs.toronto.edu/~zemel/documents/411/rltutorial.pdf
The Environment
Every RL system learns a mapping from situations to actions by trial-and-error interactions with a
dynamic environment. This environment must at least be partially observable by the reinforcement
learning system, and the observations may come in the form of sensor readings, symbolic descriptions,
or possibly “mental” situations (e.g., the situation of being lost). The actions may be low level (e.g.,
voltage to motors), high level (e.g., accept job offer), or even “mental” (e.g., shift in focus of attention).
If the RL system can observe perfectly all the information in the environment that might influence the
choice of action to perform, then the RL system chooses actions based on true “states” of the
environment. This ideal case is the best possible basis for reinforcement learning and, in fact, is a
necessary condition for much of the associated theory.
It is the job of the RL system designer to define a reinforcement function that properly defines the
goals of the RL agent. Although complex reinforcement functions can be defined, there are at least
three noteworthy classes often used to construct reinforcement functions that properly define the
desired goals.
Q-Learning
Q-learning (Watkins, 1989 and 1992) is another extension to traditional dynamic programming (value
iteration) that solves the following problem.
A deterministic Markov decision process is one in which the state transitions are deterministic (an
action performed in state xt always transitions to the same successor state xt+1). Alternatively, in a
nondeterministic Markov decision process, a probability distribution function defines a set of potential
successor states for a given action in a given state. If the MDP is non-deterministic, then value
iteration requires that we find the action that returns the maximum expected value (the sum of the
reinforcement and the integral over all possible successor states for the given action). For example, to
find the expected value of the successor state associated with a given action, one must perform that
action an infinite number of times, taking the integral over the values of all possible successor states
for that action. The reason this is necessary is demonstrated in below figure.
In the above figure there are two possible actions in state x. Each action returns a reinforcement of 0.
Action u1 causes a transition to one of two possible successor states with equal probability. The same
is true for action u2. The values of the successor states are 0 and 1 for both actions. Value iteration
requires that the value of state x be equal to the maximum over actions of the sum of reinforcement
and the expected value of the successor state. By taking an infinite number of samples of successor
states for action u1, one would be able to calculate that the actual expected value is 0.5. The same is
true for action u2. Therefore, the value of state x is 0.5 However, if one were to naively perform value
iteration on this MDP by taking a single sample of the successor state associated with each action
instead of the integral, then x would converge to a value of 0.75. Clearly the wrong answer.
Rather than finding a mapping from states to state values (as in value iteration), Q-learning finds a
mapping from state/action pairs to values (called Q-values). Instead of having an associated value
function, Qlearning makes use of the Q-function. In each state, there is a Q-value associated with each
action. The definition of a Q-value is the sum of the (possibly discounted) reinforcements received
when performing the associated action and then following the given policy thereafter. Likewise, the
definition of an optimal Qvalue is the sum of the reinforcements received when performing the
associated action and then following the optimal policy thereafter.
In the context of Q-learning, the value of a state is defined to be the maximum Q-value in the given
state. Given this definition it is easy to derive the equivalent of the Bellman equation for Q-learning.
Q-learning differs from value iteration in that it doesn’t require that in a given state each action be
performed and the expected values of the successor states be calculated. While value iteration
performs an update that is analogous to a one level breadth-first search, Q-learning takes a single-step
sample of a Monte-Carlo roll-out. This process is demonstrated in below figure.
The Qvalue is a prediction of the sum of the reinforcements one will receive when performing the
associated action and then following the given policy. To update that prediction Q(xt,ut) one must
perform the associated action ut, causing a transition to the next state xt+1 and returning a scalar
reinforcement r(xt,ut). Then one need only find the maximum Q-value in the new state to have all the
necessary information for revising the prediction (Q-value) associated with the action just performed.
Q-learning does not require one to calculate the integral over all possible successor states in the case
that the state transitions are nondeterministic. The reason is that a single sample of a successor state
for a given action is an unbiased estimate of the expected value of the successor state. In other words,
after many updates the Q-value associated with a particular action will converge to the expected sum
of all reinforcements received when performing that action and following the optimal policy
thereafter.