ML Unit-Iv
ML Unit-Iv
ML Unit-Iv
• Fitness Function:
• The fitness of each hypothesized rule set is based on its classification accuracy
over the training data. In particular, the function used to measure fitness is:
Extensions
• There are two other operators that can be used:
• AddAlternative:
• This is a constraint on a specific attribute which changes a 0 to 1.
• DropCondition:
• It replaces all bits of a particular attribute by 1.
Hypothesis Space Search
• GAs employ a randomized beam search method to seek a maximally
fit hypothesis.
• Let us compare hypothesis space search of Gas with the gradient
descent of backpropogation to understand the difference in the
learning algorithms.
• Gradient descent moves smoothly from one hypothesis to a new
hypothesis that is very similar.
• GA search moves abruptly replacing a parent hypothesis by an
offspring that may be radically different from the parent.
• GA search is less likely to fall into the same kind of local minima as
gradient descent.
Hypothesis Space Search
• One practical difficulty in some GA applications is the problem of
crowding.
• Crowding is a phenomenon in which some individual that is more highly
fit than others in the population quickly reproduces.
• The negative impact of crowding is that it reduces the diversity of the
population, thereby slowing further progress by the GA.
• The following strategies can be used to overcome crowding:
• Alter the selection function, using criteria such as tournament selection or rank
selection.
• Another strategy is "fitness sharing," in which the measured fitness of an
individual is reduced by the presence of other, similar individuals in the
population.
• A third approach is to restrict the kinds of individuals allowed to recombine to
form offspring.
Population Evolution and the Schema
Theorem
• An interesting question is whether we can mathematically
characterize the evolution over time of the population within a GA.
• The schema theorem which is based on the concept of schemas that
describe sets of bit strings is one such characterization.
• A schema is any string composed of 0s,1s and *s, where * is
interpreted as don’t care.
• The schema theorem characterizes the evolution of the population
within a GA in terms of the number of instances representing each
schema.
Population Evolution and the Schema
Theorem
• Let m(s,t) denote the number of instances of schema s in the
population at time t.
• The schema theorem describes the expected value of m(s,t+1) in
terms of m(s,t) and other properties of the schema, population, and
GA algorithm parameters.
• The evolution of the population in the GA depends on the selection
step, the crossover step, and the mutation step.
• Let us start by considering just the effect of the selection step.
Population Evolution and the Schema
Theorem
• Let f (h) denote the fitness of the individual bit string h.
• denote the average fitness of all individuals in the population at
time t.
• Let n be the total number of individuals in the population.
• Let h ϵ s∩pt indicate that the individual h is both a representative of
schema s and a member of the population at time t.
• Let denote the average fitness of instances of schema s in the
population at time t.
Population Evolution and the Schema
Theorem
• We are interested in calculating the expected value of m(s, t + l),
which we denote E[m(s, t + I)].
• The probability distribution for selection as discussed already is:
Pr(hϵs)
• The above equation gives the probability that a single hypothesis
selected by the GA will be an instance of schema s.
• The expected number of instances of s resulting from the n independent
selection steps that create the entire new generation is just n times this
probability.
Population Evolution and the Schema
Theorem
• While the above analysis considered only the selection step of the GA,
the crossover and mutation steps must be considered as well.
• The full schema theorem thus provides a lower bound on the expected
frequency of schema s, as follows:
• When variables occur only in the precondition then they are assumed
to be existentially qualified.
Terminology
• In first order logic all expressions are composed of constants (Ravi,
Raja), variables (x,y), predicate symbols (Married, Greater-Than), and
functions (age).
• The difference between predicates and functions is that predicates
take on values True or False where s function take on any values.
Terminology
Learning Sets of First Order Rules: FOIL
Learning Sets of First Order Rules: FOIL
• The most important differences between FOIL and our earlier
sequential covering and learn-one-rule are as follows:
• In its general-to-specific search to learn each new rule, FOIL employs different
detailed steps to generate candidate specializations of the rule. This
difference follows from the need to accommodate variables in the rule
preconditions.
• FOIL employs a PERFORMANCE measure, Foil_Gain, that differs from the
entropy measure shown for LEARN-ONE-RULE. This difference follows from
the need to distinguish between different bindings of the rule variables and
from the fact that FOIL seeks only rules that cover positive examples.
Generating Candidate Specializations in FOIL
• Let us assume the current rule being considered is:
P(x1,x2,……xk)🡨 L1,L2…….Ln
Where L1,L2…….Ln are literals forming the current rule preconditions
and where P(x1,x2…..xk) is the literal that forms the rule head, or post
conditions.
Generating Candidate Specializations in FOIL
• FOIL generates candidate specializations of this rule by considering
new literals Ln+1 that fit one of the following forms:
• Q(v1, . . . , vn), where Q is any predicate name occurring in Predicates and
where the vi are either new variables or variables already present in the rule.
At least one of the vi in the created literal must already exist as a variable in
the rule.
• Equal(xj, xk), where xj and xk are variables already present in the rule.
• The negation of either of the above forms of literals.
Generating Candidate Specializations in FOIL
• Let us look at an example to predict the target literal GrandDaughter(x,y)
where the other predicates used to describe examples are Father and
Female.
• The general-to-specific search in FOIL begins with the most general rule:
GrandDaughter(x,y)🡨
• Equal ( x , y ) , Female(x), Female(y), Father(x, y), Father(y, x), Father(x, z),
Father(z, x), Father(y, z), Father(z, y), and the negations of each of these
literals (e.g., -Equal(x, y)) are generated literals as candidate additions to
the rule precondition.
Generating Candidate Specializations in FOIL
• Now suppose that among the above literals FOIL greedily selects
Father( y , z) as the most promising, leading to the more specific rule
GrandDaughter(x, y)🡨 Father(y, z)
• A new variable z is added here which is to be considered while
selecting the next literal.
• The final rule will be:
GrandDaughter(x, y)🡨 Father(y, z)ᴧFather(z,x)ᴧFemale(y)
Guiding the Search in FOIL
• To select the most promising literal from the candidates generated at each
step, FOIL considers the performance of the rule over the training data.
• Let us look at the same example of GrandDaughter(x,y). Here P(x,y) can be
read as “The P of x is y”.
• The second rule is among the rules that are potentially within reach
of FOIL'S search, provided Ancestor is included in the list Predicates
that determines which predicates may be considered when
generating new literals.
• Whether this particular rule would be learned or not depends on
whether these particular literals outscore competing candidates
during FOIL'S greedy search for increasingly specific rules.
Summary of FOIL
• To learn each rule FOIL performs a general-to-specific search, at each step
adding a single new literal to the rule preconditions.
• The new literal may refer to variables already mentioned in the rule
preconditions or post-conditions, and may introduce new variables as well.
• At each step, it uses the Foil_Gain function to select among the candidate
new literals.
• In the case of noise-free training data, FOIL may continue adding new
literals to the rule until it covers no negative examples.
• FOIL post-prunes each rule it learns, using the same rule post-pruning
strategy used for decision trees
Induction as Inverted Deduction
• Another approach to inductive logic programming is based on the
simple observation that induction is just the inverse of deduction.
• Learning can be described as:
• Given two clauses C1 and C2, the resolution operator first identifies a
literal L that occurs as a positive literal in one of these two clauses
and as a negative literal in the other.
• It then draws the conclusion given by the above formula.
Inverting Resolution
First Order Resolution
•
First Order Resolution
•
First Order Resolution
• The resolution rule then constructs the resolvent C according to the
equation:
• The second property is that throughout the training process every value
will remain in the interval between 0 and its true Q value.
Convergence
• The question to be answered is “will the algorithm discussed converge
toward a equal to the true Q function?”
• The answer is yes in certain conditions:
• We assume the system is a deterministic MDP.
• We assume the immediate reward values are bounded.
• We assume the agent selects actions in such a fashion that it visits every possible
state-action pair infinitely often.
• The key idea underlying the proof of convergence is that the table entry
(s,a) with the largest error must have its error reduced by a factor of γ
whenever it is updated.
Convergence
Convergence
Experimentation Strategies
• Q learning uses a probabilistic approach to select actions.
• Actions with higher values are assigned higher probabilities, but
every action is assigned a nonzero probability.
• One way to assign such probabilities:
• P(ai/s) is the probability of selecting action ai, given that the agent is
in state s.
• K>0 is a constant that determines how strongly the selection favors
actions with high value. (exploit and explore)
Updating Sequences
• Q learning can learn the Q function while training from actions chosen
completely at random at each step as long as the resulting training
sequence visits every state-action infinitely often.
• We run repeated identical episodes to fill the table backwards from
the goal state at the rate of one new state-action transition per
episode.
• Now consider training on these same state-action transitions, but in
reverse chronological order for each episode.
• This training process will clearly converge in fewer iterations, although
it requires that the agent use more memory to store the entire episode
before beginning the training for that episode.
Updating Sequences
• A second strategy for improving the rate of convergence is to store past
state-action transitions, along with the immediate reward that was
received, and retrain on them periodically.
• Throughout the above discussion we have kept our assumption that the
agent does not know the state-transition function δ(s, a) used by the
environment to create the successor state s' = δ (s, a), or the function
r(s, a) used to generate rewards.
• If the two functions are known in advance , then many more efficient
methods are possible.
• A large number of efficient algorithms from the field of dynamic
programming can be applied when the functions δ and r are known.
Nondeterministic Rewards and Actions
•
Nondeterministic Rewards and Actions
•
Nondeterministic Rewards and Actions
• Now the training rule for the nondeterministic environment is:
Temporal Difference Learning
• Q learning can be treated as a special case of a general class of
temporal difference algorithms that learn by reducing discrepancies
between estimates made by the agent at different times.
• Let Q'(s,,a ,) denote the training value calculated by this one-step look
ahead: