Ai&ml Module 5 Final
Ai&ml Module 5 Final
Ai&ml Module 5 Final
MODULE 5
Instance-Base Learning: Introduction, k-Nearest Neighbor Learning, Locally weighted regression, Radial
basis function, Case-Based reasoning.
Reinforcement Learning: Introduction, The learning task, Q-Learning.
Table 5.1: K-Nearest Neighbour algorithm for approximating a discrete –valued function
Figure 5.1: k-NEAREST NEIGHBOR A set of positive and negative training examples is shown
on the left, along with a query instance x, to be classified. The 1-NEAREST NEIGHBOR
algorithm classifies x, positive, whereas 5-NEAREST NEIGHBOR Classifies it as negative. On
the right is the decision surface induced by the 1-NEAREST NEIGHBOR algorithm for a typical
set of training examples. The convex polygon surrounding each training example indicates the
region of instance space closest to that point (i.e., the instances for which the 1 -NEAREST
NEIGHBOR algorithm will assign the classification belonging to that training example).
approximates discrete-valued target functions, we might weight the vote of each neighbor
according to the inverse square of its distance from xq.
This can be accomplished by replacing the final line of the algorithm by
Where,
To accommodate the case where the query point x, exactly matches one of the training instances
xi and the denominator d(xq, xi)2 is therefore zero, we assign f(xq) to be f(xi) in this case. If there
are several such training examples, we assign the majority classification among them. We can
distance-weight the instances for real-valued target functions in a similar fashion, replacing the
final line of the algorithm in this case by
Note the denominator in above Equation is a constant that normalizes the contributions of the
various weights (e.g., it assures that if f (xi) = c for all training examples.
5.3 Locally Weighted Regression
The nearest-neighbor approaches described in the previous section can be thought of as
approximating the target function f (x) at the single query point x = xq Locally weighted
regression is a generalization of this approach.
It constructs an explicit approximation to f over a local region surrounding x, Locally
weighted regression uses nearby or distance-weighted training examples to form this local
approximation to f. For example, we might approximate the target function in the
neighborhood surrounding x, using a linear function, a quadratic function, a multilayer
neural network, or some other functional form.
The phrase "locally weighted regression" is called local because the function is
approximated based a only on data near the query point, weighted because the
contribution of each training example is weighted by its distance from the query point, and
regression because this is the term used widely in the statistical learning community for
the problem of approximating real-valued functions.
Let us consider the case of locally weighted regression in which the target function f is
As before, ai(x) denotes the value of the ith attribute of the instance x.
we derived methods to choose weights that minimize the squared error summed over the set D of
training examples which
Three possible criteria are given below. Note we write the error E(xq) to emphasize the fact that
now the error is being defined as a function of the query point xq.
where each xu is an instance from X and where the kernel function Ku(d(xu, x)) is defined so that
it decreases as the distance d(xu, x) increases. Here k is a user provided constant that specifies the
number of kernel functions to be included. Even though 𝑓 (x) is a global approximation to f (x),
contribution from each of the K u(d(xu, x)) terms is localized to a region nearby the point xu.
It is common to choose each function K u(d(xu, x)) to be a Gaussian function
The function given by Equation can be viewed as describing a two layer network where the first
layer of units computes the values of the various Ku(d(xu, x)) and where the second layer
computes a linear combination of these first-layer unit values.
Given a set of training examples of the target function, RBF networks are typically trained in a
two-stage process. First, the number k of hidden units is determined and each hidden unit u is
defined by choosing the values of xu and σ2 that define its kernel function Ku(d(xu, x)). Second,
the weights wu, are trained to maximize the fit of the network to the training data, using the global
error criterion given by Equation. Because the kernel functions are held fixed during this second
stage, the linear weight values wu, can be trained very efficiently.
One approach is to allocate a Gaussian kernel function for each training example (xi, f (xi)),
centering this Gaussian at the point xi. Each of these kernels may be assigned the same width a2.
Given this approach, the RBF network learns a global approximation to the target function in
which each training example (xi, f (xi)) can influence the value of f only in the neighborhood of
xi. One advantage of this choice of kernel functions is that it allows the RBF network to fit the
training data exactly. That is, for any set of m training examples the weights wo. . . wm, for
combining the m Gaussian kernel functions can be set so that f(xi) = f (xi) for each training.
Figure 5.2: A radial basis function network. Each hidden unit produces an activation determined
by a Gaussian function centered at some instance xu. Therefore, its activation will be close to
zerounless the input x is near xu. The output unit produces a linear combination of the hidden unit
activations. Although the network shown here has just one output, multiple output units can also
be included.
5.5 CASE-BASED REASONING
Instance-based methods such as k-NEAREST NEIGHBOR and locally weighted
regression share three key properties. First, they are lazy learning methods in that they
defer the decision of how to generalize beyond the training data until a new query instance
is observed.
Second, they classify new query instances by analyzing similar instances while ignoring
instances that are very different from the query. Third, they represent instances as real-
valued points in an n-dimensional Euclidean space.
Case-based reasoning (CBR) is a learning paradigm based on the first two of these
principles, but not the third. In CBR, instances are typically represented using more rich
symbolic descriptions, and the methods used to retrieve similar instances are
correspondingly more elaborate.
CBR has been applied to problems such as conceptual design of mechanical devices
based on a stored library of previous designs.
Let us consider a prototypical example of a case-based reasoning system to ground our
discussion. The CADET system (Sycara et al. 1992) employs case based reasoning to
assist in the conceptual design of simple mechanical devices such as water faucets. It uses
a library containing approximately 75 previous designs and design fragments to suggest
conceptual designs to meet the specifications of new design problems.
Each instance stored in memory (e.g., a water pipe) is represented by describing both its
structure and its qualitative function. New design problems are then presented by
specifying the desired function and requesting the corresponding structure.
This problem setting is illustrated in Figure 5.3. The top half of the figure shows the
description of a typical stored case called a T-junction pipe. Its function is represented in
terms of the qualitative relationships among the water flow levels and temperatures at its
inputs and outputs.
In the functional description at its right, an arrow with a "+" label indicates that the
variable at the arrowhead increases with the variable at its tail. For example, the output
water flow Q3 increases with increasing input water flow Ql. Similarly
FIGURE 5.3: A stored case and a new problem. The top half of the figure describes a typical
design fragment in the case library of CADET. The function is represented by the graph of
qualitative dependencies among the T-junction variables (described in the text). The bottom half
of the figure shows a typical design problem.
a "-" label indicates that the variable at the head decreases with the variable at the tail. The
bottom half of this figure depicts a new design problem described by its desired function.
This particular function describes the required behavior of one type of water faucet. Here
Qc, refers to the flow of cold water into the faucet, Q h to the input flow of hot water, and
Qm, to the single mixed flow out of the faucet. Similarly, T c, Th, and T m, refer to the
temperatures of the cold water, hot water, and mixed water respectively. The variable C t,
denotes the control signal for temperature that is input to the faucet, and Cf denotes the
control signal for waterflow. Note the description of the desired function specifies that
these controls Ct, and Cf are to influence the water flows Q c and Q h, thereby indirectly
influencing the faucet output flow Q m, and temperature T m.
Figure 5.4 An agent interacting with its environment. The agent exists in an environment
described by some set of possible states S. It can perform any of a set of possible actions A. Each
time it performs an action a t, in some state st the agent receives a real-valued reward r t, that
indicates the immediate value of this state-action transition. This produces a sequence of states s i,
actions ai, and immediate rewards ri as shown in the figure. The agent's task is to learn a control
policy, ᴨ : S → A, that maximizes the expected sum of these rewards, with future rewards
discounted exponentially by their delay.
Reinforcement learning problem differs from other function approximation tasks in several
important respects.
Delayed reward. The task of the agent is to learn a target function n that maps from the
current states to the optimal action a = ᴨ (s). When learning some target function such as
ᴨ, each training example would be a pair of the form (s, ᴨ (s)). In reinforcement learning,
however, training information is not available in this form. Instead, the trainer provides
only a sequence of immediate reward values as the agent executes its sequence of actions.
The agent, therefore, faces the problem of temporal credit assignment: determining which
of the actions in its sequence are to be credited with producing the eventual rewards.
Exploration. In reinforcement learning, the agent influences the distribution of training
examples by the action sequence it chooses. This raises the question of which
experimentation strategy produces most effective learning. The learner faces a tradeoff in
choosing whether to favor exploration of unknown states and actions (to gather new
information), or exploitation of states and actions that it has already learned will yield high
reward.
Partially observable states. Although it is convenient to assume that the agent's sensors
can perceive the entire state of the environment at each time step, in many practical
situations sensors provide only partial information. For example, a robot with a forward-
pointing camera cannot see what is behind it. In such cases, it may be necessary for the
agent to consider its previous observations together with its current sensor data when
choosing actions, and the best policy may be one that chooses actions specifically to
improve the observability of the environment.
Life-long learning. Unlike isolated function approximation tasks, robot learn-ing often
requires that the robot learn several related tasks within the same environment, using the
same sensors. For example, a mobile robot may need to learn how to dock on its battery
charger, how to navigate through narrow corridors, and how to pick up output from laser
printers. This setting raises the possibility of using previously obtained experience or
knowledge to reduce sample complexity when learning new tasks.
on the current observed state st; that is, ᴨ (st) = at. How shall we specify precisely which
policy n we would like the agent to learn? One obvious approach is to require the policy
that produces the greatest possible cumulative reward for the robot over time. To state this
requirement more precisely, we define the cumulative value Vᴨ (st) achieved by following
an arbitrary policy ᴨ from an arbitrary initial state st as follows:
where the sequence of rewards r t+i is generated by beginning at state st and by repeatedly using
the policy ᴨ to select actions as described above (i.e., at = ᴨ (st), at+l= ᴨ (st+1), etc.).
We are now in a position to state precisely the agent's learning task. We require that the agent
learn a policy ᴨ that maximizes Vᴨ(s) for all states s. We will call such a policy an optimal
policy and denote it by ᴨ*.
To simplify notation, we will refer to the value function vᴨ* (s) of such an optimal policy as
V*(s). V*(s) gives the maximum discounted cumulative reward that the agent can obtain
starting from state s; that is, the discounted cumulative reward obtained by following the
optimal policy beginning at state s.
To illustrate these concepts, a simple grid-world environment is depicted in the topmost
diagram of Figure 13.2. The six grid squares in this diagram represent six possible states,
or locations, for the agent. Each arrow in the diagram represents a possible action the
agent can take to move from one state to another.
The number associated with each arrow represents the immediate reward r(s, a) the agent
receives if it executes the corresponding state-action transition. Note the immediate reward
in this particular environment is defined to be zero for all state-action transitions except
for those leading into the state labeled G. G is called an absorbing state.
Once the states, actions, and immediate rewards are defined, and once we choose a value
for the discount factory, we can determine the optimal policy ᴨ* and its value function
V*(s). In this case, let us choose ℽ = 0.9. The diagram at the bottom of the figure shows
one optimal policy for this setting (there are others as well). Like any policy, this policy
specifies exactly one action that the agent will select in any given state.
FIGURE 5.5 A simple deterministic world to illustrate the basic concepts of Q-learning. Each
grid square represents a distinct state; each arrow a distinct action. The immediate reward
function, r(s, a) gives reward 100 for actions entering the goal state G, and zero otherwise. Values
of V*(s) and Q(s, a) follow from r(s, a), and the discount factor ℽ = 0.9. An optimal policy,
corresponding to actions with maximal Q values, is also shown.
The diagram at the right of Figure 5.5 shows the values of V* for each state. For example,
consider the bottom right state in this diagram. The value of V* for this state is 100 because the
optimal policy in this state selects the "moveup" action that receives immediate reward 100.
Thereafter, the agent will remain in the absorbing state and receive no further rewards. Similarly,
the value of V* for the bottom center state is 90. This is because the optimal policy will move the
agent from this state to the right (generating an immediate reward of zero), then upward
(generating an immediate reward of 100). Thus, the discounted future reward from the bottom
center state is 0+ ℽ 100+ ℽ 20+ ℽ 30+...= 90 Recall that V* is defined to be the sum of discounted
future rewards over the infinite future. In this particular environment, once the agent reaches the
absorbing state G its infinite future will consist of remaining in this state and receiving rewards of
zero.
5.6.3 Q LEARNING
Q-learning is a values-based learning algorithm in reinforcement learning. How can an agent
learn an optimal policy n* for an arbitrary environment? It is difficult to learn the function rt* :S
+ A directly, because the available training data does not provide training examples of the form
(s, a). Instead, the only training information available to the learner is the sequence of immediate
rewards r(si, ai) for i= 0, 1,2, As we shall see, given this kind of training information it is easier to
KR & PK, Dept. of ISE, RNSIT 2021-2022 Page 11
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (18CS71)
learn a numerical evaluation function defined over states and actions, then implement the optimal
policy in terms of this evaluation function.
The Q Function
The evaluation function Q(s, a) so that its value is the maximum dis-counted cumulative reward
that can be achieved starting from state s and applying action a as the first action. In other words,
the value of Q is the reward received.
Note that Q(s, a) is exactly the quantity that is maximized in Equation (13.3) in order to choose
the optimal action a in state s. Therefore, we can rewrite Equation in terms of Q(s,a) as
It may at first seem surprising that one can choose globally optimal action sequences by reacting
repeatedly to the local values of Q for the current state. This means the agent can choose the
optimal action without ever conducting a lookahead search to explicitly consider what state
results from the action. Part of the beauty of Q learning is that the evaluation function is defined
to have precisely this property-the value of Q for the current state and action summarizes in a
single number all the information needed to determine the discounted cumulative reward that
will be gained in the future if action a is selected in state s.
An Algorithm for Learning Q
The key problem is finding a reliable way to estimate training values for Q, given only a
sequence of immediate rewards r spread out over time. This can be accomplished through
iterative approximation. To see how, notice the close relationship between Q and V*,
This recursive definition of Q provides the basis for algorithms that iteratively approximate Q.
To describe the algorithm, we will use the symbol 𝑄 to refer to the learner's estimate, or
hypothesis, of the actual Q function. In this algorithm the learner represents its hypothesis 𝑄 by a
large table with a separate entry for each state-action pair. The table entry for the pair (s, a)
stores the value for 𝑄 (s, a)-the learner's current hypothesis about the actual but unknown
value Q(s, a). The table can be initially filled with random values (though it is easier to
understand the algorithm if one assumes initial values of zero). The agent repeatedly observes its
current state s, chooses some action a, executes this action, then observes the resulting reward r =
r(s, a) and the new state s' = δ(s, a). It then updates the table entry for 𝑄 (s, a) following
each such transition, according to the rule:
KR & PK, Dept. of ISE, RNSIT 2021-2022 Page 12
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (18CS71)
An Illustrative Example
Illustrate the operation of the Q learning algorithm, consider a single action taken by an agent,
and the corresponding refinement to 𝑄 shown in Figure 13.3. In this example, the agent moves
one cell to the right in its grid world and receives an immediate reward of zero for this
transition. It then applies the training rule of Equation to refine its estimate 𝑄for the state-action
transition it just executed. According to the training rule, the new 𝑄 estimate for this
transition is the sum of the received reward (zero) and the highest 𝑄value associated with the
resulting state (100), discounted by ℽ (.9).
Each time the agent moves forward from an old state to a new one, Q learning propagates
𝑄 estimates backward from the new state to the old. At the same time, the immediate
reward received by the agent for the transition is used to augment these propagated values of 𝑄.
Consider applying this algorithm to the grid world and reward function shown in Figure 13.2, for
which the reward is zero everywhere, except when entering the goal state. Since this world
contains an absorbing goal state, we will assume that training consists of a series of episodes.
During each episode, the agent begins at some randomly chosen state and is allowed to execute
actions until it reaches the absorbing goal state. When it does, the episode ends and the agent is
transported to a new, randomly chosen, initial state for the next Episode.
Convergence