AI Mod3 - Notes Cha1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

MODULE – 3

CHAPTER-1
Contents:

• Heuristic Search.
• Greedy best-first search.
• A* search.
• Heuristic Function.

INFORMED(HEURISTIC) SEARCH:
This section shows how an informed search strategy—one that uses problem-specific
knowledge beyond the definition of the problem itself—can find solutions more efficiently than can
an uninformed strategy.
BEST-FIRST SEARCH:
The general approach we consider is called best-first search. Best-first search is an instance of
the general TREE-SEARCH or GRAPH-SEARCH algorithm in which a node is selected for
expansion based on an evaluation function, f(n).
EVALUATION FUNCTION:
The evaluation function is construed as a cost estimate, so the node with the lowest evaluation
is expanded first. The implementation of best-first graph search is identical to that for uniform-cost
search, except for the use of f instead of g to order the priority queue.

1. Explain Heuristic/informed function & Greedy best-first search:


HEURISTIC FUNCTION:
The choice of f determines the search strategy. Most best-first algorithms include as a
component of a heuristic function, denoted h(n):
h(n) = estimated cost of the cheapest path from the state at node n to a goal state.
(Notice that h(n) takes a node as input, but, unlike g(n), it depends only on the state at that node.)
For example, in Romania, one might estimate the cost of the cheapest path from Arad to Bucharest via
the straight-line distance from Arad to Bucharest. Heuristic functions are the most common form in
which additional knowledge of the problem is imparted to the search algorithm. We study heuristics
in more depth in Section 3.6. For now, we consider them to be arbitrary, nonnegative, problem-
specific functions, with one constraint: if n is a goal node, then h(n)=0. The remainder of this section
covers two ways to use heuristic information to guide search.
GREEDY BEST-FIRST SEARCH:
Greedy best-first search 8 tries to expand the node that is closest to the goal, on the grounds
that this is likely to lead to a solution quickly. Thus, it evaluates nodes by using just the heuristic
function; that is, f(n) = h(n).
STRAIGHT-LINE DISTANCE:
Let us see how this works for route-finding problems in Romania; we use the straightline distance
heuristic, which we will call hSLD . If the goal is Bucharest, we need to know the straight-line
distances to Bucharest, which are shown in Figure 3.22. For example, hSLD (In(Arad)) = 366. Notice
that the values of hSLD cannot be computed from the problem description itself.
Moreover, it takes a certain amount of experience to know that hSLD is correlated with actual road
distances and is, therefore, a useful heuristic.

Figure 3.23 shows the progress of a greedy best-first search using hSLD to find a path from
Arad to Bucharest. The first node to be expanded from Arad will be Sibiu because it is closer to
Bucharest than either Zerind or Timisoara. The next node to be expanded will be Fagaras because it is
closest. Fagaras in turn generates Bucharest, which is the goal. For this particular problem, greedy
best-first search using hSLD finds a solution without ever expanding a node that is not on the solution
path; hence, its search cost is minimal. It is not optimal, however: the path via Sibiu and Fagaras to
Bucharest is 32 kilometers longer than the path through Rimnicu Vilcea and Pitesti. This shows why
the algorithm is called “greedy”—at each step it tries to get as close to the goal as it can. Greedy best-
first tree search is also incomplete even in a finite state space, much like depth-first search. Consider
the problem of getting from Iasi to Fagaras. The heuristic suggests that Neamt be expanded first
because it is closest to Fagaras, but it is a dead end. The solution is to go first to Vaslui—a step that is
actually farther from the goal according to the heuristic—and then to continue to Urziceni, Bucharest,
and Fagaras. The algorithm will never find this solution, however, because expanding Neamt puts Iasi
back into the frontier, Iasi is closer to Fagaras than Vaslui is, and so Iasi will be expanded again,
leading to an infinite loop. (The graph search version is complete in finite spaces, but not in infinite
ones.)
The worst-case time and space complexity for the tree version is O(bm), where m is the
maximum depth of the search space. With a good heuristic function, however, the complexity can be
reduced substantially. The amount of the reduction depends on the particular problem and on the
quality of the heuristic.

-----
2. Explain informed A* Search:
A* search: Minimizing the total estimated solution cost
The most widely known form of best-first search is called A∗ A search. It evaluates nodes by
combining g(n), the cost to reach the node, and h(n), the cost to get from the node to the goal:
f(n) = g(n) + h(n).
Since g(n) gives the path cost from the start node to node n, and h(n) is the estimated cost of
the cheapest path from n to the goal, we have
f(n) = estimated cost of the cheapest solution through n.
Thus, if we are trying to find the cheapest solution, a reasonable thing to try first is the node
with the lowest value of g(n) + h(n).
It turns out that this strategy is more than just reasonable: provided that the heuristic function
h(n) satisfies certain conditions, A∗ search is both complete and optimal. The algorithm is identical to
UNIFORM-COST-SEARCH except that A∗ uses g + h instead of g.

-----

Conditions for optimality: Admissibility and consistency:


The first condition we require for optimality is that h(n) be an admissible heuristic. An admissible
heuristic is one that never overestimates the cost to reach the goal. Because g(n) is the actual cost to
reach n along the current path, and f(n) = g(n) + h(n), we have as an immediate consequence that f(n)
never overestimates the true cost of a solution along the current path through n. Admissible heuristics
are by nature optimistic because they think the cost of solving the problem is less than it actually is.
An obvious example of an admissible heuristic is the straight-line distance hSLD that we used in getting
to Bucharest. Straight-line distance is admissible because the shortest path between any two points is
a straight line, so the straight line cannot be an overestimate.
In Figure 3.24, we show the progress of an A∗ tree search for Bucharest. The values of g are
computed from the step costs in Figure 3.24, and the values of hSLD are given in Figure 3.22. Notice in
particular that Bucharest first appears on the frontier at step (e), but it is not selected for expansion
because its f-cost (450) is higher than that of Pitesti (417). Another way to say this is that there might
be a solution through Pitesti whose cost is as low as 417, so the algorithm will not settle for a solution
that costs 450.
CONSISTENCY AMONOTONICITY:
A second, slightly stronger condition called consistency (or sometimes monotonicity) is
required only for applications of A∗ to graph search. A heuristic h(n) is consistent if, for every node n
and every successor n’ of n generated by any action a, the estimated cost of reaching the goal from n
is no greater than the step cost of getting to n’ plus the estimated cost of reaching the goal from n’:
h(n) ≤ c(n, a, n’ ) + h(n’ ) .
TRIANGLE INEQUALITY:
This is a form of the general triangle inequality, which stipulates that each side of a triangle
cannot be longer than the sum of the other two sides. Here, the triangle is formed by n, n’, and the
goal Gn closest to n. For an admissible heuristic, the inequality makes perfect sense: if there were a
route from n to Gn via n’ that was cheaper than h(n), that would violate the property that h(n) is a
lower bound on the cost to reach Gn. It is fairly easy to show (Exercise 3.29) that every consistent
heuristic is also admissible. Consistency is therefore a stricter requirement than admissibility, but one
has to work quite hard to concoct heuristics that are admissible but not consistent. All the admissible
heuristics we discuss in this chapter are also consistent. Consider, for example, hSLD . We know that
the general triangle inequality is satisfied when each side is measured by the straight-line distance and
that the straight-line distance between n and n’ is no greater than c(n, a, n’ ). Hence, hSLD is a
consistent heuristic.

Optimality of A*:
➢ As we mentioned earlier, A∗ has the following properties: the tree-search version of A∗ is
optimal if h(n) is admissible, while the graph-search version is optimal if h(n) is consistent.
We show the second of these two claims since it is more useful. The argument essentially
mirrors the argument for the optimality of uniform-cost search, with g replaced by f—just as
in the A∗ algorithm itself.
➢ The first step is to establish the following: if h(n) is consistent, then the values of f(n) along
any path are nondecreasing. The proof follows directly from the definition of consistency.
Suppose n’ is a successor of n; then g(n’ ) = g(n) + c(n, a, n’ ) for some action a, and we have
f(n’ ) = g(n’ ) + h(n’ ) = g(n) + c(n, a, n’) + h(n’ ) ≥ g(n) + h(n) = f(n) .
➢ The next step is to prove that whenever A∗ selects a node n for expansion, the optimal path to
that node has been found. Were this not the case, there would have to be another frontier node
n’ on the optimal path from the start node to n, by the graph separation property of Figure 3.9;
because f is nondecreasing along any path, n’ would have lower f-cost than n and would have
been selected first.
From the two preceding observations, it follows that the sequence of nodes expanded by A∗ using
GRAPH-SEARCH is in nondecreasing order of f(n). Hence, the first goal node selected for expansion
must be an optimal solution because f is the true cost for goal nodes (which have h = 0) and all later
goal nodes will be at least as expensive.

CONTOUR:
The fact that f-costs are nondecreasing along any path also means that we can draw contours in the
state space, just like the contours in a topographic map. Figure 3.25 shows an example. Inside the
contour labeled 400, all nodes have f(n) less than or equal to 400, and so on. Then, because A∗
expands the frontier node of lowest f-cost, we can see that an A∗ search fans out from the start node,
adding nodes in concentric bands of increasing f-cost.
With uniform-cost search (A∗ search using h(n)=0), the bands will be “circular” around the
start state. With more accurate heuristics, the bands will stretch toward the goal state and become
more narrowly focused around the optimal path.
If C∗ is the cost of the optimal solution path, then we can say the following:
• A∗ expands all nodes with f(n) < C∗.
• A∗ might then expand some of the nodes right on the “goal contour” (where f(n) = C∗)
before selecting a goal node.
Completeness requires that there be only finitely many nodes with cost less than or equal to C∗, a
condition that is true if all step costs exceed some finite _ and if b is finite.

PRUNING
Notice that A∗ expands no nodes with f(n) > C∗—for example, Timisoara is not expanded in Figure
3.24 even though it is a child of the root. We say that the subtree below Timisoara is pruned; because
hSLD is admissible, the algorithm can safely ignore this subtree while still guaranteeing optimality. The
concept of pruning—eliminating possibilities from consideration without having to examine them—is
important for many areas of AI.
OPTIMALLY EFFICIENT:
One final observation is that among optimal algorithms of this type—algorithms that extend search
paths from the root and use the same heuristic information—A∗ is optimally efficient for any given
consistent heuristic. That is, no other optimal algorithm is guaran- teed to expand fewer nodes than
A∗ (except possibly through tie-breaking among nodes with f(n) = C∗). This is because any algorithm
that does not expand all nodes with f(n) < C∗ runs the risk of missing the optimal solution.
ABSOLUTE ERROR AND RELATIVE ERROR:
A∗ search is complete, optimal, and optimally efficient among all such algorithms is rather
satisfying. Unfortunately, it does not mean that A∗ is the answer to all our searching needs. The catch
is that, for most problems, the number of states within the goal contour search space is still
exponential in the length of the solution. The details of the analysis are beyond the scope of this book,
but the basic results are as follows. For problems with constant step costs, the growth in run time as a
function of the optimal solution depth d is analyzed in ABSOLUTE ERROR terms of the absolute
error or the relative error of the heuristic. The absolute error is defined as Δ ≡ h∗ − h, where h∗ is the
actual cost of getting from the root to the goal, and the relative error is defined as

ε ≡ (h∗ − h)/h∗.
The complexity results depend very strongly on the assumptions made about the state space. The
simplest model studied is a state space that has a single goal and is essentially a tree with reversible
actions. (The 8-puzzle satisfies the first and third of these assumptions.) In this case, the time
complexity of A∗ is exponential in the maximum absolute error, that is, O(bΔ). For constant step
costs, we can write this as O(bεd), where d is the solution depth. For almost all heuristics in practical
use, the absolute error is at least proportional to the path cost h∗, so ε is constant or growing and the
time complexity is exponential in d. We can also see the effect of a more accurate heuristic: O(bεd)
= O((bε)d), so the effective branching factor (defined more formally in the next section) is bε.
When the state space has many goal states—particularly near-optimal goal states—the
search process can be led astray from the optimal path and there is an extra cost proportional to the
number of goals whose cost is within a factor ε of the optimal cost. Finally, in the general case of a
graph, the situation is even worse. There can be exponentially many states with f(n) < C∗ even if the
absolute error is bounded by a constant.
For example, consider a version of the vacuum world where the agent can clean up any square for unit
cost without even having to visit it: in that case, squares can be cleaned in any order. With N initially
dirty squares, there are 2N states where some subset has been cleaned and all of them are on an
optimal solution path—and hence satisfy f(n) < C∗—even if the heuristic has an error of 1. The
complexity of A∗ often makes it impractical to insist on finding an optimal solution. One can use
variants of A∗ that find suboptimal solutions quickly, or one can sometimes design heuristics that are
more accurate but not strictly admissible. In any case, the use of a good heuristic still provides
enormous savings compared to the use of an uninformed search. In Section 3.6, we look at the
question of designing good heuristics. Computation time is not, however, A∗’s main drawback.
Because it keeps all generated nodes in memory (as do all GRAPH-SEARCH algorithms), A∗ usually
runs out of space long before it runs out of time. For this reason, A∗ is not practical for many large-
scale problems. There are, however, algorithms that overcome the space problem without sacrificing
optimality or completeness, at a small cost in execution time. We discuss these next.

3. Explain Heuristic function by taking 8-puzzle game as example:


HEURISTIC FUNCTIONS:
The 8-puzzle was one of the earliest heuristic search problems. The object of the puzzle is to
slide the tiles horizontally or vertically into the empty space until the configuration matches the goal
configuration (Figure 3.28). The average solution cost for a randomly generated 8-puzzle instance is
about 22 steps. The branching factor is about 3. (When the empty tile is in the middle, four moves are
possible; when it is in a corner, two; and when it is along an edge, three.) This means that an
exhaustive tree search to depth 22 would look at about 322 ≈ 3.1 × 1010 states. A graph search would
cut this down by a factor of about 170,000 because only 9!/2 = 181, 440 distinct states are reachable.
This is a manageable number, but the corresponding number for the 15-puzzle is roughly 1013, so the
next order of business is to find a good heuristic function. If we want to find the shortest solutions by
using A∗, we need a heuristic function that never overestimates the number of steps to the goal.

There is a long history of such heuristics for the 15-puzzle; here are two commonly used candidates:
• h1 = the number of misplaced tiles. For Figure 3.28, all of the eight tiles are out of position,
so the start state would have h1 = 8. h1 is an admissible heuristic because it is clear that any
tile that is out of place must be moved at least once.
• h2 = the sum of the distances of the tiles from their goal positions. Because tiles cannot
move along diagonals, the distance we will count is the sum of the horizontal and vertical
distances. This is sometimes called the city block distance or Manhattan distance. h2 is
also admissible because all any move can do is move one tile one step closer to the goal. Tiles
1 to 8 in the start state give a Manhattan distance of
h2 = 3 + 1 + 2 + 2 + 2 + 3 + 3 + 2 = 18.
As expected, neither of these overestimates the true solution cost, which is 26.

-------
4. Explain the effect of heuristic accuracy on performance bring out the
comparison between Iterative-Deepening-Search & A* algorithm:
EFFECTIVE BRANCHING FACTOR:
One way to characterize the quality of a heuristic is the effective branching factor b∗. If the
total number of nodes generated by A∗ for a particular problem is N and the solution depth is d, then
b∗ is the branching factor that a uniform tree of depth d would have to have in order to contain N + 1
nodes. Thus,
N +1=1+ b* + (b*) 2 + ··· + (b*) d
For example, if A∗ finds a solution at depth 5 using 52 nodes, then the effective branching factor is
1.92. The effective branching factor can vary across problem instances, but usually it is fairly constant
for sufficiently hard problems. (The existence of an effective branching factor follows from the result,
mentioned earlier, that the number of nodes expanded by A∗ grows exponentially with solution
depth.) Therefore, experimental measurements of b∗ on a small set of problems can provide a good
guide to the heuristic’s overall usefulness. A well-designed heuristic would have a value of b∗ close to
1, allowing fairly large problems to be solved at reasonable computational cost.
To test the heuristic functions h1 and h2, we generated 1200 random problems with solution
lengths from 2 to 24 (100 for each even number) and solved them with iterative deepening search and
with A∗ tree search using both h1 and h2. Figure 3.29 gives the average number of nodes generated
by each strategy and the effective branching factor. The results suggest that h2 is better than h1, and is
far better than using iterative deepening search. Even for small problems with d = 12, A∗ with h2 is
50,000 times more efficient than uninformed iterative deepening search.

----------
DOMINATION:
One might ask whether h2 is always better than h1. The answer is “Essentially, yes.” It is easy
to see from the definitions of the two heuristics that, for any node n, h2(n) ≥ h1(n). We thus say that h2
dominates h1. Domination translates directly into efficiency: A∗ using h2 will never expand more
nodes than A∗ using h1 (except possibly for some nodes with f(n) = C∗). The argument is simple. The
every node with f(n) < C∗ will surely be expanded. This is the same as saying that every node with
h(n) < C∗ − g(n) will surely be expanded. But because h2 is at least as big as h1 for all nodes, every
node that is surely expanded by A∗ search with h2 will also surely be expanded with h1, and h1 might
cause other nodes to be expanded as well. Hence, it is generally better to use a heuristic function with
higher values, provided it is consistent and that the computation time for the heuristic is not too long.

5. Explain Generating admissible heuristics from relaxed problems:


We have seen that both h1 (misplaced tiles) and h2 (Manhattan distance) are fairly good
heuristics for the 8-puzzle and that h2 is better. How might one have come up with h2? Is it possible
for a computer to invent such a heuristic mechanically? h1 and h2 are estimates of the remaining path
length for the 8-puzzle, but they are also perfectly accurate path lengths for simplified versions of the
puzzle. If the rules of the puzzle were changed so that a tile could move anywhere instead of just to
the adjacent empty square, then h1 would give the exact number of steps in the shortest solution.
Similarly, if a tile could move one square in any direction, even onto an occupied square, then h2
would give the exact number of steps in the shortest solution. A problem with fewer restrictions on
the actions is called a relaxed problem. The state-space graph of the relaxed problem is a supergraph
of the original state space because the removal of restrictions creates added edges in the graph.
Because the relaxed problem adds edges to the state space, any optimal solution in the original
problem is, by definition, also a solution in the relaxed problem; but the relaxed problem may have
better solutions if the added edges provide short cuts.
Hence, the cost of an optimal solution to a relaxed problem is an admissible heuristic for the original
problem. Furthermore, because the derived heuristic is an exact cost for the relaxed problem, it must
obey the triangle inequality and is therefore consistent.
If a problem definition is written down in a formal language, it is possible to construct relaxed
problems automatically. For example, if the 8-puzzle actions are described as
A tile can move from square A to square B
if A is horizontally or vertically adjacent to B and B is blank,
we can generate three relaxed problems by removing one or both of the conditions:
(a) A tile can move from square A to square B if A is adjacent to B.
(b) A tile can move from square A to square B if B is blank.
(c) A tile can move from square A to square B.
From (a), we can derive h2 (Manhattan distance). The reasoning is that h2 would be the proper score
if we moved each tile in turn to its destination. From (c), we can derive h1 (misplaced tiles) because it
would be the proper score if tiles could move to their intended destination in one step. Notice that it is
crucial that the relaxed problems generated by this technique can be solved essentially without search,
because the relaxed rules allow the problem to be decomposed into eight independent subproblems. If
the relaxed problem is hard to solve, then the values of the corresponding heuristic will be expensive
to obtain.
A program called ABSOLVER can generate heuristics automatically from problem
definitions, using the “relaxed problem” method and various other techniques (Prieditis, 1993).
ABSOLVER generated a new heuristic for the 8-puzzle that was better than any preexisting heuristic
and found the first useful heuristic for the famous Rubik’s Cube puzzle.
One problem with generating new heuristic functions is that one often fails to get a single
“clearly best” heuristic. If a collection of admissible heuristics h1 ...hm is available for a problem and
none of them dominates any of the others, which should we choose? As it turns out, we need not make
a choice. We can have the best of all worlds, by defining
h(n) = max{h1(n), ..., hm(n)}.
This composite heuristic uses whichever function is most accurate on the node in question. Because
the component heuristics are admissible, h is admissible; it is also easy to prove that h is consistent.
Furthermore, h dominates all of its component heuristics.
---------
Generating admissible heuristics from subproblems: Pattern databases:
SUBPROBLEM:
Admissible heuristics can also be derived from the solution cost of a subproblem of a given
problem. For example, Figure 3.30 shows a subproblem of the 8-puzzle instance in Figure 3.28. The
subproblem involves getting tiles 1, 2, 3, 4 into their correct positions. Clearly, the cost of the optimal
solution of this subproblem is a lower bound on the cost of the complete problem. It turns out to be
more accurate than Manhattan distance in some cases.

PATTERN DATABASE:
The idea behind pattern databases is to store these exact solution costs for every possible
subproblem instance—in our example, every possible configuration of the four tiles and the blank.
(The locations of the other four tiles are irrelevant for the purposes of solving the subproblem, but
moves of those tiles do count toward the cost.) Then we compute an admissible heuristic hDB for each
complete state encountered during a search simply by looking up the corresponding subproblem
configuration in the database. The database itself is constructed by searching back13 from the goal
and recording the cost of each new pattern encountered; the expense of this search is amortized over
many subsequent problem instances. The choice of 1-2-3-4 is fairly arbitrary; we could also construct
databases for 5-6-7-8, for 2-4-6-8, and so on. Each database yields an admissible heuristic, and these
heuristics can be combined, as explained earlier, by taking the maximum value. A combined heuristic
of this kind is much more accurate than the Manhattan distance; the number of nodes generated when
solving random 15-puzzles can be reduced by a factor of 1000.
DISJOINT PATTERN DATABASES:
One might wonder whether the heuristics obtained from the 1-2-3-4 database and the 5-6-7-8 could
be added, since the two subproblems seem not to overlap. Would this still give an admissible
heuristic? The answer is no, because the solutions of the 1-2-3-4 subproblem and the 5-6-7-8
subproblem for a given state will almost certainly share some moves—it is unlikely that 1-2-3-4 can
be moved into place without touching 5-6-7-8, and vice versa. But what if we don’t count those
moves? That is, we record not the total cost of solving the 1-2- 3-4 subproblem, but just the number of
moves involving 1-2-3-4. Then it is easy to see that the sum of the two costs is still a lower bound on
the cost of solving the entire problem. This is the idea behind disjoint pattern databases. With such
databases, it is possible to solve random 15-puzzles in a few milliseconds—the number of nodes
generated is reduced by a factor of 10,000 compared with the use of Manhattan distance. For 24-
puzzles, a speedup of roughly a factor of a million can be obtained. Disjoint pattern databases work
for sliding-tile puzzles because the problem can be divided up in such a way that each move affects
only one subproblem—because only one tile is moved at a time. For a problem such as Rubik’s Cube,
this kind of subdivision is difficult because each move affects 8 or 9 of the 26 cubies. More general
ways of defining additive, admissible heuristics have been proposed that do apply to Rubik’s cube
(Yang et al., 2008), but they have not yielded a heuristic better than the best nonadditive heuristic for
the problem.
Learning heuristics from experience:
A heuristic function h(n) is supposed to estimate the cost of a solution beginning from the
state at node n. How could an agent construct such a function? One solution was given in the
preceding sections—namely, to devise relaxed problems for which an optimal solution can be found
easily. Another solution is to learn from experience. “Experience” here means solving lots of 8-
puzzles, for instance. Each optimal solution to an 8-puzzle problem provides examples from which
h(n) can be learned. Each example consists of a state from the solution path and the actual cost of the
solution from that point. From these examples, a learning algorithm can be used to construct a
function h(n) that can (with luck) predict solution costs for other states that arise during search.
FEATURE:
Inductive learning methods work best when supplied with features of a state that are relevant
to predicting the state’s value, rather than with just the raw state description. For example, the feature
“number of misplaced tiles” might be helpful in predicting the actual distance of a state from the goal.
Let’s call this feature x1(n). We could take 100 randomly generated 8-puzzle configurations and
gather statistics on their actual solution costs. We might find that when x1(n) is 5, the average solution
cost is around 14, and so on. Given these data, the value of x1 can be used to predict h(n). Of course,
we can use several features. A second feature x2(n) might be “number of pairs of adjacent tiles that
are not adjacent in the goal state.” How should x1(n) and x2(n) be combined to predict h(n)? A
common approach is to use a linear combination:
h(n) = c1x1(n) + c2x2(n).
The constants c1 and c2 are adjusted to give the best fit to the actual data on solution costs.
One expects both c1 and c2 to be positive because misplaced tiles and incorrect adjacent pairs make
the problem harder to solve. Notice that this heuristic does satisfy the condition that h(n)=0 for goal
states, but it is not necessarily admissible or consistent.

You might also like