Artificial Intelligence and Machine Learning - Final
Artificial Intelligence and Machine Learning - Final
Artificial Intelligence and Machine Learning - Final
LAKSHMI PUBLICATIONS
Plot No.73, VGP Gokul Nagar, 2nd Main Road (40 Feet Road),
Perumbakkam, Chennai-600 100, Tamil Nadu, INDIA.
Phone: 044-49523977, 98945 98598
E-mail: [email protected]
[email protected]
Website: www.lakshmipublications.com
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
by
Price: Rs.
ISBN
Published by and copies can be had from:
The audience's goal and purpose are more tightly focused. More effort has been
put into getting more accurate results. Readers will feel more at ease and
comprehend the book's depth thanks to its contents.
A brief introduction and concise captions for the figures and tables are provided
for the content in the book. More emphasis is placed on avoiding grammar mistakes
and using the proper format.
- AUTHORS
ACKNOWLEDGEMENT
We first and foremost express our gratitude to GOD ALMIGHTY for giving us the mental
fortitude required to complete the book work preparation.
We would like to extend our sincere appreciation to everyone who helped out, discussed
things, supplied feedback, enabled us to use their comments, and helped with the editing,
proofreading, and design.
We would like to express our gratitude to Management, Principal Dr. N. Balaji, Vice
Principal Dr. S. Soundararajan, HOD Dr. V.P.Gladispushparathi.
With great concern for the welfare of engineering students, we would like to extend a very
special thanks to Mrs.Nirmala Durai, Proprietrix of Lakshmi Publications,
Mr.A.Durai, B.E., Founder and Managing Director of Lakshmi publications and
Mrs. P.S. Malathi for providing professional support in all of the activities throughout the
development and production of this book.
Suggestions and comments for further improvements of this book are most
welcome. Please mail me at [email protected]
For B.E., COMPUTER SCIENCE AND ENGINEERING BRANCH
UNIT II
UNIT III
UNIT IV
UNIT V
Apple Siri is a good example of Narrow AI, but it operates with a limited
pre-defined range of functions.
IBM's Watson supercomputer also comes under Narrow AI, as it uses an
Expert system approach combined with Machine learning and natural
language processing.
Some Examples of Narrow AI are playing chess, purchasing suggestions
on e-commerce site, self-driving cars, speech recognition, and image
recognition.
2. General AI:
General AI is a type of intelligence which could perform any intellectual
task with efficiency like a human.
The idea behind the general AI to make such a system which could be
smarter and think like a human by its own.
Currently, there is no such system exist which could come under general
AI and can perform any task as perfect as a human.
The worldwide researchers are now focused on developing machines with
General AI.
As systems with general AI are still under research, and it will take lots of
efforts and time to develop such systems.
3. Super AI:
Fig. 1.2.
1.4 Artificial Intelligence and Machine Learning
1. Reactive Machines
Purely reactive machines are the most basic types of Artificial Intelligence.
Such AI systems do not store memories or past experiences for future
actions.
These machines only focus on current scenarios and react on it as per
possible best action.
IBM's Deep Blue system is an example of reactive machines.
Google's AlphaGo is also an example of reactive machines.
2. Limited Memory
Limited memory machines can store past experiences or some data for a
short period of time.
These machines can use stored data for a limited time period only.
Self-driving cars are one of the best examples of Limited Memory
systems. These cars can store recent speed of nearby cars, the distance of
other cars, speed limit, and other information to navigate the road.
3. Theory of Mind
Theory of Mind AI should understand the human emotions, people,
beliefs, and be able to interact socially like humans.
This type of AI machines is still not developed, but researchers are making
lots of efforts and improvement for developing such AI machines.
Problem Solving 1.5
4. Self-Awareness
Self-awareness AI is the future of Artificial Intelligence. These machines
will be super intelligent, and will have their own consciousness,
sentiments, and self-awareness.
These machines will be smarter than human mind.
Self-Awareness AI does not exist in reality still and it is a hypothetical
concept.
A.I Approaches
The definitions of AI according to some text books are categorized into four
approaches and are summarized in the table below:
Systems that think like human
Systems that act like human
Systems that think rationally
Systems that act rationally
Table 1.1. Some definitions of artificial intelligence, organized into four categories
1. Game Playing:
AI is widely used in Gaming. Different strategic games such as Chess, where the
machine needs to think logically, and video games to provide real-time experiences
use Artificial Intelligence.
1.8 Artificial Intelligence and Machine Learning
2. Robotics:
Artificial Intelligence is commonly used in the field of Robotics to develop
intelligent robots. AI implemented robots use real-time updates to sense any obstacle
in their path and can change the path instantly. AI robots can be used for carrying
goods in hospitals and industries and can also be used for other different purposes.
3. Healthcare:
In the healthcare sector, AI has diverse uses. In this field, AI can be used to detect
diseases and cancer cells. It also helps in finding new drugs with the use of historical
data and medical intelligence.
4. Computer Vision:
Computer vision enables the computer system to understand and derive
meaningful information from digital images, video, and other visual input with the
help of AI.
5. Agriculture:
AI is now widely used in Agriculture; for example, with the help of AI, we can
easily identify defects and nutrient absences in the soil. To identify these defects, AI
robots can be utilized. AI bots can also be used in crop harvesting at a higher speed
than human workers.
6. E-commerce
AI is one of the widely used and demanding technologies in the E-commerce
industry. With AI, e-commerce businesses are gaining more profit and grow in
business by recommending products as per the user requirement.
7. Social Media
Different social media websites such as Facebook, Instagram, Twitter, etc., use AI
to make the user experiences much better by providing different features. For
example, Twitter uses AI to recommend tweets as per the user interest and search
history.
Problem Solving 1.9
When the correct action to take is not immediately obvious, an agent may need to
to plan ahead: to consider a sequence of actions that form a path to a goal state. Such
an agent is called a problem- solving agent, and the computational process it
undertakes is called search.
The agent can follow this four-phase problem-solving process:
Goal Formulation: Goals organize behavior by limiting the objectives
and hence the actions to be considered.
Problem Formulation: The agent devises a description of the states and
actions necessary to reach the goal - an abstract model of the relevant part
of the world.
Search: Before taking any action in the real world, the agent simulates
sequences of actions in its model, searching until it finds a sequence of
actions that reaches the goal. Such a sequence is called a solution. The
agent might have to simulate multiple sequences that do not reach the
goal, but eventually it will find a solution (such as going from Arad to
Sibiu to Fagaras to Bucharest), or it will find that no solution is possible.
Execution: The agent can now execute the actions in the solution, one at a
time. It is an important property that in a fully observable, deterministic,
known environment, the solution to any problem is a fixed sequence of
actions. If the model is correct, then once the agent has found a solution, it
can ignore its percepts while it is executing the actions—because the
solution is guaranteed to lead to the goal. Control theorists call this an
open-loop system: ignoring the percepts breaks the loop between agent
and environment. If there is a chance that the model is incorrect, or the
environment is nondeterministic, then the agent would be safer using a
closed-loop approach that monitors the precepts.
1.10 Artificial Intelligence and Machine Learning
Search algorithms are one of the most important areas of Artificial Intelligence.
This topic will explain all about the search algorithms in AI.
Problem-solving agents:
In Artificial Intelligence, Search techniques are universal problem-solving
methods. Rational agents or Problem-solving agents in AI mostly used these search
strategies or algorithms to solve a specific problem and provide the best result.
Problem-solving agents are the goal-based agents and use atomic representation. In
this topic, we will learn various problem-solving search algorithms.
Fig. 1.3.
strategies can find a solution more efficiently than an uninformed search strategy.
Informed search is also called a Heuristic search.
A heuristic is a way which might not always be guaranteed for best solutions but
guaranteed to find a good solution in reasonable time.
Informed search can solve much complex problem which could not be solved in
another way.
An example of informed search algorithms is a traveling salesman problem.
1. Greedy Search
2. A* Search
Advantages:
BFS will provide a solution if any solution exists.
If there are more than one solutions for a given problem, then BFS will
provide the minimal solution which requires the least number of steps.
Disadvantages
It requires lots of memory since each level of the tree must be saved into
memory to expand the next level.
BFS needs lots of time if the solution is far away from the root node.
Example:
In the below tree structure, we have shown the traversing of the tree using BFS
algorithm from the root node S to goal node K. BFS search algorithm traverse in
layers, so it will follow the path which is shown by the dotted arrow, and the
traversed path will be:
1. S A B C D G H E F I K
Advantage:
DFS requires very less memory as it only needs to store a stack of the
nodes on the path from root node to the current node.
It takes less time to reach to the goal node than BFS algorithm (if it
traverses in the right path).
Disadvantage:
There is the possibility that many states keep re-occurring, and there is no
guarantee of finding the solution.
DFS algorithm goes for deep down searching and sometime it may go to
the infinite loop.
1.16 Artificial Intelligence and Machine Learning
Example:
In the below search tree, we have shown the flow of depth-first search, and it will
follow the order as:
Root node Left node right node
It will start searching from root node S, and traverse A, then B, then D and E, after
traversing E, it will backtrack the tree as E has no other successor and still goal node
is not found. After backtracking it will traverse node C and then G, and here it will
terminate as it found goal node.
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
Example:
Disadvantages:
It does not care about the number of steps involve in searching and only
concerned about path cost. Due to which this algorithm may be stuck in an
infinite loop.
Example:
Completeness:
Uniform-cost search is complete, such as if there is a solution, UCS will find it.
Time Complexity:
Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal
node. Then the number of steps is = C*/ ε + 1. Here we have taken +1, as we start
from state 0 and end to C*/ ε.
Hence, the worst-case time complexity of Uniform-cost search is O(b 1 + [C*/
ε])/.
Space Complexity:
The same logic is for space complexity so, the worst-case space complexity of
Uniform-cost search is O(b 1 + [C*/ ε]).
Optimal:
Uniform-cost search is always optimal as it only selects a path with the lowest
path cost.
Advantages:
It combines the benefits of BFS and DFS search algorithm in terms of fast
search and memory efficiency.
Disadvantages:
The main drawback of IDDFS is that it repeats all the work of the previous
phase.
1.20 Artificial Intelligence and Machine Learning
Example:
Following tree structure is showing the iterative deepening depth-first search.
IDDFS algorithm performs various iterations until it does not find the goal node. The
iteration performed by the algorithm is given as:
Completeness:
This algorithm is complete is if the branching factor is finite.
Time Complexity:
Let's suppose b is the branching factor and depth is d then the worst-case time
complexity is O(bd).
Space Complexity:
The space complexity of IDDFS will be O(bd).
Optimal:
IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth
of the node.
Problem Solving 1.21
A heuristic is a technique that is used to solve a problem faster than the classic
methods. These techniques are used to find the approximate solution of a problem
when classical methods do not. Heuristics are said to be the problem-solving
techniques that result in practical and quick solutions.
Heuristics are strategies that are derived from past experience with similar
problems. Heuristics use practical methods and shortcuts used to produce the
solutions that may or may not be optimal, but those solutions are sufficient in a given
limited timeframe.
Fig. 1.10.
We can perform the Heuristic techniques into two categories:
Hill Climbing
Best First search
Beam search
First, let's talk about the Hill climbing in Artificial intelligence.
take the benefit of both algorithms. It uses the heuristic function and search. With the
help of the best-first search, at each step, we can choose the most promising node.
Best first search algorithm:
Step 1: Place the starting node into the OPEN list.
Step 2: If the OPEN list is empty, Stop and return failure.
Step 3: Remove the node n from the OPEN list, which has the lowest value of
h(n), and places it in the CLOSED list.
Step 4: Expand the node n, and generate the successors of node n.
Step 5: Check each successor of node n, and find whether any node is a goal
node or not. If any successor node is the goal node, then return
success and stop the search, else continue to next step.
Step 6: For each successor node, the algorithm checks for evaluation function
f (n) and then check if the node has been in either OPEN or CLOSED
list. If the node has not been in both lists, then add it to the OPEN list.
Step 7: Return to Step 2.
A* Search Algorithm
A* search is the most commonly known form of best-first search. It uses the
heuristic function h(n) and cost to reach the node n from the start state g(n). It has
combined features of UCS and greedy best-first search, by which it solve the
problem efficiently.
It finds the shortest path through the search space using the heuristic function.
This search algorithm expands fewer search tree and gives optimal results faster.
Algorithm of A* search:
Step 1: Place the starting node in the OPEN list.
Step 2: Check if the OPEN list is empty or not. If the list is empty, then return
failure and stops.
Step 3: Select the node from the OPEN list which has the smallest value of
the evaluation function (g + h). If node n is the goal node, then return
success and stop, otherwise.
Step 4: Expand node n and generate all of its successors, and put n into the
closed list. For each successor n′, check whether n′ is already in the
1.26 Artificial Intelligence and Machine Learning
Some of the real-life examples of heuristics that people use as a way to solve a
problem:
Common sense: It is a heuristic that is used to solve a problem based on
the observation of an individual.
Rule of thumb: In heuristics, we also use a term rule of thumb. This
heuristic allows an individual to make an approximation without doing an
exhaustive search.
Working backward: It lets an individual solve a problem by assuming
that the problem is already being solved by them and working backward in
their minds to see how much a solution has been reached.
Availability heuristic: It allows a person to judge a situation based on the
examples of similar situations that come to mind.
Familiarity heuristic: It allows a person to approach a problem on the
fact that an individual is familiar with the same situation, so one should act
similarly as he/she acted in the same situation before.
Educated guess: It allows a person to reach a conclusion without doing an
exhaustive search. Using it, a person considers what they have observed in
Problem Solving 1.27
the past and applies that history to the situation where there is not any
definite answer has decided yet.
Types of heuristics
There are various types of heuristics, including the availability heuristic, affect
heuristic and representative heuristic. Each heuristic type plays a role in decision-
making. Let's discuss about the Availability heuristic, affect heuristic, and
Representative heuristic.
Availability heuristic
Availability heuristic is said to be the judgment that people make regarding the
likelihood of an event based on information that quickly comes into mind. On
making decisions, people typically rely on the past knowledge or experience of an
event. It allows a person to judge a situation based on the examples of similar
situations that come to mind.
Representative heuristic
It occurs when we evaluate an event's probability on the basis of its similarity with
another event.
Example: We can understand the representative heuristic by the example of
product packaging, as consumers tend to associate the products quality with the
external packaging of a product. If a company packages its products that remind you
of a high quality and well-known product, then consumers will relate that product as
having the same quality as the branded product.
So, instead of evaluating the product based on its quality, customers correlate the
products quality based on the similarity in packaging.
Affect heuristic
It is based on the negative and positive feelings that are linked with a certain
stimulus. It includes quick feelings that are based on past beliefs. Its theory is one's
emotional response to a stimulus that can affect the decisions taken by an individual.
When people take a little time to evaluate a situation carefully, they might base
their decisions based on their emotional response.
Example: The affect heuristic can be understood by the example of
advertisements. Advertisements can influence the emotions of consumers, so it
affects the purchasing decision of a consumer. The most common examples of
1.28 Artificial Intelligence and Machine Learning
advertisements are the ads of fast food. When fast-food companies run the
advertisement, they hope to obtain a positive emotional response that pushes you to
positively view their products.
If someone carefully analyzes the benefits and risks of consuming fast food, they
might decide that fast food is unhealthy. But people rarely take time to evaluate
everything they see and generally make decisions based on their automatic emotional
response. So, Fast food companies present advertisements that rely on such type of
Affect heuristic for generating a positive emotional response which results in sales.
Each point (state) in the landscape has an ―elevation,‖ defined by the value of the
objective function. If elevation corresponds to an objective function, then the aim is
to find the highest peak - a global maximum - this is known as hill climbing. If
elevation corresponds to cost, then the aim is to find the lowest valley - a global
minimum - this is known as gradient descent.
Hill-climbing Search
The hill-climbing search algorithm, which is the most basic local search
technique. At each step the current node is replaced by the best neighbor. The hill-
climbing search algorithm keeps track of one current state and on each iteration
moves to the neighboring state with highest value - that is, it heads in the direction
that provides the steepest ascent. It terminates when it reaches a ―peak‖ where no
neighbor has a higher value. Hill climbing does not look ahead beyond the
immediate neighbors of the current state.
Plateaus: A plateau is a flat area of the state-space landscape. It can be a flat local
maximum, from which no uphill exit exists, or a shoulder, from which progress is
possible.
Many variants of hill climbing have been invented. Stochastic hill climbing
chooses at random from among the uphill moves; the probability of selection can
vary with the steepness of the uphill move. This usually converges more slowly than
steepest ascent, but in some state landscapes, it finds better solutions. First-choice
hill climbing implements stochastic hill climbing by generating successors randomly
until one is generated that is better than the current state. This is a good strategy
when a state has many (e.g., thousands) of successors.
Another variant is random-restart hill climbing, which adopts the quote ―If at first
you don‘t succeed, try, try again.‖ It conducts a series of hill-climbing searches from
randomly generated initial states, until a goal is found. It is complete with probability
1, because it will eventually generate a goal state as the initial state. If each hill-
climbing search has a probability of success, then the expected number of restarts
required is 1 / p. The expected number of steps is the cost of one successful iteration
plus (1 – p) / p times the cost of failure. For 8-queens, random-restart hill climbing is
very effective indeed. Even for three million queens, the approach can find solutions
in seconds
The success of hill climbing depends very much on the shape of the state-space
landscape: if there are few local maxima and plateaus, random-restart hill climbing
will find a good solution very quickly.
Example:
To illustrate hill climbing, Consider the 8-queens problem. The complete-state
formulation is used here, which means that every state has all the components of a
solution, but they might not all be in the right place. In this case every state has 8
queens on the board, one per column. The initial state is chosen at random, and the
successors of a state are all possible states generated by moving a single queen to
another square in the same column (so each state has 8 7 = 56 successors). The
heuristic cost function is the number of pairs of queens that are attacking each other;
this will be zero only for solutions. (It counts as an attack if two pieces are in the
same line, even if there is an intervening piece between them.)
Problem Solving 1.31
Figure (a) : The 8-queens problem: place 8 queens on a chess board so that no queen
attacks another. (A queen attacks any piece in the same row, column, or diagonal.)
The figure (b) shows the h values of all its successors.
Solution:
Define the heuristic function
h(x) = +1 for all the blocks in the structure if the block is correctly positioned or
h(x) = – 1 for all uncorrectly placed blocks in the structure.
Adversarial Search
Adversarial search is a search, where we examine the problem which arises when
we try to plan ahead of the world and other agents are planning against us.
In previous topics, we have studied the search strategies which are only
associated with a single agent that aims to find the solution which often
expressed in the form of a sequence of actions.
But, there might be some situations where more than one agent is
searching for the solution in the same search space, and this situation
usually occurs in game playing.
The environment with more than one agent is termed as multi-agent
environment, in which each agent is an opponent of other agent and
playing against each other. Each agent needs to consider the action of
other agent and effect of that action on their performance.
So, Searches in which two or more players with conflicting goals are
trying to explore the same search space for the solution, are called
adversarial searches, often known as Games.
Games are modeled as a Search problem and heuristic evaluation function,
and these are the two main factors which help to model and solve games in
AI.
Zero-Sum Game
Zero-sum games are adversarial search which involves pure competition.
In Zero-sum game each agent's gain or loss of utility is exactly balanced
by the losses or gains of utility of another agent.
One player of the game try to maximize one single value, while other
player tries to minimize it.
Each move by one player in the game is called as ply.
Chess and tic-tac-toe are examples of a Zero-sum game.
Game tree:
A game tree is a tree where nodes of the tree are the game states and Edges of the
tree are the moves by players. Game tree involves initial state, actions function, and
result Function.
Example Explanation:
From the initial state, MAX has 9 possible moves as he starts first. MAX
place x and MIN place o, and both player plays alternatively until we
reach a leaf node where one player has three in a row or all squares are
filled.
Both players will compute each node, minimax, the minimax value which
is the best achievable utility against an optimal adversary.
Suppose both the players are well aware of the tic-tac-toe and playing the
best play. Each player is doing his best to prevent another one from
winning. MIN is acting against Max in the game.
So in the game tree, we have a layer of Max, a layer of MIN, and each
layer is called as Ply. Max place x, then MIN puts o to prevent Max from
winning, and this game continues until the terminal node.
1.36 Artificial Intelligence and Machine Learning
In this either MIN wins, MAX wins, or it's a draw. This game-tree is the
whole search space of possibilities that MIN and MAX are playing tic-tac-
toe and taking turns alternately.
Hence adversarial Search for the minimax procedure works as follows:
It aims to find the optimal strategy for MAX to win the game.
It follows the approach of Depth-first search.
In the game tree, optimal leaf node could appear at any depth of the tree.
Propagate the minimax values up to the tree until the terminal node
discovered.
In a given game tree, the optimal strategy can be determined from the minimax
value of each node, which can be written as MINIMAX(n). MAX prefer to move to a
state of maximum value and MIN prefer to move to a state of minimum value then:
For a state S MINIMAX(s)
UTILITY(s) If TERMINAL-TEST(s)
= max aActions(s) MINIMAX(RESULT(s,al)) If PLAYER(s) = MAX
minaActions(s) MINIMAX(RESULT(s, a)) If PLAYER(s) = MIN
A problem is solved when each variable has a value that satisfies all the
constraints on the variable. Such a problem is called a constraint satisfaction
problem, or CSP
For example, if X 1 and X 2 both have the domain, {1, 2, 3}then the constraint
saying that X 1 must be greater than X2 can be written as ((X 1,X 2),{(3,1),(3,2),(2,1)})
((X 1, X 2 ),X 1 > X 2 )
CSPs deal with assignments of values to variables, {X i = vi, X j = v j ,…} . An
assignment that does not violate any constraints is called a consistent or legal
assignment. A complete assignment is one in which every variable is assigned a
value, and a solution to a CSP is a consistent, complete assignment. A partial
assignment is one that leaves some variables unassigned, and a partial solution is a
partial assignment that is consistent.
The domain of every variable is the set Di = {red, green, blue} The constraints
require neighboring regions to have distinct colors.
C = {SA ≠ WA,SA ≠ NT,SA ≠ Q,SA ≠ NSW,SA ≠ V,WA ≠ NT,NT ≠ Q,Q ≠
NSW, NSW ≠ V }.
There are many possible solutions to this problem, such as
{WA = red,NT = green,Q = red,NSW = green,V = red,SA = blue,T = red }.
1.38 Artificial Intelligence and Machine Learning
It can be helpful to visualize a CSP as a constraint graph. The nodes of the graph
correspond to variables of the problem, and an edge connects any two variables that
participate in a constraint.
Constraints:
Assign a decimal digit to each of the letters in such a way that the answer
is correct
Assign decimal digit to letters
Cannot assign different digits to same letters
No two letters have the same digit
Unique digit assigned to each letter
Rules:
From Column 5, M = 1, since it is only carry-over possible from sum of 2
single digit number in column 4.
5 4 3 2 1
S E N D
+ 1 O R E
c3 c2 c1
1 O N E Y
To produce a carry from column 4 to column 5 ‗S + M‘ is at least 9 so ‗S =
8 or 9‘ so ‗S + M = 9 or 10‘ & so ‗O = 0 or 1‘ . But ‗M = 1‘, so ‗O = 0‘.
5 4 3 2 1
9 E N D
+ 1 O R E
c3 c2 c1
1 0 N E Y
The key idea is local consistency. If each variable is treated as a node in a graph
and each binary constraint as an arc, then the process of enforcing local consistency
in each part of the graph causes inconsistent values to be eliminated throughout the
graph. There are different types of local consistency, which are as follows.
Node consistency
A single variable (corresponding to a node in the CSP network) is node-consistent
if all the values in the variable‘s domain satisfy the variable‘s unary constraints. For
example, in the variant of the Australia map-coloring problem where South
Australians dislike green, the variable SA starts with domain {red , green, blue}, and
this can be made node consistent by eliminating green, leaving SA with the reduced
domain {red , blue}. Thus a network is node-consistent if every variable in the
network is node-consistent. It is always possible to eliminate all the unary constraints
in a CSP by running node consistency.
Arc consistency
A variable in a CSP is arc-consistent if every value in its domain satisfies the
variable‘s binary constraints. More formally, X i is arc-consistent with respect to
another variable X j if for every value in the current domain Di there is some value in
the domain Dj that satisfies the binary constraint on the arc (X i , X j ). A network is
arc-consistent if every variable is arc consistent with every other variable. For
example, consider the constraint Y = X 2 where the domain of both X and Y is the set
of digits. This constraint can be explicitly written as {(X, Y), {(0, 0), (1, 1), (2, 4), (3,
9)) })
To make X arc-consistent with respect to Y, reduce X‘s domain to {0, 1, 2, 3}.
and also to make Y arc-consistent with respect to X, then Y ‘s domain becomes {0,
1, 4, 9} and the whole CSP is arc-consistent.
The most popular algorithm for arc consistency is called AC-3 .
function AC-3(csp) returns false if an inconsistency is found and true otherwise
queue a queue of arcs, initially all the arcs in csp
while queue is not empty do
(X i , X j ) Pop(queue)
if REVISE(csp, X i, X j ) then
1.42 Artificial Intelligence and Machine Learning
Path Consistency
Path consistency tightens the binary constraints by using implicit constraints that
are inferred by looking at triples of variables.
A two-variable set {X i , X j } is path-consistent with respect to a third variable X m
if, for every assignment {X i = a, X j = b} consistent with the constraints on {X i ,
X j }, there is an assignment to X m that satisfies the constraints on {X i , X m} and {X m,
Problem Solving 1.43
X j }. This is called path consistency because one can think of it as looking at a path
from X i to X j with X m in the middle.
Let‘s consider the path consistency fares in coloring the Australia map with two
colors. Consider the set {WA, SA}which is path consistent with respect to NT. By
enumerating the consistent assignments to the set. there are only two assignments:
{WA = red ,SA = blue}and {WA = blue, SA = red}.In both of these assignments NT
can be neither red nor blue (because it would conflict with either WA or SA).
Because there is no valid choice for NT, both assignments can be eliminated there is
no valid assignments for {WA, SA}. Therefore, there can be no solution to this
problem.
K -consistency
Stronger forms of propagation can be defined with the notion of k -consistency. A
CSP is k- consistent if, for any set of k-1 variables and for any consistent assignment
to those variables, a consistent value can always be assigned to any kth variable
A CSP is strongly k -consistent if it is k-consistent and is also(k – 1) -consistent,
(k – 2) - consistent, all the way down to 1-consistent.
Global constraints
A global constraint is one involving an arbitrary number of variables (but not
necessarily all variables). Global constraints occur frequently in real problems and
can be handled by special- purpose algorithms. For example, the All diff constraint
says that all the variables involved must have distinct values (as in the
cryptarithmetic problem and Sudoku puzzles).
One simple form of inconsistency detection for Alldiff constraints works as
follows: if m variables are involved in the constraint, and if they n have possible
distinct values altogether, and m>n , then the constraint cannot be satisfied.
This leads to the following simple algorithm:
First, remove any variable in the constraint that has a singleton domain,
and delete that variable‘s value from the domains of the remaining
variables.
Repeat as long as there are singleton variables.
If at any point an empty domain is produced or there are more variables
than domain values left, then an inconsistency has been detected.
1.44 Artificial Intelligence and Machine Learning
Sudoku
The popular Sudoku puzzle has introduced millions of people to constraint
satisfaction problems, although they may not realize it.
A Sudoku board consists of 81 squares, some of which are initially filled
with digits from 1 to 9. The puzzle is to fill in all the remaining squares
such that no digit appears twice in any row, column, or 3 3 box
A row, column, or box is called a unit.
Fig. 1.11.
A Sudoku puzzle can be considered a CSP with 81 variables, one for each square.
The variable names A1 through A9 is used for the top row (left to right), down to I1
through I9 for the bottom row. The empty squares have the domain{1, 2, 3, 4, 5, 6, 7,
8, 9} and the pre-filled squares have a domain consisting of a single value. In
addition, there are 27 different Alldiff constraints, one for each unit (row, column,
and box of 9 squares):
Alldi ff(A1,A2,A3,A4,A5,A6,A7,A8,A9)
Alldi ff(B1,B2,B3,B4,B5,B6,B7,B8,B9)
Alldi ff(A1,B1,C1,D1,E1,F1,G1,H1,I1)
Alldi ff(A2,B2,C2,D2,E2,F2,G2,H2,I2)
Problem Solving 1.45
Alldi ff(A1,A2,A3,B1,B2,B3,C1,C2,C3)
Alldi ff(A4,A5,A6,B4,B5,B6,C4,C5,C6)
Fig. 1.12.
where variables are assigned in the order WA,NT,Q, . . .. Because the
representation of CSPs is standardized, there is no need to supply
BACKTRACKING-SEARCH with a domain-specific initial state, action function,
transition model, or goal test.
Problem Solving 1.47
2. What is A.I?
Artificial Intelligence is a branch of computer science that deals with
developing intelligent machines which can behave like human, think like human,
and has ability to take decisions by their own.
Artificial Intelligence is a combination of two words Artificial and
Intelligence, which refers to man-made intelligence. Therefore, when machines
are equipped with man-made intelligence to perform intelligent tasks similar to
humans, it is known as Artificial Intelligence.
Search
Execution
goal nodes. Uninformed search applies a way in which search tree is searched
without any information about the search space like initial state operators and
test for the goal, so it is also called blind search. It examines each node of the
tree until it achieves the goal node.
It can be divided into five main types:
Breadth-first search
Uniform cost search
Depth-first search
Iterative deepening depth-first search
Bidirectional Search
Breadth-first Search:
Breadth-first search is the most common search strategy for traversing a
tree or graph. This algorithm searches breadthwise in a tree or graph, so it
is called breadth-first search.
BFS algorithm starts searching from the root node of the tree and expands
all successor node at the current level before moving to nodes of next
level.
Problem Solving 1.51
Depth-first Search
Depth-first search is a recursive algorithm for traversing a tree or graph
data structure.
It is called the depth-first search because it starts from the root node and
follows each path to its greatest depth node before moving to the next
path.
DFS uses a stack data structure for its implementation.
The process of the DFS algorithm is similar to the BFS algorithm.
******************
UNIT II
PROBABILISTIC REASONING
Acting under uncertainty – Bayesian inference – naïve bayes models. Probabilistic
reasoning – Bayesian networks – exact inference in BN – approximate inference in
BN – causal networks.
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and
propositional logic with certainty, which means we were sure about the predicates.
With this knowledge representation, we might write AB, which means if A is true
then B is true, but consider a situation where we are not sure about whether A is true
or not then we cannot express this statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates,
we need uncertain reasoning or probabilistic reasoning.
Causes of Uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
1. Information occurred from unreliable sources.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
Probabilistic Reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the
concept of probability to indicate the uncertainty in knowledge. In probabilistic
reasoning, we combine probability theory with logic to handle the uncertainty.
2.2 Artificial Intelligence and Machine Learning
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and
objects in the real world.
Prior probability: The prior probability of an event is probability computed
before observing new information.
Posterior Probability: The probability that is calculated after all evidence or
information has taken into account. It is a combination of prior probability and new
information.
Conditional Probability:
Conditional probability is a probability of occurring an event when another event
has already happened.
Let's suppose, we want to calculate the event A when event B has already
occurred, "the probability of A under the conditions of B", it can be written as:
P(A B)
P(A | B) = P(B)
Where P(A B) = Joint probability of A and B
P(B) = Marginal probability of B.
Fig. 2.1.
If the probability of A is given and we need to find the probability of B, then it
will be given as:
P(A B)
P(B | A) = P(A)
2.4 Artificial Intelligence and Machine Learning
It can be explained by using the below Venn diagram, where B is occurred event,
so sample space will be reduced to set B, and now we can only calculate event A
when event B is already occurred by dividing the probability of P(A B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students
who likes English and mathematics, and then what is the percent of students those
who like English also like mathematics?
Solution:
Statistics is the study to help us quantify the way to measure uncertainty and
hence, the concept of „Probability‟ was introduced.
There are 3 different approaches available to determine the probability of an
event.
Classical
Frequentist
Bayesian
Let‟s understand the differences among these 3 approaches with the help of a
simple example.
Suppose we‟re rolling a fair six-sided die and we want to ask what is the
probability that the die shows a four? Under the Classical framework, all the possible
outcomes are equally likely i.e., they have equal probabilities or chances. Hence,
answering the above question, there are six possible outcomes and they are all
equally likely. So, the probability of a four on a fair six-sided die is just 1/6. This
Classical approach works well when we have well-defined equally likely outcomes.
But when things get a little subjective then it may become a little complex.
Probabilistic Reasoning 2.5
data is divided into several buckets and the frequencies of each bucket loss are
determined by expert judgment and then fitted into probability distributions.
Naïve Bayes
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such as
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
Probabilistic Reasoning 2.11
Weather No Yes
Overcast 0 5 5 / 14 = 0.35
Rainy 2 2 4 / 14 = 0.29
Sunny 2 3 5 / 14 = 0.35
All 4 / 14 = 0.29 10 / 14 = 0.71
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Fig. 2.2.
Each node corresponds to the random variables, and a variable can
be continuous or discrete.
Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and
if there is no directed link that means that nodes are independent with each
other
In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
Node C is independent of node A.
Note The Bayesian network graph does not contain any cyclic graph. Hence,
it is known as a directed acyclic graph or DAG.
Solution:
The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.
Probabilistic Reasoning 2.17
Fig. 2.3.
2.18 Artificial Intelligence and Machine Learning
We can write the events of problem statement in the form of probability: P[D, S,
A, B, E], can rewrite the above probability statement using joint probability
distribution:
P[D, S, A, B, E] = P[D | S, A, B, E]. P[S, A, B, E]
= P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(B = True) = 0.002, which is the probability of burglary.
P(B = False) = 0.998, which is the probability of no burglary.
P(E = True) = 0.001, which is the probability of a minor earthquake
P(E = False) = 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Inference by enumeration
Any conditional probability can be computed by summing terms from the full
joint distribution. More specifically, a query P (X | e) can be answered using
Equation which we repeat here for convenience:
P (X | e) = a P(X, e) = a P(X, e,y)
Now, a Bayesian network gives a complete representation of the full joint
distribution. More specifically, Equation shows that the terms P(x, e, y) in the joint
distribution can be written as products of conditional probabilities from the network.
Therefore, a query can be answered using a Bayesian network by computing sums of
products of conditional probabilities from the network.
In Figure, an algorithm, ENUMERATE-JOINT-ASK, was given for inference by
enumeration from the full joint distribution.
The algorithm takes as input a full joint distribution P and looks up values therein.
It is a simple matter to modify the algorithm so that it takes as input a Bayesian
network b n and "looks up" joint entries by multiplying the corresponding CPT entries
from bn.
Consider the query P(Burglary | John Calls = true, Mary Calls = true).
2.20 Artificial Intelligence and Machine Learning
The hidden variables for this query are Earth quake and Alarm. Equation using
initial letters for the variables in order to shorten the expressions, we have
P(B | j, m) = P(B, j, m) = P(B, e, a ,j, m)
e a
the P(b) term is a constant and can be moved outside the summaltions over a and
e, and the 13(e) term can be moved outside the summation over a. Hence, we have
This expression can be evaluated by looping through the variables in order,
multiplying CPT entries as we go. For each summation, we also need to loop over
the variable's possible values. The structure of this computation is shown in Figure
14.8. Using the numbers from Figure, it obtain P(b 1 j, m) = a 0.00059224. 'The
corresponding computation for yields a 0.0014919; hence
P(B | j, m) = 0.00059224, 0.0014919 0.284, 0.716
That is, the chance of a burglary, given calls from both neighbors, is about 28%.
The evaluation process for the expression in Equation is shown as an expression tree
in Figure. The ENUMERATION-ASK algorithm evaluates such trees using depth-
first recursion. Thus, the space complexity of ENUMERATION-ASK is only linear
in the number of variables-effectively, the algorithm sums over the full joint
distribution without ever constructing it explicitly. Unfortunately, its time complexity
for a network with n Boolean variables is always 0(2n)-better than the O(n2n) for the
simple approach described earlier, but still rather grim. One thing to note about the
tree is that it makes explicit the repeated subexpressions that are evaluated lby the
algorithm. The products P(jla) P(mla) and P(jl1a) P(ml1a) are computed twice, once
for each value of e. The next section describes a general method that avoids such
wasted computations.
Probabilistic Reasoning 2.21
Fig. 2.4. The Structure of the expression shown in equation. The evaluation proceeds top-
down, multiplying values along each path and summing at the ‘+’ nodes. Notice the
repetition of the paths for and m.
Direct Sampling
Take samples of events
We expect the frequency of the samples to converge on the probability of
the event
Used to compute conditional probabilities P(X|e)
Generate samples as before
Reject samples that do not match evidence
Estimate by counting the how often event X is in the resulting samples
2.22 Artificial Intelligence and Machine Learning
Likelihood Weighting
Avoid inefficiency of rejection sampling
Fix values for evidence variables and only sample the remaining variables
Weight samples with regard to how likely they are
Fig. 2.5.
Fig. 2.6.
The figure above shows the procedure for diagrammatically creating a causal
network from a mobile automaton
In an evolution of a multiway system, each substitution event is a vertex in a
causal network. Two events which are related by causal dependence, meaning one
occurs just before the other, have an edge between the corresponding vertices in the
causal network. More precisely, the edge is a directed edge leading from the past
event to the future event.
Some causal networks are independent of the choice of evolution, and these are
called causally invariant.
Fig. 2.7.
Whenever two rule hypotheses overlap in an evolution, the corresponding system
is not causally invariant. Hence, the simplest way to search for causal invariance is to
use rules whose hypotheses can never overlap except trivially. An overlap can
involve one or two strings. For example, A B does not have any overlaps. However,
A B A can overlap as A B A B A and the set of strings {ABB, AAB} can overlap
as AABB
A mobile automaton simulated by a string substitution system is an example of a
causally invariant network in which rule hypotheses overlap, as long as the initial
condition contains only a single active cell.
Multiway System
A multiway system is a kind of substitution system in which multiple states are
permitted at any stage. This accommodates rule systems in which there is more than
one possible way to perform an update.
Probabilistic Reasoning 2.25
Fig. 2.8.
A simple example is a string substitution system. For instance, take the
rules {AB A, BA B} and the initial condition ABA. There are two choices for
how to proceed. Applying the first rule yields the evolution ABA to AA, while
applying the second rule yields the evolution ABAAB A. So at the first step,
there is a single state ({ABA}), at the second step there are two states {AA, AB},
and at the third step there is a single state {A}.
A path through a multiway system arising from a choice of which substitutions to
make is called an evolution. Typically, a multiway system will have a large number
of possible evolutions. For example, consider strings of A s and Bs with the rule
A B BA. Then most strings will have more than one occurrence of the substring A
B, and each occurrence leads down another path in the multiway system.
2.26 Artificial Intelligence and Machine Learning
******************
UNIT III
SUPERVISED LEARNING
Introduction to machine learning – Linear Regression Models: Least squares, single
& multiple variables, Bayesian linear regression, gradient descent, Linear
Classification Models: Discriminant function – Perceptron algorithm, Probabilistic
discriminative model - Logistic regression, Probabilistic generative model – Naive
Bayes, Maximum margin classifier – Support vector machine, Decision Tree,
Random Forests
Linear Regression has actually been around for a very long time (around 200
years). It is a linear model, i.e. it assumes a linear relationship between the input
variables(x) and a single output variable(y). The y here is calculated by the linear
combination of the input variables. Linear regression is a type of machine learning
algorithm that is used to model the relation between scalar dependent and one or
more independent variables. The case of having one independent variable is known
as simple linear regression, while the case of having multiple linear regressions is
known as multiple linear regressions.
Fig. 3.1.
In both of these linear regressions, the model is constructed using a linear
predictor function. The unknown data parameters are estimated using the available
3.2 Artificial Intelligence and Machine Learning
dataset because this feature has various applications such as finance, economics,
epidemiology, etc.
Fig. 3.2.
After finding the line equation, if you are given a point, say (x 3 , y 3 ), you would
be easily able to predict if the point lies on the line or the distance of the point from
the line. This was the basic regression that I had done in schooling without even
Supervised Learning 3.3
realizing that this would have such great importance in Machine Learning. We
generally do in this to identify the equation line or curve that could fit the input and
output of the train data set properly and then use the same equation to predict the
output value of the test data set. This would result in a continuous desired value.
Two Types of Linear Regression
Let‟s talk about two types of Linear Regression
Simple Linear Regression
Multiple Linear Regression
1. Simple Linear Regression
In Simple Linear Regression, we try to find the relationship between a single
independent variable (input) and a corresponding dependent variable (output). This
can be expressed in the form of a straight line.
The same equation of a line can be re-written as:
Y = 0 + 1 X +
1. Y represents the output or dependent variable.
2. β0 and β1 are two unknown constants that represent the intercept and
coefficient (slope) respectively.
3. ε (Epsilon) is the error term.
The following is a sample graph of a Simple Linear Regression Model :
The least square method is the process of finding the best-fitting curve or line of
best fit for a set of data points by reducing the sum of the squares of the offsets
(residual part) of the points from the curve. During the process of finding the relation
between two variables, the trend of outcomes are estimated quantitatively. This
process is termed as regression analysis. The method of curve fitting is an approach
to regression analysis. This method of fitting equations which approximates the
curves to given raw data is the least squares.
It is quite obvious that the fitting of curves for a particular data set are not always
unique. Thus, it is required to find a curve having a minimal deviation from all the
measured data points. This is known as the best-fitting curve and is found by using
the least-squares method.
Fig. 3.5.
The given data points are to be minimized by the method of reducing residuals or
offsets of each point from the line. The vertical offsets are generally used in surface,
polynomial and hyper plane problems, while perpendicular offsets are utilized in
common practice.
Supervised Learning 3.7
S = d 21 + d 22 + d 23 + + d 2n
Sum = Minimum Quantity
Suppose when we have to determine the equation of line of best fit for the given
data, then we first use the following formula.
The equation of least square line is given by Y = a + b X
Normal equation for „a‟:
∑Y = na + b∑X
Normal equation for „b‟:
∑XY = a∑X + b∑X2
Solving these two normal equations we can get the required trend line equation.
Thus, we can get the line of best fit with formula y = ax + b
3.8 Artificial Intelligence and Machine Learning
Solved Example
The Least Squares Model for a set of data ( x 1 , y 1), ( x 2 , y 2), ( x 3, y 3 ), …, ( x n ,
y n ) passes through the point ( x a , y a ) where x a is the average of the x i „s and y a is
the average of the y i „s. The below example explains how to find the equation of a
straight line or a least square line using the least square method.
Question:
Consider the time series data given below:
xi 8 3 2 10 11 3 6 5 6 8
yi 4 12 1 12 9 4 9 6 1 14
Use the least square method to determine the equation of line of best fit for the
data. Then plot the line.
Solution:
Mean of x i values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8) / 10 = 62 / 10 = 6.2
Mean of y i values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/ 10 = 72 / 10 = 7.2
Straight line equation is y = a + bx.
The normal equations are
∑y = an + b∑x
∑xy = a∑x + b∑ x 2
X y x2 xy
8 4 64 32
3 12 9 36
2 1 4 2
10 12 100 120
11 9 121 99
3 4 9 12
6 9 36 54
5 6 25 30
6 1 36 6
8 14 64 112
∑x = 62 ∑y = 72 ∑ x 2 = 468 ∑xy = 503
Supervised Learning 3.9
Fig. 3.6.
Single variable linear regression is one of the fundamental tools for any
interpretation of data. Using linear regression, we can predict continuous variable
outcomes given some data, if the data has a roughly linear shape, i.e. it generally has
the shape a line.
Visually, it appears that this data is approximated pretty well by a "line of best
fit". This is certainly not the only way to approximate this data, but for now it's pretty
good. Single variable linear regression is the tool to find this line of best fit. The line
of best fit can then be used to guess how many homicide deaths there would be for
ages we don't have data on.
Supervised Learning 3.11
By the end of this tutorial you can run linear regression on this homicide data, and
in fact solve any single variable regression problem.
Fig. 3.7.
where mm is the number of examples in the data set. For a concrete example, the
homicide data set plotted above looks like:
D = {(21, 652),(22, 633),,(50, 197)}
We will write code to load data sets from files later.
Model Concept
So, how can we mathematically model single linear regression? Since the goal is
to find the perfect line, let's start by defining the model (the mathematical description
of how predictions will be created) as a line:
y′(x, a, b) = ax + b
where x is an input, y′ is the prediction for the input x , and a and b are model
parameters. Note that although this is an equation for a line with xx as the variable,
3.12 Artificial Intelligence and Machine Learning
the values of a and b determine what specific line it is. To find the best line, we just
need to find the best values for a (the slope) and b (the y-intercept).
For example, the line of best fit for the homicide data above has a slope of
about a ≈ −17.69 and a y-intercept of b ≈ 1000. How we find the magic best values
for aa and bb we don't know yet, but once we find them, prediction is easy, since we
just use the formula above.
So, how do we find the correct values of the model parameters a and b? First, we
need a way to define what the "best line" is exactly. To do so, we define a loss
function (also called a cost function), which measures how bad a particular choice
of a and b are. Values of a and b that seem poor (a line that does not fit the data set)
should result in a large value of the loss function, whereas good values of a and b (a
line that fits the data set well) should result in small values of the loss function.
In other words, the loss function should measure how far the predicted line is
from each of the data points, and add this value up for all data points. We can write
this as:
m
L(a, b) = (y′ (x i , a, b) – y i ) 2
i = 1
Recall that there are mm examples in the data set, xi is the ith input, and y i is the
ith desired output. So, ( y′(x i , a , b) – y i )2 measures how far the ith prediction is from
the i th desired output. For example, if the prediction y′ is 7, and the correct output y
is 10, then we would get (7 − 10)2 = 9. Squaring it is important so that it is always
positive.
Finally, we just add up all of these individual losses. Since the smallest possible
values for the squared terms indicate that the line fits the data as closely as possible,
the line of best fit (determined by the choice of a and b occurs exactly at the smallest
value of L(a ,b). For this reason, the model is also called least squares regression.
Fig. 3.8.
The red dot marked on the plot of L shows where the desired minimum is. We
need an algorithm to find this minimum. From calculus, we know that at the
minimum L must be entirely flat, that is the derivatives are both 0:
L m
= 2(a x i + b – y i ) x i = 0
a i = 1
L m
b
= 2(ax i + b – y i ) = 0
i = 1
If you need to review this aspect of calculus, I would recommend Khan Academy
videos. Now, for this problem it is possible to solve for a and b using the equations
above, like we would in a typical calculus course. But for more advanced machine
learning this is impossible, so instead we will learn to use an algorithm
called gradient descent to find the minimum.
The idea is intuitive: place a ball at an arbitrary location on the surface of LL, and
it will naturally roll downhill towards the flat valley of L and thus find the minimum.
We know the direction of "downhill" at any location since we know the derivatives
3.14 Artificial Intelligence and Machine Learning
of L: the derivatives are the direction of greatest upward slope (this is known as
the gradient), so the opposite (negative) derivatives are the most downhill direction.
Therefore, if the ball is currently at location (a , b), we can see where it would go by
moving it to location (a′, b′) like so:
L
a′ = a –
a
L
b′ = b –
b
where αα is a constant called the learning rate, which we will talk about more
later. If we repeat this process then the ball will continue to roll downhill into the
minimum. An animation of this process looks like:
Fig. 3.9.
When we run the gradient descent algorithm for long enough, then it will find the
optimal location for (a , b). Once we have the optimal values of a and b , then that's
it, we can just use them to predict a rate of homicide deaths given any age, using the
model:
y′(x) = ax + b
Let‟s stop and think about what this means. In contrast to OLS, we have a
posterior distribution for the model parameters that is proportional to the likelihood
of the data multiplied by the prior probability of the parameters. Here we can observe
the two primary benefits of Bayesian Linear Regression.
1. Priors: If we have domain knowledge, or a guess for what the model
parameters should be, we can include them in our model, unlike in the
frequentist approach which assumes everything there is to know about the
parameters comes from the data. If we don‟t have any estimates ahead of
time, we can use non-informative priors for the parameters such as a
normal distribution.
2. Posterior: The result of performing Bayesian Linear Regression is a
distribution of possible model parameters based on the data and the prior.
This allows us to quantify our uncertainty about the model: if we have
fewer data points, the posterior distribution will be more spread out.
As the amount of data points increases, the likelihood washes out the prior, and in
the case of infinite data, the outputs for the parameters converge to the values
3.18 Artificial Intelligence and Machine Learning
Fig. 3.10.
3.20 Artificial Intelligence and Machine Learning
When we want show the linear fit from a Bayesian model, instead of showing
only estimate, we can draw a range of lines, with each one representing a different
estimate of the model parameters. As the number of datapoints increases, the lines
begin to overlap because there is less uncertainty in the model parameters. In order to
demonstrate the effect of the number of datapoints in the model, I used two models,
the first, with the resulting fits shown on the left, used 500 datapoints and the one on
the right used 15000 datapoints. Each graph shows 100 possible models drawn from
the model parameter posteriors.
Bayesian Linear Regression Model Results with 500 (left) and 15000 observations
(right)
There is much more variation in the fits when using fewer data points, which
represents a greater uncertainty in the model. With all of the data points, the OLS and
Bayesian Fits are nearly identical because the priors are washed out by the
likelihoods from the data. When predicting the output for a single datapoint using our
Bayesian Linear Model, we also do not get a single value but a distribution.
Following is the probability density plot for the number of calories burned exercising
for 15.5 minutes. The red vertical line indicates the point estimate from OLS.
Fig. 3.11. Posterior Probability Density of Calories Burned from Bayesian Model
We see that the probability of the number of calories burned peaks around 89.3,
but the full estimate is a range of possible values.
Supervised Learning 3.21
Fig. 3.12.
This entire procedure is known as Gradient Ascent, which is also known as
steepest descent. The main objective of using a gradient descent algorithm is to
minimize the cost function using iteration. To achieve this goal, it performs two
steps iteratively:
3.22 Artificial Intelligence and Machine Learning
The main objective of gradient descent is to minimize the cost function or the
error between expected and actual. To minimize the cost function, two data points
are required: Direction & Learning Rate
These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global
minimum. Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the
cost function. If the learning rate is high, it results in larger steps but also leads to
risks of overshooting the minimum. At the same time, a low learning rate shows the
small step sizes, which compromises overall efficiency but gives the advantage of
more precision.
Fig. 3.13.
gradient descent and speed of stochastic gradient descent. Hence, we can achieve a
special type of gradient descent with higher computational efficiency and less noisy
gradient descent.
Fig. 3.14.
Whenever the slope of the cost function is at zero or just close to zero, this model
stops learning further. Apart from the global minimum, there occur some scenarios
that can show this slop, which is saddle point and local minimum. Local minima
3.26 Artificial Intelligence and Machine Learning
generate the shape similar to the global minimum, where the slope of the cost
function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side of
the point, which reaches a local maximum on one side and a local minimum on the
other side. The name of a saddle point is taken by that of a horse's saddle. The name
of local minima is because the value of the loss function is minimum at that point in
a local region. In contrast, the name of the global minima is given so because the
value of the loss function is minimum there, globally across the entire domain the
loss function.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
back propagation, this gradient becomes smaller that causing the decrease in the
learning rate of earlier layers than the later layer of the network. Once this happens,
the weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the
Gradient is too large and creates a stable model. Further, in this scenario, model
weight increases, and they will be represented as NaN. This problem can be solved
using the dimensionality reduction technique, which helps to minimize complexity
within the model.
Fig. 3.15.
f ( x , W , b ) = j (Wj x j) + b
x : input vector
W : Weight matrix
b : Bias vector
Fig. 3.16.
The weight matrix will have one row for every class that needs to be classified
and one column for ever element (feature) of x. On the picture above each line will
be represented by a row in our weight matrix.
Parametric Approach
Fig. 3.17.
The idea is that out hypothesis/model has parameters, that will aid the mapping
between the input vector to a specific class score. The parametric model has two
important components:
Score Function: Is a function f (x ,W, b) that will map our raw input
vector to a score vector
Loss Function: Quantifies how well our current set of weights maps some
input x to a expected output y, the loss function is used during training
time.
On this approach, the training phase will find us a set of parameters that will
change the hypothesis/model to map some input, to some of the output class.
During the training phase, which consist as a optimisation problem, the weights
(W) and bias (b) are the only thing that we can change.
Now some topics that are important on the diagram below:
The input image x is stretched to a single dimension vector, this loose
spatial information
The weight matrix will have one column for every element on the input
The weight matrix will have one row for every element of the output (on
this case 3 labels)
The bias will have one row for every element of the output (on this case 3
labels)
The loss will receive the current scores and the expected output for its
current input X
Supervised Learning 3.29
Consider each row of W a kind of pattern match for a specified class. The score
for each class is calculated by doing a inner product between the input vector X and
the specific row for that class.
Ex: scorec a t = [0.2(56) – 0.5(231) + 0.1(24) + 2(2)] + 1.1 = – 96.8
Fig. 3.18.
3.30 Artificial Intelligence and Machine Learning
Bias trick
Some learning libraries implementations, does a trick to consider the bias as part
of the weight matrix, the advantage of this approach is that we can solve the linear
classification with a single matrix multiplication.
f (x ,W) = W . x
Fig. 3.19.
Basically you add an extra row at the end of the input vector, and concatenate a
column on the W matrix
Fig. 3.20.
Supervised Learning 3.31
Example:
Suppose we have two sets of data points belonging to two different classes that we
want to classify. As shown in the given 2D graph, when the data points are plotted on
the 2D plane, there‟s no straight line that can separate the two classes of the data
points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used
which reduces the 2D graph into a 1D graph in order to maximize the separability
between the two classes.
Fig. 3.21.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a
new axis and projects data onto a new axis in a way to maximize the separation of
the two categories and hence, reducing the 2D graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class
Fig. 3.22.
3.32 Artificial Intelligence and Machine Learning
In the above graph, it can be seen that a new axis (in red) is generated and plotted
in the 2D graph such that it maximizes the distance between the means of the two
classes and minimizes the variation within each class. In simple terms, this newly
generated axis increases the separation between the data points of the two classes.
After generating this new axis using the above-mentioned criteria, all the data points
of the classes are plotted on this new axis and are shown in the figure given below.
But Linear Discriminant Analysis fails when the mean of the distributions are
shared, as it becomes impossible for LDA to find a new axis that makes both the
classes linearly separable. In such cases, we use non-linear discriminant analysis.
Mathematics
Let‟s suppose we have two classes and a d- dimensional samples such as x 1 , x 2 …
x n , where:
n 1 samples coming from the class (c 1) and n 2 coming from the class (c 2).
If x i is the data point, then its projection on the line represented by unit vector v
can be written as vT x i
Let‟s consider u 1 and u 2 be the means of samples class c 1 and c 2 respectively
before projection and u 1 that denotes the mean of the samples of class after
projection and it can be calculated by:
n1
1
1 = n v T x i = v T 1
1 xi c1
Similarly,
2 = v T 2
Now, In LDA we need to normalize |\widetilde{\mu_1} -\widetilde{\mu_2} |.
Let y_i = v ^{T}x_i be the projected samples, then scatter for the samples of
c 1 is:
s 21 = (y i – 1)2
yi C1
Supervised Learning 3.33
Face_Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used
to minimize the number of features to a manageable number before going through
the classification process. It generates a new template in which each dimension
consists of a linear combination of pixel values. If a linear combination is generated
using Fisher's linear discriminant, then it is called Fisher's face.
Medical
In the medical field, LDA has a great application in classifying the patient disease
on the basis of various parameters of patient health and the medical treatment which
is going on. On such parameters, it classifies disease as mild, moderate, or severe.
This classification helps the doctors in either increasing or decreasing the pace of the
treatment.
Customer_Identification
In customer identification, LDA is currently being applied. It means with the help
of LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can
3.34 Artificial Intelligence and Machine Learning
Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
Fig. 3.23.
all inputs (weight). After adding all inputs, if the total sum of all inputs is more than
a pre-determined value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance
of this model is stated as satisfied, and weight demand does not change. However,
this model consists of a few discrepancies triggered when multiple weight inputs
values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."
Fig. 3.24.
The data scientist uses the activation function to take a subjective decision based
on various problem statements and forms the desired outputs. Activation function
may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether
the learning process is slow or has vanishing or exploding gradients.
Fig. 3.26.
Most of the Machine Learning and Deep Learning problems that you solve are
conceptualized from the Generative and Discriminative Models. In Machine
Learning, one can clearly distinguish between the two modelling types:
Classifying an image as a dog or a cat falls under Discriminative
Modelling
Supervised Learning 3.39
The four most common and interesting problems in the Image domain.
Fig. 3.27.
Many of you may have already used Discriminative Modelling to solve a
classification problem in Machine Learning or Deep Learning. Being a superset of
Machine Learning and Deep Learning algorithms, Discriminative Modelling is not
limited to classification tasks. It is also widely used in object detection, semantic
segmentation, panoptic segmentation, keypoint detection, regression problems, and
language modelling.
The discriminative model falls under the supervised learning branch. In a
classification task, given that the data is labelled, it tries to distinguish among
classes, for example, a car, traffic light and a truck. Also known as classifiers, these
models correspond image samples X to class labels Y, and discover the probability
of image sample x X belonging to class label y Y.
They learn to model the decision boundaries among classes (such as cats, dogs
and tigers ). The decision boundary could be linear or non-linear. The data points that
are far away from the decision boundary (i.e. the outliers) are not very important.
The discriminative model tries to learn a boundary that separates the positive from
3.40 Artificial Intelligence and Machine Learning
the negative class, and comes up with the decision boundary. Only those closest to
this boundary are considered.
Fig. 3.28. Image showing the classification of data points by Discriminative models.
In trying to classify a sample x belonging to class label y, the discriminative
model indirectly learns certain features of the dataset that make its task easier. For
example, a car has four wheels of a circular shape and more length than width, while
the traffic light is vertical with three circular rings. These features help the model
distinguish between the two classes.
The discriminative models could be:
Probabilistic
logistic regression
a deep neural network, which models P(Y|X)
Non-probabilistic
Support Vector Machine (SVM), which tries to learn the mappings directly
from the data points to the classes with a hyperplane.
Discriminative modelling learns to model the conditional probability of class label
y given set of features x as P(Y|X).
Some of the discriminative models are:
Supervised Learning 3.41
Fig. 3.29.
Neural networks are widely used in supervised learning and reinforcement
learning problems. These networks are based on a set of layers connected to each
other. In deep learning, the number of hidden layers, mostly non-linear, can be large;
say about 1000 layers. DL models produce much better results than normal ML
networks. We mostly use the gradient descent method for optimizing the network
and minimising the loss function.
3.42 Artificial Intelligence and Machine Learning
Each node in output and hidden layers has its own classifiers. The input layer
takes inputs and passes on its scores to the next hidden layer for further activation
and this goes on till the output is reached. This progress from input to output from
left to right in the forward direction is called forward propagation.
Credit assignment path (CAP) in a neural network is the series of transformations
starting from the input to the output. CAPs elaborate probable causal connections
between the input and the output.CAP depth for a given feed forward neural network
or the CAP depth is the number of hidden layers plus one as the output layer is
included. For recurrent neural networks, where a signal may propagate through a
layer several times, the CAP depth can be potentially limitless.
Fig. 3.30.
3.44 Artificial Intelligence and Machine Learning
To understand deep neural networks, go back in time and first know Linear and
Logistic Regression.
Fig. 3.31.
To simplify a bit, assume x is the area of a house (in square feet), while y is its
price. The variables m and b are randomly initialized, and their values are updated
till the loss (mean squared error) between the prediction ( ) and the true value (y) is
minimum (does not reduce further).
Fig. 3.34.
Probability of x sampled from training data or true distribution (Left). Probability
of x sampled from modelled distribution (Right).
The above image is an example of sample generation, using generative modelling.
The goal is to input training samples with probability distribution Pdata, and learn a
distribution such that: When you sample from the distribution Pmodel,, it generates
realistic observations that represent the true distribution.
To generate such training samples, you need a training dataset, which consists of
unlabelled data points. Each data point has its own features, such as individual pixel
values (image-domain) and a set of vocabulary (text-domain). This whole process of
Supervised Learning 3.47
Fig. 3.35. Image visualizing the concept of ‘noise adds randomness, so the images
generated are diverse
Generative models, you know by now, try to learn the probabilistic distribution of
the training data. This helps them represent the data more realistically. In the above
figure, the generative model learns to generate urban-scene images, taking a random
noise as a matrix or vector. Its task is to generate realistic samples X, with
probability distribution similar to Pdata (original data from the training set). The
noise adds randomness to the model and ensures that the images generated are
diverse.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such as
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
The formula for Bayes' theorem is given as:
P(B | A)P(A)
P(A|B) = P(B)
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
Supervised Learning 3.49
The maximal margin classifier is the optimal hyperplane defined in the (rare) case
where two classes are linearly separable. Given an n p data matrix X with a binary
response variable defined as y∈ [−1,1]] it might be possible to define a p-
dimensional hyperplane.
h(X) = β0 +β1 X1 + β2 X2 ⋯ + βp Xp = x Ti β + β0 = 0
such that all observations of each class fall on opposite sides of the hyperplane.
This separating hyperplane has the property that if β is constrained to be a unit
vector, ||β|| = ∑β2 =1, then the product of the hyperplane and response variables are
positive perpendicular distances from the hyperplane, the smallest of which may be
termed the hyperplane margin, M,
y i (x ′i β + β0) ≥ M
The maximal margin classifier is the hyperplane with the maximum
margin, max{M} subject to ||β|| = 1. A separating hyperplane rarely exists. In fact,
even if a separating hyperplane does exist, its resulting margin is probably
undesirably narrow. Here is the maximal margin classifier.
Fig. 3.36.
Supervised Learning 3.53
The data set has two linearly separable classes, y ∈ [−1,1] described by two
features, X 1 and X 2 . The code is unimportant - just trying to produce the
visualization.
Fig. 3.37.
their relative position, the maximum margin is derived and an optimal hyperplane
drawn at the midpoint. The hyperplane is N – 1 dimensional, where N is the number
of features present in the given dataset. For example, A line will indicate the
decision boundary if a dataset has two features (2d input space).
Fig. 3.38.
Image depicting the selection of the hyperplane that maximises the margin
between the data points.
Supervised Learning 3.55
When the data points are not separated in a linear fashion, the Non-Linear SVM is
used. A kernel function or a kernel trick helps obtain a new hyperplane for all the
training data. As shown in the figure above, the input space is projected to a higher-
dimensional feature space, such that the distribution of data points in the new
hyperplane will be linear. Various kernel tricks, such as polynomial and radial basis
function are used to solve nonlinear classification problems with SVM.
Like SVM, Random Forest also falls in the class of discriminative modelling. It is
one of the most popular and powerful Machine Learning algorithms to perform
classification and regression. Random Forest became a hit in the Kaggle
Community as it helped win many competitions.
It is an ensemble of decision-tree models. Hence, to understand the Random
Forest algorithm, you need to know Decision Trees. Microsoft trained a deep,
randomized decision-forest classifier to predict 3D positions of body joints, from a
single-depth image. There was no temporal information, and hundreds of thousands
of training images were used.
Fig. 3.39. Image showing the architecture and working of a decision tree.
Supervised Learning 3.57
Fig. 3.41.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
3.60 Artificial Intelligence and Machine Learning
Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more
homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from
the tree.
Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision tree
starts with the root node (Salary attribute by ASM). The root node splits further into
the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Fig. 3.42.
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
Information Gain
Gini Index
1. Information Gain:
Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the
decision tree.
A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information gain
is split first. It can be calculated using the below formula:
Information Gain = Entropy(S) – [(Weighted Avg) * Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:
Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Where,
S = Total number of samples
P(yes) = probability of yes
P(no) = probability of no
2. Gini Index:
Gini index is a measure of impurity or purity used while creating a
decision tree in the CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to
the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index
to create binary splits.
Gini index can be calculated using the below formula:
Gini Index = 1 – ∑ j P2j
Supervised Learning 3.63
for each example within a dataset and updates each training example's
parameters one at a time. As it requires only one training example at a time,
hence it is easier to store in allocated memory
10. Write about Mini Batch Gradient Descent?
Mini Batch gradient descent is the combination of both batch gradient descent
and stochastic gradient descent. It divides the training datasets into small batch
sizes then performs the updates on those batches separately. Splitting training
datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher
computational efficiency and less noisy gradient descent.
11. What is QDA?
Quadratic Discriminate Analysis (QDA) For multiple input variables, each
class deploys its own estimate of variance.
12. What is FDA?
FDA uses regularization in the estimate of the variance (actually covariance)
and hence moderates the influence of different variables on LDA.
13. What is Binary classifier?
Binary classifiers are defined as the function that helps in deciding whether
input data can be represented as vectors of numbers and belongs to some specific
class. It can be considered as linear classifiers. In simple words, we can
understand it as a classification algorithm that can predict linear predictor
function in terms of weight and feature vectors.
14. Some of the discriminative models?
Support Vector Machine
Logistic Regression
k-Nearest Neighbour (kNN)
Random Forest
Deep Neural Network ( such as AlexNet, VGGNet, and ResNet )
15. Type of Generative Models?
Naive Bayes
3.66 Artificial Intelligence and Machine Learning
*********************
UNIT IV
ENSEMBLE TECHNIQUES AND
UNSUPERVISED LEARNING
Combining multiple learners: Model combination schemes, Voting, Ensemble
Learning - bagging, boosting, stacking, Unsupervised learning: K-means, Instance
Based Learning: KNN, Gaussian mixture models and Expectation maximization
Model composed of multiple learners that complement each other which are
combined to attain higher accuracy .Individual algorithms are called base learners.
Each base learner should be simple & reasonably accurate.
Together producing the required accuracy.
Ensemble Technique
It is a machine learning technique that combines several base models in order to
produce one optimal predictive model.
Decision Trees is best to outline the definition and practicality of Ensemble
Methods (however it is important to note that Ensemble Methods do not only pertain
to Decision Trees).
A Decision Tree determines the predictive value based on series of questions and
conditions. For instance, this simple Decision Tree determining on whether an
individual should play outside or not. The tree takes several weather factors into
account, and given each factor either makes a decision or asks another question. In
this example, every time it is overcast. However, if it is raining, it is needed to ask if
it is windy or not? If windy, do not play. But given no wind, so ready to go outside to
play.
4.2 Artificial Intelligence and Machine Learning
Fig. 4.1.
1. Max Voting
The max voting method is generally used for classification problems. In this
technique, multiple models are used to make predictions for each data point. The
predictions by each model are considered as a ‘vote’. The predictions which we get
from the majority of the models are used as the final prediction.
For example, when you asked 5 of your colleagues to rate your movie (out of 5);
we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the
majority gave a rating of 4, the final rating will be taken as 4. You can consider this
as taking the mode of all the predictions.
Ensemble Techniques and Unsupervised Learning 4.3
5 4 5 4 4 4
2. Averaging
Similar to the max voting technique, multiple predictions are made for each data
point in averaging. In this method, we take an average of predictions from all the
models and use it to make the final prediction. Averaging can be used for making
predictions in regression problems or while calculating probabilities for classification
problems.
For example, in the below case, the averaging method would take the average of
all the values.
i.e. (5 + 4 + 5 + 4 + 4) / 5 = 4.4
5 4 5 4 4 4.4
3. Weighted Average
This is an extension of the averaging method. All models are assigned different
weights defining the importance of each model for prediction. For instance, if two of
your colleagues are critics, while others have no prior experience in this field, then
the answers by these two friends are given more importance as compared to the other
people.
The result is calculated as [(5 0.23) + (4 0.23) + (5 0.18) + (4 0.18) + (4
0.18)] = 4.41.
Rating 5 4 5 4 4 4.41
4.4 Artificial Intelligence and Machine Learning
4.2.1. STACKING
Stacking is an ensemble learning technique that uses predictions from multiple
models (for example decision tree, knn or svm) to build a new model. This model is
used for making predictions on the test set. Below is a step-wise explanation for a
simple stacked ensemble:
1. The train set is split into 10 parts.
3. The base model (in this case, decision tree) is then fitted on the whole train
dataset. Using this model, predictions are made on the test set.
Ensemble Techniques and Unsupervised Learning 4.5
4. Steps 2 to 4 are repeated for another base model (say knn) resulting in
another set of predictions for the train set and test set.
5. The predictions from the train set are used as features to build a new
model.
4.6 Artificial Intelligence and Machine Learning
6. This model is used to make final predictions on the test prediction set.
4.2.2. BLENDING
Blending follows the same approach as stacking but uses only a holdout
(validation) set from the train set to make predictions. In other words, unlike
stacking, the predictions are made on the holdout set only. The holdout set and the
predictions are used to build a model which is run on the test set. Here is a detailed
explanation of the blending process:
1. The train set is split into training and validation sets.
The validation set and its predictions are used as features to build a new model.
This model is used to make final predictions on the test and meta-features.
Ensemble Techniques and Unsupervised Learning 4.7
4.2.3. BAGGING
The idea behind bagging is combining the results of multiple models (for instance,
all decision trees) to get a generalized result. Here’s a question: If you create all the
models on the same set of data and combine it, will it be useful? There is a high
chance that these models will give the same result since they are getting the same
input. So how can we solve this problem?
One of the techniques is bootstrapping.
Bootstrapping is a sampling technique in which we create subsets of observations
from the original dataset, with replacement. The size of the subsets is the same as the
size of the original set.
Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a
fair idea of the distribution (complete set). The size of subsets created for bagging
may be less than the original set.
Fig. 4.2.
Fig. 4.3.
4.8 Artificial Intelligence and Machine Learning
4.2.4. BOOSTING
Before we go further, here’s another question for you: If a data point is incorrectly
predicted by the first model, and then the next (probably all models), will combining
the predictions provide better results? Such situations are taken care of by boosting.
Boosting is a sequential process, where each subsequent model attempts to correct
the errors of the previous model. The succeeding models are dependent on the
previous model. Let’s understand the way boosting works in the below steps.
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher
weights. (Here, the three misclassified blue-plus points will be given
higher weights)
7. Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)
Fig. 4.4.
Ensemble Techniques and Unsupervised Learning 4.9
8. Similarly, multiple models are created, each correcting the errors of the
previous model.
9. The final model (strong learner) is the weighted mean of all the models
(weak learners).
10. Thus, the boosting algorithm combines a number of weak learners to form
a strong learner. The individual models would not perform well on the
entire dataset, but they work well for some part of the dataset. Thus, each
model actually boosts the performance of the ensemble.
the ideal solution for exploratory data analysis, cross-selling strategies, customer
segmentation, and image recognition.
Otherwise, Unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the
hidden patterns and insights from the given data. It can be compared to learning
which takes place in the human brain while learning new things.
Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Fig. 4.5.
Fig. 4.6.
Ensemble Techniques and Unsupervised Learning 4.13
Fig. 4.7.
KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
4.14 Artificial Intelligence and Machine Learning
measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
Step-4: Among these k neighbors, count the number of the data points in
each category.
Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
Fig. 4.8.
Firstly, we will choose the number of neighbors, so we will choose the k = 5.
Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:
Ensemble Techniques and Unsupervised Learning 4.15
As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can
be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we
go .
Disadvantages:
1. Classification costs are high
Ensemble Techniques and Unsupervised Learning 4.17
2. Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
This model is a soft probabilistic clustering model that allows us to describe the
membership of points to a set of clusters using a mixture of Gaussian densities. It is a
soft classification (in contrast to a hard one) because it assigns probabilities of
belonging to a specific class instead of a definitive choice. In essence, each
observation will belong to every class but with different probabilities.
GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A
Gaussian distribution is defined as a continuous probability distribution that takes on
a bell-shaped curve. Another name for Gaussian distribution is the normal
distribution. Here is a picture of Gaussian mixture models:
Fig. 4.9.
GMM has many applications, such as density estimation, clustering, and image
segmentation. For density estimation, GMM can be used to estimate the probability
4.18 Artificial Intelligence and Machine Learning
density function of a set of data points. For clustering, GMM can be used to group
together data points that come from the same Gaussian distribution. And for image
segmentation, GMM can be used to partition an image into different regions.
Gaussian mixture models can be used for a variety of use cases, including
identifying customer segments, detecting fraudulent activity, and clustering images.
In each of these examples, the Gaussian mixture model is able to identify clusters in
the data that may not be immediately obvious. As a result, Gaussian mixture models
are a powerful tool for data analysis and should be considered for any clustering task.
The following are three different steps to using gaussian mixture models:
Determining a covariance matrix that defines how each Gaussian is related
to one another. The more similar two Gaussians are, the closer their means
will be and vice versa if they are far away from each other in terms of
similarity. A gaussian mixture model can have a covariance matrix that is
diagonal or symmetric.
Determining the number of Gaussians in each group defines how many
clusters there are.
Selecting the hyperparameters which define how to optimally separate data
using gaussian mixture models as well as deciding on whether or not each
gaussian’s covariance matrix is diagonal or symmetric.
The following are different scenarios when GMMs can be used:
Gaussian mixture models can be used in a variety of scenarios, including
when data is generated by a mix of Gaussian distributions when there is
uncertainty about the correct number of clusters, and when clusters have
different shapes. In each of these cases, the use of a Gaussian mixture
model can help to improve the accuracy of results. For example, when data
is generated by a mix of Gaussian distributions, using a Gaussian mixture
model can help to better identify the underlying patterns in the data. In
addition, when there is uncertainty about the correct number of clusters,
the use of a Gaussian mixture model can help to reduce the error rate.
Gaussian mixture models can be used for anomaly detection; by fitting a
model to a dataset and then scoring new data points, it is possible to flag
points that are significantly different from the rest of the data (i.e.
Ensemble Techniques and Unsupervised Learning 4.19
can be used to detect change points in time series data and help find
turning points of stock prices or other market movements that are
otherwise difficult to spot due to volatility and noise.
Gene expression data analysis: Gaussian mixture models can be used for
gene expression data analysis. In particular, GMMs can be used to detect
differentially expressed genes between two conditions and identify which
genes might contribute toward a certain phenotype or disease state.
What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination of
various unsupervised machine learning algorithms, which is used to determine
the local maximum likelihood estimates (MLE) or maximum a posteriori
estimates (MAP) for unobservable variables in statistical models. Further, it is a
technique to find maximum likelihood estimation when the latent variables are
present. It is also referred to as the latent variable model. A latent variable model
consists of both observable and unobservable variables where observable can be
predicted while unobserved are inferred from the observed variable. These
unobservable variables are known as latent variables.
Key Points:
It is known as the latent variable model to determine MLE and MAP
parameters for latent variables.
It is used to predict values of parameters in instances where data is missing
or unobservable for learning, and this is done until convergence of the
values occurs.
4.6.1. EM ALGORITHM
The EM algorithm is the combination of various unsupervised ML algorithms,
such as the k-means clustering algorithm. Being an iterative approach, it consists
of two modes. In the first mode, we estimate the missing or latent variables. Hence it
is referred to as the Expectation/estimation step (E-step). Further, the other mode is
Ensemble Techniques and Unsupervised Learning 4.21
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
Fig. 4.10.
Expectation step (E - step): It involves the estimation (guess) of all
missing values in the dataset so that after completing this step, there
should not be any missing value.
Maximization step (M - step): This step involves the use of estimated
data in the E-step and updating the parameters.
Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Well, if we already knew where the Gaussians are in the above plot, for each
observation, we could compute the cluster probabilities. Let’s draw this picture so
you have it in mind. So we would be assigning a color to the points below:
Intuitively, for one selected observation and one selected Gaussian, the probability
of the observation belonging to the cluster would be the ratio between the Gaussian
value and the sum of all the Gaussians. Something like:
1 N(x i | 1 , 1)
P(x i belongs to Cluster 1) = Z
2 N(x i | 2 , 2) K
P(x i belongs to Cluster 2) = with Z = j N (x i | j , j )
Z j =1
3 N(x i | 3 , 3)
P(x i belongs to Cluster 3) = Z
To find, where the Gaussians are to be located in the above plot, already in
possession of the observation’s labels, like so:
Now we could easily find the parameter values and draw the Gaussians. It suffices
to consider the points independently, say the red points, and find the maximum
likelihood estimate. For a Gaussian distribution, one can demonstrate the following
results:
4.24 Artificial Intelligence and Machine Learning
1 N –
MLE = N x i = x
i
1 N –
N
MLE = (x i – x )2
i
Applying the above formula, to the red points, then the blue points, and then the
yellow points, we get the following normal distributions:
So given the normal distribution parameters, we can find the observation labels,
and given the observation labels, we can find the normal distribution parameters.
First we can set the Gaussian parameters to random values. Then we perform the
optimization by iterating through the two successive steps until convergence :
1. We assign labels to the observations using the current Gaussian parameters
2. We update the Gaussian parameters so that the fit is more likely
This would give the following result:
2. Fix q and adjust Θ so that the lower-bound gets maximized. For example,
during the second step, we compute Θ1:
Advantages of EM algorithm
It is very easy to implement the first two basic steps of the EM algorithm
in various machine learning problems, which are E-step and M- step.
It is mostly guaranteed that likelihood will enhance after each iteration.
It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
The convergence of the EM algorithm is very slow.
It can make convergence for the local optima only.
It takes both forward and backward probability into consideration. It is
opposite to that of numerical optimization, which takes only forward
probabilities.
******************
UNIT V
NEURAL NETWORKS
Perceptron - Multilayer perceptron, activation functions, network training – gradient
descent optimization – stochastic gradient descent, error backpropagation, from
shallow networks to deep networks – Unit saturation (aka the vanishing gradient
problem) – ReLU, hyperparameter tuning, batch normalization, regularization,
dropout.
Fig. 5.1.
5.2 Artificial Intelligence and Machine Learning
Fig. 5.2.
The data scientist uses the activation function to take a subjective decision based
on various problem statements and forms the desired outputs. Activation function
may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether
the learning process is slow or has vanishing or exploding gradients.
Bias, net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to the activation function
'f' to obtain the desired output. This activation function is also known as the step
function and is represented by 'f'.
Fig. 5.3.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can calculate
the weighted sum as follows:
∑ w i * x i = x 1 * w1 + x 2 * w2 + … wn * x n
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑ wi * xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as
follows:
Y = f (∑ w i * x i + b)
Perceptron Function
Perceptron function ''f (x)'' can be achieved as output by multiplying the input 'x'
with the learned weight coefficient 'w'.
5.4 Artificial Intelligence and Machine Learning
Fig. 5.4.
MLP networks are used for supervised learning format. A typical learning
algorithm for MLP networks is also called back propagation's algorithm. A
multilayer perceptron (MLP) is a feed forward artificial neural network that
generates a set of outputs from a set of inputs. An MLP is characterized by several
layers of input nodes connected as a directed graph between the input nodes
Neural Networks 5.5
connected as a directed graph between the input and output layers. MLP uses
backpropagation for training the network. MLP is a deep learning method.
The algorithm for the MLP is as follows:
1. Just as with the perceptron, the inputs are pushed forward through the
MLP by taking the dot product of the input with the weights that exist
between the input layer and the hidden layer (WH). This dot product
yields a value at the hidden layer. We do not push this value forward as we
would with a perceptron though.
2. MLPs utilize activation functions at each of their calculated layers. There
are many activation functions to discuss: rectified linear
units (ReLU), sigmoid function, tanh. Push the calculated output at the
current layer through any of these activation functions.
3. Once the calculated output at the hidden layer has been pushed through the
activation function, push it to the next layer in the MLP by taking the dot
product with the corresponding weights.
4. Repeat steps two and three until the output layer is reached.
5. At the output layer, the calculations will either be used for a back
propagation algorithm that corresponds to the activation function that was
selected for the MLP (in the case of training) or a decision will be made
based on the output (in the case of testing).
Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers. The multi-layer
perceptron model is also known as the Back propagation algorithm, which executes
in two stages as follows:
Forward Stage: Activation functions start from the input layer in the
forward stage and terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are
modified as per the model's requirement. In this stage, the error between
actual output and demanded originated backward on the output layer and
ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial
neural networks having various layers in which activation function does not remain
5.6 Artificial Intelligence and Machine Learning
Fig. 5.5.
Neural Networks 5.7
difference is the output interval. The output interval of tanh is 1, and the
whole function is 0-centric, which is better than sigmoid.
2. The major advantage is that the negative inputs will be mapped
strongly negative and the zero inputs will be mapped near zero in the tanh
graph.
Fig. 5.6.
Fig. 5.7.
The ReLU is half rectified (from the bottom). f (z) is zero when z is less than zero
and f(z) is equal to z when z is above or equal to zero.
max(0, x) x >=0
(x) = 0
x <0
The ReLU (Rectified Linear Unit) function is an activation function that is
currently more popular compared to other activation functions in deep learning.
Compared with the sigmoid function and the tanh function, it has the
following advantages:
1. When the input is positive, there is no gradient saturation problem.
5.10 Artificial Intelligence and Machine Learning
2. The calculation speed is much faster. The ReLU function has only a linear
relationship. Whether it is forward or backward, it is much faster than
sigmoid and tanh. (Sigmoid and tanh need to calculate the exponent, which
will be slower.)
In the process of training, we want to start with a bad performing neural network
and wind up with network with high accuracy. In terms of loss function, we want our
loss function to much lower in the end of training. Improving the network is possible,
because we can change its function by adjusting weights. We want to find another
function that performs better than the initial one. The problem of training is
equivalent to the problem of minimizing the loss function. Why minimize loss
instead of maximizing? Turns out loss is much easier function to optimize. There are
a lot of algorithms that optimize functions. These algorithms can gradient-based or
not, in sense that they are not only using the information provided by the function,
but also by its gradient.
First, we need to remember what a derivative is with respect to some variable.
Let‘s take some easy function f (x) = x. If we remember the rules of calculus from
high school we know, that the derivative of that is one at every value of x. The
5.12 Artificial Intelligence and Machine Learning
derivative is the rate of how fast our function is changing when we take infinitely
small step in the positive direction. Mathematically it can be written as the following:
L
L xi
xi
Which means: how much our function changes(left term) approximately equals to
derivative of that function with respect to some variable x multiplied with how much
we changed that variable. That approximation is going to be exact when we step we
take is infinitely small and this is very important concept of the derivative.
Fig. 5.10.
Let‘s go back to our function f(x) = x 2 . Obviously, the minimum of that function
is at point x = 0, but how would a computer know it ? Suppose, we start off with
some random value of x and this value is 2. The derivative of the function in that in x
= 2 equals 4. Which means that is we take a step in positive direction our function
will change proportionally to 4.
Our derivative only guarantees that the function will decrease if take infinitely
small step. Generally, you want to control how big of step you make with some kind
of hyper-parameter. This hyper-parameter is called learning rate and I‘ll talk about it
later. Let‘s now see what happens if we start at a point x = – 2. The derivative is now
equals – 4, which means, that if take a small step in positive direction our function
will change proportionally to – 4, thus it will decrease. That‘s exactly what we want.
Neural Networks 5.13
When x > 0, our derivative greater than zero and we need to go in negative
direction, when x < 0, the derivative less than zero, we need to go in positive
direction. We always need to take a step in the direction which is opposite of
derivative. Let‘s apply the same idea to gradient. Gradient is vector which points to
some direction in space. It actually point to the direction of the steepest increase of
the function. Since we want minimize our function, we‘ll take a step in the opposite
direction of gradient.
In neural network we think of inputs x, and outputs y as fixed numbers. The
variable with respect to which we‘re going to be taking our derivatives are
weights w, since these are the values we want to change to improve our network. If
we compute the gradient of the loss function w.r.t our weights and take small steps in
the opposite direction of gradient our loss will gradually decrease until it converges
to some local minima. This algorithms is called Gradient Descent. The rule for
updating weights on each iteration of Gradient Descent is the following:
L
wj = w j – l r
wj
lr in the notation above means learning rate. It‘s there to control how big of a step
we‘re taking each iteration. It is the most important hyper-parameter to tune when
training neural networks.
Ingredients:
Artificial Neurons (processing node) composed of:
(many) input neuron(s) connection(s) (dendrites)
a computation unit (nucleus) composed of:
a linear function (ax + b)
an activation function (equivalent to the the synapse)
an output (axon)
Gradient Descent (GD) is the size of the steps, determined by the learning rate
hyperparameters. If the learning rate is too small, then the algorithm will have to go
through many iterations to converge, which will take a long time, and if it is too high
we may jump the optimal value.
Fig. 5.11.
A parabolic function with two dimensions (x , y)
In the above graph, the lowest point on the parabola occurs at x = 1. The objective
of gradient descent algorithm is to find the value of ―x‖ such that ―y‖ is minimum.
―y‖ here is termed as the objective function that the gradient descent algorithm
operates upon, to descend to the lowest point.
Steps in SGD:
1. Find the slope of the objective function with respect to each parameter /
feature. In other words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola
example, differentiate ―y‖ with respect to ―x‖. If we had more features like
x 1 , x 2 etc., we take the partial derivative of ―y‖ with respect to each of the
features.)
3. Update the gradient function by plugging in the parameter values.
4. Calculate the step sizes for each feature as : step size = gradient * learning
rate.
5. Calculate the new parameters as : new params = old params -step size
6. Repeat steps 3 to 5 until gradient is almost 0.
Fig. 5.12.
Definition
1. Calculate the forward phase for each input-output pair ( x d , y d ) and store
the results yd , a kj and okj for each node j in layer k by proceeding from
layer 0, the input layer, to layer m, the output layer.
2. Calculate the backward phase for each input-output pair ( x d , y d ) and
Ed
store the results for each weight w kij connecting node in layer k – 1
w kij
to node j in layer k by proceeding from layer m, the output layer, to layer1,
the input layer.
(a) Evaluate the error term for the final layer m1 by using the second
equation.
(b) Backpropogate the error terms for the hidden layers kj , working
backwards from the final hidden layer k = m – 1, by repeatedly using
the third equation.
(c) Evaluate the partial derivatives of the individual error E d with
respect to w kij by using the first equation.
5.20 Artificial Intelligence and Machine Learning
Ed
3. Combine the individual gradients for each input-output pair to get
w kij
E(X, )
the total gradient ? for the entire set of input-output pairs X = { (
w kij
x 1 , y 1 ) , , ( x N , y N )}by using the fourth equation (a simple average of
the individual gradients).
4. Update the weights according to the learning rate and total gradient
E(X, )
by using the fifth equation (moving in the direction of the
w kij
negative gradient).
Types of Backpropagation
There are two types of Backpropagation which are as follows −
Static Back Propagation − In this type of backpropagation, the static output is
created because of the mapping of static input. It is used to resolve static
classification problems like optical character recognition.
Recurrent Backpropagation − The Recurrent Propagation is directed forward or
directed until a specific determined value or threshold value is acquired. After the
certain value, the error is evaluated and propagated backward.
Key Points
Simplifies the network structure by elements weighted links that have the
least effect on the trained network
You need to study a group of input and activation values to develop the
relationship between the input and hidden unit layers.
It helps to assess the impact that a given input variable has on a network
output. The knowledge gained from this analysis should be represented in
rules.
Backpropagation is especially useful for deep neural networks working on
error-prone projects, such as image or speech recognition.
Backpropagation takes advantage of the chain and power rules allows
backpropagation to function with any number of outputs.
The problem:
As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to train.
Why:
Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1. Therefore, a large change in the input
of the sigmoid function will cause a small change in the output. Hence, the derivative
becomes small.
Fig. 5.13.
exponentially as we propagate down to the initial layers. A small gradient means that
the weights and biases of the initial layers will not be updated effectively with each
training session. Since these initial layers are often crucial to recognizing the core
elements of the input data, it can lead to overall inaccuracy of the whole network.
Solution:
The simplest solution is to use other activation functions, such as ReLU, which
doesn‘t cause a small derivative. Residual networks are another solution, as they
provide residual connections straight to earlier layers. The residual connection
directly adds the value at the beginning of the block, x, to the end of the block (F(x)
+ x). This residual connection doesn‘t go through activation functions that
―squashes‖ the derivatives, resulting in a higher overall derivative of the block.
Fig. 5.14.
A residual block
Finally, batch normalization layers can also resolve the issue. As stated before, the
problem arises when a large input space is mapped to a small one, causing the
derivatives to disappear. In Image 1, this is most clearly seen at when |x| is big. Batch
normalization reduces this problem by simply normalizing the input so |x| doesn‘t
reach the outer edges of the sigmoid function. As seen in Image 3, it normalizes the
input so that most of it falls in the green region, where the derivative isn‘t too small.
The rectified linear activation unit, or ReLU, is one of the few landmarks in the
deep learning revolution. It‘s simple, yet it‘s far superior to previous activation
functions like sigmoid or tanh.
ReLU formula is : f (x) = max(0, x)
Fig. 5.16.
Neural Networks 5.25
Both the ReLU function and its derivative are monotonic. If the function receives
any negative input, it returns 0; however, if the function receives any positive value
x, it returns that value. As a result, the output has a range of 0 to infinite. ReLU is the
most often used activation function in neural networks, especially CNNs, and is
utilized as the default activation function.
def relu(x):
return max(0.0, x)
We see from the plot that all the negative values have been set to zero, and the
positive values are returned as it is. Note that we've given a set of consecutively
increasing numbers as input, so we've a linear output with an increasing slope.
5.26 Artificial Intelligence and Machine Learning
Fig. 5.17.
Advantages of ReLU:
ReLU is used in the hidden layers instead of Sigmoid or tanh as using sigmoid or
tanh in the hidden layers leads to the infamous problem of "Vanishing Gradient". The
"Vanishing Gradient" prevents the earlier layers from learning important information
when the network is backpropagating. The sigmoid which is a logistic function is
more preferrable to be used in regression or binary classification related problems
and that too only in the output layer, as the output of a sigmoid function ranges from
0 to 1. Also Sigmoid and tanh saturate and have lesser sensitivity.
Some of the advantages of ReLU are:
Simpler Computation: Derivative remains constant i.e 1 for a positive
input and thus reduces the time taken for the model to learn and in
minimizing the errors.
Representational Sparsity: It is capable of outputting a true zero value.
Linearity: Linear activation functions are easier to optimize and allow for
a smooth flow. So, it is best suited for supervised tasks on large sets of
labelled data.
Disadvantages of ReLU:
Exploding Gradient: This occurs when the gradient gets accumulated,
this causes a large differences in the subsequent weight updates. This as a
Neural Networks 5.27
result causes instability when converging to the global minima and causes
instability in the learning too.
Dying ReLU: The problem of "dead neurons" occurs when the neuron gets
stuck in the negative side and constantly outputs zero. Because gradient of
0 is also 0, it's unlikely for the neuron to ever recover. This happens when
the learning rate is too high or negative bias is quite large.
Grid Search CV
In GridSearchCV approach, the machine learning model is evaluated for a range
of hyperparameter values. This approach is called GridSearchCV, because it searches
for the best set of hyperparameters from a grid of hyperparameters values. For
example, if we want to set two hyperparameters C and Alpha of the Logistic
Regression Classifier model, with different sets of values. The grid search technique
will construct many versions of the model with all possible combinations of
hyperparameters and will return the best one.
5.28 Artificial Intelligence and Machine Learning
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4].
For a combination of C = 0.3 and Alpha = 0.2, the performance score comes out to
be 0.726(Highest), therefore it is selected.
Output:
Tuned Logistic Regression Parameters: {‗C‘: 3.7275937203149381} Best score is
0.7708333333333334
Neural Networks 5.29
Drawback:
GridSearch CV will go through all the intermediate combinations of
hyperparameters which makes grid search computationally very expensive.
9. Randomized Search CV
Randomized Search CV solves the drawbacks of GridSearch CV, as it goes
through only a fixed number of hyperparameter settings. It moves within the grid in a
random fashion to find the best set of hyperparameters. This approach reduces
unnecessary computation.
The following code illustrates how to use RandomizedSearch CV
# Necessary imports
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
Output:
Tuned Decision Tree Parameters: {‗min_samples_leaf‘: 5, ‗max_depth‘: 3,
‗max_features‘: 5, ‗criterion‘: ‗gini‘} Best score is 0.7265625
L = Number of layers
Bias = 0
Activation Function : Sigmoid
Initially, our inputs X 1, X 2, X 3, X 4 are in normalized form as they are coming
from the pre-processing stage. When the input passes through the first layer, it
transforms, as a sigmoid function applied over the dot product of input X and the
weight matrix W
h1 = (W 1 X)
Neural Networks 5.31
Similarly, this transformation will take place for the second layer and go till the
last layer L as shown in the following image.
h 1 = (W 1 X)
h 2 = (W2 h 1 ) = (W2(W 1 X))
O = (W L h L –1)
Although, our input X was normalized with time the output will no longer be on
the same scale. As the data go through multiple layers of the neural network and L
activation functions are applied, it leads to an internal co-variate shift in the data.
Here, m is the number of neurons at layer h. Once we have meant at our end, the
next step is to calculate the standard deviation of the hidden activations.
1 1/2
= m (h i – )2
Further, as we have the mean and the standard deviation ready. We will normalize
the hidden activations using these values. For this, we will subtract the mean from
5.32 Artificial Intelligence and Machine Learning
each input and divide the whole value with the sum of standard deviation and the
smoothing term (ε).The smoothing term(ε) assures numerical stability within the
operation by stopping a division by a zero value.
(h i – )
h i(norm) =
+
When a model performs well on the training data and does not perform well on
the testing data, then the model is said to have high generalization error. In other
words, in such a scenario, the model has low bias and high variance and is too
complex. This is called overfitting. Overfitting means that the model is a good fit on
the train data compared to the data, as illustrated in the graph above. Overfitting is
also a result of the model being too complex.
Fig. 5.18.
In the above graph, the two lines represent the relationship between total years of
experience and salary, where salary is the target variable. These are slopes indicating
5.34 Artificial Intelligence and Machine Learning
the change in salary per unit change in total years of experience. As the slope b 1 + b3
decreases to slope b 1 , we see that the salary is less sensitive to the total years of
experience. By decreasing the slope, the target variable (salary) became less sensitive
to the change in the independent X variables, which increases the bias into the
model. Remember, bias is the difference between the predicted and the actual values.
With the increase in bias to the model, the variance (which is the difference
between the predictions when the model fits different datasets.) decreases. And, by
decreasing the variance, the overfitting gets reduced. The models having the higher
variance leads to overfitting, and we saw above, we will shrink or reduce the beta
coefficients to overcome the overfitting. The beta coefficients or the weights of the
features converge towards zero, which is known as shrinkage.
The goal of the linear regression model is to minimize the loss function. Now for
Regularization, the goal becomes to minimize the following cost function:
n
(y a c t – y p r e d )2 + penalty
i = 1
Where, the penalty term comprises the regularization parameter and the weights
associated with the variables. Hence, the penalty term is:
penalty = λ * w
where,
λ = Regularization parameter
w = weight associated with the variables; generally considered to be
the L – p norms
The regularization parameter in machine learning is λ : It imposes a higher penalty
on the variable having higher values, and hence, it controls the strength of the
penalty term. This tuning parameter controls the bias-variance trade-off.
Neural Networks 5.35
Fig. 5.19.
Dropout is used as a regularization technique - it prevents overfitting by ensuring
that no units are codependent (more on this later).
Weight decay: incentivize the network to use smaller weights by adding a penalty
to the loss function (this ensures that the norms of the weights are relatively evenly
distributed amongst all the weights in the networks, which prevents just a few
weights from heavily influencing network output)
Noise: allow some random fluctuations in the data through augmentation (which
makes the network robust to a larger distribution of inputs and hence improves
generalization)
Model combination: average the outputs of separately trained neural networks
(requires a lot of computational power, data, and time)
Dropout remains an extremely popular protective measure against overfitting
because of its efficiency and effectiveness.
Fig. 5.20.
Fig. 5.21.
It can be observed in figure a that the units don‘t seem to pick up on any
meaningful feature, whereas in figure b, the units seem to have picked up on distinct
edges and spots in the data provided to them. This indicates that dropout helps break
co-adaptations among units, and each unit can act more independently when dropout
regularization is used. In other words, without dropout, the network would never be
able to catch a unit A compensating for another unit B‘s flaws. With dropout, at
some point unit A would be ignored and the training accuracy would decrease as a
result, exposing the inaccuracy of unit B.
Neural Networks 5.39
Activation Functions
Sigmoid Leaky ReLU
1 max(0.1 x, x)
(x=
1 + e –x
tanh Maxout
tanh(x) max(w T1 x + b 1 , w T2 x + b 2 )
ReLU ELU
max(0,x) x x0
(e x – 1) x < 0
5. Define an Activation function
An activation function is a function that is added to an artificial neural
network in order to help the network learn complex patterns in the data. When
comparing with a neuron-based model that is in our brains, the activation
function is at the end deciding what is to be fired to the next neuron. The neuron
doesn‘t really know how to bound to value and thus is not able to decide the
firing pattern. Thus the activation function is an important part of an artificial
neural network.
6. How do you train a neural network?
In the process of training, we want to start with a bad performing neural
network and wind up with network with high accuracy. In terms of loss function,
we want our loss function to much lower in the end of training. Improving the
network is possible, because we can change its function by adjusting weights. We
want to find another function that performs better than the initial one.
7. What is the objective of Gradient Descent?
Gradient, in plain terms means slope or slant of a surface. So gradient descent
literally means descending a slope to reach the lowest point on that surface
8. List the types of Gradient Descent:
Typically, there are three types of Gradient Descent:
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent
Neural Networks 5.41
Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
Calculate the error in the outputs
o ErrorB = Actual Output – Desired Output
Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.
14. List the Types of Backpropagation
There are two types of Backpropagation which are as follows −
Static Back Propagation − In this type of backpropagation, the static output
is created because of the mapping of static input. It is used to resolve static
classification problems like optical character recognition.
Recurrent Backpropagation − The Recurrent Propagation is directed
forward or directed until a specific determined value or threshold value is
acquired. After the certain value, the error is evaluated and propagated
backward.
15. What is Unit saturation (vanishing gradient problem)
The vanishing gradient problem is an issue that sometimes arises when
training machine learning algorithms through gradient descent. This most often
occurs in neural networks that have several neuronal layers such as in a deep
learning system, but also occurs in recurrent neural networks.
The key point is that the calculated partial derivatives used to compute the
gradient as one goes deeper into the network. Since the gradients control how
much the network learns during training, if the gradients are very small or zero,
then little to no training can take place, leading to poor predictive performance.
16. Define ReLU
The rectified linear activation unit, or ReLU, is one of the few landmarks in
the deep learning revolution. It‘s simple, yet it‘s far superior to previous
activation functions like sigmoid or tanh.
ReLU formula is : f (x) = max(0, x)
17. State Hyperparameter tuning
A Machine Learning model is defined as a mathematical model with a
number of parameters that need to be learned from the data. By training a model
with existing data, we are able to fit the model parameters. However, there is
Neural Networks 5.43
*********************
Model Question Paper - 1
B.E./B.Tech DEGREE EXAMINATION.,
Fourth Semester
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
(Regulations 2021)
Time: Three Hours Maximum: 100 Marks
Answer ALL Questions
PART – A (10 2 = 20 Marks)
1. What is A.I ?
Artificial Intelligence is a branch of computer science that deals with
developing intelligent machines which can behave like human, think like human,
and has ability to take decisions by their own.
Artificial Intelligence is a combination of two words Artificial and
Intelligence, which refers to man-made intelligence. Therefore, when machines
are equipped with man-made intelligence to perform intelligent tasks similar to
humans, it is known as Artificial Intelligence.
2. Different Between Super A.I & Weak A.I?
S.No Weak AI Super AI
1 Narrow AI is a type of AI which is Super AI is a level of Intelligence of
able to perform a dedicated task with Systems at which machines could
intelligence. The most common and surpass human intelligence, and can
currently available AI is Narrow AI perform any task better than human
in the world of Artificial with cognitive properties. It is an
Intelligence. outcome of general AI.
2 Narrow AI cannot perform beyond Some key characteristics of strong
its field or limitations, as it is only AI include capability include the
trained for one specific task. Hence it ability to think, to reason, solve the
is also termed as weak AI. Narrow puzzle, make judgments, plan, learn,
AI can fail in unpredictable ways if it and communicate by its own.
goes beyond its limits.
MQ.2 Artificial Intelligence and Machine Learning
the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).
algorithms discover hidden patterns or data groupings without the need for
human intervention. Its ability to discover similarities and differences in
information make it the ideal solution for exploratory data analysis, cross-selling
strategies, customer segmentation, and image recognition
9. Define a Stochastic Gradient Descent (SGD):
In Stochastic Gradient Descent, a few samples are selected randomly instead
of the whole data set for each iteration. In Gradient Descent, there is a term
called “batch” which denotes the total number of samples from a dataset that is
used for calculating the gradient for each iteration. In typical Gradient Descent
optimization, like Batch Gradient Descent, the batch is taken to be the whole
dataset.
10. Write an algorithm for stochastic gradient descent.
1. The algorithm starts at a random point by initializing the weights with
random values
2. Then it calculates the gradients at that random point
3. Then it moves in the opposite direction of the gradient
4. The process continues to repeat itself until it finds the point of minimum
loss
PART - B (5 13 = 65 Marks)
11. (a) Write about CSP with suitable example?
Ans: Refer Page No.1.36
[OR]
(b) Explain the types of Uninformed search algorithms?
Ans: Refer Page No.1.13
12. (a) Explain Bayes’ theorem with example.
Ans: Refer Page No.2.5
[OR]
(b) Illustrate causal network with neat diagram.
Ans: Refer Page No.2.22
Model Question Papers MQ.5
PART - C (1 15 = 15 Marks)
16. (a) Solve the Cryptarithmetic problem CSP – SEND + MORE = MONEY.?
Ans: Refer Page No. 1.38
[OR]
(b) Explain Bayes’ theorem with example.
Ans: Refer Page No.2.10
*********************
MQ.6 Artificial Intelligence and Machine Learning
tanh Maxout
tanh(x) max(w T1 x + b 1 , w T2 x + b 2 )
ReLU ELU
max(0,x) x x0
(e x – 1 ) x < 0
Model Question Papers MQ.9
PART - B (5 13 = 65 Marks)
11. (a) Explain Hill Climbing Search algorithm.
Ans: Refer Page No.1.29
[OR]
(b) What are the application of A.I .
Ans: Refer Page No.1.7
12. (a) Explain Working of Naïve Bayes' Classifier with example.
Ans: Refer Page No.2.11
[OR]
(b) Explain in detail about approximate inference in Bayesian network.
Ans: Refer Page No.2.21
13. (a) Explain in detail about Maximum Margin Classifier?
Ans: Refer Page No.3.52
[OR]
(b) Write in detail about Decision tree?
Ans: Refer Page No.3.55
14. (a) Compare the Advanced Ensemble techniques.
Ans: Refer Page No.4.4
[OR]
(b) Explain the Working of Unsupervised Learning.
Ans: Refer Page No.4.11
15. (a) Clarify how to train a neural network.
Ans: Refer Page No.5.11
MQ.10 Artificial Intelligence and Machine Learning
[OR]
(b) Explain Stochastic gradient descent.
Ans: Refer Page No.5.16
PART - C (1 15 = 15 Marks)
16. (a) Explain the types of Uninformed search and Informed search algorithm?
Ans: Refer Page No.1.13
[OR]
(b) Demonstrate Gaussian Mixture Model (GMM).
Ans: Refer Page No.4.17
*********************