Ai Tycs Sem-V
Ai Tycs Sem-V
Ai Tycs Sem-V
Compiled By:
Asst. Prof. Megha Sharma
For Video lectures visit my YouTube channel
OMega TechEd.
Or
Scan the QR-Code
References:
Artificial Intelligence
A Modern Approach Third Edition
Peter Norvig and Stuart J. Russell
www.javatpoint.com
Unit-1
Introduction to AI and Intelligent Agents
What Is AI: Foundations, History and State of the Art of AI
Intelligent Agents: Agents and Environments, Nature of Environments,
Structure of Agents.
Problem Solving by searching: Problem-Solving Agents, Uninformed
Search Strategies, Informed (Heuristic) Search Strategies
Introduction to AI
Approaches to AI
1. Acting humanly: The Turing Test approach:
The computer would need to possess the following capabilities:
In 1950, Alan Turing introduced a test to check whether a machine can think like a
human or not, this test is known as the Turing Test. In this test, Turing proposed that
the computer can be said to be intelligent if it can mimic human response under specific
conditions.
The test result does not depend on each correct answer, but only how close its responses
are like a human answer. The computer is permitted to do everything possible to force a
wrong identification by the interrogator.
2. Thinking humanly: The cognitive modeling approach.
If we are going to say that a given program thinks like a human, we must have
some way of determining how humans think. We need to get inside the actual
workings of human minds. There are three ways to do this: through
introspection—trying to catch our own thoughts as they go by; through
psychological experiments—observing a person in action; and through brain
imaging—observing the brain in action Once we have a sufficiently precise
theory of the mind, it becomes possible to express the theory as a computer
program. If the program’s input–output behavior matches corresponding human
behavior, that is evidence that some of the program’s mechanisms could also be
operating in humans.
3. Thinking rationally: The “laws of thought” approach
The Greek philosopher Aristotle was one of the first to attempt to codify “right
thinking,” that is, irrefutable reasoning processes. His syllogisms provided
patterns for argument structures that always yielded correct conclusions when
given correct premises—for example, “Socrates is a man; all men are mortal;
therefore, Socrates is mortal.” These laws of thought were supposed to govern
the operation of the mind; their study initiated the field called logic.
4. Acting rationally: The rational agent approach
An agent is just something that acts (agent comes from the Latin agere, to do).
Of course, all computer programs do something, but computer agents are
expected to do more: operate autonomously, perceive their environment, persist
over a prolonged time, adapt to change, and create and pursue goals. A rational
agent is one that acts to achieve the best outcome or, when there is uncertainty,
the best expected outcome. The rational-agent approach has two advantages
over the other approaches.
First, it is more general than the “laws of thought” approach because correct
inference is just one of several possible mechanisms for achieving rationality.
Second, it is more amenable to scientific development than are approaches
based on human behavior or human thought. The standard of rationality is
mathematically well defined and completely general and can be “unpacked” to
generate agent designs that probably achieve it. Human behavior, on the other
hand, is adapted for one specific environment and is defined by, well, the sum
of all the things that humans do.
Application of AI
Following are some sectors which have the application of Artificial Intelligence:
1. AI in Astronomy
o Artificial Intelligence can be very useful to solve complex universe problems. AI
technology can be helpful for understanding the universe such as how it works,
origin, etc.
2. AI in Healthcare
o In the last five to ten years, AI becoming more advantageous for the healthcare
industry and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis than
humans. AI can help doctors with diagnoses and can inform when patients are
worsening so that medical help can reach the patient before hospitalization.
3. AI in Gaming
o AI can be used for gaming purposes. The AI machines can play strategic games
like chess, where the machine needs to think of many possible places.
4. AI in Finance
o AI and finance industries are the best matches for each other. The finance
industry is implementing automation, chatbot, adaptive intelligence, algorithm
trading, and machine learning into financial processes.
5. AI in Data Security
o The security of data is crucial for every company and cyber-attacks are growing
very rapidly in the digital world. AI can be used to make your data safer and
more secure. Some examples such as AEG bot, AI2 Platform, are used to
determine software bugs and cyber-attacks in a better way.
6. AI in social media
o Social Media sites such as Facebook, Twitter, and Snapchat contain billions of
user profiles, which need to be stored and managed in a very efficient way. AI
can organize and manage massive amounts of data. AI can analyze lots of data to
identify the latest trends, hashtags, and requirement of different users.
8. AI in Automotive Industry
o Some Automotive industries are using AI to provide virtual assistant to their
users for better performance. Such as Tesla has introduced TeslaBot, an
intelligent virtual assistant.
o Various Industries are currently working to develop self-driven cars which can
make your journey more safe and secure.
9. AI in Robotics:
o Artificial Intelligence has a remarkable role in Robotics. Usually, general robots
are programmed such that they can perform some repetitive tasks, but with the
help of AI, we can create intelligent robots which can perform tasks with their
own experiences without pre-programmed.
o Humanoid Robots are the best examples for AI in robotics, recently the
intelligent Humanoid robot named Erica and Sophia has been developed which
can talk and behave like humans.
10. AI in Entertainment
o We are currently using some AI based applications in our daily life with some
entertainment services such as Netflix or Amazon. With the help of ML/AI
algorithms, these services show recommendations for programs or shows.
11. AI in Agriculture
o Agriculture is an area which requires various resources, labor, money, and time
for best result. Now a day's agriculture is becoming digital, and AI is emerging in
this field. Agriculture is applying AI as agriculture robotics, solid and crop
monitoring, predictive analysis. AI in agriculture can be very helpful for farmers.
12. AI in E-commerce
o AI is providing a competitive edge to the e-commerce industry, and it is
becoming more demanding in the e-commerce business. AI is helping shoppers
to discover associated products with recommended size, color, or even brand.
13. AI in education:
o AI can automate grading so that the tutor can have more time to teach. AI chatbot
can communicate with students as a teaching assistant.
o AI in the future can be used as a personal virtual tutor for students, which will be
accessible easily at any time and any place.
Artificial Intelligence is not a new word and not a new technology for researchers. This
technology is much older than you would imagine. Even there are the myths of
Mechanical men in Ancient Greek and Egyptian Myths.
The following are some milestones in the history of AI which defines the journey from
the AI generation to till date development.
o Year 1943: The first work which is now recognized as AI was done by Warren
McCulloch and Walter pits in 1943. They proposed a model of artificial
neurons.
o Year 1949: Donald Hebb demonstrated an updating rule for modifying the
connection strength between neurons. His rule is now called Hebbian learning.
o Year 1950: The Alan Turing who was an English mathematician and pioneered
Machine learning in 1950. Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can check the machine's
ability to exhibit intelligent behavior equivalent to human intelligence, called the
Turing test.
At that time high-level computer languages such as FORTRAN, LISP, or COBOL were
invented. And the enthusiasm for AI was very high at that time.
A boom of AI (1980-1987)
o Year 1980: After AI winter duration, AI came back with "Expert System".
Expert systems were programmed to emulate the decision-making ability of a
human expert.
o In the Year 1980, the first national conference of the American Association of
Artificial Intelligence was held at Stanford University.
Now AI has developed to a remarkable level. The concept of Deep learning, big data,
and data science are now trending like a boom. Nowadays companies like Google,
Facebook, IBM, and Amazon are working with AI and creating amazing devices. The
future of Artificial Intelligence is inspiring and will come with high intelligence.
Types of Agents
Agents can be grouped into five classes based on their degree of perceived intelligence
and capability. All these agents can improve their performance and generate better
action over time. These are given below:
4. Utility-based agents
o These agents are like the goal-based agent but provide an extra component of
utility measurement which makes them different by providing a measure of
success at a given state.
o Utility-based agent acts based not only on goals but also the best way to achieve
the goal.
o The Utility-based agent is useful when there are multiple possible alternatives,
and an agent must choose to perform the best action.
o The utility function maps each state to a real number to check how efficiently
each action achieves the goals.
5. Learning Agents
o A learning agent in AI is the type of agent which can learn from its past
experiences, or it has learning capabilities.
o It starts to act with basic knowledge and then able to act and adapt automatically
through learning.
o A learning agent has mainly four conceptual components, which are:
a. Learning element: It is responsible for making improvements by learning
from the environment.
b. Critic: Learning element takes feedback from critic which describes that
how well the agent is doing with respect to a fixed performance standard.
c. Performance element: It is responsible for selecting external action.
d. Problem generator: This component is responsible for suggesting actions
that will lead to new and informative experiences.
o Hence, learning agents can learn, analyze performance, and look for new ways to
improve the performance.
Intelligent Agents:
Rational Agent:
A rational agent is an agent which has clear preference, models uncertainty, and acts in
a way to maximize its performance measure with all possible actions.
A rational agent is said to perform the right things. AI is about creating rational agents
to use for game theory and decision theory for various real-world scenarios.
PEAS Representation
o P: Performance measure
o E: Environment
o A: Actuators
o S: Sensors
Here performance measure is the objective for the success of an agent's behavior.
1. Keyboard
o Healthy o Patient o Tests
Medical (Entry of symptoms)
Diagnose patient o Hospital o Treatments
o Minimized o Staff
cost
2.
o Cleanness o Room o Wheels o Camera
Vacuum
Cleaner o Efficiency o Table o Brushes o Dirt detection sensor
o Battery o Wood o Vacuum o Cliff sensor
life floor Extractor o Bump Sensor
o Security o Carpet o Infrared Wall Sensor
o Various
obstacles
3. Part -
o Percentage o Conveyor o Jointed o Camera
picking
Robot of parts in belt with Arms o Joint angle sensors.
correct parts, o Hand
bins. o Bins
Agent Environment
An environment is everything in the world which surrounds the agent, but it is not a
part of an agent itself. An environment can be described as a situation in which an agent
is present.
The environment is where the agent lives, operates, and provides the agent with
something to sense and act upon it. The environment is mostly said to be non-
feministic.
Types of Environments
As per Russell and Norvig, an environment can have various features from the point of
view of an agent:
2. Deterministic vs Stochastic:
o If an agent's current state and selected action can completely determine the next
state of the environment, then such an environment is called a deterministic
environment.
o A stochastic environment is random in nature and cannot be determined
completely by an agent.
o In a deterministic, fully observable environment, agents do not need to worry
about uncertainty.
3. Episodic vs Sequential:
o In an episodic environment, there is a series of one-shot actions, and only the
current percept is required for the action.
o However, in Sequential environment, an agent requires memory of past actions to
determine the next best actions.
Single-agent vs multi-agent
o If only one agent is involved in an environment, and operating by itself then such
an environment is called single agent environment.
o However, if multiple agents are operating in an environment, then such an
environment is called a multi-agent environment.
o The agent design problems in the multi-agent environment are different from
single agent environment.
5. Static vs Dynamic:
o If the environment can change itself while an agent is deliberating, then such an
environment is called a dynamic environment or else it is called a static
environment.
o Static environments are easy to deal with because an agent does not need to
continue looking at the world while deciding for an action.
o However, for dynamic environment, agents need to keep looking at the world at
each action.
o Taxi driving is an example of a dynamic environment whereas Crossword
puzzles are an example of a static environment.
Discrete vs Continuous:
o If in an environment there are a finite number of percepts and actions that can be
performed within it, then such an environment is called a discrete environment or
else it is called continuous environment.
o A chess game comes under discrete environment as there is a finite number of
moves that can be performed.
o A self-driving car is an example of a continuous environment.
7. Known vs Unknown
o Known and unknown are not actually a feature of an environment, but it is an
agent's state of knowledge to perform an action.
o In a known environment, the results for all actions are known to the agent. While
in an unknown environment, agent needs to learn how it works to perform an
action.
o It is quite possible that a known environment to be partially observable and an
Unknown environment to be fully observable.
8. Accessible vs Inaccessible
o If an agent can obtain complete and accurate information about the state's
environment, then such an environment is called an Accessible environment or
else it is called inaccessible.
o An empty room whose state can be defined by its temperature is an example of
an accessible environment.
o Information about an event on earth is an example of Inaccessible environment.
Problem-solving agents:
Problem Formulation:
A problem can be defined formally by five components:
➢ The initial state that the agent starts in
➢ A description of the possible actions available to the agent.
➢ A description of what each action does; the formal name for this is the
transition model.
Together, the initial state, actions, and transition model implicitly define the
state space of the problem—the set of all states reachable from the initial
state by any sequence of actions.
➢ The goal test, which determines whether a given state is a goal state.
Sometimes there is an explicit set of possible goal states, and the test
simply checks whether the given state is one of them.
➢ A path cost function that assigns a numeric cost to each path. The
problem-solving agent chooses a cost function that reflects its own
performance measure.
A solution to a problem is an action sequence that leads from the initial
state to a goal state. Solution quality is measured by the path cost
function, and an optimal solution has the lowest path cost among all
solutions.
Example: 8-Puzzle
The goal of the 8-queens problem is to place eight queens on a chessboard such
that no queen attacks any other.
There are two main kinds of formulation. An incremental formulation involves
operators that augment the state description, starting with an empty state; for the
8-queens problem, this means that each action adds a queen to the state. A
complete-state formulation starts with all 8 queens on the board and moves
them around. In either case, the path cost is of no interest because only the final
state counts. The first incremental formulation one might try is the following:
• States: Any arrangement of 0 to 8 queens on the board is a state.
• Initial state: No queens on board.
• Actions: Add a queen to any empty square.
• Transition model: Returns the board with a queen added to the specified
square.
• Goal test: 8 queens are on the board, none attacked.
In this formulation, we have 64 · 63 ··· 57 ≈ 1.8 × 1014 possible sequences to
investigate.
A better formulation would prohibit placing a queen in any square that is
already attacked:
• States: All possible arrangements of n queens (0 ≤ n ≤ 8), one per column in
the leftmost n columns, with no queen attacking another.
• Actions: Add a queen to any square in the leftmost empty column such that it
is not attacked by any other queen. This formulation reduces the 8-queens state
space from 1.8 × 1014 to just 2,057, and solutions are easy to find. On the other
hand, for 100 queens the reduction is from roughly 10400 states to about 1052
states.
Based on the search problems we can classify the search algorithms into uninformed
(Blind search) search and informed search (Heuristic search) algorithms.
Uninformed/Blind Search:
The uninformed search does not contain any domain knowledge such as closeness, the
location of the goal. It operates in a brute-force way as it only includes information
about how to traverse the tree and how to identify leaf and goal nodes. Uninformed
search applies a way in which search tree is searched without any information about
the search space like initial state operators and test for the goal, so it is also called blind
search. It examines each node of the tree until it achieves the goal node.
It can be divided into five main types:
o Breadth-first search
o Uniform cost search
o Depth-first search
o Iterative deepening depth-first search
o Bidirectional Search
Breadth-first Search:
o Breadth-first search is the most common search strategy for traversing a tree or
graph. This algorithm searches breadthwise in a tree or graph, so it is called
breadth-first search.
o BFS algorithm starts searching from the root node of the tree and expands all
successor node at the current level before moving to nodes of next level.
o The breadth-first search algorithm is an example of a general-graph search
algorithm.
o Breadth-first search implemented using FIFO queue data structure.
Advantages:
Disadvantages:
o It requires lots of memory since each level of the tree must be saved into
memory to expand to the next level.
o BFS needs lots of time if the solution is far away from the root node.
Example:
In the below tree structure, we have shown the traversing of the tree using BFS
algorithm from the root node S to goal node K. BFS search algorithm traverse in
layers, so it will follow the path which is shown by the dotted arrow, and the traversed
path will be:
1. S---> A--->B---->C--->D---->G--->H--->E---->F---->I---->K
Time Complexity: Time Complexity of BFS algorithm can be obtained by the number
of nodes traversed in BFS to the shallowest Node. Where the d= depth of shallowest
solution and b is a node at every state.
Space Complexity: Space complexity of BFS algorithm is given by the Memory size
of frontier which is O(bd).
Completeness: BFS is complete, which means if the shallowest goal node is at some
finite depth, then BFS will find a solution.
2. Depth-first Search
o Depth-first search is a recursive algorithm for traversing a tree or graph data
structure.
o It is called the depth-first search because it starts from the root node and follows
each path to its greatest depth node before moving to the next path.
o DFS uses a stack data structure for its implementation.
o The process of the DFS algorithm is like the BFS algorithm.
Advantage:
o DFS requires very less memory as it only needs to store a stack of the nodes on
the path from root node to the current node.
o It takes less time to reach the goal node than BFS algorithm (if it traverses in the
right path).
Disadvantage:
o There is the possibility that many states keep re-occurring, and there is no
guarantee of finding the solution.
o DFS algorithm goes for deep down searching and sometimes it may go to the
infinite loop.
Example:
In the below search tree, we have shown the flow of depth-first search, and it will
follow the order as:
It will start searching from root node S, and traverse A, then B, then D and E, after
traversing E, it will backtrack the tree as E has no other successor and still goal node is
not found. After backtracking it will traverse node C and then G, and here it will
terminate as it found goal node.
Completeness: DFS search algorithm is complete within finite state space as it will
expand every node within a limited search tree.
Time Complexity: Time complexity of DFS will be equivalent to the node traversed
by the algorithm. It is given by:
Where, m= maximum depth of any node and this can be much larger than d
(Shallowest solution depth)
Space Complexity: DFS algorithm needs to store only single path from the root node,
hence space complexity of DFS is equivalent to the size of the fringe set, which
is O(bm).
Optimal: DFS search algorithm is non-optimal, as it may generate many steps or high
cost to reach the goal node.
o Standard failure value: It indicates that the problem does not have any solution.
o Cutoff failure value: It defines no solution for the problem within a given depth
limit.
Advantages:
Disadvantages:
Example:
Completeness: DLS search algorithm is complete if the solution is above the depth-
limit.
Optimal: Depth-limited search can be viewed as a special case of DFS, and it is also
not optimal even if ℓ>d.
Advantages:
o Uniform cost search is optimal because at every state the path with the least cost
is chosen.
Disadvantages:
o It does not care about the number of steps involve in searching and only
concerned about path cost. Due to which this algorithm may be stuck in an
infinite loop.
Example:
Completeness:
Uniform-cost search is complete, such as if there is a solution, UCS will find it.
Time Complexity:
Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal
node. Then the number of steps is = C*/ε+1. Here we have taken +1, as we start from
state 0 and end to C*/ε.
Space Complexity:
The same logic is for space complexity so, the worst-case space complexity of
Uniform-cost search is O(b1 + [C*/ε]).
Optimal:
Uniform-cost search is always optimal as it only selects a path with the lowest path
cost.
The iterative deepening algorithm is a combination of DFS and BFS algorithms. This
search algorithm finds out the best depth limit and does it by gradually increasing the
limit until a goal is found.
This algorithm performs depth-first search up to a certain "depth limit", and it keeps
increasing the depth limit after each iteration until the goal node is found.
This Search algorithm combines the benefits of Breadth-first search's fast search and
depth-first search's memory efficiency.
The iterative search algorithm is useful for uninformed search when search space is
large, and depth of goal node is unknown.
Advantages:
o It combines the benefits of BFS and DFS search algorithm in terms of fast
search and memory efficiency.
Disadvantages:
o The main drawback of IDDFS is that it repeats all the work of the previous
phase.
Example:
The following tree structure shows the iterative deepening depth-first search. IDDFS
algorithm performs various iterations until it does not find the goal node. The iteration
performed by the algorithm is given as:
1'st Iteration-----> A
2'nd Iteration----> A, B, C
3'rd Iteration------>A, B, D, E, C, F, G
4'th Iteration------>A, B, D, H, I, E, C, F, K, G
In the fourth iteration, the algorithm will find the goal node.
Completeness:
Time Complexity:
Let's suppose b is the branching factor and depth is d then the worst-case time
complexity is O(bd).
Space Complexity:
Optimal:
IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth of
the node.
Bidirectional search algorithm runs two simultaneous searches, one form initial state
called as forward-search and other from goal node called as backward-search, to find
the goal node. Bidirectional search replaces one single search graph with two small
subgraphs in which one starts the search from an initial vertex and other starts from
goal vertex. The search stops when these two graphs intersect each other.
Bidirectional search can use search techniques such as BFS, DFS, DLS, etc.
Advantages:
Disadvantages:
Example:
In the below search tree, bidirectional search algorithm is applied. This algorithm
divides one graph/tree into two sub-graphs. It starts traversing from node 1 in the
forward direction and starts from goal node 16 in the backward direction.
A heuristic is a way which might not always be guaranteed for best solutions but
guaranteed to find a good solution in reasonable time.
Informed search can solve many complex problems which could not be solved in
another way.
1. Greedy Search
2. A* Search
The informed search algorithm is more useful for large search space. Informed
search algorithm uses the idea of heuristic, so it is also called Heuristic search.
Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence heuristic cost
should be less than or equal to the estimated cost.
The Greedy best-first search algorithm always selects the path which appears best at
that moment. It is the combination of depth-first search and breadth-first search
algorithms. It uses the heuristic function and search. Best-first search allows us to take
advantage of both algorithms. With the help of best-first search, at each step, we can
choose the most promising node. In the best first search algorithm, we expand the node
which is closest to the goal node and the closest cost is estimated by heuristic function,
i.e.
1. f(n)= g(n).
Advantages:
o Best first search can switch between BFS and DFS by gaining the advantages of
both the algorithms.
o This algorithm is more efficient than BFS and DFS algorithms.
Disadvantages:
o It can behave as an unguided depth-first search in the worst-case scenario.
o It can get stuck in a loop like DFS.
o This algorithm is not optimal.
Example:
Consider the below search problem, and we will traverse it using greedy best-first
search. At each iteration, each node is expanded using evaluation function f(n)=h(n) ,
which is given in the below table.
In this search example, we are using two lists which are OPEN and CLOSED Lists.
Following are the iteration for traversing the above example.
Time Complexity: The worst-case time complexity of Greedy best first search is
O(bm).
Space Complexity: The worst-case space complexity of Greedy best first search is
O(bm). Where, m is the maximum depth of the search space.
Complete: Greedy best-first search is also incomplete, even if the given state space is
finite.
A* search is the most known form of best-first search. It uses heuristic function h(n),
and costs to reach the node n from the start state g(n). It has combined features of UCS
and greedy best-first search, by which it solves the problem efficiently. A* search
algorithm finds the shortest path through the search space using the heuristic function.
This search algorithm expands less search tree and provides optimal result faster. The
A* algorithm is like UCS except that it uses g(n)+h(n) instead of g(n).
In A* search algorithm, we use search heuristic as well as the cost to reach the node.
Hence, we can combine both costs as follows, and this sum is called as a fitness
number.
Algorithm of A* search:
Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure
and stops.
Step 3: Select the node from the OPEN list which has the smallest value of evaluation
function (g+h), if node n is goal node, then return success and stop, otherwise.
Step 4: Expand node n and generate all its successors and put n into the closed list. For
each successor n', check whether n' is already in the OPEN or CLOSED list, if not then
compute evaluation function for n' and place into Open list.
Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to
the back pointer which reflects the lowest g(n') value.
Advantages:
o The A* search algorithm is the best algorithm than other search algorithms.
o A* search algorithm is optimal and complete.
o This algorithm can solve very complex problems.
Disadvantages:
o It does not always produce the shortest path as it is mostly based on heuristics
and approximation.
o A* search algorithm has some complex issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes
in the memory, so it is not practical for various large-scale problems.
Example:
In this example, we will traverse the given graph using the A* algorithm. The heuristic
value of all states is given in the below table so we will calculate the f(n) of each state
using the formula f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start
state.
Here we will use OPEN and CLOSED list.
Solution:
Iteration3: {(S--> A-->C--->G, 6), (S--> A-->C--->D, 11), (S--> A-->B, 7), (S-->G,
10)}
Iteration 4 will give the result, as S--->A--->C--->G it provides the optimal path with
cost 6.
Points to remember:
o A* algorithm returns the path which occurred first, and it does not search for all
remaining paths.
o The efficiency of A* algorithm depends on the quality of heuristic.
o A* algorithm expands all nodes which satisfy the condition f(n)<="" li="">
o Admissible: the first condition required for optimality is that h(n) should be an
admissible heuristic for A* tree search. An admissible heuristic is optimistic in
nature.
o Consistency: Second required condition is consistency for only A* graph-
search.
If the heuristic function is admissible, then A* tree search will always find the least
cost path.
o Generate and Test variant: Hill Climbing is the variant of Generate and Test
method. The Generate and Test method produces feedback which helps to
decide which direction to move in the search space.
o Greedy approach: Hill-climbing algorithm search moves in the direction which
optimizes the cost.
o No backtracking: It does not backtrack the search space, as it does not
remember the previous states.
On Y-axis we have taken the function which can be an objective function or cost
function, and state-space on the x-axis. If the function on Y-axis is cost then, the goal
of search is to find the global minimum and local minimum. If the function of Y-axis is
Objective function, then the goal of the search is to find the global maximum and local
maximum.
Different regions in the state space landscape:
Local Maximum: Local maximum is a state which is better than its neighbor
states, but there is also another state which is higher than it.
Global Maximum: Global maximum is the best possible state of state space
landscape. It has the highest value of objective function.
Flat local maximum: It is a flat space in the landscape where all the neighboring
states of current states have the same value.
Simple hill climbing is the simplest way to implement a hill climbing algorithm. It
only evaluates the neighbor node state at a time and selects the first one which
optimizes current cost and set it as a current state. It only checks it's one successor
state, and if it finds better than the current state, then move else be in the same state.
This algorithm has the following features:
Stochastic hill climbing does not examine all its neighbor before moving. Rather, this
search algorithm selects one neighbor node at random and decides whether to choose it
as a current state or examine another state.
1. Local Maximum: A local maximum is a peak state in the landscape, which is better
than each of its neighboring states, but there is another state also present which is
higher than the local maximum.
2. Plateau: A plateau is the flat area of the search space in which all the neighbor
states of the current state contains the same value, because of this algorithm does not
find any best direction to move. A hill-climbing search might be lost in the plateau
area.
Solution: The solution for the plateau is to take big steps or very little steps while
searching, to solve the problem. Randomly select a state which is far away from the
current state so it is possible that the algorithm could find non-plateau region.
3. Ridges: A ridge is a special form of the local maximum. It has an area which is
higher than its surrounding areas, but itself has a slope, and cannot be reached in a
single move.
Simulated Annealing:
Knowledge Representation
computer can understand and can utilize this knowledge to solve complex real-
world problems such as diagnosing a medical condition or communicating with
humans in natural language.
○ Object: All the facts about objects in our world domain. E.g., Guitars contain
things.
1. Declarative Knowledge:
2. Procedural Knowledge
how to do something.
3. Meta-knowledge:
4. Heuristic knowledge:
5. Structural knowledge:
○ It describes relationships between various concepts such as kind of, part of, and
grouping of something.
○ Perception
○ Learning
○ Planning
○ Execution
The above diagram shows how an AI system can interact with the real world and
what components help it to show intelligence. AI systems have a Perception.
component by which it retrieves information from their environment. It can be visual,
audio or another form of sensory input. The learning component is responsible for
learning from data captured by Perception comportment. In the complete cycle, the
main components are knowledge representation and Reasoning. These two components
are involved in showing intelligence in machine-like humans. These two components
are independent of each other but also coupled together. The planning and execution
depend on analysis of Knowledge representation and reasoning.
There are mainly four ways of knowledge representation which are given as follows:
1. Logical Representation
3. Frame Representation
4. Production Rules
1. Logical Representation
Logical representation is a language with some concrete rules which deals with
propositions and has no ambiguity in representation. Logical representation means
drawing a conclusion based on various conditions. This representation lays down some
important communication rules. It consists of precisely defined syntax and semantics
which support the sound inference. Each sentence can be translated into logic using
syntax and semantics.
Syntax:
○ Syntaxes are the rules which decide how we can construct legal sentences in
logic.
○ Semantics are the rules by which we can interpret the sentence in the logic.
a. Propositional Logics
b. Predicate logics
Propositional logic (PL) is the simplest form of logic where all the statements are made
by propositions. A proposition is a declarative statement which is either true or false. It
is a technique of knowledge representation in logical and mathematical form.
Example:
1. It is Sunday.
4. 5 is a prime number.
connectives.
○ These connectives are also called logical operators.
○ The propositions and connectives are the basic elements of the propositional
logic.
such as "Where is Rohini", "How are you", "What is your name", are not
propositions.
The syntax of propositional logic defines the allowable sentences for the knowledge
representation. There are two types of Propositions:
a. Atomic Propositions
b. Compound propositions
a single proposition symbol. These are the sentences which must be either true
or false.
Logical Connectives:
conjunction.
Example: Rohan is intelligent and hardworking. It can be written as,
P= Rohan is intelligent,
Q= Rohan is hardworking. → P∧ Q.
b. Kind-of-relation
Example: Following are some statements which we need to represent in the form of
nodes and arcs.
Statements:
a. Jerry is a cat.
b. Jerry is a mammal.
d. Jerry is brown.
Facets: The various aspects of a slot are known as Facets. Facets are features of frames
which enable us to put constraints on the frames. Example: IF-NEEDED facts are
called when data of any slot is needed. A frame may consist of any number of slots,
and a slot may include any number of facets and facets may have any number of
values. A frame is also known as slot-filter knowledge representation in artificial
intelligence.
Frames are derived from semantic networks and later evolved into our modern-day
classes and objects. A single frame is not very useful. Frames systems consist of a
collection of frames which are connected. In the frame, knowledge about an object or
event can be stored together in the knowledge base. The frame is a type of technology
Slots Filters
Year 1996
Page 1152
4. Production Rules
The production rules system consists of (condition, action) pairs which mean, "If
condition then action". It has mainly three parts:
○ Working Memory
○ The recognize-act-cycle.
In production rules agent checks for the condition and if the condition exists then
production rule fires and corresponding action is carried out. The condition part of the
rule determines which rule may be applied to a problem. And the action part carries out
the associated problem-solving steps. This complete process is called a recognize-act
cycle.
The working memory contains the description of the current state of problem-solving
and rules can write knowledge to the working memory. This knowledge matches and
may fire other rules.
If there is a new situation (state) generated, then multiple production rules will be fired
together, this is called conflict set. In this situation, the agent needs to select a rule
from these sets, and it is called a conflict resolution.
Example:
○ IF (at bus stop AND bus arrives) THEN action (get into the bus)
○ IF (on the bus AND paid AND empty seat) THEN action (sit down).
○ IF (bus arrives at destination) THEN action (get down from the bus)
Reasoning in Artificial intelligence
In artificial intelligence, reasoning is essential so that the machine can also think
rationally as a human brain and can perform like a human.
Types of Reasoning
○ Deductive reasoning
○ Inductive reasoning
○ Abductive reasoning
○ Monotonic Reasoning
○ Non-monotonic Reasoning
1. Deductive reasoning:
Deductive reasoning is a type of propositional logic in AI, and it requires various rules
and facts. It is sometimes referred to as top-down reasoning, and contradictory to
inductive reasoning.
In deductive reasoning, the truth of the premises guarantees the truth of the conclusion.
Deductive reasoning mostly starts from the general premises to the specific conclusion,
which can be explained as below example.
Example:
2. Inductive Reasoning:
Example:
Premise: All the pigeons we have seen in the zoo are white.
3. Abductive reasoning:
Abductive reasoning is a form of logical reasoning which starts with single or multiple
observations then seeks to find the most likely explanation or conclusion for the
observation.
Example:
Conclusion It is raining.
Common Sense reasoning simulates the human ability to make presumptions about
events which occur every day.
It relies on good judgment rather than exact logic and operates on heuristic knowledge
and heuristic rules.
Example:
The above two statements are examples of common-sense reasoning which a human
mind can easily understand and assume.
5. Monotonic Reasoning:
In monotonic reasoning, once the conclusion is taken, then it will remain the same
even if we add some other information to existing information in our knowledge base.
In monotonic reasoning, adding knowledge does not decrease the set of prepositions
that can be derived.
To solve monotonic problems, we can derive the valid conclusion from the available
facts only, and it will not be affected by new facts.
Monotonic reasoning is not useful for real-time systems, as in real time, facts get
changed, so we cannot use monotonic reasoning.
It is a fact, and it cannot be changed even if we add another sentence in the knowledge
base like, "The moon revolves around the earth" Or "Earth is not round," etc.
6. Non-monotonic Reasoning
"Human perceptions for various things in daily life, "is a general example of non-
monotonic reasoning.
Example: Let suppose the knowledge base contains the following knowledge:
○ Pitty is a bird.
So, from the above sentences, we can conclude that Pitty can fly.
So, to represent uncertain knowledge, where we are not sure about the predicates, we
need uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
The following are some leading causes of uncertainty to occur in the real world.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
Probabilistic reasoning:
In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations,"
"A match between two teams or two players." These are probable sentences for
which we can assume that it will happen but are not sure about it, so here we use
probabilistic reasoning.
In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:
○ Bayes' rule
○ Bayesian Statistics
Probability: Probability can be defined as a chance that an uncertain event will occur. It
is the numerical measure of the likelihood that an event will occur. The value of
probability always remains between 0 and 1 that represent ideal uncertainties.
We can find the probability of an uncertain event by using the below formula.
○ P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in
the real world.
Conditional probability:
Let's suppose, we want to calculate the event A when event B has already occurred,
"the probability of A under the conditions of B", it can be written as:
If the probability of A is given and we need to find the probability of B, then it will be
given as:
It can be explained by using the below Venn diagram, where B is occurred event, so
sample space will be reduced to set B, and now we can only calculate event A when
event B is already occurred by dividing the probability of P(A⋀B) by P( B ).
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
Example: If cancer corresponds to one's age, then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event
A with known event B:
The above equation (a) is called Bayes' rule or Bayes' theorem. This equation is basic
of most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as
Probability of hypothesis A when we have occurred evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
evidence.
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes'
rule can be written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
The 'Fuzzy' word means the things that are not clear or are vague. Sometimes, we
cannot decide in real life whether the given problem or statement is either true or false.
At that time, this concept provides many values between the true and false and gives
the flexibility to find the best solution to that problem.
Fuzzy logic contains multiple logical values, and these values are the truth values of a
variable or problem between 0 and 1. This concept was introduced by Lofti Zadeh in
1965 based on the Fuzzy Set Theory. This concept provides the possibilities which are
not given by computers, but like the range of possibilities generated by humans.
In the Boolean system, only two possibilities (0 and 1) exist, where 1 denotes the
absolute truth value and 0 denotes the absolute false value. But in the fuzzy system,
there are multiple possibilities present between the 0 and 1, which are partially false
and partially true.
In the architecture of the Fuzzy Logic system, each component plays an important role.
The architecture consists of the different four components which are given below.
1. Rule Base
2. Fuzzification
3. Inference Engine
4. Defuzzification
1. Rule Base
Rule Base is a component used for storing the set of rules and the If-Then conditions
given by the experts are used for controlling the decision-making systems. There are so
many updates that have come to the Fuzzy theory recently, which offers effective
methods for designing and tuning of fuzzy controllers. These updates or developments
decrease the number of fuzzy set of rules.
2. Fuzzification
○ Small (S)
3. Inference Engine
This component is a main component in any Fuzzy Logic system (FLS), because all
the information is processed in the Inference Engine. It allows users to find the
matching degree between the current fuzzy input and the rules. After the matching
degree, this system determines which rule is to be added according to the given input
field. When all rules are fired, then they are combined for developing the control
actions.
4. Defuzzification
Defuzzification is a module or component, which takes the fuzzy set inputs generated
by the Inference Engine, and then transforms them into a crisp value. It is the last step
in the process of a fuzzy logic system. The crisp value is a type of value which is
acceptable by the user. Various techniques are present to do this, but the user has to
select the best one to reduce the number of errors.
Membership Function
The membership function is a function which represents the graph of fuzzy sets and
allows users to quantify the linguistic term. It is a graph which is used for mapping
each element of x to the value between 0 and 1.
For the Fuzzy set B, the membership function for X is defined as: μB:X → [0,1]. In
function X, each element of set B is mapped to the value between 0 and 1. This is
called a degree of membership or membership value.
Learning
An agent is learning if it improves its performance on future tasks after making
observations about the world.
Why would we want an agent to learn?
There are three main reasons. First, the designers cannot anticipate all possible
situations that the agent might find itself in. For example, a robot designed to
navigate mazes must learn the layout of each new maze it encounters. Second,
the designers cannot anticipate all changes over time; a program designed to
predict tomorrow’s stock market prices must learn to adapt when conditions
change from boom to bust. Third, sometimes human programmers have no idea
how to program a solution themselves. For example, most people are good at
recognizing the faces of family members, but even the best programmers are
unable to program a computer to accomplish that task, except by using learning
algorithms.
Forms of Learning
Any component of an agent can be improved by learning from data. The
improvements, and the techniques used to make them, depend on four major
factors.
Which component is to be improved.
What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from?
There are three types of feedback that determine the three main types of
learning:
In unsupervised learning the agent learns patterns in the input even though no
explicit feedback is supplied. The most common unsupervised learning task is
clustering: detecting potentially useful clusters of input examples. For example,
a taxi agent might gradually develop a concept of “good traffic days” and “bad
traffic days” without ever being given labeled examples of each by a teacher.
In reinforcement learning the agent learns from a series of reinforcements—
rewards or punishments. For example, the lack of a tip at the end of the journey
gives the taxi agent an indication that it did something wrong. The two points
for a win at the end of a chess game tells the agent it did something right. It is
up to the agent to decide which of the actions prior to the reinforcement were
most responsible for it.
In supervised learning the agent observes some example input–output pairs and
learns a function that maps from input to output. Consider, for example, an agent
training to become a taxi driver. Every time the instructor shouts “Brake!” the
agent might learn a condition– action rule for when to break the inputs are
percepts and the output are provided by a teacher who says “Brake!” or “Turn
left.”
SEMI- SUPERVISED LEARNING: In practice, these distinctions are not always
so crisp. In semi-supervised learning we are given a few labelled examples and
must make what we can of a large collection of unlabeled examples.
Supervised Learning
In supervised learning, models are trained using labelled dataset, where the
model learns about each type of data. Once the training process is completed,
the model is tested based on test data (a subset of the training set), and then it
predicts the output.
The task of supervised learning is this:
Given a training set of N example input–output pairs
(x1, y1),(x2, y2),...(xN , yN ) ,
where each yj was generated by an unknown function y = f(x), discover a
function h that approximates the true function f. Here x and y can be any value;
they need not be numbers.
Function h is a hypothesis.
Learning is a search through the space of possible hypotheses for one that will
perform well, even on new examples beyond the training set. To measure the
accuracy of a hypothesis we give it a test set of examples that are distinct from
the training set. We say a hypothesis.
generalizes well if it correctly predicts the value of y for novel examples.
Supervised learning can be further divided into two types of problems:
1. Classification
2. Regression
Classification
The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the categorical
data.
Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have features
that are like each other and dissimilar to other classes.
o Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Regression
Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables with one
or more independent variables. More specifically, Regression analysis helps
us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables
are held fixed. It predicts continuous/real values such as temperature, age,
salary, price, etc.
In Regression, we plot a graph between the variables which best fits the given
datapoints. Using this plot, the machine learning model can make predictions about the
data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between
the datapoints and the regression line is minimum." The distance between datapoints
and line tells whether a model has captured a strong relationship or not.
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input variable,
then such linear regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee based on the year of experience.
mathematical equation for Linear regression:
Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to
solve classification problems. In classification problems, we have dependent
variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with categorical variables such as 0 or 1,
Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is like multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression is also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.
The task of the regression algorithm The task of the classification algorithm is to map
is to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best In Classification, we try to find the decision
fit line, which can predict the output boundary, which can divide the dataset into
more accurately. different classes.
The regression Algorithm can be The Classification algorithms can be divided into
further divided into Linear and Non- Binary Classifier and Multi-class Classifier.
linear Regression.
Regularization
Sometimes the machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to predict the output
when dealing with unseen data by introducing noise in the output, and hence the model
is called overfitted. This problem can be dealt with the help of a regularization
technique.
This technique can be used in such a way that it will allow us to maintain all variables
or features in the model by reducing the magnitude of the variables. Hence, it
maintains accuracy as well as a generalization of the model.
It mainly regularizes or reduces the coefficient of features to zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same
number of features."
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for linear regression is called RSS or
Residual sum of squares.
o Ridge Regression
o Lasso Regression
Ridge Regression: Ridge regression is one of the types of linear regression in which
a small amount of bias is introduced so that we can get better long-term predictions.
In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.
The equation for the cost function in ridge regression will be:
It is like the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
Since it takes absolute values, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
It is also called L1 regularization. The equation for the cost function of Lasso
regression will be:
Decision tree
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules, and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
o The decisions or the test are performed based on features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, like a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and moves further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contain possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and call the final node as a leaf node.
The positive examples are the ones in which the goal WillWait is true (x1,
x3,...); the negative examples are the ones in which it is false (x2, x5,...). The
decision-tree-learning algorithm adopts a greedy divide-and-conquer strategy:
always test the most important attribute first. This test divides the problem up
into smaller subproblems that can then be solved recursively. By “most
important attribute,” we mean the one that makes the most difference to the
classification of an example. That way, we hope to get to the correct
classification with a small number of tests, meaning that all paths in the tree will
be short and the tree will be shallow.
Figure shows that Type is a poor attribute, because it leaves us with four
possible outcomes, each of which has the same number of positive as negative
examples. On the other hand, we see that Patrons is an important attribute,
because if the value is None or Some, then we are left with example sets for
which we can answer definitively (No and Yes, respectively). If the value is
Full, we are left with a mixed set of examples. In general, after the first attribute
test splits up the examples, each outcome is a new decision tree learning
problem, with fewer examples and one less attribute. There are four cases to
consider for these recursive problems:
1. If the remaining examples are all positive (or all negative), then we are done
we can answer Yes or No. Figure (b) shows examples of this happening in the
None and Some branches.
2. If there are some positive and some negative examples, then choose the best
attribute to split them. Figure (b) shows Hungry being used to split the
remaining examples.
3. If there are no examples left, it means that no example has been observed for
this com bination of attribute values, and we return a default value calculated
from the plurality classification of all the examples that were used in
constructing the node’s parent. These are passed along in the variable parent
example.
4.If there are no attributes left, but both positive and negative examples, it
means that these examples have the same description, but different
classifications. This can happen because there is an error or noise in the data;
because the domain is nondeterministic; or because we can’t observe an
attribute that would distinguish the examples.
The output of the learning algorithm on our sample training set is shown in the
Figure below. The tree is clearly different from the original tree shown in figure.
1. One might conclude that the learning algorithm is not doing a very good job
of learning the correct function. This would be the wrong conclusion to draw,
however. The learning algorithm looks at the examples, not at the correct
function, and in fact, its hypothesis not only is consistent with all the examples
but is considerably simpler than the original tree! The learning algorithm has no
reason to include tests for Raining and Reservation, because it can classify all
the examples without them.
Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models. The
main goal of each machine learning model is to generalize well.
Overfitting: Overfitting refers to the condition when the model completely fits
the training data but fails to generalize the testing unseen data. Overfit condition
arises when the model memorizes the noise of the training data and fails to
capture important patterns. The chances of occurrence of overfitting increase as
much as we provide training to our model. It means the more we train our
model, the more chances of occurring the overfitted model.
Example: The concept of the overfitting can be understood by the below graph
of the linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got
any best fit, so, it will generate the prediction errors.
There are some ways by which we can reduce the occurrence of overfitting in
our model.
Decision Tree Pruning: Pruning refers to a technique to remove the parts of the
decision tree to prevent growing to its full depth. By tuning the hyperparameters
of the decision tree model one can prune the trees and prevent them from
overfitting.
For decision trees, a technique called decision tree pruning combats overfitting.
Pruning works by eliminating nodes that are not clearly relevant. We start with
a full tree, as generated by decision-tree-learning. We then look at a test node
that has only leaf nodes as descendants. If the test appears to be irrelevant—
detecting only noise in the data— then we eliminate the test, replacing it with a
leaf node. We repeat this process, considering each test with only leaf
descendants, until each one has either been pruned or accepted as is.
The pre-pruning technique refers to the early stopping of the growth of the
decision tree. The pre-pruning technique involves tuning the hyperparameters of
the decision tree model prior to the training pipeline. The hyperparameters of the
decision tree including max_depth, min_samples_leaf, min_samples_split can be
tuned to early stop the growth of the tree and prevent the model from overfitting.
Post-Pruning:
The Post-pruning technique allows the decision tree model to grow to its full
depth, then removes the tree branches to prevent the model from overfitting.
Cross Validation
In machine learning, there is always the need to test the stability of the model. It
means based only on the training dataset; we can't fit our model on the training
dataset. For this purpose, we reserved a particular sample of the dataset, which
was not part of the training dataset. After that, we test our model on that sample
before deployment, and this complete process comes under cross-validation.
K-Fold Cross-Validation
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into
5 folds. On 1st iteration, the first fold is reserved for test the model, and rest are
used to train the model. On 2nd iteration, the second fold is used to test the
model, and rest are used to train the model. This process will continue until each
fold is not used for the test fold.
Early stopping the training: The decision tree algorithm stops generating nodes
when there is no good attribute to split on, rather than going to all the trouble of
generating nodes and then pruning them away.
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the feed of
training data can be stopped at an early stage, due to which the model may not
learn enough from the training data. As a result, it may fail to find the best fit of
the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.
Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data
points present in the plot.
Stationarity assumption: We want to learn a hypothesis that fits the future data
best. To make that precise we need to define “future data” and “best.”
We make the stationarity assumption: that there is a probability distribution
over examples that remains stationary over time.
Each example data point (before we see it) is a random variable Ej whose
observed value.
ERROR RATE The next step is to define “best fit.” We define the error rate
of a hypothesis as the proportion of mistakes it makes—the proportion of
times that h(x) ≠y for an (x, y) example.
Now, just because a hypothesis h has a low error rate on the training set does
not mean that it will generalize well. A professor knows that an exam will
not accurately evaluate students if they have already seen the exam
questions. Similarly, to get an accurate evaluation of a hypothesis, we need
to test it on a set of examples it has not seen yet. The simplest approach is
the one we have seen already: randomly split the available data into a
training set from which the learning algorithm produces h and a test set on
which the accuracy of h is evaluated. This method, sometimes called holdout
cross-validation.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human
brain, data is stored in such a manner as to be distributed, and we can extract
more than one piece of this data, when necessary, from our memory parallelly.
We can say that the human brain is made up of incredibly amazing parallel
processors.
Input Layer:
Hidden Layer:
The hidden layer is present in between input and output layers. It performs all
the calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a
transfer function.
It determines weighted total is passed as an input to an activation function to
produce the output. Activation functions choose whether a node should fire or
not. Only those who are fired make it to the output layer. There are distinctive
activation functions available that can be applied upon the sort of task we are
performing.
Types of Artificial Neural Networks (ANN) depending upon the human brain
neuron and network functions, an artificial neural network similarly performs
tasks. Most of the artificial neural networks will have some similarities with a
more complex biological partner and are very effective at their expected tasks.
For example, segmentation or classification.
Feedback ANN:
In this type of ANN, the output returns to the network to accomplish the best-
evolved results internally. The feedback networks feed information back into
themselves and are well suited to solve optimization issues. The Internal system
error corrections utilize feedback ANNs.
Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer,
an output layer, and at least one layer of a neuron. Through assessment of its
output by reviewing its input, the intensity of the network can be noticed based
on group behavior of the associated neurons, and the output is decided. The
primary advantage of this network is that it figures out how to evaluate and
recognize input patterns.
Nonparametric model
Ensemble Learning
Ensemble methods are techniques that create multiple models and then combine
them to produce improved results. Ensemble methods in machine learning usually
produce more accurate solutions than a single model would.
The models that contribute to the ensemble are called ensemble members. They
may be of the same type or of different types.
They may or may not be trained on the same training data.
Bagging is used when our objective is to reduce the variance of a decision tree.
Here the concept is to create a few subsets of data from the training sample, which
is chosen randomly with replacement. Now each collection of subset data is used
to prepare their decision trees thus, we end up with an ensemble of various
models. The average of all the assumptions from numerous trees is used, which is
more powerful than a single decision tree.
Bagging avoids overfitting of data and is used for both regression and
classification models, specifically for decision tree algorithms.
Steps to Perform Bagging
• Consider there are n observations and m features in the training set. You
need to select a random sample from the training dataset without
replacement.
• The feature offering the best split out of the lot is used to split the nodes.
Boosting
The basic principle behind the working of the boosting algorithm is to generate
multiple weak learners and combine their predictions to form one strict rule.
These weak rules are generated by applying base Machine Learning algorithms on
different distributions of the data set. These algorithms generate weak rules for
each iteration. After multiple iterations, the weak learners are combined to form a
strong learner that will predict a more accurate outcome.
Here's how the algorithm works:
Step 1: The base algorithm reads the data and assigns equal weight to each
sample observation.
Step 2: False predictions made by the base learner are identified. In the next
iteration, these false predictions are assigned to the next base learner with a higher
weightage on these incorrect predictions.
Step 3: Repeat step 2 until the algorithm can correctly classify the output.
The results from the first decision stump are analyzed, and if any observations are
wrongfully classified, they are assigned higher weights. A new decision stump is
drawn by considering the higher-weight observations as more significant. Again,
if any observations are misclassified, they're given a higher weight, and this
process continues until all the observations fall into the right class.
AdaBoost can be used for both classification and regression-based problems.
However, it is more commonly used for classification purposes.
The difference in this boosting type is that the weights for misclassified outcomes
are not incremented. Instead, the Gradient Boosting method tries to optimize the
loss function of the previous learner by adding a new model that adds weak
learners to reduce the loss function.
The main idea here is to overcome the errors in the previous learner's predictions.
This boosting has three main components:
o Loss function: The use of the loss function depends on the type of
problem. The advantage of gradient boosting is that there is no need for a
new boosting algorithm for each loss function.
o Weak learner: In gradient boosting, decision trees are used as weak
learners. A regression tree is used to give true values, which can combine to
create correct predictions. Like in the AdaBoost algorithm, small trees with
a single split are used, i.e., decision stump. Larger trees are used for large
levels, 4-8.
o Additive Model: Trees are added one at a time in this model. Existing trees
remain the same. During the addition of trees, gradient descent is used to
minimize the loss function.
Like AdaBoost, Gradient Boosting can also be used for classification and
regression problems.
The main aim of this algorithm is to increase the speed and efficiency of
computation. The Gradient Descent Boosting algorithm computes the output
slower since they sequentially analyze the data set. Therefore XGBoost is used to
boost or extremely boost the model's performance.
Random Forest
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
Statistical Learning
Statistics is the science that allows us to collect, analyze, interpret, present, and
organize data. It provides a robust set of tools for understanding patterns and
trends, and making inferences and predictions based on data. When we're dealing
with large datasets, statistics help us understand and summarize the data, allowing
us to make sense of complex phenomena.
Machine learning, on the other hand, is a powerful tool that allows computers to
learn from and make decisions or predictions based on data. The goal of machine
learning is to create models that can adapt and improve over time, as well as
generalize from specific examples to broader cases.
This is where the beauty of the fusion between statistics and machine learning
comes to light. The principles of statistics are the very pillars that uphold the
structure of machine learning.
• Constructing machine learning models. Statistics provides the
methodologies and principles for creating models in machine learning. For
instance, the linear regression model leverages the statistical method of
least squares to estimate the coefficients.
• Interpreting results. Statistical concepts allow us to interpret the results
generated by machine learning models. Measures such as p-value,
confidence intervals, R-squared, and others provide us with a statistical
perspective on the machine learning model’s performance.
• Validating models. Statistical techniques are essential for validating and
refining machine learning models. For instance, techniques like hypothesis
testing, cross-validation, and bootstrapping help us quantify the
performance of models and avoid problems like overfitting.
Where,
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Unsupervised Learning
The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters
such that objects with most similarities remain into a group and has less or
no similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occur together in the dataset. Association
rules make marketing strategy more effective. People who buy X item
(suppose a bread) also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
K-means Clustering
It allows us to cluster the data into different groups and is a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third step, which means reassigning each datapoint to the new
closest centroid of each cluster.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby.
Association rule learning works on the concept of If and Else Statement, such as if
A then B.
o Support
o Confidence
o Lift
Support
Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number
of records that contain X.
Lift
It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:
Apriori Algorithm
It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
Eclat Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The purpose of this frequent
tree is to extract the most frequent patterns.
Reinforcement Learning
Definition:
o Agent (): An entity that can perceive/explore the environment and act
upon it.
o Environment (): A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
o Action (): Actions are the moves taken by an agent within the environment.
o State (): State is a situation returned by the environment after each action
taken by the agent.
o Reward (): A feedback returned to the agent from the environment to
evaluate the action of the agent.
o Policy (): Policy is a strategy applied by the agent for the next action
based on the current state.
o Value (): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.
o Q-value (): It is mostly similar to the value, but it takes one additional
parameter as a current action (a).
Markov Decision Process
MDP is used to describe the environment for the RL, and almost all the RL
problems can be formalized using MDP.
MDP uses Markov property, and to better understand the MDP, we need to learn
about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on
the current state and future action and states do not depend on past actions,
rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies
the Markov property. Such as in a Chess game, the players only focus on the
current state and do not need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Applications of Reinforcement Learning
• Robotics: RL is used in Robot navigation, Robo-soccer, walking, juggling,
etc.
• Control: RL can be used for adaptive control such as Factory processes,
admission control in telecommunication, and Helicopter pilot is an
example of reinforcement learning.
• Game Playing: RL can be used in Game playing such as tic-tac-toe, chess,
etc.
• Chemistry: RL can be used for optimizing the chemical reactions.
• Business: RL is now used for business strategy planning.
• Manufacturing: In various automobile manufacturing companies, the
robots use deep reinforcement learning to pick goods and put them in some
containers.
• Finance Sector: The RL is currently used in the finance sector for
evaluating trading strategies.
1. Value-based:
The value-based approach is about finding the optimal value function,
which is the maximum value at a state under any policy. Therefore, the
agent expects a long-term return at any state(s) under policy π.
2. Policy-based:
A policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach, the
agent tries to apply such a policy that the action performed in each step
helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any
state.
o Stochastic: In this policy, probability determines the produced
action.
3. Model-based: In the model-based approach, a virtual model is created for
the environment, and the agent explores that environment to learn it. There
is no solution or algorithm for this approach because the model
representation is different for each environment.
To understand the working process of the RL, we need to consider two main
things:
Let's take an example of a maze environment that the agent needs to explore.
Consider the below image:
In the above image, the agent is at the very first block of the maze. The maze
consists of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward
point. It can take four actions: move up, move down, move left, and move right.
The agent can take any path to reach the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so
he will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the
final step. To memorize the steps, it assigns 1 value to each previous step.
Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value
to each previous block. But what will the agent do if he starts moving from the
block, which has 1 value block on both sides? Consider the below diagram:
Where,
V(s)= value calculated at a particular point.
= Discount factor
In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to
move.
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because
there is no reward at this state.
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0,
because there is no reward at this state also.
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0,
because there is no reward at this state also.
Now, we will move further to the 6th block, and here the agent may change the
route because it always tries to find the optimal path. So now, let's consider from
the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he
will feel a bump if he moves to the fire pit, then he will get the -1 reward. But
here we are taking only positive rewards, so for this, he will move upwards only.
The complete block values will be calculated using this formula. Consider the
below image:
Reinforcement Learning: Active vs Passive
Both active and passive reinforcement learning are types of RL. In the case of
passive RL, the agent’s policy is fixed which means that it is told what to do. In
contrast to this, in active RL, an agent needs to decide what to do as there’s no
fixed policy that it can act on. Therefore, the goal of a passive RL agent is to
execute a fixed policy (sequence of actions) and evaluate it while that of an active
RL agent is to act and learn an optimal policy.
Passive Learning
As the goal of the agent is to evaluate how good an optimal policy is, the agent
needs to learn the expected utility Uπ(s) for each state. This can be done in three
ways.
In this method, the agent executes a sequence of trials or runs (sequences of states-
actions transitions that continue until the agent reaches the terminal state). Each
trial gives a sample value, and the agent estimates the utility based on the sample’s
values. Can be calculated as running averages of sample values.
The main drawback is that this method makes a wrong assumption that state
utilities are independent while they are Markovian. Also, it is slow to converge.
It can be solved using the value-iteration algorithm. The algorithm converges fast
but can become quite costly to compute for large state spaces. ADP is a model-
based approach and requires the transition model of the environment.
TD learning does not require the agent to learn the transition model. The update
occurs between successive states and agent only updates states that are directly
affected.
While ADP adjusts the utility of s with all its successor states, TD learning adjusts
it with that of a single successor state’s’. TD is slower in convergence but much
simpler in terms of computation.
The active learning agent learns the complete model with outcome probabilities
for all actions. In this case the agent has a choice of action. They use Bellman
equations to calculate utilities to define optimal policy. After obtaining the
optimal utility function, the agent can just extract an optimal action by looking
ahead one step and the expected utility can be maximized.
Can be done using a passive ADP agent and then using value or policy iteration it
can learn optimal actions. But this approach results into a greedy agent.
Hence, we use an approach that gives higher weights to unexplored actions and
lower weights to actions with lower utilities.
Q-learning is a TD learning method which does not require the agent to learn the
transitional model, instead learns Q-value functions Q(s, a) . Q values are directly
related to utility value of a state.
▪ Q learning is a model free method as it does not need any model for learning
or for action selection.
▪ Q learning agent is an active learner. It learns the value Q(s,n) of each action
for every situation.
▪ The use of exploration function is same as that of ADP but it does not learn the
transition model as in case of other agents, rather it uses the Q value which can
be directly related to its neighbors.
▪ This is simpler to compute but slower than ADP.
▪ Also, it need not pay any attention to the policy used because of Q value usage.
For the same reason, it is also called as “off-policy learning algorithm”.
▪ It can behave better even if the policies are random or adversarial.
Due to these properties, Q learning methods are one of the most successful methods
even in games such as checkers, chess etc. Where the evaluation function is learned by a
model based on Q-learning.
Hidden Markov Models (HMMs) are a type of probabilistic model that is commonly
used in machine learning for tasks such as speech recognition, natural language
processing, and bioinformatics. They are a popular choice for modelling sequences of
data because they can effectively capture the underlying structure of the data, even
when the data is noisy or incomplete.
The basic idea behind an HMM is that the hidden states generate the observations, and
the observed data is used to estimate the hidden state sequence. This is often referred to
as the forward-backwards algorithm.