Ai Tycs Sem-V

Download as pdf or txt
Download as pdf or txt
You are on page 1of 126

ARTIFICIAL INTELLIGENCE

T.Y.B.Sc. Computer Science


Semester-V

Compiled By:
Asst. Prof. Megha Sharma
For Video lectures visit my YouTube channel
OMega TechEd.
Or
Scan the QR-Code

References:
Artificial Intelligence
A Modern Approach Third Edition
Peter Norvig and Stuart J. Russell

www.javatpoint.com
Unit-1
Introduction to AI and Intelligent Agents
What Is AI: Foundations, History and State of the Art of AI
Intelligent Agents: Agents and Environments, Nature of Environments,
Structure of Agents.
Problem Solving by searching: Problem-Solving Agents, Uninformed
Search Strategies, Informed (Heuristic) Search Strategies

Introduction to AI

Artificial Intelligence is composed of two words Artificial and Intelligence, where


Artificial defines "man-made," and intelligence defines "thinking power", hence AI
means "a man-made thinking power."

Definition: "It is a branch of computer science by which we can create intelligent


machines which can behave like a human, think like humans, and able to make
decisions."

Approaches to AI
1. Acting humanly: The Turing Test approach:
The computer would need to possess the following capabilities:

• knowledge representation to store what it knows or hears.


• automated reasoning to use the stored information to answer questions
and to draw new conclusions.
• machine learning to adapt to new circumstances and to detect and
extrapolate patterns.
• computer vision to perceive objects, and
• robotics to manipulate objects and move about

Acting humanly: The Turing Test approach

In 1950, Alan Turing introduced a test to check whether a machine can think like a
human or not, this test is known as the Turing Test. In this test, Turing proposed that
the computer can be said to be intelligent if it can mimic human response under specific
conditions.

Consider, Player A is a computer, Player B is human, and Player C is an interrogator.


The interrogator is aware that one of them is a machine, but he needs to identify this
based on questions and their responses. The conversation between all players is via
keyboard and screen so the result would not depend on the machine's ability to convert
words as speech.

The test result does not depend on each correct answer, but only how close its responses
are like a human answer. The computer is permitted to do everything possible to force a
wrong identification by the interrogator.
2. Thinking humanly: The cognitive modeling approach.
If we are going to say that a given program thinks like a human, we must have
some way of determining how humans think. We need to get inside the actual
workings of human minds. There are three ways to do this: through
introspection—trying to catch our own thoughts as they go by; through
psychological experiments—observing a person in action; and through brain
imaging—observing the brain in action Once we have a sufficiently precise
theory of the mind, it becomes possible to express the theory as a computer
program. If the program’s input–output behavior matches corresponding human
behavior, that is evidence that some of the program’s mechanisms could also be
operating in humans.
3. Thinking rationally: The “laws of thought” approach
The Greek philosopher Aristotle was one of the first to attempt to codify “right
thinking,” that is, irrefutable reasoning processes. His syllogisms provided
patterns for argument structures that always yielded correct conclusions when
given correct premises—for example, “Socrates is a man; all men are mortal;
therefore, Socrates is mortal.” These laws of thought were supposed to govern
the operation of the mind; their study initiated the field called logic.
4. Acting rationally: The rational agent approach
An agent is just something that acts (agent comes from the Latin agere, to do).
Of course, all computer programs do something, but computer agents are
expected to do more: operate autonomously, perceive their environment, persist
over a prolonged time, adapt to change, and create and pursue goals. A rational
agent is one that acts to achieve the best outcome or, when there is uncertainty,
the best expected outcome. The rational-agent approach has two advantages
over the other approaches.
First, it is more general than the “laws of thought” approach because correct
inference is just one of several possible mechanisms for achieving rationality.
Second, it is more amenable to scientific development than are approaches
based on human behavior or human thought. The standard of rationality is
mathematically well defined and completely general and can be “unpacked” to
generate agent designs that probably achieve it. Human behavior, on the other
hand, is adapted for one specific environment and is defined by, well, the sum
of all the things that humans do.

Application of AI

Artificial Intelligence has various applications today. It is becoming essential for


today's time because it can solve complex problems in an efficient way in multiple
industries, such as Healthcare, entertainment, finance, education, etc. AI is making our
daily life more comfortable and faster.

Following are some sectors which have the application of Artificial Intelligence:

1. AI in Astronomy
o Artificial Intelligence can be very useful to solve complex universe problems. AI
technology can be helpful for understanding the universe such as how it works,
origin, etc.

2. AI in Healthcare

o In the last five to ten years, AI becoming more advantageous for the healthcare
industry and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis than
humans. AI can help doctors with diagnoses and can inform when patients are
worsening so that medical help can reach the patient before hospitalization.

3. AI in Gaming
o AI can be used for gaming purposes. The AI machines can play strategic games
like chess, where the machine needs to think of many possible places.

4. AI in Finance
o AI and finance industries are the best matches for each other. The finance
industry is implementing automation, chatbot, adaptive intelligence, algorithm
trading, and machine learning into financial processes.

5. AI in Data Security
o The security of data is crucial for every company and cyber-attacks are growing
very rapidly in the digital world. AI can be used to make your data safer and
more secure. Some examples such as AEG bot, AI2 Platform, are used to
determine software bugs and cyber-attacks in a better way.

6. AI in social media
o Social Media sites such as Facebook, Twitter, and Snapchat contain billions of
user profiles, which need to be stored and managed in a very efficient way. AI
can organize and manage massive amounts of data. AI can analyze lots of data to
identify the latest trends, hashtags, and requirement of different users.

7. AI in Travel & Transport


o AI is becoming highly demanding for travel industries. AI can do various travel
related works such as from making travel arrangement to suggesting the hotels,
flights, and best routes to the customers. Travel industries are using AI-powered
chatbots which can make human-like interaction with customers for better and
fast response.

8. AI in Automotive Industry
o Some Automotive industries are using AI to provide virtual assistant to their
users for better performance. Such as Tesla has introduced TeslaBot, an
intelligent virtual assistant.
o Various Industries are currently working to develop self-driven cars which can
make your journey more safe and secure.

9. AI in Robotics:
o Artificial Intelligence has a remarkable role in Robotics. Usually, general robots
are programmed such that they can perform some repetitive tasks, but with the
help of AI, we can create intelligent robots which can perform tasks with their
own experiences without pre-programmed.
o Humanoid Robots are the best examples for AI in robotics, recently the
intelligent Humanoid robot named Erica and Sophia has been developed which
can talk and behave like humans.
10. AI in Entertainment
o We are currently using some AI based applications in our daily life with some
entertainment services such as Netflix or Amazon. With the help of ML/AI
algorithms, these services show recommendations for programs or shows.

11. AI in Agriculture
o Agriculture is an area which requires various resources, labor, money, and time
for best result. Now a day's agriculture is becoming digital, and AI is emerging in
this field. Agriculture is applying AI as agriculture robotics, solid and crop
monitoring, predictive analysis. AI in agriculture can be very helpful for farmers.

12. AI in E-commerce
o AI is providing a competitive edge to the e-commerce industry, and it is
becoming more demanding in the e-commerce business. AI is helping shoppers
to discover associated products with recommended size, color, or even brand.

13. AI in education:
o AI can automate grading so that the tutor can have more time to teach. AI chatbot
can communicate with students as a teaching assistant.
o AI in the future can be used as a personal virtual tutor for students, which will be
accessible easily at any time and any place.

History of Artificial Intelligence

Artificial Intelligence is not a new word and not a new technology for researchers. This
technology is much older than you would imagine. Even there are the myths of
Mechanical men in Ancient Greek and Egyptian Myths.

The following are some milestones in the history of AI which defines the journey from
the AI generation to till date development.

Maturation of Artificial Intelligence (1943-1952)

o Year 1943: The first work which is now recognized as AI was done by Warren
McCulloch and Walter pits in 1943. They proposed a model of artificial
neurons.
o Year 1949: Donald Hebb demonstrated an updating rule for modifying the
connection strength between neurons. His rule is now called Hebbian learning.
o Year 1950: The Alan Turing who was an English mathematician and pioneered
Machine learning in 1950. Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can check the machine's
ability to exhibit intelligent behavior equivalent to human intelligence, called the
Turing test.

The birth of Artificial Intelligence (1952-1956)


o Year 1955: Allen Newell and Herbert A. Simon created the "first artificial
intelligence program” Which was named as "Logic Theorist". This program has
proved 38 of 52 Mathematics theorems and found new and more elegant proofs
for some theorems.
o Year 1956: The word "Artificial Intelligence" first adopted by American
Computer scientist John McCarthy at the Dartmouth Conference. For the first
time, AI was coined as an academic field.

At that time high-level computer languages such as FORTRAN, LISP, or COBOL were
invented. And the enthusiasm for AI was very high at that time.

The golden years-Early enthusiasm (1956-1974)


o Year 1966: The researchers emphasized developing algorithms which can solve
mathematical problems. Joseph Weizenbaum created the first chatbot in 1966,
which was named ELIZA.
o Year 1972: The first intelligent humanoid robot was built in Japan which was
named as WABOT-1.

The first AI winter (1974-1980)


o The duration between the years 1974 to 1980 was the first AI winter duration. AI
winter refers to the time where computer scientists dealt with a severe shortage of
funding from government for AI research.
o During AI winters, an interest of publicity on artificial intelligence was
decreased.

A boom of AI (1980-1987)
o Year 1980: After AI winter duration, AI came back with "Expert System".
Expert systems were programmed to emulate the decision-making ability of a
human expert.
o In the Year 1980, the first national conference of the American Association of
Artificial Intelligence was held at Stanford University.

The second AI winter (1987-1993)


o The duration between the years 1987 to 1993 was the second AI Winter duration.
o Again, Investors and government stopped in funding for AI research as due to
high cost but not efficient result. The expert system such as XCON was very cost
effective.

The emergence of intelligent agents (1993-2011)


o Year 1997: In the year 1997, IBM Deep Blue beats world chess champion, Gary
Kasparov, and became the first computer to beat a world chess champion.
o Year 2002: for the first time, AI entered the home in the form of Roomba, a
vacuum cleaner.
o Year 2006: AI came in the Business world till the year 2006. Companies like
Facebook, Twitter, and Netflix also started using AI.

Deep learning, big data, and artificial general intelligence (2011-present)


o Year 2011: In the year 2011, IBM's Watson won jeopardy, a quiz show, where it
had to solve complex questions as well as riddles. Watson had proved that it
could understand natural language and can solve tricky questions quickly.
o Year 2012: Google has launched an Android app feature "Google now", which
was able to provide information to the user as a prediction.
o Year 2014: In the year 2014, Chatbot "Eugene Gootman" won a competition in
the infamous "Turing test."
o Year 2018: The "Project Debater" from IBM debated on complex topics with
two master debaters and performed extremely well.
o Google has demonstrated an AI program "Duplex" which was a virtual assistant,
and which had taken hairdresser appointment on call, and lady on other side
didn't notice that she was talking with the machine.

Now AI has developed to a remarkable level. The concept of Deep learning, big data,
and data science are now trending like a boom. Nowadays companies like Google,
Facebook, IBM, and Amazon are working with AI and creating amazing devices. The
future of Artificial Intelligence is inspiring and will come with high intelligence.

Foundation of Artificial Intelligence

Artificial intelligence is a science and technology based on disciplines such as


Computer Science, Biology, Psychology, Linguistics, Mathematics, and
Engineering. A major thrust of AI is in the development of computer functions
associated with human intelligence, such as reasoning, learning, and problem
solving.
Agents

An agent is anything that can be viewed as perceiving its environment through


sensors and acting upon that environment through actuators. A human agent has
eyes, ears, and other organs for sensors and hands, legs, vocal tract, and so on
for actuators. A robotic agent might have cameras and infrared range finders for
sensors and various motors for actuators. An agent’s choice of action at any
given instant can depend on the entire percept sequence observed to date, but
not on anything it has not perceived.
Mathematically speaking, we say that an agent’s behavior is described by the
agent function that maps any given percept sequence to an action.
The agent function is an abstract mathematical description; the agent program
is a concrete implementation, running within some physical system.

Types of Agents

Agents can be grouped into five classes based on their degree of perceived intelligence
and capability. All these agents can improve their performance and generate better
action over time. These are given below:

o Simple Reflex Agent


o Model-based reflex agent
o Goal-based agents
o Utility-based agent
o Learning agent

1. Simple Reflex agent:


o The Simple reflex agents are the simplest agents. These agents take decisions
based on the current percepts and ignore the rest of the percept history.
o These agents only succeed in the fully observable environment.
o The Simple reflex agent does not consider any part of percepts history during
their decision and action process.
o The Simple reflex agent works on Condition-action rule, which means it maps
the current state to action. Such as a Room Cleaner agent, it works only if there is
dirt in the room.
o Problems for the simple reflex agent design approach:
o They have very limited intelligence.
o They do not have knowledge of non-perceptual parts of the current state.
o Mostly too big to generate and to store.
o Not adaptive to changes in the environment.

2. Model-based reflex agent


o The Model-based agent can work in a partially observable environment and track
the situation.
o A model-based agent has two important factors:
o Model: It is knowledge about "how things happen in the world," so it is
called a Model-based agent.
o Internal State: It is a representation of the current state based on percept
history.
o These agents have the model, "which is knowledge of the world" and based on
the model they perform actions.
o Updating the agent state requires information about:
a. How the world evolves
b. How the agent's action affects the world.
3. Goal-based agents
o The knowledge of the current state environment is not always sufficient to decide
for an agent to what to do.
o The agent needs to know its goal which describes desirable situations.
o Goal-based agents expand the capabilities of the model-based agent by having
the "goal" information.
o They choose an action, so that they can achieve the goal.
o These agents may have to consider a long sequence of possible actions before
deciding whether the goal is achieved or not. Such considerations of different
scenarios are called searching and planning, which makes an agent proactive.

4. Utility-based agents
o These agents are like the goal-based agent but provide an extra component of
utility measurement which makes them different by providing a measure of
success at a given state.
o Utility-based agent acts based not only on goals but also the best way to achieve
the goal.
o The Utility-based agent is useful when there are multiple possible alternatives,
and an agent must choose to perform the best action.
o The utility function maps each state to a real number to check how efficiently
each action achieves the goals.

5. Learning Agents
o A learning agent in AI is the type of agent which can learn from its past
experiences, or it has learning capabilities.
o It starts to act with basic knowledge and then able to act and adapt automatically
through learning.
o A learning agent has mainly four conceptual components, which are:
a. Learning element: It is responsible for making improvements by learning
from the environment.
b. Critic: Learning element takes feedback from critic which describes that
how well the agent is doing with respect to a fixed performance standard.
c. Performance element: It is responsible for selecting external action.
d. Problem generator: This component is responsible for suggesting actions
that will lead to new and informative experiences.
o Hence, learning agents can learn, analyze performance, and look for new ways to
improve the performance.

Intelligent Agents:

An intelligent agent is an autonomous entity which acts upon an environment using


sensors and actuators for achieving goals. An intelligent agent may learn from the
environment to achieve their goals. A thermostat is an example of an intelligent agent.

Following are the main four rules for an AI agent:

o Rule 1: An AI agent must have the ability to perceive the environment.


o Rule 2: Observation must be used to make decisions.
o Rule 3: Decision should result in an action.
o Rule 4: The action taken by an AI agent must be a rational action.

Rational Agent:

A rational agent is an agent which has clear preference, models uncertainty, and acts in
a way to maximize its performance measure with all possible actions.

A rational agent is said to perform the right things. AI is about creating rational agents
to use for game theory and decision theory for various real-world scenarios.

For an AI agent, rational action is most important because in AI reinforcement learning


algorithm, for each best possible action, agent gets the positive reward and for each
wrong action, an agent gets a negative reward.
The rationality of an agent is measured by its performance measure. Rationality can be
judged based on following points:

o Performance measure which defines the success criterion.


o Agent’s prior knowledge of its environment.
o Best possible actions that an agent can perform.
o The sequence of percepts.

This leads to a definition of a rational agent:


“For each possible percept sequence, a rational agent should select an action
that is expected to maximize its performance measure, given the evidence
provided by the percept sequence and whatever built-in knowledge the agent
has.”

PEAS Representation

PEAS is a type of model on which an AI agent works upon. When we define an AI


agent or rational agent, then we can group its properties under PEAS representation
model. It is made up of four words:

o P: Performance measure
o E: Environment
o A: Actuators
o S: Sensors

Here performance measure is the objective for the success of an agent's behavior.

PEAS for self-driving cars:

Let's suppose a self-driving car then PEAS representation will be:

Performance: Safety, time, legal drive, comfort

Environment: Roads, other vehicles, road signs, pedestrian

Actuators: Steering, accelerator, brake, signal, horn

Sensors: Camera, GPS, speedometer, odometer, accelerometer, sonar.


Agent Performance Environment Actuators Sensors
measure

1. Keyboard
o Healthy o Patient o Tests
Medical (Entry of symptoms)
Diagnose patient o Hospital o Treatments
o Minimized o Staff
cost

2.
o Cleanness o Room o Wheels o Camera
Vacuum
Cleaner o Efficiency o Table o Brushes o Dirt detection sensor
o Battery o Wood o Vacuum o Cliff sensor
life floor Extractor o Bump Sensor
o Security o Carpet o Infrared Wall Sensor
o Various
obstacles

3. Part -
o Percentage o Conveyor o Jointed o Camera
picking
Robot of parts in belt with Arms o Joint angle sensors.
correct parts, o Hand
bins. o Bins

Agent Environment

An environment is everything in the world which surrounds the agent, but it is not a
part of an agent itself. An environment can be described as a situation in which an agent
is present.

The environment is where the agent lives, operates, and provides the agent with
something to sense and act upon it. The environment is mostly said to be non-
feministic.

Types of Environments

As per Russell and Norvig, an environment can have various features from the point of
view of an agent:

1. Fully observable vs Partially Observable


2. Static vs Dynamic
3. Discrete vs Continuous
4. Deterministic vs Stochastic
5. Single-agent vs multi-agent
6. Episodic vs sequential
7. Known vs Unknown.
8. Accessible vs Inaccessible

1. Fully observable vs Partially Observable:


o If an agent sensor can sense or access the complete state of an environment at
each point of time then it is a fully observable environment, else it is partially
observable.
o A fully observable environment is easy as there is no need to maintain the
internal state to keep track of the history of the world.
o An agent with no sensors in all environments then such an environment is
called unobservable.

2. Deterministic vs Stochastic:
o If an agent's current state and selected action can completely determine the next
state of the environment, then such an environment is called a deterministic
environment.
o A stochastic environment is random in nature and cannot be determined
completely by an agent.
o In a deterministic, fully observable environment, agents do not need to worry
about uncertainty.

3. Episodic vs Sequential:
o In an episodic environment, there is a series of one-shot actions, and only the
current percept is required for the action.
o However, in Sequential environment, an agent requires memory of past actions to
determine the next best actions.

Single-agent vs multi-agent
o If only one agent is involved in an environment, and operating by itself then such
an environment is called single agent environment.
o However, if multiple agents are operating in an environment, then such an
environment is called a multi-agent environment.
o The agent design problems in the multi-agent environment are different from
single agent environment.
5. Static vs Dynamic:
o If the environment can change itself while an agent is deliberating, then such an
environment is called a dynamic environment or else it is called a static
environment.
o Static environments are easy to deal with because an agent does not need to
continue looking at the world while deciding for an action.
o However, for dynamic environment, agents need to keep looking at the world at
each action.
o Taxi driving is an example of a dynamic environment whereas Crossword
puzzles are an example of a static environment.

Discrete vs Continuous:
o If in an environment there are a finite number of percepts and actions that can be
performed within it, then such an environment is called a discrete environment or
else it is called continuous environment.
o A chess game comes under discrete environment as there is a finite number of
moves that can be performed.
o A self-driving car is an example of a continuous environment.

7. Known vs Unknown
o Known and unknown are not actually a feature of an environment, but it is an
agent's state of knowledge to perform an action.
o In a known environment, the results for all actions are known to the agent. While
in an unknown environment, agent needs to learn how it works to perform an
action.
o It is quite possible that a known environment to be partially observable and an
Unknown environment to be fully observable.

8. Accessible vs Inaccessible
o If an agent can obtain complete and accurate information about the state's
environment, then such an environment is called an Accessible environment or
else it is called inaccessible.
o An empty room whose state can be defined by its temperature is an example of
an accessible environment.
o Information about an event on earth is an example of Inaccessible environment.
Problem-solving agents:

In Artificial Intelligence, Search techniques are universal problem-solving


methods. Rational agents or Problem-solving agents in AI mostly used these search
strategies or algorithms to solve a specific problem and provide the best result.
Problem-solving agents are goal-based agents and use atomic representation.

o Search: Searching is a step-by-step procedure to solve a search-problem in each


search space. A search problem can have three main factors:
a. Search Space: Search space represents a set of possible solutions, which a
system may have.
b. Start State: It is a state from where the agent begins the search.
c. Goal test: It is a function which observes the current state and returns
whether the goal state is achieved or not.
o Search tree: A tree representation of search problem is called Search tree. The
root of the search tree is the root node which corresponds to the initial state.
o Actions: It gives the description of all the available actions to the agent.
o Transition model: A description of what each action does, can be represented as
a transition model.
o Path Cost: It is a function which assigns a numeric cost to each path.
o Solution: It is an action sequence which leads from the start node to the goal
node.
o Optimal Solution: If a solution has the lowest cost among all solutions.

Problem Formulation:
A problem can be defined formally by five components:
➢ The initial state that the agent starts in
➢ A description of the possible actions available to the agent.
➢ A description of what each action does; the formal name for this is the
transition model.
Together, the initial state, actions, and transition model implicitly define the
state space of the problem—the set of all states reachable from the initial
state by any sequence of actions.
➢ The goal test, which determines whether a given state is a goal state.
Sometimes there is an explicit set of possible goal states, and the test
simply checks whether the given state is one of them.
➢ A path cost function that assigns a numeric cost to each path. The
problem-solving agent chooses a cost function that reflects its own
performance measure.
A solution to a problem is an action sequence that leads from the initial
state to a goal state. Solution quality is measured by the path cost
function, and an optimal solution has the lowest path cost among all
solutions.

Example: 8-Puzzle

Initial State Goal State

• States: A state description specifies the location of each of the


eight tiles and the blank in one of the nine squares.
• Initial state: Any state can be designated as the initial state. Note
that any given goal can be reached from exactly half of the
possible initial states.
• Actions: The simplest formulation defines the actions as
movements of the blank space Left, Right, Up, or Down.
Different subsets of these are possible depending on where the
blank is.
• Transition model: Given a state and action, this returns the
resulting state.
• Goal test: This checks whether the state matches the goal
configuration. (Other goal configurations are possible.)
• Path cost: Each step costs 1, so the path cost is the number of
steps in the path.
Example: 8-queens problem

The goal of the 8-queens problem is to place eight queens on a chessboard such
that no queen attacks any other.
There are two main kinds of formulation. An incremental formulation involves
operators that augment the state description, starting with an empty state; for the
8-queens problem, this means that each action adds a queen to the state. A
complete-state formulation starts with all 8 queens on the board and moves
them around. In either case, the path cost is of no interest because only the final
state counts. The first incremental formulation one might try is the following:
• States: Any arrangement of 0 to 8 queens on the board is a state.
• Initial state: No queens on board.
• Actions: Add a queen to any empty square.
• Transition model: Returns the board with a queen added to the specified
square.
• Goal test: 8 queens are on the board, none attacked.
In this formulation, we have 64 · 63 ··· 57 ≈ 1.8 × 1014 possible sequences to
investigate.
A better formulation would prohibit placing a queen in any square that is
already attacked:
• States: All possible arrangements of n queens (0 ≤ n ≤ 8), one per column in
the leftmost n columns, with no queen attacking another.
• Actions: Add a queen to any square in the leftmost empty column such that it
is not attacked by any other queen. This formulation reduces the 8-queens state
space from 1.8 × 1014 to just 2,057, and solutions are easy to find. On the other
hand, for 100 queens the reduction is from roughly 10400 states to about 1052
states.

Types of search algorithms

Based on the search problems we can classify the search algorithms into uninformed
(Blind search) search and informed search (Heuristic search) algorithms.

Uninformed/Blind Search:

The uninformed search does not contain any domain knowledge such as closeness, the
location of the goal. It operates in a brute-force way as it only includes information
about how to traverse the tree and how to identify leaf and goal nodes. Uninformed
search applies a way in which search tree is searched without any information about
the search space like initial state operators and test for the goal, so it is also called blind
search. It examines each node of the tree until it achieves the goal node.
It can be divided into five main types:

o Breadth-first search
o Uniform cost search
o Depth-first search
o Iterative deepening depth-first search
o Bidirectional Search

Breadth-first Search:
o Breadth-first search is the most common search strategy for traversing a tree or
graph. This algorithm searches breadthwise in a tree or graph, so it is called
breadth-first search.
o BFS algorithm starts searching from the root node of the tree and expands all
successor node at the current level before moving to nodes of next level.
o The breadth-first search algorithm is an example of a general-graph search
algorithm.
o Breadth-first search implemented using FIFO queue data structure.

Advantages:

o BFS will provide a solution if any solution exists.


o If there is more than one solution for a given problem, then BFS will provide the
minimal solution which requires the least number of steps.

Disadvantages:

o It requires lots of memory since each level of the tree must be saved into
memory to expand to the next level.
o BFS needs lots of time if the solution is far away from the root node.

Example:

In the below tree structure, we have shown the traversing of the tree using BFS
algorithm from the root node S to goal node K. BFS search algorithm traverse in
layers, so it will follow the path which is shown by the dotted arrow, and the traversed
path will be:

1. S---> A--->B---->C--->D---->G--->H--->E---->F---->I---->K
Time Complexity: Time Complexity of BFS algorithm can be obtained by the number
of nodes traversed in BFS to the shallowest Node. Where the d= depth of shallowest
solution and b is a node at every state.

T (b) = 1+b2+b3+.......+ bd= O (bd)

Space Complexity: Space complexity of BFS algorithm is given by the Memory size
of frontier which is O(bd).

Completeness: BFS is complete, which means if the shallowest goal node is at some
finite depth, then BFS will find a solution.

Optimality: BFS is optimal if path cost is a non-decreasing function of the depth of


the node.

2. Depth-first Search
o Depth-first search is a recursive algorithm for traversing a tree or graph data
structure.
o It is called the depth-first search because it starts from the root node and follows
each path to its greatest depth node before moving to the next path.
o DFS uses a stack data structure for its implementation.
o The process of the DFS algorithm is like the BFS algorithm.
Advantage:

o DFS requires very less memory as it only needs to store a stack of the nodes on
the path from root node to the current node.
o It takes less time to reach the goal node than BFS algorithm (if it traverses in the
right path).

Disadvantage:

o There is the possibility that many states keep re-occurring, and there is no
guarantee of finding the solution.
o DFS algorithm goes for deep down searching and sometimes it may go to the
infinite loop.

Example:

In the below search tree, we have shown the flow of depth-first search, and it will
follow the order as:

Root node--->Left node ----> right node.

It will start searching from root node S, and traverse A, then B, then D and E, after
traversing E, it will backtrack the tree as E has no other successor and still goal node is
not found. After backtracking it will traverse node C and then G, and here it will
terminate as it found goal node.
Completeness: DFS search algorithm is complete within finite state space as it will
expand every node within a limited search tree.

Time Complexity: Time complexity of DFS will be equivalent to the node traversed
by the algorithm. It is given by:

T(n)= 1+ n2+ n3 +.........+ nm=O(nm)

Where, m= maximum depth of any node and this can be much larger than d
(Shallowest solution depth)

Space Complexity: DFS algorithm needs to store only single path from the root node,
hence space complexity of DFS is equivalent to the size of the fringe set, which
is O(bm).

Optimal: DFS search algorithm is non-optimal, as it may generate many steps or high
cost to reach the goal node.

3. Depth-Limited Search Algorithm:

A depth-limited search algorithm is like depth-first search with a predetermined limit.


Depth-limited search can solve the drawback of the infinite path in the Depth-first
search. In this algorithm, the node at the depth limit will treat as if it has no successor
nodes further.

Depth-limited search can be terminated with two Conditions of failure:

o Standard failure value: It indicates that the problem does not have any solution.
o Cutoff failure value: It defines no solution for the problem within a given depth
limit.

Advantages:

Depth-limited search is Memory efficient.

Disadvantages:

o Depth-limited search also has a disadvantage of incompleteness.


o It may not be optimal if the problem has more than one solution.

Example:
Completeness: DLS search algorithm is complete if the solution is above the depth-
limit.

Time Complexity: Time complexity of DLS algorithm is O(bℓ).

Space Complexity: Space complexity of DLS algorithm is O(b×ℓ).

Optimal: Depth-limited search can be viewed as a special case of DFS, and it is also
not optimal even if ℓ>d.

4. Uniform-cost Search Algorithm:

Uniform-cost search is a searching algorithm used for traversing a weighted tree or


graph. This algorithm comes into play when a different cost is available for each edge.
The primary goal of the uniform-cost search is to find a path to the goal node which
has the lowest cumulative cost. Uniform-cost search expands nodes according to their
path costs form the root node. It can be used to solve any graph/tree where the optimal
cost is in demand. A uniform-cost search algorithm is implemented by the priority
queue. It gives maximum priority to the lowest cumulative cost. Uniform cost search is
equivalent to BFS algorithm if the path cost of all edges is the same.

Advantages:

o Uniform cost search is optimal because at every state the path with the least cost
is chosen.
Disadvantages:

o It does not care about the number of steps involve in searching and only
concerned about path cost. Due to which this algorithm may be stuck in an
infinite loop.

Example:

Completeness:

Uniform-cost search is complete, such as if there is a solution, UCS will find it.

Time Complexity:

Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal
node. Then the number of steps is = C*/ε+1. Here we have taken +1, as we start from
state 0 and end to C*/ε.

Hence, the worst-case time complexity of Uniform-cost search is O(b1 + [C*/ε])/.

Space Complexity:

The same logic is for space complexity so, the worst-case space complexity of
Uniform-cost search is O(b1 + [C*/ε]).
Optimal:

Uniform-cost search is always optimal as it only selects a path with the lowest path
cost.

5. Iterative deepeningdepth-first Search:

The iterative deepening algorithm is a combination of DFS and BFS algorithms. This
search algorithm finds out the best depth limit and does it by gradually increasing the
limit until a goal is found.

This algorithm performs depth-first search up to a certain "depth limit", and it keeps
increasing the depth limit after each iteration until the goal node is found.

This Search algorithm combines the benefits of Breadth-first search's fast search and
depth-first search's memory efficiency.

The iterative search algorithm is useful for uninformed search when search space is
large, and depth of goal node is unknown.

Advantages:

o It combines the benefits of BFS and DFS search algorithm in terms of fast
search and memory efficiency.

Disadvantages:

o The main drawback of IDDFS is that it repeats all the work of the previous
phase.

Example:

The following tree structure shows the iterative deepening depth-first search. IDDFS
algorithm performs various iterations until it does not find the goal node. The iteration
performed by the algorithm is given as:
1'st Iteration-----> A
2'nd Iteration----> A, B, C
3'rd Iteration------>A, B, D, E, C, F, G
4'th Iteration------>A, B, D, H, I, E, C, F, K, G
In the fourth iteration, the algorithm will find the goal node.

Completeness:

This algorithm is complete is if the branching factor is finite.

Time Complexity:

Let's suppose b is the branching factor and depth is d then the worst-case time
complexity is O(bd).

Space Complexity:

The space complexity of IDDFS will be O(bd).

Optimal:

IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth of
the node.

6. Bidirectional Search Algorithm:

Bidirectional search algorithm runs two simultaneous searches, one form initial state
called as forward-search and other from goal node called as backward-search, to find
the goal node. Bidirectional search replaces one single search graph with two small
subgraphs in which one starts the search from an initial vertex and other starts from
goal vertex. The search stops when these two graphs intersect each other.

Bidirectional search can use search techniques such as BFS, DFS, DLS, etc.
Advantages:

o Bidirectional search is fast.


o Bidirectional search requires less memory.

Disadvantages:

o Implementation of the bidirectional search tree is difficult.


o In bidirectional search, one should know the goal state in advance.

Example:

In the below search tree, bidirectional search algorithm is applied. This algorithm
divides one graph/tree into two sub-graphs. It starts traversing from node 1 in the
forward direction and starts from goal node 16 in the backward direction.

The algorithm terminates at node 9 where two searches meet.

Completeness: Bidirectional Search is complete if we use BFS in both searches.

Time Complexity: Time complexity of bidirectional search using BFS is O(bd).

Space Complexity: Space complexity of bidirectional search is O(bd).

Optimal: Bidirectional search is Optimal.


Informed Search

Informed search algorithms use domain knowledge. In an informed search, problem


information is available which can guide the search. Informed search strategies can
find a solution more efficiently than an uninformed search strategy. Informed search is
also called a Heuristic search.

A heuristic is a way which might not always be guaranteed for best solutions but
guaranteed to find a good solution in reasonable time.

Informed search can solve many complex problems which could not be solved in
another way.

An example of informed search algorithms is a traveling salesman problem.

1. Greedy Search
2. A* Search

The informed search algorithm is more useful for large search space. Informed
search algorithm uses the idea of heuristic, so it is also called Heuristic search.

Heuristics function: Heuristic is a function which is used in Informed Search, and it


finds the most promising path. It takes the current state of the agent as its input and
produces the estimation of how close agent is from the goal. The heuristic method,
however, might not always give the best solution, but it guaranteed to find a good
solution in reasonable time. The heuristic function estimates how close a state is to the
goal. It is represented by h(n), and it calculates the cost of an optimal path between the
pair of states. The value of the heuristic function is always positive.

Admissibility of the heuristic function is given as:

1. h(n) <= h*(n)

Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence heuristic cost
should be less than or equal to the estimated cost.

1.) Best-first Search Algorithm (Greedy Search):

The Greedy best-first search algorithm always selects the path which appears best at
that moment. It is the combination of depth-first search and breadth-first search
algorithms. It uses the heuristic function and search. Best-first search allows us to take
advantage of both algorithms. With the help of best-first search, at each step, we can
choose the most promising node. In the best first search algorithm, we expand the node
which is closest to the goal node and the closest cost is estimated by heuristic function,
i.e.
1. f(n)= g(n).

Were, h(n)= estimated cost from node n to the goal.

The greedy best first algorithm is implemented by the priority queue.

Best first search algorithm:


o Step 1: Place the starting node into the OPEN list.
o Step 2: If the OPEN list is empty, Stop and return failure.
o Step 3: Remove the node n, from the OPEN list which has the lowest value of
h(n) and place it in the CLOSED list.
o Step 4: Expand the node n, and generate the successors of node n.
o Step 5: Check each successor of node n and find whether any node is a goal
node or not. If any successor node is goal node, then return success and
terminate the search, else proceed to Step 6.
o Step 6: For each successor node, algorithm checks for evaluation function f(n),
and then checks if the node has been in either OPEN or CLOSED list. If the
node has not been in either list, then add it to the OPEN list.
o Step 7: Return to Step 2.

Advantages:
o Best first search can switch between BFS and DFS by gaining the advantages of
both the algorithms.
o This algorithm is more efficient than BFS and DFS algorithms.

Disadvantages:
o It can behave as an unguided depth-first search in the worst-case scenario.
o It can get stuck in a loop like DFS.
o This algorithm is not optimal.

Example:

Consider the below search problem, and we will traverse it using greedy best-first
search. At each iteration, each node is expanded using evaluation function f(n)=h(n) ,
which is given in the below table.
In this search example, we are using two lists which are OPEN and CLOSED Lists.
Following are the iteration for traversing the above example.

Expand the nodes of S and put in the CLOSED list.

Initialization: Open [A, B], Closed [S]

Iteration 1: Open [A], Closed [S, B]


Iteration 2: Open [E, F, A], Closed [S, B]
: Open [E, A], Closed [S, B, F]

Iteration 3: Open [I, G, E, A], Closed [S, B, F]


: Open [I, E, A], Closed [S, B, F, G]

Hence the final solution path will be S----> B----->F----> G

Time Complexity: The worst-case time complexity of Greedy best first search is
O(bm).

Space Complexity: The worst-case space complexity of Greedy best first search is
O(bm). Where, m is the maximum depth of the search space.

Complete: Greedy best-first search is also incomplete, even if the given state space is
finite.

Optimal: Greedy best first search algorithm is not optimal.

2.) A* Search Algorithm:

A* search is the most known form of best-first search. It uses heuristic function h(n),
and costs to reach the node n from the start state g(n). It has combined features of UCS
and greedy best-first search, by which it solves the problem efficiently. A* search
algorithm finds the shortest path through the search space using the heuristic function.
This search algorithm expands less search tree and provides optimal result faster. The
A* algorithm is like UCS except that it uses g(n)+h(n) instead of g(n).

In A* search algorithm, we use search heuristic as well as the cost to reach the node.
Hence, we can combine both costs as follows, and this sum is called as a fitness
number.

Algorithm of A* search:

Step1: Place the starting node in the OPEN list.

Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure
and stops.
Step 3: Select the node from the OPEN list which has the smallest value of evaluation
function (g+h), if node n is goal node, then return success and stop, otherwise.

Step 4: Expand node n and generate all its successors and put n into the closed list. For
each successor n', check whether n' is already in the OPEN or CLOSED list, if not then
compute evaluation function for n' and place into Open list.

Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to
the back pointer which reflects the lowest g(n') value.

Step 6: Return to Step 2.

Advantages:
o The A* search algorithm is the best algorithm than other search algorithms.
o A* search algorithm is optimal and complete.
o This algorithm can solve very complex problems.

Disadvantages:
o It does not always produce the shortest path as it is mostly based on heuristics
and approximation.
o A* search algorithm has some complex issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes
in the memory, so it is not practical for various large-scale problems.

Example:
In this example, we will traverse the given graph using the A* algorithm. The heuristic
value of all states is given in the below table so we will calculate the f(n) of each state
using the formula f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start
state.
Here we will use OPEN and CLOSED list.
Solution:

Initialization: {(S, 5)}

Iteration1: {(S--> A, 4), (S-->G, 10)}

Iteration2: {(S--> A-->C, 4), (S--> A-->B, 7), (S-->G, 10)}

Iteration3: {(S--> A-->C--->G, 6), (S--> A-->C--->D, 11), (S--> A-->B, 7), (S-->G,
10)}
Iteration 4 will give the result, as S--->A--->C--->G it provides the optimal path with
cost 6.

Points to remember:

o A* algorithm returns the path which occurred first, and it does not search for all
remaining paths.
o The efficiency of A* algorithm depends on the quality of heuristic.
o A* algorithm expands all nodes which satisfy the condition f(n)<="" li="">

Complete: A* algorithm is complete as long as:

o Branching factor is finite.


o Cost at every action is fixed.

Optimal: A* search algorithm is optimal if it follows below two conditions:

o Admissible: the first condition required for optimality is that h(n) should be an
admissible heuristic for A* tree search. An admissible heuristic is optimistic in
nature.
o Consistency: Second required condition is consistency for only A* graph-
search.

If the heuristic function is admissible, then A* tree search will always find the least
cost path.

Time Complexity: The time complexity of A* search algorithm depends on heuristic


function, and the number of nodes expanded is exponential to the depth of solution d.
So the time complexity is O(b^d), where b is the branching factor.

Space Complexity: The space complexity of A* search algorithm is O(b^d)

Hill Climbing Algorithm


o Hill climbing algorithm is a local search algorithm which continuously moves in
the direction of increasing elevation/value to find the peak of the mountain or
best solution to the problem. It terminates when it reaches a peak value where no
neighbor has a higher value.
o Hill climbing algorithm is a technique which is used for optimizing the
mathematical problems. One of the widely discussed examples of Hill climbing
algorithm is Traveling-salesman Problem in which we need to minimize the
distance traveled by the salesman.
o It is also called greedy local search as it only looks to its good immediate
neighbor state and not beyond that.
o A node of hill climbing algorithm has two components which are state and
value.
o Hill Climbing is mostly used when a good heuristic is available.
o In this algorithm, we don't need to maintain and handle the search tree or graph
as it only keeps a single current state.

Features of Hill Climbing:

Following are some main features of Hill Climbing Algorithm:

o Generate and Test variant: Hill Climbing is the variant of Generate and Test
method. The Generate and Test method produces feedback which helps to
decide which direction to move in the search space.
o Greedy approach: Hill-climbing algorithm search moves in the direction which
optimizes the cost.
o No backtracking: It does not backtrack the search space, as it does not
remember the previous states.

State-space Diagram for Hill Climbing:

The state-space landscape is a graphical representation of the hill-climbing algorithm


which shows a graph between various states of algorithm and Objective function/Cost.

On Y-axis we have taken the function which can be an objective function or cost
function, and state-space on the x-axis. If the function on Y-axis is cost then, the goal
of search is to find the global minimum and local minimum. If the function of Y-axis is
Objective function, then the goal of the search is to find the global maximum and local
maximum.
Different regions in the state space landscape:

Local Maximum: Local maximum is a state which is better than its neighbor
states, but there is also another state which is higher than it.

Global Maximum: Global maximum is the best possible state of state space
landscape. It has the highest value of objective function.

Current state: It is a state in a landscape diagram where an agent is currently present.

Flat local maximum: It is a flat space in the landscape where all the neighboring
states of current states have the same value.

Shoulder: It is a plateau region which has an uphill edge.

Types of Hills Climbing Algorithm:


o Simple hill Climbing:
o Steepest-Ascent hill-climbing:
o Stochastic hill Climbing:

1. Simple Hill Climbing:

Simple hill climbing is the simplest way to implement a hill climbing algorithm. It
only evaluates the neighbor node state at a time and selects the first one which
optimizes current cost and set it as a current state. It only checks it's one successor
state, and if it finds better than the current state, then move else be in the same state.
This algorithm has the following features:

o Less time consuming


o Less optimal solution and the solution is not guaranteed.
Algorithm for Simple Hill Climbing:
o Step 1: Evaluate the initial state, if it is goal state then return success and stop.
o Step 2: Loop Until a solution is found or there is no new operator left to apply.
o Step 3: Select and apply an operator to the current state.
o Step 4: Check new state:
a. If it is goal state, then return success and quit.
b. Else if it is better than the current state then assigns new state as a current
state.
c. Else if not better than the current state, then return to step 2.
o Step 5: Exit.

2. Steepest-Ascent hill climbing:

The steepest-Ascent algorithm is a variation of simple hill climbing algorithm. This


algorithm examines all the neighboring nodes of the current state and selects one
neighbor node which is closest to the goal state. This algorithm consumes more time as
it searches for multiple neighbors.

Algorithm for Steepest-Ascent hill climbing:


o Step 1: Evaluate the initial state, if it is goal state then return success and stop,
else make current state as initial state.
o Step 2: Loop until a solution is found or the current state does not change.
a. Let SUCC be a state such that any successor of the current state will be
better than it.
b. For each operator that applies to the current state:
a. Apply the new operator and generate a new state.
b. Evaluate the new state.
c. If it is goal state, then return it and quit, else compare it to the
SUCC.
d. If it is better than SUCC, then set the new state as SUCC.
e. If the SUCC is better than the current state, then set current state to
SUCC.
o Step 5: Exit.
3. Stochastic hill climbing:

Stochastic hill climbing does not examine all its neighbor before moving. Rather, this
search algorithm selects one neighbor node at random and decides whether to choose it
as a current state or examine another state.

Problems in Hill Climbing Algorithm (Limitations):

1. Local Maximum: A local maximum is a peak state in the landscape, which is better
than each of its neighboring states, but there is another state also present which is
higher than the local maximum.

Solution: Backtracking technique can be a solution of the local maximum in state


space landscape. Create a list of the promising paths so that the algorithm can
backtrack the search space and explore other paths as well.

2. Plateau: A plateau is the flat area of the search space in which all the neighbor
states of the current state contains the same value, because of this algorithm does not
find any best direction to move. A hill-climbing search might be lost in the plateau
area.

Solution: The solution for the plateau is to take big steps or very little steps while
searching, to solve the problem. Randomly select a state which is far away from the
current state so it is possible that the algorithm could find non-plateau region.
3. Ridges: A ridge is a special form of the local maximum. It has an area which is
higher than its surrounding areas, but itself has a slope, and cannot be reached in a
single move.

Solution: With the use of bidirectional search, or by moving in different directions, we


can improve this problem.

Simulated Annealing:

A hill-climbing algorithm which never makes a move towards a lower value is


guaranteed to be incomplete because it can get stuck on a local maximum. And if
algorithm applies a random walk, by moving a successor, then it may complete but not
efficient. Simulated Annealing is an algorithm which yields both efficiency and
completeness.

In mechanical term Annealing is a process of hardening a metal or glass to a high


temperature then cooling gradually, so this allows the metal to reach a low-energy
crystalline state. The same process is used in simulated annealing in which the
algorithm picks a random move, instead of picking the best move. If the random move
improves the state, then it follows the same path. Otherwise, the algorithm follows the
path which has a probability of less than 1 or it moves downhill and chooses another
path.
UNIT-2

Knowledge Representation, Reasoning, and Machine Learning


Knowledge Representation, different forms, Reasoning, Planning, Uncertainty in
Knowledge, Fuzzy logic, & fuzzification.
Machine Learning: Forms of Learning, Parametric & Non-Parametric
Models, Classification, Regression, Regularization, Decision Trees, SVM,
Artificial Neural Networks, Ensemble Learning, Boosting, K-NN, Gradient
Descent

Knowledge Representation

○ Knowledge representation and reasoning (KR, KRR) is the part of Artificial

intelligence which is concerned with AI agents thinking and how thinking


contributes to intelligent behavior of agents.

○ It is responsible for representing information about the real world so that a

computer can understand and can utilize this knowledge to solve complex real-
world problems such as diagnosing a medical condition or communicating with
humans in natural language.

○ It is also a way which describes how we can represent knowledge in artificial

intelligence. Knowledge representation is not just storing data into some


database, but it also enables an intelligent machine to learn from that knowledge
and experiences so that it can behave intelligently like a human.

Following are the kind of knowledge which needs to be represented in AI systems:

○ Object: All the facts about objects in our world domain. E.g., Guitars contain

strings, trumpets are brass instruments.

○ Events: Events are the actions which occur in our world.

○ Performance: It describes behavior which involves knowledge about how to do

things.

○ Meta-knowledge: It is knowledge about what we know.


○ Facts: Facts are the truths about the real world and what we represent.

○ Knowledgebase: The central component of knowledge-based agents is the

knowledge base. It is represented as KB. The Knowledgebase is a group of the


Sentences (Here, sentences are used as a technical term and not identical with
the English language).

Knowledge: Knowledge is awareness or familiarity gained by experiences of facts,


data, and situations. Following are the types of knowledge in artificial intelligence:

Types of knowledge (Forms of Knowledge)

1. Declarative Knowledge:

○ Declarative knowledge is to know about something.

○ It includes concepts, facts, and objects.

○ It is also called descriptive knowledge and expressed in declarative sentences.

○ It is simpler than procedural language.

2. Procedural Knowledge

○ It is also known as imperative knowledge.

○ Procedural knowledge is a type of knowledge which is responsible for knowing

how to do something.

○ It can be directly applied to any task.

○ It includes rules, strategies, procedures, agendas, etc.

○ Procedural knowledge depends on the task on which it can be applied.

3. Meta-knowledge:

○ Knowledge about the other types of knowledge is called Meta-knowledge.

4. Heuristic knowledge:

○ Heuristic knowledge represents knowledge of some experts in a field or subject.


○ Heuristic knowledge is rules of thumb based on previous experiences, awareness

of approaches, and which are good to work but not guaranteed.

5. Structural knowledge:

○ Structural knowledge is basic knowledge of problem-solving.

○ It describes relationships between various concepts such as kind of, part of, and

grouping of something.

○ It describes the relationship that exists between concepts or objects.

An Artificial intelligence system has the following components for displaying


intelligent behavior:

○ Perception

○ Learning

○ Knowledge Representation and Reasoning

○ Planning

○ Execution

The above diagram shows how an AI system can interact with the real world and
what components help it to show intelligence. AI systems have a Perception.
component by which it retrieves information from their environment. It can be visual,
audio or another form of sensory input. The learning component is responsible for
learning from data captured by Perception comportment. In the complete cycle, the
main components are knowledge representation and Reasoning. These two components
are involved in showing intelligence in machine-like humans. These two components
are independent of each other but also coupled together. The planning and execution
depend on analysis of Knowledge representation and reasoning.

There are mainly four ways of knowledge representation which are given as follows:

1. Logical Representation

2. Semantic Network Representation

3. Frame Representation

4. Production Rules

1. Logical Representation

Logical representation is a language with some concrete rules which deals with
propositions and has no ambiguity in representation. Logical representation means
drawing a conclusion based on various conditions. This representation lays down some
important communication rules. It consists of precisely defined syntax and semantics
which support the sound inference. Each sentence can be translated into logic using
syntax and semantics.

Syntax:

○ Syntaxes are the rules which decide how we can construct legal sentences in

logic.

○ It determines which symbol we can use in knowledge representation.

○ How to write those symbols.


Semantics:

○ Semantics are the rules by which we can interpret the sentence in the logic.

○ Semantics also involves assigning a meaning to each sentence.

Logical representation can be categorized into mainly two logics:

a. Propositional Logics

b. Predicate logics

Propositional logic in Artificial intelligence

Propositional logic (PL) is the simplest form of logic where all the statements are made
by propositions. A proposition is a declarative statement which is either true or false. It
is a technique of knowledge representation in logical and mathematical form.

Example:

1. It is Sunday.

2. The Sun rises from West (False proposition)

3. 3+3= 7(False proposition)

4. 5 is a prime number.

Following are some basic facts about propositional logic:

○ Propositional logic is also called Boolean logic as it works on 0 and 1.

○ In propositional logic, we use symbolic variables to represent the logic, and we

can use any symbol to represent a proposition, such as A, B, C, P, Q, R, etc.

○ Propositions can be either true or false, but it cannot be both.

○ Propositional logic consists of an object, relations or function, and logical

connectives.
○ These connectives are also called logical operators.

○ The propositions and connectives are the basic elements of the propositional

logic.

○ Connectives can be said to be a logical operator which connects two sentences.

○ A proposition formula which is always true is called tautology, and it is also

called a valid sentence.

○ A proposition formula which is always false is called Contradiction.

○ Statements which are questions, commands, or opinions are not propositions

such as "Where is Rohini", "How are you", "What is your name", are not
propositions.

Syntax of propositional logic:

The syntax of propositional logic defines the allowable sentences for the knowledge
representation. There are two types of Propositions:

a. Atomic Propositions

b. Compound propositions

○ Atomic Proposition: Atomic propositions are simple propositions. It consists of

a single proposition symbol. These are the sentences which must be either true
or false.

a) 2+2 is 4, it is an atomic proposition as it is a true fact.

b) "The Sun is cold" is also a proposition as it is a false fact.

○ Compound proposition: Compound propositions are constructed by combining

simpler or atomic propositions, using parentheses and logical connectives.


Example:

1. "It is raining today, and the street is wet."

2. "Ankit is a doctor, and his clinic is in Mumbai."

Logical Connectives:

Logical connectives are used to connect two simpler propositions or represent a


sentence logically. We can create compound propositions with the help of logical
connectives. There are mainly five connectives, which are given as follows:

1. Negation: A sentence such as ¬ P is called negation of P. A literal can be either

Positive literal or negative literal.

2. Conjunction: A sentence which has ∧ connective such as, P ∧ Q is called a

conjunction.
Example: Rohan is intelligent and hardworking. It can be written as,
P= Rohan is intelligent,
Q= Rohan is hardworking. → P∧ Q.

3. Disjunction: A sentence which has ∨ connective, such as P ∨ Q. is called

disjunction, where P and Q are the propositions.


Example: "Ritika is a doctor or Engineer",
Here P= Ritika is Doctor. Q= Ritika is Doctor, so we can write it as P ∨ Q.

4. Implication: A sentence such as P → Q, is called an implication. Implications

are also known as if-then rules. It can be represented as


If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is represented as P → Q

5. Biconditional: A sentence such as P⇔ Q is a Biconditional sentence, example If

I am breathing, then I am alive


P= I am breathing, Q= I am alive, it can be represented as P ⇔ Q.
Following is the summarized table for Propositional Logic Connectives:

2. Semantic Network Representation

Semantic networks are alternatives to predicate logic for knowledge representation. In


Semantic networks, we can represent our knowledge in the form of graphical networks.
This network consists of nodes representing objects and arcs which describe the
relationship between those objects. Semantic networks can categorize the object in
different forms and can also link those objects. Semantic networks are easy to
understand and can be easily extended.

This representation consists of mainly two types of relations:

a. IS-A relation (Inheritance)

b. Kind-of-relation

Example: Following are some statements which we need to represent in the form of
nodes and arcs.

Statements:

a. Jerry is a cat.

b. Jerry is a mammal.

c. Jerry is owned by Priya.

d. Jerry is brown.

e. All Mammals are animals.


3. Frame Representation

A frame is a record-like structure which consists of a collection of attributes and its


values to describe an entity in the world. Frames are the AI data structure which
divides knowledge into substructures by representing stereotypical situations. It
consists of a collection of slots and slot values. These slots may be of any type and
size. Slots have names and values which are called facets.

Facets: The various aspects of a slot are known as Facets. Facets are features of frames
which enable us to put constraints on the frames. Example: IF-NEEDED facts are
called when data of any slot is needed. A frame may consist of any number of slots,
and a slot may include any number of facets and facets may have any number of
values. A frame is also known as slot-filter knowledge representation in artificial
intelligence.

Frames are derived from semantic networks and later evolved into our modern-day
classes and objects. A single frame is not very useful. Frames systems consist of a
collection of frames which are connected. In the frame, knowledge about an object or
event can be stored together in the knowledge base. The frame is a type of technology

which is widely used in various applications including Natural language processing


and machine visions.
Let's take an example of a frame for a book.

Slots Filters

Title Artificial Intelligence

Genre Computer Science

Author Peter Norvig

Edition Third Edition

Year 1996

Page 1152
4. Production Rules

The production rules system consists of (condition, action) pairs which mean, "If
condition then action". It has mainly three parts:

○ The set of production rules

○ Working Memory

○ The recognize-act-cycle.

In production rules agent checks for the condition and if the condition exists then
production rule fires and corresponding action is carried out. The condition part of the
rule determines which rule may be applied to a problem. And the action part carries out
the associated problem-solving steps. This complete process is called a recognize-act
cycle.

The working memory contains the description of the current state of problem-solving
and rules can write knowledge to the working memory. This knowledge matches and
may fire other rules.

If there is a new situation (state) generated, then multiple production rules will be fired
together, this is called conflict set. In this situation, the agent needs to select a rule
from these sets, and it is called a conflict resolution.

Example:

○ IF (at bus stop AND bus arrives) THEN action (get into the bus)

○ IF (on the bus AND paid AND empty seat) THEN action (sit down).

○ IF (on bus AND unpaid) THEN action (pay charges).

○ IF (bus arrives at destination) THEN action (get down from the bus)
Reasoning in Artificial intelligence

Reasoning is the mental process of deriving logical conclusions and making


predictions from available knowledge, facts, and beliefs. Or we can say, "Reasoning is
a way to infer facts from existing data." It is a general process of thinking rationally, to
find valid conclusions.

In artificial intelligence, reasoning is essential so that the machine can also think
rationally as a human brain and can perform like a human.

Types of Reasoning

In artificial intelligence, reasoning can be divided into the following categories:

○ Deductive reasoning

○ Inductive reasoning

○ Abductive reasoning

○ Common Sense Reasoning

○ Monotonic Reasoning

○ Non-monotonic Reasoning

1. Deductive reasoning:

Deductive reasoning is deducing new information from logically related known


information. It is the form of valid reasoning, which means the argument's conclusion
must be true when the premises are true.

Deductive reasoning is a type of propositional logic in AI, and it requires various rules
and facts. It is sometimes referred to as top-down reasoning, and contradictory to
inductive reasoning.

In deductive reasoning, the truth of the premises guarantees the truth of the conclusion.
Deductive reasoning mostly starts from the general premises to the specific conclusion,
which can be explained as below example.

Example:

Premise-1: All the human eats veggies

Premise-2: Suresh is human.

Conclusion: Suresh eats veggies.

2. Inductive Reasoning:

Inductive reasoning is a form of reasoning to arrive at a conclusion using limited sets


of facts by the process of generalization. It starts with a series of specific facts or data
and reaches a general statement or conclusion.

Inductive reasoning is a type of propositional logic, which is also known as cause-


effect reasoning or bottom-up reasoning.

In inductive reasoning, we use historical data or various premises to generate a generic


rule, for which premises support the conclusion.

In inductive reasoning, premises provide probable support to the conclusion, so the


truth of premises does not guarantee the truth of the conclusion.

Example:

Premise: All the pigeons we have seen in the zoo are white.

Conclusion: Therefore, we can expect all the pigeons to be white.

3. Abductive reasoning:

Abductive reasoning is a form of logical reasoning which starts with single or multiple
observations then seeks to find the most likely explanation or conclusion for the
observation.

Abductive reasoning is an extension of deductive reasoning, but in abductive


reasoning, the premises do not guarantee the conclusion.

Example:

Implication: Cricket ground is wet if it is raining

Axiom: Cricket ground is wet.

Conclusion It is raining.

4. Common Sense Reasoning

Common sense reasoning is an informal form of reasoning, which can be gained


through experiences.

Common Sense reasoning simulates the human ability to make presumptions about
events which occur every day.

It relies on good judgment rather than exact logic and operates on heuristic knowledge
and heuristic rules.

Example:

1. One person can be at one place at a time.

2. If I put my hand in a fire, then it will burn.

The above two statements are examples of common-sense reasoning which a human
mind can easily understand and assume.
5. Monotonic Reasoning:

In monotonic reasoning, once the conclusion is taken, then it will remain the same
even if we add some other information to existing information in our knowledge base.
In monotonic reasoning, adding knowledge does not decrease the set of prepositions
that can be derived.

To solve monotonic problems, we can derive the valid conclusion from the available
facts only, and it will not be affected by new facts.

Monotonic reasoning is not useful for real-time systems, as in real time, facts get
changed, so we cannot use monotonic reasoning.

Monotonic reasoning is used in conventional reasoning systems, and a logic-based


system is monotonic.

Any theorem proving is an example of monotonic reasoning.

○ The Earth revolves around the Sun.

It is a fact, and it cannot be changed even if we add another sentence in the knowledge
base like, "The moon revolves around the earth" Or "Earth is not round," etc.

6. Non-monotonic Reasoning

In Non-monotonic reasoning, some conclusions may be invalidated if we add some


more information to our knowledge base.

Logic will be said as non-monotonic if some conclusions can be invalidated by adding


more knowledge into our knowledge base.

Non-monotonic reasoning deals with incomplete and uncertain models.

"Human perceptions for various things in daily life, "is a general example of non-
monotonic reasoning.

Example: Let suppose the knowledge base contains the following knowledge:

○ Birds can fly.

○ Penguins cannot fly.

○ Pitty is a bird.

So, from the above sentences, we can conclude that Pitty can fly.

Probabilistic reasoning in Artificial intelligence


Uncertainty: we have learned knowledge representation using first-order logic and
propositional logic with certainty, which means we were sure about the predicates.
With this knowledge representation, we might write A→B, which means if A is true
then B is true but consider a situation where we are not sure about whether A is true
or not then we cannot express this statement, this situation is called uncertainty.

So, to represent uncertain knowledge, where we are not sure about the predicates, we
need uncertain reasoning or probabilistic reasoning.

Causes of uncertainty:

The following are some leading causes of uncertainty to occur in the real world.

1. Information occurred from unreliable sources.

2. Experimental Errors

3. Equipment fault

4. Temperature variation

5. Climate change.

Probabilistic reasoning:

Probabilistic reasoning is a way of knowledge representation where we apply the


concept of probability to indicate the uncertainty in knowledge. In probabilistic
reasoning, we combine probability theory with logic to handle the uncertainty.

We use probability in probabilistic reasoning because it provides a way to handle the


uncertainty that is the result of someone's laziness and ignorance.

In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations,"
"A match between two teams or two players." These are probable sentences for
which we can assume that it will happen but are not sure about it, so here we use
probabilistic reasoning.

Need of probabilistic reasoning in AI:

○ When there are unpredictable outcomes.

○ When specifications or possibilities of predicates become too large to handle.

○ When an unknown error occurs during an experiment.

In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:

○ Bayes' rule

○ Bayesian Statistics

As probabilistic reasoning uses probability and related terms, so before understanding


probabilistic reasoning, let's understand some common terms:

Probability: Probability can be defined as a chance that an uncertain event will occur. It
is the numerical measure of the likelihood that an event will occur. The value of
probability always remains between 0 and 1 that represent ideal uncertainties.

We can find the probability of an uncertain event by using the below formula.

○ P(¬A) = probability of a not happening event.

○ P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.

Sample space: The collection of all possible events is called sample space.

Random variables: Random variables are used to represent the events and objects in
the real world.

Prior probability: The prior probability of an event is probability computed before


observing new information.

Posterior Probability: The probability that is calculated after all evidence or


information has been considered. It is a combination of prior probability and new
information.

Conditional probability:

Conditional probability is a probability of occurring an event when another event has


already happened.

Let's suppose, we want to calculate the event A when event B has already occurred,
"the probability of A under the conditions of B", it can be written as:

Where P(A⋀B)= Joint probability of a and B

P(B)= Marginal probability of B.

If the probability of A is given and we need to find the probability of B, then it will be
given as:

It can be explained by using the below Venn diagram, where B is occurred event, so
sample space will be reduced to set B, and now we can only calculate event A when
event B is already occurred by dividing the probability of P(A⋀B) by P( B ).

Bayes' theorem:

Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.

In probability theory, it relates the conditional probability and marginal probabilities of


two random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The
Bayesian inference is an application of Bayes' theorem, which is fundamental to
Bayesian statistics.

It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).

Bayes' theorem allows updating the probability prediction of an event by observing


new information of the real world.

Example: If cancer corresponds to one's age, then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.

Bayes' theorem can be derived using product rule and conditional probability of event
A with known event B:

As from product rule we can write:

1. P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:

1. P(A ⋀ B)= P(B|A) P(A)

Equating right hand side of both the equations, we will get:

The above equation (a) is called Bayes' rule or Bayes' theorem. This equation is basic
of most modern AI systems for probabilistic inference.

It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as
Probability of hypothesis A when we have occurred evidence B.

P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.

P(A) is called the prior probability, probability of hypothesis before considering the
evidence.

P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes'
rule can be written as:

Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.

Fuzzy logic, & fuzzification.

The 'Fuzzy' word means the things that are not clear or are vague. Sometimes, we
cannot decide in real life whether the given problem or statement is either true or false.
At that time, this concept provides many values between the true and false and gives
the flexibility to find the best solution to that problem.

Fuzzy logic contains multiple logical values, and these values are the truth values of a
variable or problem between 0 and 1. This concept was introduced by Lofti Zadeh in
1965 based on the Fuzzy Set Theory. This concept provides the possibilities which are
not given by computers, but like the range of possibilities generated by humans.

In the Boolean system, only two possibilities (0 and 1) exist, where 1 denotes the
absolute truth value and 0 denotes the absolute false value. But in the fuzzy system,
there are multiple possibilities present between the 0 and 1, which are partially false
and partially true.

The Fuzzy logic can be implemented in systems such as micro-controllers,


workstation-based or large network-based systems for achieving the definite output.
It can also be implemented in both hardware and software.

Architecture of a Fuzzy Logic System

In the architecture of the Fuzzy Logic system, each component plays an important role.
The architecture consists of the different four components which are given below.

1. Rule Base

2. Fuzzification

3. Inference Engine

4. Defuzzification

Following diagram shows the architecture or process of a Fuzzy Logic system:

1. Rule Base

Rule Base is a component used for storing the set of rules and the If-Then conditions
given by the experts are used for controlling the decision-making systems. There are so
many updates that have come to the Fuzzy theory recently, which offers effective
methods for designing and tuning of fuzzy controllers. These updates or developments
decrease the number of fuzzy set of rules.

2. Fuzzification

Fuzzification is a module or component for transforming the system inputs, i.e., it


converts the crisp number into fuzzy steps. The crisp numbers are those inputs which
are measured by the sensors and then fuzzification passed them into the control
systems for further processing. This component divides the input signals into following
five states in any Fuzzy Logic system:

○ Large Positive (LP)

○ Medium Positive (MP)

○ Small (S)

○ Medium Negative (MN)

○ Large negative (LN)

3. Inference Engine

This component is a main component in any Fuzzy Logic system (FLS), because all
the information is processed in the Inference Engine. It allows users to find the
matching degree between the current fuzzy input and the rules. After the matching
degree, this system determines which rule is to be added according to the given input
field. When all rules are fired, then they are combined for developing the control
actions.

4. Defuzzification

Defuzzification is a module or component, which takes the fuzzy set inputs generated
by the Inference Engine, and then transforms them into a crisp value. It is the last step
in the process of a fuzzy logic system. The crisp value is a type of value which is
acceptable by the user. Various techniques are present to do this, but the user has to
select the best one to reduce the number of errors.

Membership Function

The membership function is a function which represents the graph of fuzzy sets and
allows users to quantify the linguistic term. It is a graph which is used for mapping
each element of x to the value between 0 and 1.

This function is also known as an indicator or characteristic function.

For the Fuzzy set B, the membership function for X is defined as: μB:X → [0,1]. In
function X, each element of set B is mapped to the value between 0 and 1. This is
called a degree of membership or membership value.

Learning
An agent is learning if it improves its performance on future tasks after making
observations about the world.
Why would we want an agent to learn?
There are three main reasons. First, the designers cannot anticipate all possible
situations that the agent might find itself in. For example, a robot designed to
navigate mazes must learn the layout of each new maze it encounters. Second,
the designers cannot anticipate all changes over time; a program designed to
predict tomorrow’s stock market prices must learn to adapt when conditions
change from boom to bust. Third, sometimes human programmers have no idea
how to program a solution themselves. For example, most people are good at
recognizing the faces of family members, but even the best programmers are
unable to program a computer to accomplish that task, except by using learning
algorithms.
Forms of Learning
Any component of an agent can be improved by learning from data. The
improvements, and the techniques used to make them, depend on four major
factors.
Which component is to be improved.
What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from?
There are three types of feedback that determine the three main types of
learning:
In unsupervised learning the agent learns patterns in the input even though no
explicit feedback is supplied. The most common unsupervised learning task is
clustering: detecting potentially useful clusters of input examples. For example,
a taxi agent might gradually develop a concept of “good traffic days” and “bad
traffic days” without ever being given labeled examples of each by a teacher.
In reinforcement learning the agent learns from a series of reinforcements—
rewards or punishments. For example, the lack of a tip at the end of the journey
gives the taxi agent an indication that it did something wrong. The two points
for a win at the end of a chess game tells the agent it did something right. It is
up to the agent to decide which of the actions prior to the reinforcement were
most responsible for it.
In supervised learning the agent observes some example input–output pairs and
learns a function that maps from input to output. Consider, for example, an agent
training to become a taxi driver. Every time the instructor shouts “Brake!” the
agent might learn a condition– action rule for when to break the inputs are
percepts and the output are provided by a teacher who says “Brake!” or “Turn
left.”
SEMI- SUPERVISED LEARNING: In practice, these distinctions are not always
so crisp. In semi-supervised learning we are given a few labelled examples and
must make what we can of a large collection of unlabeled examples.

Supervised Learning
In supervised learning, models are trained using labelled dataset, where the
model learns about each type of data. Once the training process is completed,
the model is tested based on test data (a subset of the training set), and then it
predicts the output.
The task of supervised learning is this:
Given a training set of N example input–output pairs
(x1, y1),(x2, y2),...(xN , yN ) ,
where each yj was generated by an unknown function y = f(x), discover a
function h that approximates the true function f. Here x and y can be any value;
they need not be numbers.
Function h is a hypothesis.
Learning is a search through the space of possible hypotheses for one that will
perform well, even on new examples beyond the training set. To measure the
accuracy of a hypothesis we give it a test set of examples that are distinct from
the training set. We say a hypothesis.
generalizes well if it correctly predicts the value of y for novel examples.
Supervised learning can be further divided into two types of problems:
1. Classification
2. Regression
Classification

The Classification algorithm is a Supervised Learning technique that is used


to identify the category of new observations based on training data. In
Classification, a program learns from the given dataset or observations and
then classifies new observation into a few classes or groups. Such as, Yes
or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called
targets/labels or categories.

The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the categorical
data.

Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have features
that are like each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a


classifier. There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Regression
Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables with one
or more independent variables. More specifically, Regression analysis helps
us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables
are held fixed. It predicts continuous/real values such as temperature, age,
salary, price, etc.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on the
one or more predictor variables. It is mainly used for prediction, forecasting, time
series modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given
datapoints. Using this plot, the machine learning model can make predictions about the
data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between
the datapoints and the regression line is minimum." The distance between datapoints
and line tells whether a model has captured a strong relationship or not.

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input variable,
then such linear regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee based on the year of experience.
mathematical equation for Linear regression:

Y= aX+b

Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to
solve classification problems. In classification problems, we have dependent
variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with categorical variables such as 0 or 1,
Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as
follows:

o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary (0/1, pass/fail)


o Multi (cats, dogs, lions)
o Ordinal (low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is like multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression is also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.

Difference between Classification and Regression

Regression Algorithm Classification Algorithm

In Regression, the output variable In Classification, the output variable must be a


must be of continuous nature or real discrete value.
value.

The task of the regression algorithm The task of the classification algorithm is to map
is to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.

In Regression, we try to find the best In Classification, we try to find the decision
fit line, which can predict the output boundary, which can divide the dataset into
more accurately. different classes.

Regression algorithms can be used to Classification Algorithms can be used to solve


solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification
prediction, etc. of cancer cells, etc.

The regression Algorithm can be The Classification algorithms can be divided into
further divided into Linear and Non- Binary Classifier and Multi-class Classifier.
linear Regression.

Regularization

Regularization is one of the most important concepts of machine learning. It is a


technique to prevent the model from overfitting by adding extra information to it.

Sometimes the machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to predict the output
when dealing with unseen data by introducing noise in the output, and hence the model
is called overfitted. This problem can be dealt with the help of a regularization
technique.

This technique can be used in such a way that it will allow us to maintain all variables
or features in the model by reducing the magnitude of the variables. Hence, it
maintains accuracy as well as a generalization of the model.

It mainly regularizes or reduces the coefficient of features to zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same
number of features."

Regularization works by adding a penalty or complexity term to the complex model.


Let's consider the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted.

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for linear regression is called RSS or
Residual sum of squares.

There are mainly two types of regularization techniques,

o Ridge Regression
o Lasso Regression

Ridge Regression: Ridge regression is one of the types of linear regression in which
a small amount of bias is introduced so that we can get better long-term predictions.

Ridge regression is a regularization technique, which is used to reduce the


complexity of the model. It is also called L2 regularization.

In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.

The equation for the cost function in ridge regression will be:

Lasso Regression: Lasso regression is another regularization technique to reduce


the complexity of the model. It stands for Least Absolute and Selection Operator.

It is like the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.

Since it takes absolute values, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.

It is also called L1 regularization. The equation for the cost function of Lasso
regression will be:
Decision tree

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules, and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
o The decisions or the test are performed based on features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, like a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

Below diagram explains the general structure of a decision tree:

Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and moves further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contain possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and call the final node as a leaf node.

Decision tree Restaurant Waiting Problem

A decision tree represents a function that takes as input a vector of attribute


values and returns a “decision”—a single output value. The input and output
values can be discrete or continuous. A decision tree reaches its decision by
performing a sequence of tests.
As an example, we will build a decision tree to decide whether to wait for a
table at a restaurant. The aim here is to learn a definition for the goal predicate
WillWait.
First, we list the attributes that we will consider as part of the input:
1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar: whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and
Full).
6. Price: the restaurant’s price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0–10 minutes, 10–30, 30–60,
or >60).
An example for a Boolean decision tree consists of an (x, y) pair, where x is a
vector of values for the input attributes, and y is a single Boolean output value.
Figure1.
A training set of 12 examples.

The positive examples are the ones in which the goal WillWait is true (x1,
x3,...); the negative examples are the ones in which it is false (x2, x5,...). The
decision-tree-learning algorithm adopts a greedy divide-and-conquer strategy:
always test the most important attribute first. This test divides the problem up
into smaller subproblems that can then be solved recursively. By “most
important attribute,” we mean the one that makes the most difference to the
classification of an example. That way, we hope to get to the correct
classification with a small number of tests, meaning that all paths in the tree will
be short and the tree will be shallow.
Figure shows that Type is a poor attribute, because it leaves us with four
possible outcomes, each of which has the same number of positive as negative
examples. On the other hand, we see that Patrons is an important attribute,
because if the value is None or Some, then we are left with example sets for
which we can answer definitively (No and Yes, respectively). If the value is
Full, we are left with a mixed set of examples. In general, after the first attribute
test splits up the examples, each outcome is a new decision tree learning
problem, with fewer examples and one less attribute. There are four cases to
consider for these recursive problems:

1. If the remaining examples are all positive (or all negative), then we are done
we can answer Yes or No. Figure (b) shows examples of this happening in the
None and Some branches.
2. If there are some positive and some negative examples, then choose the best
attribute to split them. Figure (b) shows Hungry being used to split the
remaining examples.
3. If there are no examples left, it means that no example has been observed for
this com bination of attribute values, and we return a default value calculated
from the plurality classification of all the examples that were used in
constructing the node’s parent. These are passed along in the variable parent
example.
4.If there are no attributes left, but both positive and negative examples, it
means that these examples have the same description, but different
classifications. This can happen because there is an error or noise in the data;
because the domain is nondeterministic; or because we can’t observe an
attribute that would distinguish the examples.
The output of the learning algorithm on our sample training set is shown in the
Figure below. The tree is clearly different from the original tree shown in figure.
1. One might conclude that the learning algorithm is not doing a very good job
of learning the correct function. This would be the wrong conclusion to draw,
however. The learning algorithm looks at the examples, not at the correct
function, and in fact, its hypothesis not only is consistent with all the examples
but is considerably simpler than the original tree! The learning algorithm has no
reason to include tests for Raining and Reservation, because it can classify all
the examples without them.

Choosing attribute tests


The greedy search used in decision tree learning is designed to approximately
minimize the depth of the final tree. The idea is to pick the attribute that goes as
far as possible toward providing an exact classification of the examples. A
perfect attribute divides the examples into sets, each of which are all positive or
all negative and thus will be leaves of the tree. The Patron’s attribute is not
perfect, but it is good. A useless attribute, such as Type, leaves the example sets
with roughly the same proportion of positive and negative examples as the
original set.
Entropy
Entropy is a measure of the uncertainty of a random variable; acquisition of
information corresponds to a reduction in entropy. A random variable with only
one value—a coin that always comes up heads—has no uncertainty and thus its
entropy is defined as zero; thus, we gain no information by observing its value.
In general, the entropy of a random variable V with values vk, each with
probability P(vk), is defined as Entropy:
H(V ) = ∑k P(vk) log2 1 /P(vk) = − ∑ k P(vk) log2 P(vk) .
We can check that the entropy of a fair coin flip is indeed 1 bit:
H(Fair ) = −(0.5 log2 0.5+0.5 log2 0.5) = 1 .
If the coin is loaded to give 99% heads, we get
H(Loaded ) = −(0.99 log2 0.99 + 0.01 log2 0.01) ≈ 0.08 bits.
It will help to define B(q) as the entropy of a Boolean random variable that is
true with probability q: B(q) = −(q log2 q + (1 − q) log2(1 − q)) .
Thus, H(Loaded ) = B(0.99) ≈ 0.08.
If a training set contains p positive examples and n negative examples, then the
entropy of the goal attribute on the whole set is H(Goal) = B (p/ p + n)
The restaurant training set in Figure 18.3 has p = n = 6, so the corresponding
entropy is B(0.5) or exactly 1 bit.
An attribute A with d distinct values divides the training set E into subsets
E1,...,Ed. Each subset Ek has pk positive examples and nk negative examples,
so if we go along that branch, we will need an additional B(pk/(pk + nk)) bits of
information to answer the question. A randomly chosen example from the
training set has the kth value for the attribute with probability (p k + nk)/(p + n),
so the expected entropy remaining after testing attribute A is
Remainder (A) =
∑dk=1 pk+nk / p+n B( pk / pk+nk ) .
Information Gain:
The information gain from the attribute test on A is the expected reduction in
entropy:
Gain(A) = B( p/ p+n) − Remainder (A) .
Gain(Patrons )=1 – [2/12B( 0 /2 ) + 4/12B( 4/4 ) + 6/12B( 2/6 ) % ≈ 0.541 bits,
Gain(Type)=1 – [ 2/12B( 1/2 ) + 2/12B( 1/2 ) + 4/12B( 2/4 ) + 4/12B( 2/4 ) % =
0 bits,
confirming our intuition that Patrons is a better attribute to split on. In fact,
Patrons has the maximum gain of any of the attributes and would be chosen by
the decision-tree learning algorithm as the root.
Generalization and overfitting

Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models. The
main goal of each machine learning model is to generalize well.

Generalization defines the ability of an ML model to provide a suitable output


by adapting the given set of unknown input. It means after providing training on
the dataset, it can produce reliable and accurate output. Hence, the underfitting
and overfitting are the two terms that need to be checked for the performance of
the model and whether the model is generalizing well or not.

Overfitting: Overfitting refers to the condition when the model completely fits
the training data but fails to generalize the testing unseen data. Overfit condition
arises when the model memorizes the noise of the training data and fails to
capture important patterns. The chances of occurrence of overfitting increase as
much as we provide training to our model. It means the more we train our
model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph
of the linear regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got
any best fit, so, it will generate the prediction errors.

There are some ways by which we can reduce the occurrence of overfitting in
our model.

o Decision Tree Pruning:


o Cross-Validation:
o Training with more data
o Removing features
o Early stopping the training:
o Regularization
o Ensembling

Decision Tree Pruning: Pruning refers to a technique to remove the parts of the
decision tree to prevent growing to its full depth. By tuning the hyperparameters
of the decision tree model one can prune the trees and prevent them from
overfitting.
For decision trees, a technique called decision tree pruning combats overfitting.
Pruning works by eliminating nodes that are not clearly relevant. We start with
a full tree, as generated by decision-tree-learning. We then look at a test node
that has only leaf nodes as descendants. If the test appears to be irrelevant—
detecting only noise in the data— then we eliminate the test, replacing it with a
leaf node. We repeat this process, considering each test with only leaf
descendants, until each one has either been pruned or accepted as is.

Suppose we are at a node consisting of p positive and n negative examples. If


the attribute is irrelevant, we would expect that it would split the examples into
subsets that each have roughly the same proportion of positive examples as the
whole set, p/(p + n), and so the information gain will be close to zero. Thus, the
information gain is a good clue to irrelevance.

There are two types of pruning Pre-pruning and Post-pruning.


Pre-Pruning:

The pre-pruning technique refers to the early stopping of the growth of the
decision tree. The pre-pruning technique involves tuning the hyperparameters of
the decision tree model prior to the training pipeline. The hyperparameters of the
decision tree including max_depth, min_samples_leaf, min_samples_split can be
tuned to early stop the growth of the tree and prevent the model from overfitting.

Post-Pruning:

The Post-pruning technique allows the decision tree model to grow to its full
depth, then removes the tree branches to prevent the model from overfitting.

Cross Validation

Cross-validation is a technique for validating the model efficiency by training it


on the subset of input data and testing on previously unseen subset of the input
data. We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.

In machine learning, there is always the need to test the stability of the model. It
means based only on the training dataset; we can't fit our model on the training
dataset. For this purpose, we reserved a particular sample of the dataset, which
was not part of the training dataset. After that, we test our model on that sample
before deployment, and this complete process comes under cross-validation.

K-Fold Cross-Validation

K-fold cross-validation approach divides the input dataset into K groups of


samples of equal sizes. These samples are called folds. For each learning set,
the prediction function uses k-1 folds, and the rest of the folds are used for the
test set. This approach is a very popular CV approach because it is easy to
understand, and the output is less biased than other methods.

The steps for k-fold cross-validation are:


o Split the input dataset into K groups.
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset.
o Fit the model on the training set and evaluate the performance of
the model using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into
5 folds. On 1st iteration, the first fold is reserved for test the model, and rest are
used to train the model. On 2nd iteration, the second fold is used to test the
model, and rest are used to train the model. This process will continue until each
fold is not used for the test fold.

Early stopping the training: The decision tree algorithm stops generating nodes
when there is no good attribute to split on, rather than going to all the trouble of
generating nodes and then pruning them away.

Ensemble — Random Forest:

Random Forest is an ensemble technique for classification and regression by


bootstrapping multiple decision trees. Random Forest follows bootstrap
sampling and aggregation techniques to prevent overfitting.
Underfitting

Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the feed of
training data can be stopped at an early stage, due to which the model may not
learn enough from the training data. As a result, it may fail to find the best fit of
the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.

Example: We can understand the underfitting using below output of the linear
regression model:

As we can see from the above diagram, the model is unable to capture the data
points present in the plot.

We can avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Evaluating and choosing the best hypothesis

Stationarity assumption: We want to learn a hypothesis that fits the future data
best. To make that precise we need to define “future data” and “best.”
We make the stationarity assumption: that there is a probability distribution
over examples that remains stationary over time.

Each example data point (before we see it) is a random variable Ej whose
observed value.

ej = (xj , yj )is sampled from that distribution, and is independent of the


previous examples:

P(Ej |Ej−1, Ej−2,...) = P(Ej ) ,

and each example has an identical prior probability distribution:

P(Ej ) = P(Ej−1) = P(Ej−2) = ··· . Examples that satisfy these assumptions


are called independent and identically distributed or i.i.d. An i.i.d.
assumption connects the past to the future; without some such connection, all
bets are off—the future could be anything.

ERROR RATE The next step is to define “best fit.” We define the error rate
of a hypothesis as the proportion of mistakes it makes—the proportion of
times that h(x) ≠y for an (x, y) example.

Now, just because a hypothesis h has a low error rate on the training set does
not mean that it will generalize well. A professor knows that an exam will
not accurately evaluate students if they have already seen the exam
questions. Similarly, to get an accurate evaluation of a hypothesis, we need
to test it on a set of examples it has not seen yet. The simplest approach is
the one we have seen already: randomly split the available data into a
training set from which the learning algorithm produces h and a test set on
which the accuracy of h is evaluated. This method, sometimes called holdout
cross-validation.

K-FOLD CROSS-VALIDATION We can squeeze more out of the data and


still get an accurate estimate using a technique called k-fold cross-
validation. The idea is that each example serves double duty—as training
data and test data. First we split the data into k equal subsets. We then
perform k rounds of learning; on each round 1/k of the data is held out as a
test set and the remaining examples are used as training data. The average
test set score of the k rounds should then be a better estimate than a single
score. Popular values for k are 5 and 10—enough to give an estimate that is
statistically likely to be accurate, at a cost of 5 to 10 times longer
computation time.

The extreme is k = n, also known as leave-one-out cross-validation or


LOOCV.

PEEKING : Despite the best efforts of statistical methodologists, users


frequently invalidate their results by inadvertently peeking at the test data.
The hypothesis was selected based on its test set error rate, so information
about the test set has leaked into the learning algorithm. Peeking is a
consequence of using test-set performance to both choose a hypothesis and
evaluate it.

Support Vector Machine (SVM)

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a
hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the hyperplane
of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.

We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.

Artificial Neural Networks

The term "Artificial neural network" refers to a biologically inspired sub-field


of artificial intelligence modelled after the brain. An Artificial neural network is
usually a computational network based on biological neural networks that
construct the structure of the human brain. Like a human brain has neurons
interconnected to each other, artificial neural networks also have neurons that
are linked to each other in various layers of the networks. These neurons are
known as nodes.

Dendrites from Biological Neural Network represent inputs in Artificial Neural


Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon
represents Output.

The below figure illustrates the typical diagram of Biological Neural


Network.
The typical Artificial Neural Network looks something like the given figure.

An Artificial Neural Network in the field of Artificial intelligence where it


attempts to mimic the network of neurons makes up a human brain so that
computers will have an option to understand things and make decisions in a
human-like manner. The artificial neural network is designed by programming
computers to behave simply like interconnected brain cells.

There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human
brain, data is stored in such a manner as to be distributed, and we can extract
more than one piece of this data, when necessary, from our memory parallelly.
We can say that the human brain is made up of incredibly amazing parallel
processors.

We can understand the artificial neural network with an example, consider an


example of a digital logic gate that takes an input and gives an output. "OR"
gate, which takes two inputs. If one or both the inputs are "On," then we get
"On" in output. If both the inputs are "Off," then we get "Off" in output. Here
the output depends upon input. Our brain does not perform the same task. The
outputs to inputs relationship keep changing because of the neurons in our
brain, which are "learning."

Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by


the programmer.

Hidden Layer:

The hidden layer is present in between input and output layers. It performs all
the calculations to find hidden features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a
transfer function.
It determines weighted total is passed as an input to an activation function to
produce the output. Activation functions choose whether a node should fire or
not. Only those who are fired make it to the output layer. There are distinctive
activation functions available that can be applied upon the sort of task we are
performing.

Types of Artificial Neural Network:

Types of Artificial Neural Networks (ANN) depending upon the human brain
neuron and network functions, an artificial neural network similarly performs
tasks. Most of the artificial neural networks will have some similarities with a
more complex biological partner and are very effective at their expected tasks.
For example, segmentation or classification.

Feedback ANN:

In this type of ANN, the output returns to the network to accomplish the best-
evolved results internally. The feedback networks feed information back into
themselves and are well suited to solve optimization issues. The Internal system
error corrections utilize feedback ANNs.

Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer,
an output layer, and at least one layer of a neuron. Through assessment of its
output by reviewing its input, the intensity of the network can be noticed based
on group behavior of the associated neurons, and the output is decided. The
primary advantage of this network is that it figures out how to evaluate and
recognize input patterns.

Nonparametric model

Parametric model: A learning model that summarizes data with a set of


parameters of fixed size (independent of the number of training examples) is
called a parametric model.

Nonparametric model: A nonparametric model is one that cannot be


characterized by a bound set of parameters. For example, suppose that each
hypothesis we generate simply retains within itself all the training examples and
uses all of them to predict the next example. Such a hypothesis family would be
nonparametric because the effective number of parameters is unbounded— it
grows with the number of examples. This approach is called instance-based
learning or memory-based learning.

Instance-based learning table lookup: The simplest instance-based learning


method is table lookup: take all the training examples, put them in a lookup
table, and then when asked for h(x), see if x is in the table; if it is, return
the corresponding y. The problem with this method is that it does not
generalize well: when x is not in the table all it can do is return some
default value.

Ensemble Learning

Ensemble methods are techniques that create multiple models and then combine
them to produce improved results. Ensemble methods in machine learning usually
produce more accurate solutions than a single model would.
The models that contribute to the ensemble are called ensemble members. They
may be of the same type or of different types.
They may or may not be trained on the same training data.

Different types of ensemble classifiers are:


1. Bagging
2. Boosting
3. Random Forests

Bagging (Bootstrap Aggregating)

Bagging is used when our objective is to reduce the variance of a decision tree.
Here the concept is to create a few subsets of data from the training sample, which
is chosen randomly with replacement. Now each collection of subset data is used
to prepare their decision trees thus, we end up with an ensemble of various
models. The average of all the assumptions from numerous trees is used, which is
more powerful than a single decision tree.

Bagging avoids overfitting of data and is used for both regression and
classification models, specifically for decision tree algorithms.
Steps to Perform Bagging

• Consider there are n observations and m features in the training set. You
need to select a random sample from the training dataset without
replacement.

• A subset of m features is chosen randomly to create a model using


sample observations.

• The feature offering the best split out of the lot is used to split the nodes.

• The tree is grown, so you have the best root nodes.

• The above steps are repeated n times. It aggregates the output of


individual decision trees to give the best prediction.
Advantages of Bagging in Machine Learning

• Bagging minimizes the overfitting of data.

• It improves the model’s accuracy.

• It deals with higher dimensional data efficiently.

Boosting

Boosting is an ensemble learning method that combines a set of weak learners


into strong learners to minimize training errors. In boosting, a random sample of
data is selected, fitted with a model, and then trained sequentially. That is, each
model tries to compensate for the weaknesses of its predecessor. Each classifier's
weak rules are combined with each iteration to form one strict prediction rule.

Boosting is an efficient algorithm that converts a weak learner into a strong


learner. They use the concept of the weak learner and strong learner conversation
through the weighted average values and higher votes values for prediction. These
algorithms use decision stamp and margin maximizing classification for
processing.

The basic principle behind the working of the boosting algorithm is to generate
multiple weak learners and combine their predictions to form one strict rule.
These weak rules are generated by applying base Machine Learning algorithms on
different distributions of the data set. These algorithms generate weak rules for
each iteration. After multiple iterations, the weak learners are combined to form a
strong learner that will predict a more accurate outcome.
Here's how the algorithm works:

Step 1: The base algorithm reads the data and assigns equal weight to each
sample observation.

Step 2: False predictions made by the base learner are identified. In the next
iteration, these false predictions are assigned to the next base learner with a higher
weightage on these incorrect predictions.

Step 3: Repeat step 2 until the algorithm can correctly classify the output.

Three popular types of boosting methods include:

1. Adaptive boosting or AdaBoost: This method operates iteratively, identifying


misclassified data points and adjusting their weights to minimize the training
error. The model continues to optimize sequentially until it yields the strongest
predictor.

AdaBoost is implemented by combining several weak learners into a single strong


learner. The weak learners in AdaBoost consider a single input feature and draw
out a single split decision tree called the decision stump. Each observation is
weighted equally while drawing out the first decision stump.

The results from the first decision stump are analyzed, and if any observations are
wrongfully classified, they are assigned higher weights. A new decision stump is
drawn by considering the higher-weight observations as more significant. Again,
if any observations are misclassified, they're given a higher weight, and this
process continues until all the observations fall into the right class.
AdaBoost can be used for both classification and regression-based problems.
However, it is more commonly used for classification purposes.

2. Gradient Boosting: Gradient Boosting is also based on sequential ensemble


learning. Here the base learners are generated sequentially so that the present base
learner is always more effective than the previous one, i.e., and the overall model
improves sequentially with each iteration.

The difference in this boosting type is that the weights for misclassified outcomes
are not incremented. Instead, the Gradient Boosting method tries to optimize the
loss function of the previous learner by adding a new model that adds weak
learners to reduce the loss function.

The main idea here is to overcome the errors in the previous learner's predictions.
This boosting has three main components:

o Loss function: The use of the loss function depends on the type of
problem. The advantage of gradient boosting is that there is no need for a
new boosting algorithm for each loss function.
o Weak learner: In gradient boosting, decision trees are used as weak
learners. A regression tree is used to give true values, which can combine to
create correct predictions. Like in the AdaBoost algorithm, small trees with
a single split are used, i.e., decision stump. Larger trees are used for large
levels, 4-8.
o Additive Model: Trees are added one at a time in this model. Existing trees
remain the same. During the addition of trees, gradient descent is used to
minimize the loss function.

Like AdaBoost, Gradient Boosting can also be used for classification and
regression problems.

3. Extreme gradient boosting or XGBoost: Boost is an advanced gradient


boosting method. XGBoost, developed by Tianqi Chen, falls under the Distributed
Machine Learning Community (DMLC) category.

The main aim of this algorithm is to increase the speed and efficiency of
computation. The Gradient Descent Boosting algorithm computes the output
slower since they sequentially analyze the data set. Therefore XGBoost is used to
boost or extremely boost the model's performance.

Random Forest

Random Forest is a popular machine learning algorithm that belongs to the


supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one
decision tree, the random forest takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Random Forest works in two-phases: first is to create the random forest by


combining N decision tree, and second is to make predictions for each tree created
in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.

K-Nearest Neighbor (KNN)

o K-Nearest Neighbor is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most like the
available categories.
o The K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can
be easily classified into a well suite category by using the K- NN algorithm.
o The K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much like the
new data.
o Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
The K-NN working can be explained based on the below algorithm:

o Step-1: Select the number K of the neighbors.


o Step-2: Calculate the Euclidean distance of K number of neighbors.
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of data points in each
category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data.
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.
Gradient Descent

Gradient Descent is known as one of the most used optimization algorithms to


train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.

Gradient Descent is defined as one of the most used iterative optimization


algorithms of machine learning to train the machine learning and deep
learning models. It helps in finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:

o If we move towards a negative gradient or away from the gradient of the


function at the current point, it will give the local minimum of that
function.
o Whenever we move towards a positive gradient or towards the gradient of
the function at the current point, we will get the local maximum of that
function.

o will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as


steepest descent. The main objective of using a gradient descent algorithm is to
minimize the cost function using iteration. To achieve this goal, it performs two
steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient


or slope of that function.
o Move away from the direction of the gradient, which means slope increased
from the current point by alpha times, where Alpha is defined as Learning
Rate. It is a tuning parameter in the optimization process which helps to
decide the length of the steps.
o The cost function is defined as the measurement of difference or error
between actual values and expected values at the current position and
present in the form of a single real number. It helps to increase and
improve machine learning efficiency by providing feedback to this model
so that it can minimize error and find the local or global minimum. Further,
it continuously iterates along the direction of the negative gradient until the
cost function approaches zero. At this steepest descent point, the model will
stop learning further.
o Although cost function and loss function are considered synonymous, there
is also a minor difference between them. The slight difference between the
loss function and the cost function is about the error within the training of
machine learning models, as loss function refers to the error of one training
example, while a cost function calculates the average error across an entire
training set.
o The cost function is calculated after making a hypothesis with initial
parameters and modifying these parameters using gradient descent
algorithms over known data to reduce the cost function.
UNIT-3

Probabilistic Models, Unsupervised Learning, Reinforcement


Learning
Probabilistic models: Statistical Learning, learning with complete
data, Naïve Bayes Classifier, Learning with hidden variables: The E-
M algorithm.
Unsupervised Learning: Concept of Unsupervised Learning,
Association Rule Mining.
Reinforcement learning: Concept of Reinforcement learning, Q-
Learning, Hidden Markov Model.

Statistical Learning

Statistical machine learning involves using statistical techniques to


develop models that can learn from data and make predictions or
decisions.

Statistics is the science that allows us to collect, analyze, interpret, present, and
organize data. It provides a robust set of tools for understanding patterns and
trends, and making inferences and predictions based on data. When we're dealing
with large datasets, statistics help us understand and summarize the data, allowing
us to make sense of complex phenomena.
Machine learning, on the other hand, is a powerful tool that allows computers to
learn from and make decisions or predictions based on data. The goal of machine
learning is to create models that can adapt and improve over time, as well as
generalize from specific examples to broader cases.
This is where the beauty of the fusion between statistics and machine learning
comes to light. The principles of statistics are the very pillars that uphold the
structure of machine learning.
• Constructing machine learning models. Statistics provides the
methodologies and principles for creating models in machine learning. For
instance, the linear regression model leverages the statistical method of
least squares to estimate the coefficients.
• Interpreting results. Statistical concepts allow us to interpret the results
generated by machine learning models. Measures such as p-value,
confidence intervals, R-squared, and others provide us with a statistical
perspective on the machine learning model’s performance.
• Validating models. Statistical techniques are essential for validating and
refining machine learning models. For instance, techniques like hypothesis
testing, cross-validation, and bootstrapping help us quantify the
performance of models and avoid problems like overfitting.

• Underpinning advanced techniques. Even some of the more complex


machine learning algorithms, such as Neural Networks, have statistical
principles at their core. The optimization techniques, like gradient descent,
used to train these models are based on statistical theory.

Naïve Bayes Classifiers

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simplest and most effective
Classification algorithms which helps in building fast machine learning
models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts based on the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed


event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

P(B) is Marginal Probability: Probability of Evidence.


Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.

Learning with hidden variables

The E-M Algorithm.

The Expectation-Maximization (EM) algorithm is defined as the combination of


various unsupervised machine learning algorithms, which is used to determine
the local maximum likelihood estimates (MLE) or maximum a posteriori
estimates (MAP) for unobservable variables in statistical models. Further, it is a
technique to find maximum likelihood estimation when the latent variables are
present. It is also referred to as the latent variable model.

A latent variable model consists of both observable and unobservable variables


where observable can be predicted while unobserved are inferred from the
observed variable. These unobservable variables are known as latent variables.

The EM algorithm is the combination of various unsupervised ML algorithms,


such as the k-means clustering algorithm. Being an iterative approach, it
consists of two modes. In the first mode, we estimate the missing or latent
variables. Hence it is referred to as the Expectation/estimation step (E-step).
Further, the other mode is used to optimize the parameters of the models so that it
can explain the data more clearly. The second mode is known as
the maximization-step or M-step.
o Expectation step (E - step): It involves the estimation (guess) of all
missing values in the dataset so that after completing this step, there should
not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data
in the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.

The EM algorithm is completed mainly in 4 steps, which include Initialization


Step, Expectation Step, Maximization Step, and convergence Step. These steps
are explained as follows:
o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to


estimate or guess the values of the missing or incomplete data using the
observed data. Further, E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use
complete data obtained from the 2nd step to update the parameter values.
Further, M-step primarily updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the
process from step 2 until the convergence occurs.

Unsupervised Learning

Unsupervised learning is a machine learning technique in which models are not


supervised using a training dataset. Instead, models themselves find the hidden
patterns and insights from the given data. It can be compared to learning which
takes place in the human brain while learning new things.

Definition: Unsupervised learning is a type of machine learning in which models


are trained using unlabeled dataset and are allowed to act on that data without
any supervision.

The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters
such that objects with most similarities remain into a group and has less or
no similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occur together in the dataset. Association
rules make marketing strategy more effective. People who buy X item
(suppose a bread) also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

K-means Clustering

K-Means Clustering is an Unsupervised Learning algorithm which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs to only one group that has similar
properties.

It allows us to cluster the data into different groups and is a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid.


The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are
near to the k-center, create a cluster.
o Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
o The below diagram explains the working of the K-means Clustering
Algorithm:

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third step, which means reassigning each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Association Rule Mining

Association rule learning is a type of unsupervised learning technique that checks


for the dependency of one data item on another data item and maps accordingly so
that it can be more profitable. It tries to find some interesting relations or
associations among the variables of dataset. It is based on different rules to
discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by
various big retailers to discover the associations between items. We can
understand it by taking the example of a supermarket, as in a supermarket, all
products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby.

Association rule learning works on the concept of If and Else Statement, such as if
A then B.

Here the If element is called antecedent, and then statement is called


as Consequent. These types of relationships where we can find out some
association or relation between two items is known as single cardinality. It is all
about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data
items, there are several metrics. These metrics are given below:

o Support
o Confidence
o Lift

Support

Support is the frequency of A or how frequently an item appears in the dataset. It


is defined as the fraction of the transaction T that contains the itemset X. If there
are X datasets, then for transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number
of records that contain X.
Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is


independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to
each other.
o Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.

Types of Association Rule Learning

Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed


to work on databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm


uses a depth-first search technique to find frequent itemsets in a transaction
database. It performs faster execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The purpose of this frequent
tree is to extract the most frequent patterns.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and


applications of association rule mining. This technique is commonly used
by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be
cured easily, as it helps in identifying the probability of illness for a
particular disease.
o Protein Sequence: The association rules help in determining the synthesis
of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many
more other applications.

Reinforcement Learning

Definition:

"Reinforcement learning is a type of machine learning method where an


intelligent agent (computer program) interacts with the environment and learns to
act within that."

Reinforcement Learning is a feedback-based Machine learning technique in


which an agent learns to behave in an environment by performing the actions and
seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or a penalty.

o In Reinforcement Learning, the agent learns automatically using feedback


without any labelled data, unlike supervised learning.
o Since there is no labelled data, the agent is bound to learn by its
experience only.
o RL solves a specific type of problem where decision making is sequential,
and the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve
performance by getting the maximum positive rewards.

Example: The problem is as follows: We have an agent and a reward, with


many hurdles in between. The agent is supposed to find the best possible
path to reach the reward. The following problem explains the problem
more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to
get the reward that is the diamond and avoid the hurdles that are fired. The robot
learns by trying all the possible paths and then choosing the path which gives
him the reward with the least hurdles. Each right step will give the robot a reward
and each wrong step will subtract the reward of the robot. The total reward will
be calculated when it reaches the final reward, that is the diamond.

Terms used in Reinforcement Learning

o Agent (): An entity that can perceive/explore the environment and act
upon it.
o Environment (): A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
o Action (): Actions are the moves taken by an agent within the environment.
o State (): State is a situation returned by the environment after each action
taken by the agent.
o Reward (): A feedback returned to the agent from the environment to
evaluate the action of the agent.
o Policy (): Policy is a strategy applied by the agent for the next action
based on the current state.
o Value (): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.
o Q-value (): It is mostly similar to the value, but it takes one additional
parameter as a current action (a).
Markov Decision Process

Markov Decision Process or MDP, is used to formalize the reinforcement


learning problems. If the environment is completely observable, then its dynamic
can be modelled as a Markov Process. In MDP, the agent constantly interacts
with the environment and performs actions; at each action, the environment
responds and generates a new state.

MDP is used to describe the environment for the RL, and almost all the RL
problems can be formalized using MDP.

MDP contains a tuple of four elements (S, A, Pa, Ra):

o A set of finite States S


o A set of finite Actions A
o Rewards received after transitioning from state S to state S', due to action
a.
o Probability Pa.

MDP uses Markov property, and to better understand the MDP, we need to learn
about it.

Markov Property:

It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on
the current state and future action and states do not depend on past actions,
rewards, or states."

Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies
the Markov property. Such as in a Chess game, the players only focus on the
current state and do not need to remember past actions or states.
Finite MDP:

A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Applications of Reinforcement Learning
• Robotics: RL is used in Robot navigation, Robo-soccer, walking, juggling,
etc.
• Control: RL can be used for adaptive control such as Factory processes,
admission control in telecommunication, and Helicopter pilot is an
example of reinforcement learning.
• Game Playing: RL can be used in Game playing such as tic-tac-toe, chess,
etc.
• Chemistry: RL can be used for optimizing the chemical reactions.
• Business: RL is now used for business strategy planning.
• Manufacturing: In various automobile manufacturing companies, the
robots use deep reinforcement learning to pick goods and put them in some
containers.
• Finance Sector: The RL is currently used in the finance sector for
evaluating trading strategies.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which


are:

1. Value-based:
The value-based approach is about finding the optimal value function,
which is the maximum value at a state under any policy. Therefore, the
agent expects a long-term return at any state(s) under policy π.
2. Policy-based:
A policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach, the
agent tries to apply such a policy that the action performed in each step
helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any
state.
o Stochastic: In this policy, probability determines the produced
action.
3. Model-based: In the model-based approach, a virtual model is created for
the environment, and the agent explores that environment to learn it. There
is no solution or algorithm for this approach because the model
representation is different for each environment.

Working Process of Reinforcement Learning

To understand the working process of the RL, we need to consider two main
things:

o Environment: It can be anything such as a room, maze, football ground,


etc.
o Agent: An intelligent agent such as an AI robot.

Let's take an example of a maze environment that the agent needs to explore.
Consider the below image:

In the above image, the agent is at the very first block of the maze. The maze
consists of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward
point. It can take four actions: move up, move down, move left, and move right.

The agent can take any path to reach the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so
he will get the +1-reward point.

The agent will try to remember the preceding steps that it has taken to reach the
final step. To memorize the steps, it assigns 1 value to each previous step.
Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value
to each previous block. But what will the agent do if he starts moving from the
block, which has 1 value block on both sides? Consider the below diagram:

It will be a difficult condition for the agent whether he should go up or down as


each block has the same value. So, the above approach is not suitable for the agent
to reach the destination. Hence to solve the problem, we will use the Bellman
equation, which is the main concept behind reinforcement learning.
Bellman Equation

The Bellman equation was introduced by the Mathematician Richard Ernest


Bellman in the year 1953, and hence it is called a Bellman equation. It is
associated with dynamic programming and used to calculate the values of a
decision problem at a certain point by including the values of previous states.

It is a way of calculating the value functions in dynamic programming or


environment that leads to modern reinforcement learning.

The key-elements used in Bellman equations are:


o Action performed by the agent is referred to as "a"
o The state that occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ."

The Bellman equation can be written as:

V(s) = max [R(s,a) + γV(s`)]

Where,
V(s)= value calculated at a particular point.

R(s,a) = Reward at a particular state s by performing an action. γ

= Discount factor

V(s`) = The value at the previous state.

In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.

So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.

For 1st block:

V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to
move.

V(s3) = max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.

For 2nd block:

V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because
there is no reward at this state.

V(s2) = max [0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9

For 3rd block:

V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0,
because there is no reward at this state also.

V(s1) = max [0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

For 4th block:

V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0,
because there is no reward at this state also.

V(s5) = max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73

For 5th block:


V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0,
because there is no reward at this state also.

V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66

Consider the below image:

Now, we will move further to the 6th block, and here the agent may change the
route because it always tries to find the optimal path. So now, let's consider from
the block next to the fire pit.

Now, the agent has three options to move; if he moves to the blue box, then he
will feel a bump if he moves to the fire pit, then he will get the -1 reward. But
here we are taking only positive rewards, so for this, he will move upwards only.
The complete block values will be calculated using this formula. Consider the
below image:
Reinforcement Learning: Active vs Passive

Both active and passive reinforcement learning are types of RL. In the case of
passive RL, the agent’s policy is fixed which means that it is told what to do. In
contrast to this, in active RL, an agent needs to decide what to do as there’s no
fixed policy that it can act on. Therefore, the goal of a passive RL agent is to
execute a fixed policy (sequence of actions) and evaluate it while that of an active
RL agent is to act and learn an optimal policy.
Passive Learning

As the goal of the agent is to evaluate how good an optimal policy is, the agent
needs to learn the expected utility Uπ(s) for each state. This can be done in three
ways.

1. Direct Utility Estimation

In this method, the agent executes a sequence of trials or runs (sequences of states-
actions transitions that continue until the agent reaches the terminal state). Each
trial gives a sample value, and the agent estimates the utility based on the sample’s
values. Can be calculated as running averages of sample values.

The main drawback is that this method makes a wrong assumption that state
utilities are independent while they are Markovian. Also, it is slow to converge.

2. Adaptive Dynamic Programming (ADP)

ADP uses dynamic programming to solve Markov decision problems. ADP is a


smarter method than Direct Utility Estimation as it runs trials to learn the model
of the environment by estimating the utility of a state as a sum of reward for being
in that state and the expected discounted reward of being in the next state.
Where R(s) = reward for being in states, P (s’|s, π(s)) = transition model, γ =
discount factor and Uπ(s) = utility of being in state’s’.

It can be solved using the value-iteration algorithm. The algorithm converges fast
but can become quite costly to compute for large state spaces. ADP is a model-
based approach and requires the transition model of the environment.

3. Temporal Difference Learning (TD)

TD learning does not require the agent to learn the transition model. The update
occurs between successive states and agent only updates states that are directly
affected.

Where α = learning rate which determines the convergence to true utilities.

While ADP adjusts the utility of s with all its successor states, TD learning adjusts
it with that of a single successor state’s’. TD is slower in convergence but much
simpler in terms of computation.

Active Reinforcement Learning

The active learning agent learns the complete model with outcome probabilities
for all actions. In this case the agent has a choice of action. They use Bellman
equations to calculate utilities to define optimal policy. After obtaining the
optimal utility function, the agent can just extract an optimal action by looking
ahead one step and the expected utility can be maximized.

1. ADP with exploration function


As the goal of an active agent is to learn an optimal policy, the agent needs to
learn the expected utility of each state and update its policy.

Can be done using a passive ADP agent and then using value or policy iteration it
can learn optimal actions. But this approach results into a greedy agent.

Hence, we use an approach that gives higher weights to unexplored actions and
lower weights to actions with lower utilities.

2. Learning an Action-Utility Function (Q-Learning)

Q-learning is a TD learning method which does not require the agent to learn the
transitional model, instead learns Q-value functions Q(s, a) . Q values are directly
related to utility value of a state.

Q-values can be updated using the following equation,

Next action can be selected using the following policy,

▪ Q learning is a model free method as it does not need any model for learning
or for action selection.
▪ Q learning agent is an active learner. It learns the value Q(s,n) of each action
for every situation.
▪ The use of exploration function is same as that of ADP but it does not learn the
transition model as in case of other agents, rather it uses the Q value which can
be directly related to its neighbors.
▪ This is simpler to compute but slower than ADP.
▪ Also, it need not pay any attention to the policy used because of Q value usage.
For the same reason, it is also called as “off-policy learning algorithm”.
▪ It can behave better even if the policies are random or adversarial.

Due to these properties, Q learning methods are one of the most successful methods
even in games such as checkers, chess etc. Where the evaluation function is learned by a
model based on Q-learning.

Hidden Markov Model

Hidden Markov Models (HMMs) are a type of probabilistic model that is commonly
used in machine learning for tasks such as speech recognition, natural language
processing, and bioinformatics. They are a popular choice for modelling sequences of
data because they can effectively capture the underlying structure of the data, even
when the data is noisy or incomplete.

Definition: A Hidden Markov Model (HMM) is a probabilistic model that consists of


a sequence of hidden states, each of which generates an observation. The hidden
states are usually not directly observable, and the goal of HMM is to estimate the
sequence of hidden states based on a sequence of observations. An HMM is defined by
the following components:
o A set of N hidden states, S = {s1, s2, ..., sN}.
o A set of M observations, O = {o1, o2, ..., oM}.
o An initial state probability distribution, ? = {?1, ?2, ..., ?N}, which specifies the
probability of starting in each hidden state.
o A transition probability matrix, A = [aij], defines the probability of moving from
one hidden state to another.
o An emission probability matrix, B = [bjk], defines the probability of emitting an
observation from a given hidden state.

The basic idea behind an HMM is that the hidden states generate the observations, and
the observed data is used to estimate the hidden state sequence. This is often referred to
as the forward-backwards algorithm.

You might also like