NLP Lab Manual Updated

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

IPS ACADEMY INDORE

INSTITUTE OF ENGINEERING & SCIENCE


COMPUTER SCIENCE & ENGINEERING – AIML
DEPARTMENT

LAB MANUAL
(2023-24)

Natural Language Processing

Name :- …………………………………………….

Year :- 3rd Semester:- 6th

Class Roll No.:- …… Enrollment No.:- ……………….


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

CONTENTS

1. Vision Mission of the Institute


2. Vision Mission of the Department
3. PEOs
4. POs
5. COs
6. Laboratory Regulations and Safety Rules
7. Index
8. Experiments

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Vision of the Institute

To be the fountain head of novel ideas & innovations in science & technology & persist
to be a foundation of pride for all Indians.

Mission of the Institute

• To provide value based broad Engineering, Technology and Science where


education in students are urged to develop their professional skills.
• To inculcate dedication, hard work, sincerity, integrity and ethics in building up
overall professional personality of our student and faculty.
• To inculcate a spirit of entrepreneurship and innovation in passing out students.
• To instill sensitivity amongst the youth towards the community and environment.
• To instigate sponsored research and provide consultancy services in technical,
educational and industrial areas.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Vision of the Department

Attaining global recognition in computer science and engineering education, research


and training to meet the growing needs of the industry and society.

Mission of the Department

M1: Provide quality education, in both the theoretical and applied foundations of
computer science and train students to effectively apply this education to solve real-
world problems.
M2: Amplifying their potential for lifelong high- quality careers.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Program Education Objectives (PEOs)

PEO 1:
To prepare students for successful careers in software industry that meet the needs of
Indian and multinational companies.
PEO 2:
To provide students with solid foundation in mathematical, scientific and engineering
fundamentals to solve engineering problems and required also to pursue higher studies.
PEO 3:
To develop the ability to work with the core competence of computer science &
engineering i.e. software engineering, hardware structure & networking concepts so
that one can find feasible solution to real world problems.
PEO 4:
To inseminate in students effective communication skills, team work, multidisciplinary
approach, and an ability to relate engineering issues to broader social context.
PEO 5:
To motivate students perseverance for lifelong learning and to introduce them to
professional ethics and codes of professional practice.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Program Outcomes (POs)

PO1. Engineering knowledge: Apply the knowledge of mathematics, science,


engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.

PO2. Problem analysis: Identify, formulate, research literature, and analyze


complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.

PO3. Design/development of solutions: Design solutions for complex engineering


problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.

PO4. Conduct investigations of complex problems: Use research-based


knowledge and research methods including design of experiments, analysis and
interpretation of data, and synthesis of the information to provide valid conclusions.

PO5. Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and modeling
to complex engineering activities with an understanding of the limitations.

PO6. The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

PO7. Environment and sustainability: Understand the impact of the professional


engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.

PO8. Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.

PO9. Individual and team work: Function effectively as an individual, and as a


member or leader in diverse teams, and in multidisciplinary settings.

PO10. Communication: Communicate effectively on complex engineering


activities with the engineering community and with society at large, such as, being
able to comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions.
PO11. Project management and finance: Demonstrate knowledge and
understanding of the engineering and management principles and apply these to
one’s own work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.

PO12. Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context of
technological change

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Course Outcomes (Cos)

CO1 To tag a given text with basic Language features.

CO2 To design an innovative application using NLP components.

CO3 To implement a rule based system to tackle morphology/syntax of a language.

CO4 To design a tag set to be used for statistical processing for real-time applications.

CO5 To compare and contrast the use of different statistical approaches for
different types of NLP applications.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Laboratory Regulations and Safety Rules

1. Without Prior permission do not enter into the Laboratory.

2. While entering into the LAB students should wear their ID cards.

3. The Students should come with proper uniform.

4. Student should not use mobile phone inside the laboratory.

5. Students should sign in the LOGIN REGISTER before entering into the
laboratory.

6. Students should come with observation and record note book to the laboratory.

7. Do not change any computer setting.

8. Students should maintain silence inside the laboratory.

9. After completing the laboratory exercise, make sure to SHUTDOWN the


system properly.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Enrollment No. : ………………………

INDEX

S. No. Experiment Name Date Grade Signature


1. Write a program on Word Analysis

2. Write a program on Word Generation

3. Write a program on Morphology

4. Write a program on N-Grams

5. Write a program on Part of Tagging

6. Write a program for Hidden Markov Model

7. Write a program for Viterbi Algorithm

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No. 1
Write a program on Word Analysis

import nltk

from nltk.corpus import stopwords


from collections import Counter
txt = """We hold these truths to be self-evident, that all men are created equal, that they are
endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and
the pursuit of Happiness. That to secure these rights, Governments are instituted among Men,
deriving their just powers from the consent of the governed, That whenever any Form of
Government becomes destructive of these ends, it is the Right of the People to alter or to abolish
it, and to institute new Government, laying its foundation on such principles and organizing its
powers in such form, as to them shall seem most likely to effect their Safety and Happiness.
Prudence, indeed, will dictate that Governments long established should not be changed for light
and transient causes; and accordingly all experience hath shewn, that mankind are more disposed
to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they
are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same
Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty,
to throw off such Government, and to provide new Guards for their future security. Such has been
the patient sufferance of these Colonies; and such is now the necessity which constrains them to
alter their former Systems of Government. The history of the present King of Great Britain is a
history of repeated injuries and usurpations, all having in direct object the establishment of an
absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world."""

print("Number of characters (including spaces): ", len(txt))


print("Number of characters (excluding spaces): ", len(txt.replace(" ", "")))
print("Number of sentences: ", len(nltk.sent_tokenize(txt)))
print("Number of words: ", len(nltk.word_tokenize(txt)))
stop_words = set(stopwords.words("english"))
word_tokens = nltk.word_tokenize(txt)
filtered_text = [word for word in word_tokens if word not in stop_words]
print("Number of words (excluding stopwords): ",len(filtered_text))
n = Counter(filtered_text)
print("Number of unique words (excluding stopwords): ",len(n))
print("Top 5 repeated words(excluding stopwords): ",n.most_common(5))

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Output:

Number of characters (including spaces): 1624

Number of characters (excluding spaces): 1356

Number of sentences: 7

Number of words: 301

Number of words (excluding stopwords): 164

Number of unique words (excluding stopwords): 120

Top 5 repeated words(excluding stopwords): [(',', 23), ('.', 7), ('Government', 4), ('among', 2),
('Happiness', 2)]

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions

Q1. What is NLP ?


Ans: Natural Language Processing (NLP) is a process of manipulating or understanding the text
or speech by any software or machine. An analogy is that humans interact and understand each
other’s views and respond with the appropriate answer. In NLP, this interaction, understanding,
and response are made by a computer instead of a human.
Q2. What is NLTK?
Ans: NLTK (Natural Language Toolkit) is a suite that contains libraries and programs for statistical
language processing. It is one of the most powerful NLP libraries, which contains packages to
make machines understand human language and reply to it with an appropriate response

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No.2
Write a program on Word Generation

from nltk.stem import PorterStemmer ,LancasterStemmer,SnowballStemmer, snowball


from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob
print("Original Text: " ,text := 'data science uses scientific methods algorithms and many types
of processes')

#STEMMING
porter = PorterStemmer()
word_tokens = word_tokenize(text)
print("\nPorter Stemmer: ",stems := [porter.stem(word) for word in word_tokens])
lancaster=LancasterStemmer()
word_tokens = word_tokenize(text)
print("Lancaster Stemmer: ",stems := [lancaster.stem(word) for word in word_tokens])
snowball =SnowballStemmer("english") #Language Input as a parameter
word_tokens = word_tokenize(text)
print("Lancaster Stemmer: ",stems := [snowball.stem(word) for word in word_tokens])

#LEMMATIZATION
lemmatizer = WordNetLemmatizer()
word_tokens = word_tokenize(text)

# provide context i.e. part-of-speech


print("\n Word Net Lemmatizer: ",lemmas := [lemmatizer.lemmatize(word, pos ='v') for word in
word_tokens])
sent = TextBlob(text)
print("Textblob Lemmatizer: ",[w.lemmatize() for w in sent.words])

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Output:

Original Text: data science uses scientific methods algorithms and many types of processes

Porter Stemmer: ['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of',
'process']
Lancaster Stemmer: ['dat', 'sci', 'us', 'sci', 'method', 'algorithm', 'and', 'many', 'typ', 'of', 'process']
Lancaster Stemmer: ['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of',
'process']

Word Net Lemmatizer: ['data', 'science', 'use', 'scientific', 'methods', 'algorithms', 'and', 'many',
'type', 'of', 'process']
Textblob Lemmatizer: ['data', 'science', 'us', 'scientific', 'method', 'algorithm', 'and', 'many', 'type',
'of', 'process']

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions
Q1. What is Stemming ?

Ans: Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. Stemming is important in natural language
understanding (NLU) and natural language processing (NLP).

Q2. What is Lemmatization ?

Ans: Lemmatization is the process of grouping together the different inflected forms of a word
so they can be analysed as a single item. Lemmatization is similar to stemming but it brings
context to the words. So it links words with similar meaning to one word.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No.3
Write a program on Morphology

Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:

• बच्चों(bachchoM) consists of two morphemes, बच्चा (bachchaa) has the information of

the root word noun " बच्चा "(bachchaa) and ओ (oM) has the information of plural and

oblique case.
• played has two morphemes play and -ed having information verb "play" and "past tense",
so given word is past tense form of verb "play".

Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Output:

Word: नदी

Word: लड़का

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions

Q1.What is Morphology ?

Ans: Morphology is a branch of linguistics that focuses on the way in which words are formed
from morphemes. There are two types of morphemes namely lexical morphemes and
grammatical morphemes

Q2.What are the different types of computational approaches to morphology?

Ans: 1) Dictionary Lookup 2) Finite-State Morphology 3) Unification-Based Morphology


4) Functional Morphology

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No. 4

Write a program on N-Grams

# imports
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import FreqDist
from nltk.util import ngrams
from nltk.corpus import stopwords
# input the reuters sentences
sents =reuters.sents()

# write the removal characters such as : Stopwords and punctuation


stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list

# generate unigrams bigrams trigrams


unigram=[]
bigram=[]
trigram=[]
tokenized_text=[]
for sentence in sents:
sentence = list(map(lambda x:x.lower(),sentence))
for word in sentence:
if word== '.':
6th SEM Natural Language Processing 2023-24
IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE
sentence.remove(word)
else:
unigram.append(word)

tokenized_text.append(sentence)
bigram.extend(list(ngrams(sentence, 2,pad_left=True, pad_right=True)))
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))

# remove the n-grams with removable words


def remove_stopwords(x):
y = []
for pair in x:
count = 0
for word in pair:
if word in removal_list:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return (y)
unigram = remove_stopwords(unigram)
bigram = remove_stopwords(bigram)
trigram = remove_stopwords(trigram)

#generate frequency of n-grams


from collections import defaultdict
from collections import Counter
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)

d = defaultdict(Counter)

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

for a, b, c in freq_tri:
if(a != None and b!= None and c!= None):
d[(a, b)][c] += freq_tri[a, b, c]

# Next word prediction


s=''
def pick_word(counter):
"Chooses a random element."
return random.choice(list(counter.elements()))
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
suffix = pick_word(d[prefix])
s=s+' '+suffix
print(s)
prefix = prefix[1], suffix

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Output:

he said
he said ,
he said , has
he said , has unanimously
he said , has unanimously approved
he said , has unanimously approved a
he said , has unanimously approved a bill
he said , has unanimously approved a bill which
he said , has unanimously approved a bill which would
he said , has unanimously approved a bill which would cost
he said , has unanimously approved a bill which would cost an
he said , has unanimously approved a bill which would cost an estimated
he said , has unanimously approved a bill which would cost an estimated 5
he said , has unanimously approved a bill which would cost an estimated 5 6
he said , has unanimously approved a bill which would cost an estimated 5 6 mln
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band one
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band one ,
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band one , rather

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions
Q1. What are N-Grams?

Ans: N-grams of texts are extensively used in text mining and natural language processing tasks.
They are basically a set of co-occurring words within a given window and when computing the
n-grams you typically move one word forward (although you can move X words forward in more
advanced scenarios).

For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then
the ngrams would be:

• the cow
• cow jumps
• jumps over
• over the
• the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to
jumps->over,etc, essentially moving one word forward to generate the next bigram.
If N=3, the n-grams would be:
• the cow jumps
• cow jumps over
• jumps over the
• over the moon
So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is
essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3
this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so
on.

Q2. What is TextBlob in Python?


Ans: Textblob is an open-source python library for processing textual data. It performs different
operations on textual data such as noun phrase extraction, sentiment analysis, classification,
translation, etc.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No. 5

Write a program on Part of Tagging

import nltk
from nltk.corpus import stopwords
from nltk import RegexpParser
stop_words = set(stopwords.words('english')) #English language selected for stopwards
#Sentence
txt = "Sukanya, Rajib and Naba are my good friends. "
# Word tokenizers is used to find the words and punctuation in a string
wordsList = nltk.word_tokenize(txt)
# removing stop words from wordList
wordsList = [w for w in wordsList if w not in stop_words]
# Using a Tagger. Which is part-of-speech tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
#Using a tagset
tagged_universal = nltk.pos_tag(wordsList,tagset='universal')
print("Default : ",*tagged,"\nWith parameter tagset='universal' :",*tagged_universal,end="\n\n")
#Chunking
#grammar pattern rule using for chunking
patterns= """chunk rule: {<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
# Name of symbol Description
#. Any character except new line
#* Match 0 or more repetitions
#? Match 0 or 1 repetitions
chunker = RegexpParser(patterns)
print("\nAfter RegexParser:",chunker)
output = chunker.parse(tagged)
print("\nAfter Chunking",output)
#output for graph
output.draw()

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Output:
Default : ('Sukanya', 'NNP') (',', ',') ('Rajib', 'NNP') ('Naba', 'NNP') ('good', 'JJ') ('friends', 'NNS')
('.', '.')
With parameter tagset='universal' : ('Sukanya', 'NOUN') (',', '.') ('Rajib', 'NOUN') ('Naba',
'NOUN') ('good', 'ADJ') ('friends', 'NOUN') ('.', '.')

After RegexParser: chunk.RegexpParser with 1 stages:


RegexpChunkParser with 1 rules:
<ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>

After Chunking (S
(chunk rule Sukanya/NNP)
,/,
(chunk rule Rajib/NNP Naba/NNP good/JJ)
(chunk rule friends/NNS)
./.)

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions

Q1. What is Part-of-speech (POS) tagging?


Ans: It is a process of converting a sentence to forms – list of words, list of tuples (where each
tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether
the word is a noun, adjective, verb, and so on.

Q2. What are stop words in nlp?


Ans: Stop words are a set of commonly used words in a language. Examples of stop words in
English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and
Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry
very little useful information.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No. 6
Write a program for Hidden Markov Model

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from hmmlearn import hmm

# Define the state space


states = ["Silence", "Word1", "Word2", "Word3"]
n_states = len(states)

# Define the observation space


observations = ["Loud", "Soft"]
n_observations = len(observations)

# Define the initial state distribution


start_probability = np.array([0.8, 0.1, 0.1, 0.0])

# Define the state transition probabilities


transition_probability = np.array([[0.7, 0.2, 0.1, 0.0], [0.0, 0.6, 0.4, 0.0],
[0.0, 0.0, 0.6, 0.4], [0.0, 0.0, 0.0, 1.0]])

# Define the observation likelihoods


emission_probability = np.array([[0.7, 0.3], [0.4, 0.6],
[0.6, 0.4], [0.3, 0.7]])

# Fit the model


model = hmm.CategoricalHMM(n_components=n_states)
model.startprob_ = start_probability
model.transmat_ = transition_probability
model.emissionprob_ = emission_probability

# Define the sequence of observations


observations_sequence = np.array([0, 1, 0, 0, 1, 1, 0, 1]).reshape(-1, 1)

# Predict the most likely hidden states


hidden_states = model.predict(observations_sequence)
print("Most likely hidden states:", hidden_states)

# Plot the results


sns.set_style("darkgrid")
plt.plot(hidden_states, '-o', label="Hidden State")
plt.legend()
plt.show()

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Output:

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions

Q1. What is HMM in NLP?


Ans: Hidden Markov Model (HMM) as “a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states”

Q1. How can HMMs be applied to part-of-speech tagging or speech recognition tasks??
Ans: Part-of-Speech Tagging:

• Model Structure: Words as observable symbols, part-of-speech tags as hidden states.

• Training: Learn transitions between tags and associations between words and tags.

• Inference: Viterbi algorithm for the most likely tag sequence.

Speech Recognition:

• Model Structure: HMMs model phonetic units, with acoustic features as observables.

• Training: Capture transitions between phonetic units and model acoustic features.

• Inference: Viterbi or other decoding algorithms for the most likely phonetic unit sequence
based on observed features.

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Experiment No. 7
Write a program for Viterbi Algorithm

states = ('Healthy', 'Fever')

observations = ('normal', 'cold', 'dizzy')

start_probability = {'Healthy': 0.6, 'Fever': 0.4}

transition_probability = {
'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
'Fever' : {'Healthy': 0.4, 'Fever': 0.6}
}

emission_probability = {
'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}
}

def viterbi(obs, states, start_p, trans_p, emit_p):


V = [{}]
path = {}

# Initialize base cases (t == 0)


for y in states:
V[0][y] = start_p[y] * emit_p[y][obs[0]]
path[y] = [y]

# Run Viterbi for t > 0


for t in range(1, len(obs)):
V.append({})
newpath = {}

for y in states:
(prob, state) = max((V[t-1][y0] * trans_p[y0][y] * emit_p[y][obs[t]], y0) for y0 in states)
V[t][y] = prob
newpath[y] = path[state] + [y]

# Don't need to remember the old paths


path = newpath
n=0 # if only one element is observed max is sought in the initialization values
if len(obs) != 1:
n=t
# print_dptable(V)
(prob, state) = max((V[n][y], y) for y in states)
return (prob, path[state])
6th SEM Natural Language Processing 2023-24
IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

# Don't study this, it just prints a table of the steps.


def print_dptable(V):
s = " " + " ".join(("%7d" % i) for i in range(len(V))) + "\n"
for y in V[0]:
s += "%.5s: " % y
s += " ".join("%.7s" % ("%f" % v[y]) for v in V)
s += "\n"
print(s)

viterbi(observations, states, start_probability, transition_probability, emission_probability)

OUTPUT:

(0.01512, ['Healthy', 'Healthy', 'Fever'])

6th SEM Natural Language Processing 2023-24


IPS ACADEMY, INSTITUTE OF ENGINEERING AND SCIENCE- INDORE

Viva Questions

Q1. What is Viterbi algorithm ?


Ans: The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a
posteriori probability estimate of the most likely sequence of hidden states called the Viterbi path
that results in a sequence of observed events, especially in the context of Markov information
sources and hidden Markov models (HMM).

Q2. Why the Viterbi algorithm considered as a dynamic programming approach??


Ans: The Viterbi algorithm is considered a dynamic programming approach because it efficiently
solves problems by breaking them into smaller overlapping subproblems, storing solutions to
avoid redundancy. It exhibits optimal substructure, uses memorization to store results, and relies
on recursion to calculate probabilities for the most likely sequence of hidden states in Hidden
Markov Models (HMMs).

6th SEM Natural Language Processing 2023-24


Knowledge, Skills, Values

IPS ACADEMY
16 Collages, 71 Courses, 51 Acre Campus

ISO 9001: 2008 Certified

Knowledge Village
Rajendra Nagar
A. B. Road Indore
452012(M.P.) India
Ph: 0731-4014601-604 Mo: 07746000161
E Mail: [email protected]
Website: ies.ipsacademy.org & www.ipsacademy.org

You might also like