NLP Lab Manual Updated
NLP Lab Manual Updated
NLP Lab Manual Updated
LAB MANUAL
(2023-24)
Name :- …………………………………………….
CONTENTS
To be the fountain head of novel ideas & innovations in science & technology & persist
to be a foundation of pride for all Indians.
M1: Provide quality education, in both the theoretical and applied foundations of
computer science and train students to effectively apply this education to solve real-
world problems.
M2: Amplifying their potential for lifelong high- quality careers.
PEO 1:
To prepare students for successful careers in software industry that meet the needs of
Indian and multinational companies.
PEO 2:
To provide students with solid foundation in mathematical, scientific and engineering
fundamentals to solve engineering problems and required also to pursue higher studies.
PEO 3:
To develop the ability to work with the core competence of computer science &
engineering i.e. software engineering, hardware structure & networking concepts so
that one can find feasible solution to real world problems.
PEO 4:
To inseminate in students effective communication skills, team work, multidisciplinary
approach, and an ability to relate engineering issues to broader social context.
PEO 5:
To motivate students perseverance for lifelong learning and to introduce them to
professional ethics and codes of professional practice.
PO5. Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and modeling
to complex engineering activities with an understanding of the limitations.
PO6. The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
PO8. Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO12. Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context of
technological change
CO4 To design a tag set to be used for statistical processing for real-time applications.
CO5 To compare and contrast the use of different statistical approaches for
different types of NLP applications.
2. While entering into the LAB students should wear their ID cards.
5. Students should sign in the LOGIN REGISTER before entering into the
laboratory.
6. Students should come with observation and record note book to the laboratory.
INDEX
Experiment No. 1
Write a program on Word Analysis
import nltk
Output:
Number of sentences: 7
Top 5 repeated words(excluding stopwords): [(',', 23), ('.', 7), ('Government', 4), ('among', 2),
('Happiness', 2)]
Viva Questions
Experiment No.2
Write a program on Word Generation
#STEMMING
porter = PorterStemmer()
word_tokens = word_tokenize(text)
print("\nPorter Stemmer: ",stems := [porter.stem(word) for word in word_tokens])
lancaster=LancasterStemmer()
word_tokens = word_tokenize(text)
print("Lancaster Stemmer: ",stems := [lancaster.stem(word) for word in word_tokens])
snowball =SnowballStemmer("english") #Language Input as a parameter
word_tokens = word_tokenize(text)
print("Lancaster Stemmer: ",stems := [snowball.stem(word) for word in word_tokens])
#LEMMATIZATION
lemmatizer = WordNetLemmatizer()
word_tokens = word_tokenize(text)
Output:
Original Text: data science uses scientific methods algorithms and many types of processes
Porter Stemmer: ['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of',
'process']
Lancaster Stemmer: ['dat', 'sci', 'us', 'sci', 'method', 'algorithm', 'and', 'many', 'typ', 'of', 'process']
Lancaster Stemmer: ['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of',
'process']
Word Net Lemmatizer: ['data', 'science', 'use', 'scientific', 'methods', 'algorithms', 'and', 'many',
'type', 'of', 'process']
Textblob Lemmatizer: ['data', 'science', 'us', 'scientific', 'method', 'algorithm', 'and', 'many', 'type',
'of', 'process']
Viva Questions
Q1. What is Stemming ?
Ans: Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. Stemming is important in natural language
understanding (NLU) and natural language processing (NLP).
Ans: Lemmatization is the process of grouping together the different inflected forms of a word
so they can be analysed as a single item. Lemmatization is similar to stemming but it brings
context to the words. So it links words with similar meaning to one word.
Experiment No.3
Write a program on Morphology
Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:
the root word noun " बच्चा "(bachchaa) and ओ (oM) has the information of plural and
oblique case.
• played has two morphemes play and -ed having information verb "play" and "past tense",
so given word is past tense form of verb "play".
Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.
Output:
Word: नदी
Word: लड़का
Viva Questions
Q1.What is Morphology ?
Ans: Morphology is a branch of linguistics that focuses on the way in which words are formed
from morphemes. There are two types of morphemes namely lexical morphemes and
grammatical morphemes
Experiment No. 4
# imports
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import FreqDist
from nltk.util import ngrams
from nltk.corpus import stopwords
# input the reuters sentences
sents =reuters.sents()
tokenized_text.append(sentence)
bigram.extend(list(ngrams(sentence, 2,pad_left=True, pad_right=True)))
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))
d = defaultdict(Counter)
for a, b, c in freq_tri:
if(a != None and b!= None and c!= None):
d[(a, b)][c] += freq_tri[a, b, c]
Output:
he said
he said ,
he said , has
he said , has unanimously
he said , has unanimously approved
he said , has unanimously approved a
he said , has unanimously approved a bill
he said , has unanimously approved a bill which
he said , has unanimously approved a bill which would
he said , has unanimously approved a bill which would cost
he said , has unanimously approved a bill which would cost an
he said , has unanimously approved a bill which would cost an estimated
he said , has unanimously approved a bill which would cost an estimated 5
he said , has unanimously approved a bill which would cost an estimated 5 6
he said , has unanimously approved a bill which would cost an estimated 5 6 mln
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band one
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band one ,
he said , has unanimously approved a bill which would cost an estimated 5 6 mln of band one , rather
Viva Questions
Q1. What are N-Grams?
Ans: N-grams of texts are extensively used in text mining and natural language processing tasks.
They are basically a set of co-occurring words within a given window and when computing the
n-grams you typically move one word forward (although you can move X words forward in more
advanced scenarios).
For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then
the ngrams would be:
• the cow
• cow jumps
• jumps over
• over the
• the moon
•
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to
jumps->over,etc, essentially moving one word forward to generate the next bigram.
If N=3, the n-grams would be:
• the cow jumps
• cow jumps over
• jumps over the
• over the moon
So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is
essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3
this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so
on.
Experiment No. 5
import nltk
from nltk.corpus import stopwords
from nltk import RegexpParser
stop_words = set(stopwords.words('english')) #English language selected for stopwards
#Sentence
txt = "Sukanya, Rajib and Naba are my good friends. "
# Word tokenizers is used to find the words and punctuation in a string
wordsList = nltk.word_tokenize(txt)
# removing stop words from wordList
wordsList = [w for w in wordsList if w not in stop_words]
# Using a Tagger. Which is part-of-speech tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
#Using a tagset
tagged_universal = nltk.pos_tag(wordsList,tagset='universal')
print("Default : ",*tagged,"\nWith parameter tagset='universal' :",*tagged_universal,end="\n\n")
#Chunking
#grammar pattern rule using for chunking
patterns= """chunk rule: {<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
# Name of symbol Description
#. Any character except new line
#* Match 0 or more repetitions
#? Match 0 or 1 repetitions
chunker = RegexpParser(patterns)
print("\nAfter RegexParser:",chunker)
output = chunker.parse(tagged)
print("\nAfter Chunking",output)
#output for graph
output.draw()
Output:
Default : ('Sukanya', 'NNP') (',', ',') ('Rajib', 'NNP') ('Naba', 'NNP') ('good', 'JJ') ('friends', 'NNS')
('.', '.')
With parameter tagset='universal' : ('Sukanya', 'NOUN') (',', '.') ('Rajib', 'NOUN') ('Naba',
'NOUN') ('good', 'ADJ') ('friends', 'NOUN') ('.', '.')
After Chunking (S
(chunk rule Sukanya/NNP)
,/,
(chunk rule Rajib/NNP Naba/NNP good/JJ)
(chunk rule friends/NNS)
./.)
Viva Questions
Experiment No. 6
Write a program for Hidden Markov Model
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from hmmlearn import hmm
Output:
Viva Questions
Q1. How can HMMs be applied to part-of-speech tagging or speech recognition tasks??
Ans: Part-of-Speech Tagging:
• Training: Learn transitions between tags and associations between words and tags.
Speech Recognition:
• Model Structure: HMMs model phonetic units, with acoustic features as observables.
• Training: Capture transitions between phonetic units and model acoustic features.
• Inference: Viterbi or other decoding algorithms for the most likely phonetic unit sequence
based on observed features.
Experiment No. 7
Write a program for Viterbi Algorithm
transition_probability = {
'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
'Fever' : {'Healthy': 0.4, 'Fever': 0.6}
}
emission_probability = {
'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}
}
for y in states:
(prob, state) = max((V[t-1][y0] * trans_p[y0][y] * emit_p[y][obs[t]], y0) for y0 in states)
V[t][y] = prob
newpath[y] = path[state] + [y]
OUTPUT:
Viva Questions
IPS ACADEMY
16 Collages, 71 Courses, 51 Acre Campus
Knowledge Village
Rajendra Nagar
A. B. Road Indore
452012(M.P.) India
Ph: 0731-4014601-604 Mo: 07746000161
E Mail: [email protected]
Website: ies.ipsacademy.org & www.ipsacademy.org