NLP Lab Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Experiment -1

Aim: Demonstrate Noise Removal for any textual data and remove regular expression pattern such
as hash tag from textual data

Description: Text cleaning or Text preprocessing or data preprocessing


A text is unstructured data may be full of inconsistencies and ambiguity.
Text preprocessing is a method in NLP that involves cleaning text data and making it ready for
model building. A raw text (text corpus) human writable text data collected from one or many
sources such as websites, spoken languages and voice recognition systems may contain various
words with the wrong spelling, short words, special symbols, emojis, etc.
Clean noisy text data; it is an essential step before building any model in Natural Language
Processing. After cleaning, more text preprocessing is required to reshape the data in a manner
that it can be fed directly to the model. If preprocessing is not done properly the data will be as
good as garbage and the NLP model produced will be as bad as garbage only.
In Natural Language Processing (NLP) different methods required to clean the data (text) and to
handle a special symbol or punctuation in the text data.
Most Common Text Preprocessing steps are
1.Lowercasing
2.Removing numbers
3.Removing Extra Whitespaces
4.Removal of HTML Tags
5.Removal of URLs
6.Removal of hash tags
7.Removing Punctuations
8.Tokenization
9.Spelling Correction
10.Stopwords Removal
11.Removal of Frequent Words
12.Stemming
13.Lemmatization
1.Lowercasing The lowercasing is an important text preprocessing step in which we convert the
text into the same casing, preferably all in lowercase. It is helpful in text featurization techniques
like term frequency, TFIDF , it prevents duplication of same words having different casing.
2.converting numbers into words or removing numbers Remove numbers if they are not
relevant to your analyses. Usually, regular expressions are used to remove numbers.
3.Remove whitespaces : Remove leading and ending spaces
4.Removal of Tags : If you scrape data from a different website, removing HTML tags becomes
an essential step as part of our preprocessing. use Python regular expression function to find all
the unwanted tags. define a custom function remove_tag() which cleans the HTML tags from the
text by using regular expression. And finally, we apply this function to our Twitter dataframe.
5.Removal of URLs : Remove URLs present in the data.If we are doing a Twitter analysis, then
there is a good chance that the tweet will have some URL in it. It need to remove them for our
further analysis.
6.Extract hashtags from text : A hashtag is a keyword or phrase preceded by the hash symbol
(#), written within a post or comment to highlight it and facilitate a search for it. Like #like, #gfg,
#selfie
7.Remove punctuation : Set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:
8. Spelling Correction : Typos are common in text data and we might want to correct those
spelling mistakes before we do our analysis. This ensures we get better results from our models.

import string
import re
input_text = ' \t #sample <HTML> <H1> #Greetings our college offering Computer science cou
rses in {B.Tech} with 3 sPecializations & [M.TEch] with 1 specializations #ComputerScience-
>totalcourses:? https://www.aec.edu.in is our coLLege website !!! </H1> .... \t '
input_text = input_text.lower()
print(input_text)
input_text = re.sub(r'\d+', '', input_text)
print(input_text)
input_text = input_text.strip()
print(input_text)
html_pattern = re.compile('<.*?>')
input_text=html_pattern.sub(r'', input_text)
print(input_text)
url_pattern = re.compile(r'https?://\S+ | www\.\S+')
input_text=url_pattern.sub(r'', input_text)
print(input_text)
hash_pattern = re.compile(r'#[a-z]+')
input_text=hash_pattern.sub(r' ', input_text)
print(input_text)
for punc in string.punctuation:
if punc in input_text:
input_text = input_text.replace(punc, '')
print(input_text)
x = re.findall("is our \S+ \S+",input_text)
input_text = re.sub("is our \S+ \S+","",input_text)
print(input_text)
tweet = """if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 1
00 %%"""
x = re.sub('[^a-zA-Z0-9 ]+','',tweet)
print(x)

OutPut: #sample <html> <h1> #greetings our college offering computer science courses in
{b.tech} with 3 specializations & [m.tech] with 1 specializations #computerscience-
>totalcourses:? https://www.aec.edu.in is our college website !!! </h1> ....
#sample <html> <h> #greetings our college offering computer science courses in {b.tech}
with specializations & [m.tech] with specializations #computerscience->totalcourses:?
https://www.aec.edu.in is our college website !!! </h> ....
#sample <html> <h> #greetings our college offering computer science courses in {b.tech} with
specializations & [m.tech] with specializations #computerscience->totalcourses:?
https://www.aec.edu.in is our college website !!! </h> ....

#sample #greetings our college offering computer science courses in {b.tech} with
specializations & [m.tech] with specializations #computerscience->totalcourses:?
https://www.aec.edu.in is our college website !!! ....

#sample #greetings our college offering computer science courses in {b.tech} with
specializations & [m.tech] with specializations #computerscience->totalcourses:? is our college
website !!! ....

our college offering computer science courses in {b.tech} with specializations & [m.tech] with
specializations ->totalcourses:? is our college website !!! ....

our college offering computer science courses in btech with specializations mtech with
specializations totalcourses is our college website

our college offering computer science courses in btech with specializations mtech with
specializations totalcourses

if you hold an empty gatorade bottle up to your ear you can hear the sports 100
Experiment -2
Aim: Perform lemmatization and stemming using python library nltk

Description: A data set containing news is a corpus or The tweets containing Twitter data are a
corpus. Corpus consists of documents, documents comprise paragraphs, paragraphs comprise
sentences and sentences comprise further smaller units which are called Tokens.
Tokens can be words, phrases, or Engrams, and Engrams are defined as the group of n words
together.

Sentence: I like my iphone For the above sentence, the different engrams are as follows:
Uni-grams(n=1) are: I, like, my, iphone Di-grams(n=2) are: I like, like my, my iphone Tri-
grams(n=3) are: I like my, like my iphone So, uni-grams are representing one word, di-grams are
representing two words together and tri-grams are representing three words together.
Text object The text object is a sentence or a phrase or a word or an article.
Morpheme In the field of NLP, a Morpheme is defined as the base form of a word. A token is
generally made up of two components,
Morphemes: The base form of the word, and Inflectional forms: The suffixes and prefixes added
to morphemes. Let’s discuss the structure of the tokens:

Tokenization : It is a process of splitting a text object into smaller units which are also called
tokens. Examples of tokens include words, numbers, engrams, or even symbols. The most
commonly used tokenization process is White-space Tokenization.
Different types of Tokenization:
i) White-space Tokenization ii) Regular Expression Tokenization

i) White-space Tokenization It is also known as unigram tokenization. In this process, we


split the entire text into words by splitting them from white spaces.
Sentence: I went to New-York to play football Tokens generated are: “I”, “went”, “to”, “New-
York”, “to”, “play”, “football” Notice that “New-York” is not split further because the
tokenization process was based on whitespaces only.
ii) Regular Expression Tokenization It is another type of Tokenization process, in which a
regular expression pattern is used to get the tokens.
The following string containing multiple delimiters such as comma, semi-colon, and white space.
Sentence:= “Basketball, Hockey; Golf Tennis" re.split(r’[;,s]’, Sentence Tokens generated are:
“Basketball”, ”Hockey”, “Golf”, “Tennis”
Tokenization can be performed at the sentence level or at the word level or even at the character
level.
1. Sentence Tokenization 2Word Tokenization
Stemming -is a kind of normalization for words. It is a technique where a set of words in a sentence
are converted into a sequence to shorten its lookup. The words which have the same meaning but
have some variation according to the context or sentence are normalized. Stemming is hence a way
to find the root word from variations of the word.NLTK provides inbuilt stemmers Porter
Stemmer, Snowball Stemmer and Lancaster Stemmer.
Lemmatization is the algorithmic process of finding the lemma of a word depending on their
meaning. Lemmatization usually refers to the morphological analysis of words, which aims to
remove inflectional endings. It helps in returning the base or dictionary form of a word, which is
known as the lemma. Lemmatization can be done with or without a POS tag. A POS or part-of-
speech tag assigns a tag to each word, and hence increases the accuracy of the lemma in the context
of the dataset. For example, the word ‘leaves’ without a POS tag would get lemmatized to the word
‘leaf’, but with a verb tag, its lemma would become ‘leave’.
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
import nltk
# Snowball Stemmer has language as a parameter.
words = ["Walking","Swimming","Computer","Computing","Language","Natual","Education","
easy", "irrational", "relation"]
#Create instances of both stemmers, and stem the words using them.
stemmer_ps = PorterStemmer()
#an instance of Porter Stemmer
stemmed_words_ps = [stemmer_ps.stem(word) for word in words]
print("Porter stemmed words: ", stemmed_words_ps)
stemmer_ss = SnowballStemmer("english")
#an instance of Snowball Stemmer
stemmed_words_ss = [stemmer_ss.stem(word) for word in words]
print("Snowball stemmed words: ", stemmed_words_ss)
sentence="I was wonder when I walk in Indian roads because everybody using computers to und
erstand the language so they forgot their mother language it is natural because people are edicted
to computer it is irritating me"
token_words=nltk.word_tokenize(sentence) #we need to tokenize the sentence or else stemming
will return the entire sentence as is.
stem_sentence=[]
for word in token_words:
stem_sentence.append(stemmer_ps.stem(word))
print("The Porter stemmed sentence is: ", stem_sentence)

OutPut:
Porter stemmed words: ['walk', 'swim', 'comput', 'comput', 'languag', 'natual', 'educ', 'easi', 'irrat',
'relat']
Snowball stemmed words: ['walk', 'swim', 'comput', 'comput', 'languag', 'natual', 'educ', 'easi',
'irrat', 'relat']
The Porter stemmed sentence is: ['i', 'wa', 'wonder', 'when', 'i', 'walk', 'in', 'indian', 'road', 'becaus',
'everybodi', 'use', 'comput', 'to', 'understand', 'the', 'languag', 'so', 'they', 'forgot', 'their', 'mother',
'languag', 'it', 'is', 'natur', 'becaus', 'peopl', 'are', 'edict', 'to', 'comput', 'it', 'is', 'irrit', 'me']
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "rela
tional","has"]

lemmatizer = WordNetLemmatizer() #an instance of Word Net Lemmatizer


lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("The lemmatized words: ", lemmatized_words) #prints the lemmatized words

lemmatized_words_pos = [lemmatizer.lemmatize(word, pos = "v") for word in words]


print("The lemmatized words using a POS tag: ", lemmatized_words_pos) #prints POS tagged le
mmatized words

sentence="I was wonder when I walk in Indian roads because everybody using computers to und
erstand the language so they forgot their mother language it is natural because people are edicted
to computer it is irritating me"
token_words=nltk.word_tokenize(sentence) #we need to tokenize the sentence or else lemmatizi
ng will return the entire sentence as is.
lemma_sentence=[]
for word in token_words:
lemma_sentence.append(lemmatizer.lemmatize(word))
print("The lemmatized sentence is: ", lemma_sentence)

Output:
The lemmatized words: ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa',
'relational', 'ha']
The lemmatized words using a POS tag: ['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships',
'easily', 'be', 'relational', 'have']
The lemmatized sentence is: ['I', 'wa', 'wonder', 'when', 'I', 'walk', 'in', 'Indian', 'road', 'because',
'everybody', 'using', 'computer', 'to', 'understand', 'the', 'language', 'so', 'they', 'forgot', 'their',
'mother', 'language', 'it', 'is', 'natural', 'because', 'people', 'are', 'edicted', 'to', 'computer', 'it', 'is',
'irritating', 'me']
Experiment -3
Aim : Demonstrate object standardization such as replace social media slangs from a text

Description: NLP is used for chat bots, summaries of articles or texts, language translation, and
verbal view description. NLP includes steps such as pre-processing, entity extraction, word
frequency measurements. With noise reduction, operations are performed on connectors such as
“and, or, but”.

Object Standardization : Text data often contains words or phrases which are not present in any
standard lexical dictionaries. These pieces are not recognized by search engines and models.
Examples – acronyms, hashtags with attached words, and colloquial slangs. With the help of
regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the
code below uses a dictionary lookup method to replace social media slangs from a text. Other
types of text preprocessing includes encoding-decoding noise, grammar checker, and spelling
correction
The dictionary contains the normalization process of the words from the same root, such as “I do,
I do, I will do” normalization. Object standardization is pre-processing techniques that can be done
on abbreviations such as “rt → retweet, dm → direct message”.
After preprocessing, entity extraction and entity selection are performed at this stage. At this stage,
the relevant topic is removed from the text. One of the techniques used is Latent Dirichlet
Allocation for Topic Modeling (LDA)

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love", "...": " "}
def lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
return new_text

print(_lookup_words("RT this is a retweeted dm message tweet by Shivam Bansal"))


print(lookup_dict.keys())
print(lookup_dict.values())
Output:
Message : RT this is a retweeted dm message tweet by Shivam Bansal
Converted Message : Retweet this is a retweeted direct message message tweet by Shivam Bansal
dict_keys(['rt', 'dm', 'awsm', 'luv', '...'])
dict_values(['Retweet', 'direct message', 'awesome', 'love', ' '])

Experiment -4
Aim: Perform part of speech tagging on any textual data.

Description: One of the more powerful aspects of the NLTK module is the Part of Speech tagging
that it can do for you. This means labeling words in a sentence as nouns, adjectives, verbs...etc.
Even more impressive, it also labels by tense, and more. Here's a list of the tags, what they mean,
and some examples:

Part-of-speech (POS) tagging is just what it sounds like: the process goes through the words in
your corpus and tags them with metadata, indicating whether those words are nouns, verbs,
adjectives, etc.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project: Number Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
sentence tokenizer, PunktSentenceTokenizer is capable of unsupervised machine learning, train
it on any body of text that you use.
create training and testing data
Data Set used i) State of the Union address from 2005 ii) State of the Union address from 2006 of
President George W. Bush.
Train the Punkt tokenizer
Finish up part of speech tagging script by creating a function that will run through and tag all of
the parts of speech per sentence

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
nltk.download('state_union')
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()

Output :
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'),
('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF',
'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'),
('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'),
('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'),
(',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'),
('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',',
','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':',
':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',',
','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'),
('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'),
('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope',
'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband',
'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','),
('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life',
'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'),
('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union',
'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',',
','), ('Jan', 'NNP'), ('.', '.')]

Experiment -5
Aim : Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Description : Latent Dirichlet allocation (LDA) is a topic model that generates topics based on
word frequency from a set of documents. LDA is particularly useful for finding reasonably
accurate mixtures of topics within a given document set. It is a generative probabilistic model that
assumes each topic is a mixture over an underlying set of words, and each document is a mixture
of over a set of topic probabilities.
Process of LDA:

Input : M number of documents, N number of words, K number of topics.


The model trains to output:
psi -distribution of words for each topic K
phi - the distribution of topics for each document i

Required Python packages:


i. NLTK(Natural language toolkit)
ii. stop_words Python package containing stop words
iii. gensim, a topic modeling package containing our LDA model.
Steps involved
1) Loading data
2) Data cleaning
3) Exploratory analysis
4) Preparing data for LDA analysis
5) LDA model training
6) Analyzing LDA model results
Data Cleaning methods :
Tokenizing: converting a document to its atomic elements.
Stopping: removing meaningless words.
Stemming: merging words that are equivalent in meaning.

Constructing a document-term matrix : The result of cleaning stage is texts, a tokenized, stopped
and stemmed list of words from a single document. we looped through all our documents and
appended each one to texts. So now texts is a list of lists, one list for each of our original
documents. To generate an LDA model, we need to understand how frequently each term occurs
within each document.

Construct a document-term matrix with a package called genism : The Dictionary() function
traverses texts, assigning a unique integer id to each unique token while also collecting word counts
and relevant statistics. dictionary must be converted into a bag-of-words:

Applying the LDA model : corpus is a document-term matrix and now we’re ready to generate an
LDA model: The LdaModel class is described in detail in the gensim documentation. Parameters
used in our example:
Parameters:
num_topics: required. An LDA model requires the user to determine how many topics should be
generated. Our document set is small, so we’re only asking for three topics.
id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
passes: optional. The number of laps the model will take through corpus. The greater the number
of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.
Examining the results LDA model is now stored as ldamodel with the print_topic and print_topics
methods
LDA assumes documents are produced from a mixture of topics. Those topics then generate words
based on their probability distribution, like the ones in our walkthrough model. In other words,
LDA assumes a document is made from the following steps:
Determine the number of words in a document. Let’s say our document has 6 words. Determine
the mixture of topics in that document. For example, the document might contain 1/2 the topic
“health” and 1/2 the topic “vegetables.” Using each topic’s multinomial distribution, output words
to fill the document’s word slots. In our example, the “health” topic is 1/2 our document, or 3
words. The “health” topic might have the word “diet” at 20% probability or “exercise” at 15%, so
it will fill the document word slots based on those probabilities. Given this assumption of how
documents are created, LDA backtracks and tries to figure out what topics would create those
documents in the first place.

!pip install stop_words


from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressu
re."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my
brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:

# clean and tokenize document string


raw = i.lower()
tokens = tokenizer.tokenize(raw)

# remove stop words from tokens


stopped_tokens = [i for i in tokens if not i in en_stop]

# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

# add tokens to list


texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary


dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix


corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model


ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, pa
sses=20)

print(ldamodel.num_terms)
print(ldamodel.num_topics)
print(ldamodel.get_topics())
ldamodel.print_topics()
print(ldamodel.print_topics(num_topics=3, num_words=3))
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, pa
sses=20)

print(ldamodel.print_topics(num_topics=4, num_words=8))

Output : 32
2
[[0.08628245 0.07171855 0.07067294 0.08623768 0.04238176 0.07171956
0.04237814 0.04237782 0.04220941 0.04237783 0.04237831 0.04237794
0.04237753 0.01423728 0.01423811 0.01423659 0.01520271 0.0142376
0.01423843 0.01425746 0.01423829 0.01423937 0.01425364 0.01425225
0.01425186 0.0142531 0.01425185 0.0142523 0.01425273 0.01425207
0.01568182 0.01568263]
[0.02228727 0.03435219 0.01177719 0.02232435 0.01177278 0.03435134
0.01177578 0.01177604 0.05879794 0.01177604 0.01177564 0.01177595
0.01177629 0.03508804 0.03508734 0.03508861 0.08117064 0.03508777
0.03508708 0.05851251 0.0350872 0.0350863 0.03507448 0.03507564
0.03507596 0.03507493 0.03507596 0.03507559 0.03507524 0.03507578
0.03389136 0.03389069]]
[(0, '0.086*"brocolli" + 0.086*"good" + 0.072*"mother"'), (1, '0.081*"health" + 0.059*"drive" +
0.059*"pressur"')]
[(0, '0.161*"health" + 0.089*"profession" + 0.089*"say" + 0.089*"good" + 0.089*"brocolli" +
0.018*"mother" + 0.018*"brother" + 0.018*"pressur"'), (1, '0.063*"brother" + 0.063*"mother" +
0.063*"pressur" + 0.062*"never" + 0.062*"feel" + 0.062*"seem" + 0.062*"well" +
0.062*"perform"'), (2, '0.132*"brocolli" + 0.132*"good" + 0.132*"eat" + 0.074*"brother" +
0.074*"mother" + 0.073*"like" + 0.015*"health" + 0.015*"drive"'), (3, '0.083*"drive" +
0.046*"blood" + 0.046*"suggest" + 0.046*"increas" + 0.046*"tension" + 0.046*"expert" +
0.046*"may" + 0.046*"caus"')]
Experiment -6
Aim : Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python
Description :
It is a widely used statistical method in natural language processing and information retrieval. It
measures how important a term is within a document relative to a collection of documents. Words
within a text document are transformed into importance numbers by a text vectorization process.
TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the Inverse
Document Frequency (IDF).

TF-IDF is useful in many natural language processing applications.


1. Search Engines used to rank the relevance of a document for a query.
2. Text classification, text summarization, and topic modelling.

Example
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))


print('The words in the corpus: \n', words_set)
n_docs = len(corpus) #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)


for i in range(n_docs):
words = corpus[i].split(' ') # Words in the document
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))

df_tf
# Dataframe shows the frequency of each word in each document,
# a column for each word and a row for each document.
print("IDF of: ")

idf = {}
for w in words_set:
k = 0 # number of documents in the corpus that contain this word

for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)

print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()

for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
df_tf_idf
# "data" has an IDF of 0 because it appears in every document.
# So is not considered to be an important term in this corpus.

Output :
Number of words in the corpus: 14
The words in the corpus:
{'science', 'is', 'best', 'scientists', 'analyze', 'most', 'courses', 'this', 'one', 'fields', 'data', 'the',
'important', 'of'}
IDF of:
science: 0.17609125905568124
is: 0.17609125905568124
best: 0.47712125471966244
scientists: 0.47712125471966244
analyze: 0.47712125471966244
most: 0.47712125471966244
courses: 0.47712125471966244
this: 0.47712125471966244
one: 0.17609125905568124
fields: 0.47712125471966244
data: 0.0
the: 0.17609125905568124
important: 0.47712125471966244
of: 0.17609125905568124

scie an d imp
scie mos cou fiel
is best ntis aly this one at the orta of
nce t rses ds
ts ze a nt

0.03 0.01 0.00 0.0 0.04 0.00 0.00 0.01 0.04 0.01 0.03
0.00 0. 0.04
0 201 600 000 000 337 000 000 600 337 600 201
000 0 3375
7 8 0 0 5 0 0 8 5 8 7

0.01 0.01 0.05 0.0 0.00 0.05 0.05 0.01 0.00 0.01 0.01
0.00 0. 0.00
1 956 956 301 000 000 301 301 956 000 956 956
000 0 0000
6 6 3 0 0 3 3 6 0 6 6

0.00 0.00 0.00 0.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.11 0. 0.00
2 000 000 000 192 000 000 000 000 000 000 000
928 0 0000
0 0 0 8 0 0 0 0 0 0 0

Experiment -7
Aim : Demonstrate word embeddings using word2vec
Description : Word embedding is one of the most important techniques in natural language
processing(NLP), where words are mapped to vectors of real numbers. Word embedding is capable
of capturing the meaning of a word in a document, semantic and syntactic similarity, relation with
other words. It also has been widely used for recommender systems and text classification.
Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural
network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec
can make natural language computer-readable, then further implementation of mathematical
operations on words can be used to detect their similarities. A well-trained set of word vectors will
place similar words close to each other in that space. For instance, the words women, men, and
human might cluster in one corner, while yellow, red and blue cluster together in another.

gensim is an open source python library for natural language processing and it was developed and
is maintained by the Czech natural language processing researcher Gensim library will enable us
to develop word embeddings by training our own word2vec models on a custom corpus either with
CBOW of skip-grams algorithms.

Train the genism word2vec model


model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

The hyperparameters of this model.


size: The number of dimensions of the embeddings and the default is 100.
window: The maximum distance between a target word and words around the target word. The
default window is 5.
min_count: The minimum count of words to consider when training the model; words with
occurrence less than this count will be ignored. The default for min_count is 5.
workers: The number of partitions during training and the default workers is 3.
sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is
CBOW.

After training the word2vec model, obtain the word embedding from the training model.
Finally print the model.
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Output :
Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[ 5.9599371e-04 3.6903401e-03 2.2744297e-03 5.7322328e-04
-4.7999555e-03 4.1460539e-03 3.6190548e-03 4.4815554e-03
-9.4492309e-04 -2.3332548e-03 -7.7754230e-04 -2.0325035e-03
-4.9208495e-05 -3.8984963e-03 2.2744499e-03 1.9393873e-03
1.0208354e-03 2.7080898e-03 1.9608904e-03 1.0961948e-03

Experiment - 8
Aim : Implement Text classification using naïve bayes classifier and text blob library.
Description :
Text classifier are systems that classify your texts and divide them in different classes.
TextBlob is a Python library for processing textual data. It provides a consistent API for diving
into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase
extraction, sentiment analysis, and more.

Step -1 Installing textblob.


Step-2 Download the data files that textblob uses for its functionality and for nltk.
Step-3 Traine the classifier based on Naive Bayes Classifier.
Step-4 Test the data using classifier to get your text classified.
Step-5 Calculat the accuracy of the classifier.

!pip install textblob


import nltk
nltk.download('punkt')

train = [
('What an amazing weather.', 'pos'),
('this is an amazing idea!', 'pos'),
('I feel very good about these ideas.', 'pos'),
('this is my best performance.', 'pos'),
("what an awesome view", 'pos'),
('I do not like this place', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with all this tension", 'neg'),
('he is my sworn enemy!', 'neg'),
('my friends is horrible.', 'neg')
]
test = [
('the food was great.', 'pos'),
('I do not want to live anymore', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Ramesh is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)
print(cl.classify("This is an amazing library!"))
# Lets test the accuracy of the classifier
print(cl.accuracy(test))
# print(cl.classify("my friends is tension"))
# print(cl.accuracy(test))
cl.show_informative_features(4)

Output :
pos
0.8333333333333334
Most Informative Features
contains(I) = True neg : pos = 2.3 : 1.0
contains(an) = False neg : pos = 2.2 : 1.0
contains(I) = False pos : neg = 1.8 : 1.0
contains(my) = True neg : pos = 1.7 : 1.0
Experiment -9
Aim : Apply support vector machine for text classification

Description : Support Vector Machine” (SVM) is a supervised machine learning algorithm that
can be used for both classification or regression challenges. However, it is mostly used in
classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional
space (where n is a number of features you have) with the value of each feature being the value of
a particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well. Support Vectors are the coordinates of individual
observation. The SVM classifier is a frontier that best segregates the two classes (hyper-plane/
line).

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X, y)

# create a mesh to plot in


x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)


plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()

svc = svm.SVC(kernel='rbf', C=1,gamma='auto').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)


plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
plt.show()
Experiment -10
Aim: Convert text to vectors (using term frequency) and apply cosine similarity to provide
closeness among two texts.

Description: Cosine similarity is a measure of similarity between two non-zero vectors of an inner
product space that “measures the cosine of the angle between them”
Cosine Similarity tends to determine how similar two words or sentence are, It can be used for
Sentiment Analysis, Text Comparison and being used by lot of popular packages out there like
word2vec.

This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example, Vector(A) = [5,0,2] Vector(B) = [2,5,0]
The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10
The documents are similar lesser the angle between them and Cosine of Angle increase as the
value of angle decreases since Cos 0 =1 and Cos 90 = 0

First step calculate the cosine similarity between the documents. Convert the
documents/Sentences/words in a form of feature vector first.
Useful Methods for feature extraction i) Bag of Words ii) TF-IDF.

Bag of Words counts the unique words in documents and frequency of each of the words. Scikit
learn Countvectorizer extract the Bag of Words Features.

TF-IDF score of a word to rank it’s importance in a document


tfidf score of a word w = tf(w)*idf(w)
tf(w) = Number of times the word appears in a document/Total number of words in the document
idf(w) = Number of documents/Number of documents that contains word w
se Scikit learn Cosine Similarity function to compare the first document i.e. Document 0 with the
other Documents in Corpus.

Cosine Similarities of the document 0 compared with other documents in the corpus. The first
element in array is 1 which means Document 0 is compared with Document 0 and second element
0.08619387, 0,0 where Document 0 is compared with Document 1,2,3

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd
count_vect = CountVectorizer()
Document1= "Aditya Engineering College situated at Surampalem"
Document2= "Engineering Colleges offer computer science courses in MCA AIML CSE IT depa
rtments"
Document3= "Computer science students have opprtunities in IT sector"
Document4= "IT sector hire students with skills in computer science"
corpus = [Document1,Document2,Document3,Document4]
X_train_counts = count_vect.fit_transform(corpus)
df1=pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_out(),inde
x=['Document 0','Document 1','Document2','Document3'])
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
df2=pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=['Docum
ent 0','Document 1','Document 2','Document 3'])
print(df1)
print(df2)
cosine_similarity(trsfm[0:3], trsfm)
Output :
aditya aiml at college colleges computer courses cse \
Document 0 1 0 1 1 0 0 0 0
Document 1 0 1 0 0 1 1 1 1
Document2 0 0 0 0 0 1 0 0
Document3 0 0 0 0 0 1 0 0

departments engineering ... mca offer opprtunities science \


Document 0 0 1 ... 0 0 0 0
Document 1 1 1 ... 1 1 0 1
Document2 0 0 ... 0 0 1 1
Document3 0 0 ... 0 0 0 1

sector situated skills students surampalem with


Document 0 0 1 0 0 1 0
Document 1 0 0 0 0 0 0
Document2 1 0 0 1 0 0
Document3 1 0 1 1 0 1

[4 rows x 24 columns]
aditya aiml at college colleges computer \
Document 0 0.421765 0.000000 0.421765 0.421765 0.000000 0.000000
Document 1 0.000000 0.328776 0.000000 0.000000 0.328776 0.209853
Document 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.289152
Document 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.263386

courses cse departments engineering ... mca \


Document 0 0.000000 0.000000 0.000000 0.332524 ... 0.000000
Document 1 0.328776 0.328776 0.328776 0.259211 ... 0.328776
Document 2 0.000000 0.000000 0.000000 0.000000 ... 0.000000
Document 3 0.000000 0.000000 0.000000 0.000000 ... 0.000000

offer opprtunities science sector situated skills \


Document 0 0.000000 0.000000 0.000000 0.000000 0.421765 0.000000
Document 1 0.328776 0.000000 0.209853 0.000000 0.000000 0.000000
Document 2 0.000000 0.453012 0.289152 0.357160 0.000000 0.000000
Document 3 0.000000 0.000000 0.263386 0.325334 0.000000 0.412645

students surampalem with


Document 0 0.000000 0.421765 0.000000
Document 1 0.000000 0.000000 0.000000
Document 2 0.357160 0.000000 0.000000
Document 3 0.325334 0.000000 0.412645

[4 rows x 24 columns]
array([[1. , 0.08619387, 0. , 0. ],
[0.08619387, 1. , 0.24271786, 0.22108976],
[0. , 0.24271786, 1. , 0.53702605]])

You might also like