NLP Lab Manual
NLP Lab Manual
NLP Lab Manual
Aim: Demonstrate Noise Removal for any textual data and remove regular expression pattern such
as hash tag from textual data
import string
import re
input_text = ' \t #sample <HTML> <H1> #Greetings our college offering Computer science cou
rses in {B.Tech} with 3 sPecializations & [M.TEch] with 1 specializations #ComputerScience-
>totalcourses:? https://www.aec.edu.in is our coLLege website !!! </H1> .... \t '
input_text = input_text.lower()
print(input_text)
input_text = re.sub(r'\d+', '', input_text)
print(input_text)
input_text = input_text.strip()
print(input_text)
html_pattern = re.compile('<.*?>')
input_text=html_pattern.sub(r'', input_text)
print(input_text)
url_pattern = re.compile(r'https?://\S+ | www\.\S+')
input_text=url_pattern.sub(r'', input_text)
print(input_text)
hash_pattern = re.compile(r'#[a-z]+')
input_text=hash_pattern.sub(r' ', input_text)
print(input_text)
for punc in string.punctuation:
if punc in input_text:
input_text = input_text.replace(punc, '')
print(input_text)
x = re.findall("is our \S+ \S+",input_text)
input_text = re.sub("is our \S+ \S+","",input_text)
print(input_text)
tweet = """if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 1
00 %%"""
x = re.sub('[^a-zA-Z0-9 ]+','',tweet)
print(x)
OutPut: #sample <html> <h1> #greetings our college offering computer science courses in
{b.tech} with 3 specializations & [m.tech] with 1 specializations #computerscience-
>totalcourses:? https://www.aec.edu.in is our college website !!! </h1> ....
#sample <html> <h> #greetings our college offering computer science courses in {b.tech}
with specializations & [m.tech] with specializations #computerscience->totalcourses:?
https://www.aec.edu.in is our college website !!! </h> ....
#sample <html> <h> #greetings our college offering computer science courses in {b.tech} with
specializations & [m.tech] with specializations #computerscience->totalcourses:?
https://www.aec.edu.in is our college website !!! </h> ....
#sample #greetings our college offering computer science courses in {b.tech} with
specializations & [m.tech] with specializations #computerscience->totalcourses:?
https://www.aec.edu.in is our college website !!! ....
#sample #greetings our college offering computer science courses in {b.tech} with
specializations & [m.tech] with specializations #computerscience->totalcourses:? is our college
website !!! ....
our college offering computer science courses in {b.tech} with specializations & [m.tech] with
specializations ->totalcourses:? is our college website !!! ....
our college offering computer science courses in btech with specializations mtech with
specializations totalcourses is our college website
our college offering computer science courses in btech with specializations mtech with
specializations totalcourses
if you hold an empty gatorade bottle up to your ear you can hear the sports 100
Experiment -2
Aim: Perform lemmatization and stemming using python library nltk
Description: A data set containing news is a corpus or The tweets containing Twitter data are a
corpus. Corpus consists of documents, documents comprise paragraphs, paragraphs comprise
sentences and sentences comprise further smaller units which are called Tokens.
Tokens can be words, phrases, or Engrams, and Engrams are defined as the group of n words
together.
Sentence: I like my iphone For the above sentence, the different engrams are as follows:
Uni-grams(n=1) are: I, like, my, iphone Di-grams(n=2) are: I like, like my, my iphone Tri-
grams(n=3) are: I like my, like my iphone So, uni-grams are representing one word, di-grams are
representing two words together and tri-grams are representing three words together.
Text object The text object is a sentence or a phrase or a word or an article.
Morpheme In the field of NLP, a Morpheme is defined as the base form of a word. A token is
generally made up of two components,
Morphemes: The base form of the word, and Inflectional forms: The suffixes and prefixes added
to morphemes. Let’s discuss the structure of the tokens:
Tokenization : It is a process of splitting a text object into smaller units which are also called
tokens. Examples of tokens include words, numbers, engrams, or even symbols. The most
commonly used tokenization process is White-space Tokenization.
Different types of Tokenization:
i) White-space Tokenization ii) Regular Expression Tokenization
OutPut:
Porter stemmed words: ['walk', 'swim', 'comput', 'comput', 'languag', 'natual', 'educ', 'easi', 'irrat',
'relat']
Snowball stemmed words: ['walk', 'swim', 'comput', 'comput', 'languag', 'natual', 'educ', 'easi',
'irrat', 'relat']
The Porter stemmed sentence is: ['i', 'wa', 'wonder', 'when', 'i', 'walk', 'in', 'indian', 'road', 'becaus',
'everybodi', 'use', 'comput', 'to', 'understand', 'the', 'languag', 'so', 'they', 'forgot', 'their', 'mother',
'languag', 'it', 'is', 'natur', 'becaus', 'peopl', 'are', 'edict', 'to', 'comput', 'it', 'is', 'irrit', 'me']
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "rela
tional","has"]
sentence="I was wonder when I walk in Indian roads because everybody using computers to und
erstand the language so they forgot their mother language it is natural because people are edicted
to computer it is irritating me"
token_words=nltk.word_tokenize(sentence) #we need to tokenize the sentence or else lemmatizi
ng will return the entire sentence as is.
lemma_sentence=[]
for word in token_words:
lemma_sentence.append(lemmatizer.lemmatize(word))
print("The lemmatized sentence is: ", lemma_sentence)
Output:
The lemmatized words: ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa',
'relational', 'ha']
The lemmatized words using a POS tag: ['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships',
'easily', 'be', 'relational', 'have']
The lemmatized sentence is: ['I', 'wa', 'wonder', 'when', 'I', 'walk', 'in', 'Indian', 'road', 'because',
'everybody', 'using', 'computer', 'to', 'understand', 'the', 'language', 'so', 'they', 'forgot', 'their',
'mother', 'language', 'it', 'is', 'natural', 'because', 'people', 'are', 'edicted', 'to', 'computer', 'it', 'is',
'irritating', 'me']
Experiment -3
Aim : Demonstrate object standardization such as replace social media slangs from a text
Description: NLP is used for chat bots, summaries of articles or texts, language translation, and
verbal view description. NLP includes steps such as pre-processing, entity extraction, word
frequency measurements. With noise reduction, operations are performed on connectors such as
“and, or, but”.
Object Standardization : Text data often contains words or phrases which are not present in any
standard lexical dictionaries. These pieces are not recognized by search engines and models.
Examples – acronyms, hashtags with attached words, and colloquial slangs. With the help of
regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the
code below uses a dictionary lookup method to replace social media slangs from a text. Other
types of text preprocessing includes encoding-decoding noise, grammar checker, and spelling
correction
The dictionary contains the normalization process of the words from the same root, such as “I do,
I do, I will do” normalization. Object standardization is pre-processing techniques that can be done
on abbreviations such as “rt → retweet, dm → direct message”.
After preprocessing, entity extraction and entity selection are performed at this stage. At this stage,
the relevant topic is removed from the text. One of the techniques used is Latent Dirichlet
Allocation for Topic Modeling (LDA)
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love", "...": " "}
def lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
return new_text
Experiment -4
Aim: Perform part of speech tagging on any textual data.
Description: One of the more powerful aspects of the NLTK module is the Part of Speech tagging
that it can do for you. This means labeling words in a sentence as nouns, adjectives, verbs...etc.
Even more impressive, it also labels by tense, and more. Here's a list of the tags, what they mean,
and some examples:
Part-of-speech (POS) tagging is just what it sounds like: the process goes through the words in
your corpus and tags them with metadata, indicating whether those words are nouns, verbs,
adjectives, etc.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project: Number Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
sentence tokenizer, PunktSentenceTokenizer is capable of unsupervised machine learning, train
it on any body of text that you use.
create training and testing data
Data Set used i) State of the Union address from 2005 ii) State of the Union address from 2006 of
President George W. Bush.
Train the Punkt tokenizer
Finish up part of speech tagging script by creating a function that will run through and tag all of
the parts of speech per sentence
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
nltk.download('state_union')
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Output :
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'),
('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF',
'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'),
('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'),
('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'),
(',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'),
('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',',
','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':',
':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',',
','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'),
('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'),
('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope',
'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband',
'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','),
('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life',
'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'),
('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union',
'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',',
','), ('Jan', 'NNP'), ('.', '.')]
Experiment -5
Aim : Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Description : Latent Dirichlet allocation (LDA) is a topic model that generates topics based on
word frequency from a set of documents. LDA is particularly useful for finding reasonably
accurate mixtures of topics within a given document set. It is a generative probabilistic model that
assumes each topic is a mixture over an underlying set of words, and each document is a mixture
of over a set of topic probabilities.
Process of LDA:
Constructing a document-term matrix : The result of cleaning stage is texts, a tokenized, stopped
and stemmed list of words from a single document. we looped through all our documents and
appended each one to texts. So now texts is a list of lists, one list for each of our original
documents. To generate an LDA model, we need to understand how frequently each term occurs
within each document.
Construct a document-term matrix with a package called genism : The Dictionary() function
traverses texts, assigning a unique integer id to each unique token while also collecting word counts
and relevant statistics. dictionary must be converted into a bag-of-words:
Applying the LDA model : corpus is a document-term matrix and now we’re ready to generate an
LDA model: The LdaModel class is described in detail in the gensim documentation. Parameters
used in our example:
Parameters:
num_topics: required. An LDA model requires the user to determine how many topics should be
generated. Our document set is small, so we’re only asking for three topics.
id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
passes: optional. The number of laps the model will take through corpus. The greater the number
of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.
Examining the results LDA model is now stored as ldamodel with the print_topic and print_topics
methods
LDA assumes documents are produced from a mixture of topics. Those topics then generate words
based on their probability distribution, like the ones in our walkthrough model. In other words,
LDA assumes a document is made from the following steps:
Determine the number of words in a document. Let’s say our document has 6 words. Determine
the mixture of topics in that document. For example, the document might contain 1/2 the topic
“health” and 1/2 the topic “vegetables.” Using each topic’s multinomial distribution, output words
to fill the document’s word slots. In our example, the “health” topic is 1/2 our document, or 3
words. The “health” topic might have the word “diet” at 20% probability or “exercise” at 15%, so
it will fill the document word slots based on those probabilities. Given this assumption of how
documents are created, LDA backtracks and tries to figure out what topics would create those
documents in the first place.
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
print(ldamodel.num_terms)
print(ldamodel.num_topics)
print(ldamodel.get_topics())
ldamodel.print_topics()
print(ldamodel.print_topics(num_topics=3, num_words=3))
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, pa
sses=20)
print(ldamodel.print_topics(num_topics=4, num_words=8))
Output : 32
2
[[0.08628245 0.07171855 0.07067294 0.08623768 0.04238176 0.07171956
0.04237814 0.04237782 0.04220941 0.04237783 0.04237831 0.04237794
0.04237753 0.01423728 0.01423811 0.01423659 0.01520271 0.0142376
0.01423843 0.01425746 0.01423829 0.01423937 0.01425364 0.01425225
0.01425186 0.0142531 0.01425185 0.0142523 0.01425273 0.01425207
0.01568182 0.01568263]
[0.02228727 0.03435219 0.01177719 0.02232435 0.01177278 0.03435134
0.01177578 0.01177604 0.05879794 0.01177604 0.01177564 0.01177595
0.01177629 0.03508804 0.03508734 0.03508861 0.08117064 0.03508777
0.03508708 0.05851251 0.0350872 0.0350863 0.03507448 0.03507564
0.03507596 0.03507493 0.03507596 0.03507559 0.03507524 0.03507578
0.03389136 0.03389069]]
[(0, '0.086*"brocolli" + 0.086*"good" + 0.072*"mother"'), (1, '0.081*"health" + 0.059*"drive" +
0.059*"pressur"')]
[(0, '0.161*"health" + 0.089*"profession" + 0.089*"say" + 0.089*"good" + 0.089*"brocolli" +
0.018*"mother" + 0.018*"brother" + 0.018*"pressur"'), (1, '0.063*"brother" + 0.063*"mother" +
0.063*"pressur" + 0.062*"never" + 0.062*"feel" + 0.062*"seem" + 0.062*"well" +
0.062*"perform"'), (2, '0.132*"brocolli" + 0.132*"good" + 0.132*"eat" + 0.074*"brother" +
0.074*"mother" + 0.073*"like" + 0.015*"health" + 0.015*"drive"'), (3, '0.083*"drive" +
0.046*"blood" + 0.046*"suggest" + 0.046*"increas" + 0.046*"tension" + 0.046*"expert" +
0.046*"may" + 0.046*"caus"')]
Experiment -6
Aim : Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python
Description :
It is a widely used statistical method in natural language processing and information retrieval. It
measures how important a term is within a document relative to a collection of documents. Words
within a text document are transformed into importance numbers by a text vectorization process.
TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the Inverse
Document Frequency (IDF).
Example
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
df_tf
# Dataframe shows the frequency of each word in each document,
# a column for each word and a row for each document.
print("IDF of: ")
idf = {}
for w in words_set:
k = 0 # number of documents in the corpus that contain this word
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)
print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
df_tf_idf
# "data" has an IDF of 0 because it appears in every document.
# So is not considered to be an important term in this corpus.
Output :
Number of words in the corpus: 14
The words in the corpus:
{'science', 'is', 'best', 'scientists', 'analyze', 'most', 'courses', 'this', 'one', 'fields', 'data', 'the',
'important', 'of'}
IDF of:
science: 0.17609125905568124
is: 0.17609125905568124
best: 0.47712125471966244
scientists: 0.47712125471966244
analyze: 0.47712125471966244
most: 0.47712125471966244
courses: 0.47712125471966244
this: 0.47712125471966244
one: 0.17609125905568124
fields: 0.47712125471966244
data: 0.0
the: 0.17609125905568124
important: 0.47712125471966244
of: 0.17609125905568124
scie an d imp
scie mos cou fiel
is best ntis aly this one at the orta of
nce t rses ds
ts ze a nt
0.03 0.01 0.00 0.0 0.04 0.00 0.00 0.01 0.04 0.01 0.03
0.00 0. 0.04
0 201 600 000 000 337 000 000 600 337 600 201
000 0 3375
7 8 0 0 5 0 0 8 5 8 7
0.01 0.01 0.05 0.0 0.00 0.05 0.05 0.01 0.00 0.01 0.01
0.00 0. 0.00
1 956 956 301 000 000 301 301 956 000 956 956
000 0 0000
6 6 3 0 0 3 3 6 0 6 6
0.00 0.00 0.00 0.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.11 0. 0.00
2 000 000 000 192 000 000 000 000 000 000 000
928 0 0000
0 0 0 8 0 0 0 0 0 0 0
Experiment -7
Aim : Demonstrate word embeddings using word2vec
Description : Word embedding is one of the most important techniques in natural language
processing(NLP), where words are mapped to vectors of real numbers. Word embedding is capable
of capturing the meaning of a word in a document, semantic and syntactic similarity, relation with
other words. It also has been widely used for recommender systems and text classification.
Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural
network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec
can make natural language computer-readable, then further implementation of mathematical
operations on words can be used to detect their similarities. A well-trained set of word vectors will
place similar words close to each other in that space. For instance, the words women, men, and
human might cluster in one corner, while yellow, red and blue cluster together in another.
gensim is an open source python library for natural language processing and it was developed and
is maintained by the Czech natural language processing researcher Gensim library will enable us
to develop word embeddings by training our own word2vec models on a custom corpus either with
CBOW of skip-grams algorithms.
After training the word2vec model, obtain the word embedding from the training model.
Finally print the model.
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)
Output :
Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[ 5.9599371e-04 3.6903401e-03 2.2744297e-03 5.7322328e-04
-4.7999555e-03 4.1460539e-03 3.6190548e-03 4.4815554e-03
-9.4492309e-04 -2.3332548e-03 -7.7754230e-04 -2.0325035e-03
-4.9208495e-05 -3.8984963e-03 2.2744499e-03 1.9393873e-03
1.0208354e-03 2.7080898e-03 1.9608904e-03 1.0961948e-03
Experiment - 8
Aim : Implement Text classification using naïve bayes classifier and text blob library.
Description :
Text classifier are systems that classify your texts and divide them in different classes.
TextBlob is a Python library for processing textual data. It provides a consistent API for diving
into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase
extraction, sentiment analysis, and more.
train = [
('What an amazing weather.', 'pos'),
('this is an amazing idea!', 'pos'),
('I feel very good about these ideas.', 'pos'),
('this is my best performance.', 'pos'),
("what an awesome view", 'pos'),
('I do not like this place', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with all this tension", 'neg'),
('he is my sworn enemy!', 'neg'),
('my friends is horrible.', 'neg')
]
test = [
('the food was great.', 'pos'),
('I do not want to live anymore', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Ramesh is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)
print(cl.classify("This is an amazing library!"))
# Lets test the accuracy of the classifier
print(cl.accuracy(test))
# print(cl.classify("my friends is tension"))
# print(cl.accuracy(test))
cl.show_informative_features(4)
Output :
pos
0.8333333333333334
Most Informative Features
contains(I) = True neg : pos = 2.3 : 1.0
contains(an) = False neg : pos = 2.2 : 1.0
contains(I) = False pos : neg = 1.8 : 1.0
contains(my) = True neg : pos = 1.7 : 1.0
Experiment -9
Aim : Apply support vector machine for text classification
Description : Support Vector Machine” (SVM) is a supervised machine learning algorithm that
can be used for both classification or regression challenges. However, it is mostly used in
classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional
space (where n is a number of features you have) with the value of each feature being the value of
a particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well. Support Vectors are the coordinates of individual
observation. The SVM classifier is a frontier that best segregates the two classes (hyper-plane/
line).
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=1,gamma='auto').fit(X, y)
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
Description: Cosine similarity is a measure of similarity between two non-zero vectors of an inner
product space that “measures the cosine of the angle between them”
Cosine Similarity tends to determine how similar two words or sentence are, It can be used for
Sentiment Analysis, Text Comparison and being used by lot of popular packages out there like
word2vec.
This is also called as Scalar product because the dot product of two vectors gives a scalar result.
For Example, Vector(A) = [5,0,2] Vector(B) = [2,5,0]
The dot product vector(A) DOT vector(B) = 5*2+0*5+2*0=10+0+0 =10
The documents are similar lesser the angle between them and Cosine of Angle increase as the
value of angle decreases since Cos 0 =1 and Cos 90 = 0
First step calculate the cosine similarity between the documents. Convert the
documents/Sentences/words in a form of feature vector first.
Useful Methods for feature extraction i) Bag of Words ii) TF-IDF.
Bag of Words counts the unique words in documents and frequency of each of the words. Scikit
learn Countvectorizer extract the Bag of Words Features.
Cosine Similarities of the document 0 compared with other documents in the corpus. The first
element in array is 1 which means Document 0 is compared with Document 0 and second element
0.08619387, 0,0 where Document 0 is compared with Document 1,2,3
import pandas as pd
count_vect = CountVectorizer()
Document1= "Aditya Engineering College situated at Surampalem"
Document2= "Engineering Colleges offer computer science courses in MCA AIML CSE IT depa
rtments"
Document3= "Computer science students have opprtunities in IT sector"
Document4= "IT sector hire students with skills in computer science"
corpus = [Document1,Document2,Document3,Document4]
X_train_counts = count_vect.fit_transform(corpus)
df1=pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_out(),inde
x=['Document 0','Document 1','Document2','Document3'])
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
df2=pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=['Docum
ent 0','Document 1','Document 2','Document 3'])
print(df1)
print(df2)
cosine_similarity(trsfm[0:3], trsfm)
Output :
aditya aiml at college colleges computer courses cse \
Document 0 1 0 1 1 0 0 0 0
Document 1 0 1 0 0 1 1 1 1
Document2 0 0 0 0 0 1 0 0
Document3 0 0 0 0 0 1 0 0
[4 rows x 24 columns]
aditya aiml at college colleges computer \
Document 0 0.421765 0.000000 0.421765 0.421765 0.000000 0.000000
Document 1 0.000000 0.328776 0.000000 0.000000 0.328776 0.209853
Document 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.289152
Document 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.263386
[4 rows x 24 columns]
array([[1. , 0.08619387, 0. , 0. ],
[0.08619387, 1. , 0.24271786, 0.22108976],
[0. , 0.24271786, 1. , 0.53702605]])