nlp
nlp
nlp
http://bit.ly/iai4yma
Applications of Natural Language Processing
• Once the text has been normalised, it is then fed to an NLP based AI
model. Note that in NLP, modelling requires data pre-processing only
after which the data is fed to the machine. Depending upon the type
of chatbot we try to make, there are a lot of AI models available
which help us build the foundation of our project.
Evaluation
• The model trained is then evaluated and the accuracy for the same is
generated on the basis of the relevance of the answers which the
machine gives to the user’s responses. To understand the efficiency of
the model, the suggested answers by the chatbot are compared to
the actual answers.
• Fig-1:The model’s output does not match the true function at all.
Hence the model is said to be underfitting and its accuracy is lower.
• Fig-2: In the second one, the model’s performance matches well with
the true function which states that the model has optimum accuracy
and the model is called a perfect fit.
• Fig-3: In the third case, model performance is trying to cover all the
data samples even if they are out of alignment to the true function.
This model is said to be overfitting and this too has a lower accuracy.
• Once the model is evaluated thoroughly, it is then deployed in the
form of an app which people can use easily.
Chatbots
• • Mitsuku Bot*
• https://www.pandorabots.com/mitsuku/
• • CleverBot*
• https://www.cleverbot.com/
• • Jabberwacky*
• http://www.jabberwacky.com/
• Haptik*
• https://haptik.ai/contact-us
• Rose*
• http://ec2-54-215-197-164.us-west-
1.compute.amazonaws.com/speech.php
• Ochatbot*
• https://www.ometrics.com/blog/list-of-fun-chatbots/
Let us discuss!
• As you interact with more and more chatbots, you would realise that some
of them are scripted or in other words are traditional chatbots while others
were AI-powered and had more knowledge. With the help of this
experience, we can understand that there are 2 types of chatbots around
us: Script-bot and Smart-bot.
• Other examples of script bot may include the bots which are
deployed in the customer care section of various companies. Their
job is to answer some basic queries that they are coded for and
connect them to human executives once they are unable to handle
the conversation.
• On the other hand, all the assistants like Google Assistant, Alexa,
Cortana, Siri, etc. can be taken as smart bots as not only can they
handle the conversations but can also manage to do other tasks
which makes them smarter.
Human Language VS Computer Language
• There are rules in human language. There are nouns, verbs, adverbs,
adjectives. A word can be a noun at one time and an adjective some other
time. There are rules to provide structure to a language.
• This is the issue related to the syntax of the language. Syntax refers to the
grammatical structure of a sentence. When the structure is present, we can
start interpreting the message. Now we also want to have the computer do
this. One way to do this is to use the part-of-speech tagging. This allows
the computer to identify the different parts of a speech.
• Besides the matter of arrangement, there’s also meaning behind the
language we use. Human communication is complex. There are multiple
characteristics of the human language that might be easy for a human to
understand but extremely difficult for a computer to understand.
Analogy with programming language:
• After the stopwords removal, we convert the whole text into a similar
case, preferably lower case. This ensures that the case-sensitivity of
the machine does not consider same words as different just because
of different cases.
• Here in this example, the all the 6 forms of hello would be converted
to lower case and hence would be treated as the same word by the
machine.
Stemming
• In this step, the remaining words are reduced to their root words. In
other words, stemming is the process in which the affixes of words
are removed and the words are converted to their base form.
• * Images shown here are the property of individual organisations and
are used here for reference purpose only.
• Note that in stemming, the stemmed words (words which are we get
after removing the affixes) might not be meaningful. Here in this
example as you can see: healed, healing and healer all were reduced
to heal but studies was reduced to studi after the affix removal which
is not a meaningful word. Stemming does not take into account if the
stemmed word is meaningful or not. It just removes the affixes hence
it is faster.
Lemmatization
• In this step, the vocabulary is written in the top row. Now, for each
word in the document, if it matches with the vocabulary, put a 1
under it. If the same word appears again, increment the previous
value by 1. And if the word does not occur in that document, put a 0
under it.
• Since in the first document, we have words: aman, and, anil, are,
stressed. So, all these words get a value of 1 and rest of the words get
a 0 value.
• Step 4: Repeat for all documents
• Same exercise has to be done for all the documents. Hence, the table
becomes:
• In this table, the header row contains the vocabulary of the corpus
and three rows correspond to three different documents. Take a look
at this table and analyse the positioning of 0s and 1s in it.
• Finally, this gives us the document vector table for our corpus. But
the tokens have still not converted to numbers. This leads us to the
final steps of our algorithm: TFIDF.
TFIDF: Term Frequency & Inverse Document
Frequency
• Suppose you have a book. Which characters or words do you think would occur
the most in it?
• __________________________________________________________________
__________________________________________________________________
________________________________
• Bag of words algorithm gives us the frequency of words in each document we
have in our corpus. It gives us an idea that if the word is occurring more in a
document, its value is more for that document. For example, if I have a document
on air pollution, air and pollution would be the words which occur many times in
it. And these words are valuable too as they give us some context around the
document. But let us suppose we have 10 documents and all of them talk about
different issues. One is on women empowerment, the other is on unemployment
and so on. Do you think air and pollution would still be one of the most occurring
words in the whole corpus? If not, then which words do you think would have the
highest frequency in all of them?
• And, this, is, the, etc. are the words which occur the most in almost
all the documents. But these words do not talk about the corpus at
all. Though they are important for humans as they make the
statements understandable to us, for the machine they are a
complete waste as they do not provide us with any information
regarding the corpus. Hence, these are termed as stopwords and are
mostly removed at the pre-processing stage only.
• Take a look at this graph. It is a plot of occurrence of words versus their value. As
you can see, if the words have highest occurrence in all the documents of the
corpus, they are said to have negligible value hence they are termed as stop
words. These words are mostly removed at the pre-processing stage only. Now as
we move ahead from the stopwords, the occurrence level drops drastically and
the words which have adequate occurrence in the corpus are said to have some
amount of value and are termed as frequent words. These words mostly talk
about the document’s subject and their occurrence is adequate in the corpus.
Then as the occurrence of words drops further, the value of such words rises.
These words are termed as rare or valuable words. These words occur the least
but add the most value to the corpus. Hence, when we look at the text, we take
frequent and rare words into consideration.
• Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse
Document Frequency. TFIDF helps un in identifying the value for each word. Let
us understand each term one by one
Term Frequency
• Now, let us look at the other half of TFIDF which is Inverse Document
Frequency. For this, let us first understand what does document
frequency mean. Document Frequency is the number of documents
in which the word occurs irrespective of how many times it has
occurred in those documents. The document frequency for the
exemplar vocabulary would be:
• Here, you can see that the document frequency of ‘aman’, ‘anil’,
‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two documents. Rest
of them occurred in just one document hence the document
frequency for them is one.
• Talking about inverse document frequency, we need to put the
document frequency in the denominator while the total number of
documents is the numerator. Here, the total number of documents
are 3, hence inverse document frequency becomes:
• Finally, the formula of TFIDF for any word W becomes:
• TFIDF(W) = TF(W) * log( IDF(W) )
• Here, log is to the base of 10. Don’t worry! You don’t need to
calculate the log values by yourself. Simply use the log function in the
calculator and find out!
• Now, let’s multiply the IDF values to the TF values. Note that the TF
values are for each document while the IDF values are for the whole
corpus. Hence, we need to multiply the IDF values to each row of the
document vector table.
• Here, you can see that the IDF values for Aman in each row is the
same and similar pattern is followed for all the words of the
vocabulary. After calculating all the values, we get:
• Finally, the words have been converted to numbers. These numbers are the
values of each for each document. Here, you can see that since we have less
amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF
value increases, the value of that word decreases. That is, for example:
• Total Number of documents: 10
• Number of documents in which ‘and’ occurs: 10
• Therefore, IDF(and) = 10/10 = 1
• Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
• On the other hand, number of documents in which ‘pollution’ occurs: 3
• IDF(pollution) = 10/3 = 3.3333…
• Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has
considerable value in the corpus.
• Summarising the concept, we can say that:
• 1. Words that occur in all the documents with high term frequencies
have the least values and are considered to be the stopwords.
• 2. For a word to have high TFIDF value, the word needs to have a high
term frequency but less document frequency which shows that the
word is important for one document but is not a common word for all
documents.
• 3. These values help the computer understand which words are to be
considered while processing the natural language. The higher the
value, the more important the word is for a given corpus.
Applications of TFIDF