nlp

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Identify the mystery animal:

http://bit.ly/iai4yma
Applications of Natural Language Processing

• Automatic Summarization: Information overload is a real problem


when we need to access a specific, important piece of information
from a huge knowledge base. Automatic summarization is relevant
not only for summarizing the meaning of documents and information,
but also to understand the emotional meanings within the
information, such as in collecting data from social media. Automatic
summarization is especially relevant when used to provide an
overview of a news item or blog post, while avoiding redundancy
from multiple sources and maximizing the diversity of content
obtained.
• Sentiment Analysis: The goal of sentiment analysis is to identify
sentiment among several posts or even in the same post where
emotion is not always explicitly expressed. Companies use Natural
Language Processing applications, such as sentiment analysis, to
identify opinions and sentiment online to help them understand what
customers think about their products and services (i.e., “I love the
new iPhone” and, a few lines later “But sometimes it doesn’t work
well” where the person is still talking about the iPhone) and overall
indicators of their reputation. Beyond determining simple polarity,
sentiment analysis understands sentiment in context to help better
understand what’s behind an expressed opinion, which can be
extremely relevant in understanding and driving purchasing decisions.
• Text classification: Text classification makes it possible to assign
predefined categories to a document and organize it to help you find
the information you need or simplify some activities. For example, an
application of text categorization is spam filtering in email.
• Virtual Assistants: Nowadays Google Assistant, Cortana, Siri, Alexa, etc
have become an integral part of our lives. Not only can we talk to
them but they also have the abilities to make our lives easier. By
accessing our data, they can help us in keeping notes of our tasks,
make calls for us, send messages and a lot more. With the help of
speech recognition, these assistants can not only detect our speech
but can also make sense out of it. According to recent researches, a
lot more advancements are expected in this field in the near future.
Natural Language Processing: Getting Started

• Natural Language Processing is all about how machines try to


understand and interpret human language and operate accordingly.
But how can Natural Language Processing be used to solve the
problems around us? Let us take a look.
Revisiting the AI Project Cycle

• Let us try to understand how we can develop a project in Natural


Language processing with the help of an example.
The Scenario

• The world is competitive nowadays. People face competition in even the


tiniest tasks and are expected to give their best at every point in time.
When people are unable to meet these expectations, they get stressed and
could even go into depression. We get to hear a lot of cases where people
are depressed due to reasons like peer pressure, studies, family issues,
relationships, etc. and they eventually get into something that is bad for
them as well as for others. So, to overcome this, cognitive behavioural
therapy (CBT) is considered to be one of the best methods to address stress
as it is easy to implement on people and also gives good results. This
therapy includes understanding the behaviour and mindset of a person in
their normal life. With the help of CBT, therapists help people overcome
their stress and live a happy life.
Problem Scoping

• CBT is a technique used by most therapists to cure patients out of


stress and depression. But it has been observed that people do not
wish to seek the help of a psychiatrist willingly. They try to avoid such
interactions as much as possible. Thus, there is a need to bridge the
gap between a person who needs help and the psychiatrist. Let us
look at various factors around this problem through the 4Ws problem
canvas.
Data Acquisition

• To understand the sentiments of people, we need to collect their


conversational data so the machine can interpret the words that they
use and understand their meaning. Such data can be collected from
various means:
• 1. Surveys 2. Observing the therapist’s sessions
• 3. Databases available on the internet 4. Interviews, etc.
Data Exploration

• Once the textual data has been collected, it needs to be processed


and cleaned so that an easier version can be sent to the machine.
Thus, the text is normalised through various steps and is lowered to
minimum vocabulary since the machine does not require
grammatically correct statements but the essence of it.
Modelling

• Once the text has been normalised, it is then fed to an NLP based AI
model. Note that in NLP, modelling requires data pre-processing only
after which the data is fed to the machine. Depending upon the type
of chatbot we try to make, there are a lot of AI models available
which help us build the foundation of our project.
Evaluation

• The model trained is then evaluated and the accuracy for the same is
generated on the basis of the relevance of the answers which the
machine gives to the user’s responses. To understand the efficiency of
the model, the suggested answers by the chatbot are compared to
the actual answers.
• Fig-1:The model’s output does not match the true function at all.
Hence the model is said to be underfitting and its accuracy is lower.
• Fig-2: In the second one, the model’s performance matches well with
the true function which states that the model has optimum accuracy
and the model is called a perfect fit.
• Fig-3: In the third case, model performance is trying to cover all the
data samples even if they are out of alignment to the true function.
This model is said to be overfitting and this too has a lower accuracy.
• Once the model is evaluated thoroughly, it is then deployed in the
form of an app which people can use easily.
Chatbots

• As we have seen earlier, one of the most common applications of


Natural Language Processing is a chatbot. There are a lot of chatbots
available and many of them use the same approach as we used in the
scenario above.
• Let us try some of the chatbots and see how they work.

• • Mitsuku Bot*
• https://www.pandorabots.com/mitsuku/
• • CleverBot*

• https://www.cleverbot.com/

• • Jabberwacky*

• http://www.jabberwacky.com/
• Haptik*

• https://haptik.ai/contact-us

• Rose*

• http://ec2-54-215-197-164.us-west-
1.compute.amazonaws.com/speech.php
• Ochatbot*

• https://www.ometrics.com/blog/list-of-fun-chatbots/
Let us discuss!

• • Which chatbot did you try? Name any one.


• • What is the purpose of this chatbot?
• • How was the interaction with the chatbot?
• • Did the chat feel like talking to a human or a robot? Why do you think so?
• • Do you feel that the chatbot has a certain personality?

• As you interact with more and more chatbots, you would realise that some
of them are scripted or in other words are traditional chatbots while others
were AI-powered and had more knowledge. With the help of this
experience, we can understand that there are 2 types of chatbots around
us: Script-bot and Smart-bot.
• Other examples of script bot may include the bots which are
deployed in the customer care section of various companies. Their
job is to answer some basic queries that they are coded for and
connect them to human executives once they are unable to handle
the conversation.
• On the other hand, all the assistants like Google Assistant, Alexa,
Cortana, Siri, etc. can be taken as smart bots as not only can they
handle the conversations but can also manage to do other tasks
which makes them smarter.
Human Language VS Computer Language

• Humans communicate through language which we process all the


time. Our brain keeps on processing the sounds that it hears around
itself and tries to make sense out of them all the time. Even in the
classroom, as the teacher delivers the session, our brain is
continuously processing everything and storing it in some place. Also,
while this is happening, when your friend whispers something, the
focus of your brain automatically shifts from the teacher’s speech to
your friend’s conversation. So now, the brain is processing both the
sounds but is prioritising the one on which our interest lies.
• The sound reaches the brain through a long channel. As a person
speaks, the sound travels from his mouth and goes to the listener’s
eardrum. The sound striking the eardrum is converted into neuron
impulse, gets transported to the brain and then gets processed. After
processing the signal, the brain gains understanding around the
meaning of it. If it is clear, the signal gets stored. Otherwise, the
listener asks for clarity to the speaker. This is how human languages
are processed by humans.
• On the other hand, the computer understands the language of
numbers. Everything that is sent to the machine has to be converted
to numbers. And while typing, if a single mistake is made, the
computer throws an error and does not process that part. The
communications made by the machines are very basic and simple.
• Now, if we want the machine to understand our language, how
should this happen? What are the possible difficulties a machine
would face in processing natural language? Let us take a look at some
of them here:
Arrangement of the words and meaning

• There are rules in human language. There are nouns, verbs, adverbs,
adjectives. A word can be a noun at one time and an adjective some other
time. There are rules to provide structure to a language.
• This is the issue related to the syntax of the language. Syntax refers to the
grammatical structure of a sentence. When the structure is present, we can
start interpreting the message. Now we also want to have the computer do
this. One way to do this is to use the part-of-speech tagging. This allows
the computer to identify the different parts of a speech.
• Besides the matter of arrangement, there’s also meaning behind the
language we use. Human communication is complex. There are multiple
characteristics of the human language that might be easy for a human to
understand but extremely difficult for a computer to understand.
Analogy with programming language:

• Different syntax, same semantics: 2+3 = 3+2


• Here the way these statements are written is different, but their
meanings are the same that is 5.
• Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
• Here the statements written have the same syntax but their meanings
are different. In Python 2.7, this statement would result in 1 while in
Python 3, it would give an output of 1.5.
• Think of some other examples of different syntax and same semantics
and vice-versa.
Multiple Meanings of a word

• Let’s consider these three sentences:


• His face turned red after he found out that he took the wrong bag
• What does this mean? Is he feeling ashamed because he took another
person’s bag instead of his? Is he feeling angry because he did not manage
to steal the bag that he has been targeting?
• The red car zoomed past his nose
• Probably talking about the color of the car
• His face turns red after consuming the medicine
• Is he having an allergic reaction? Or is he not able to bear the taste of that
medicine?
• Here we can see that context is important. We understand a sentence
almost intuitively, depending on our history of using the language,
and the memories that have been built within. In all three sentences,
the word red has been used in three different ways which according
to the context of the statement changes its meaning completely.
Thus, in natural language, it is important to understand that a word
can have multiple meanings and the meanings fit into the statement
according to the context of it.
• Think of some other words which can have multiple meanings and
use them in sentences.
Perfect Syntax, no Meaning

• Sometimes, a statement can have a perfectly correct syntax but it


does not mean anything. For example, take a look at this statement:
• Chickens feed extravagantly while the moon drinks tea.
• This statement is correct grammatically but does this make any
sense? In Human language, a perfect balance of syntax and semantics
is important for better understanding.
Data Processing
• Humans interact with each other very easily. For us, the natural languages that
we use are so convenient that we speak them easily and understand them well
too. But for computers, our languages are very complex. As you have already
gone through some of the complications in human languages above, now it is
time to see how Natural Language Processing makes it possible for the machines
to understand and speak in the Natural Languages just like humans.
• Since we all know that the language of computers is Numerical, the very first step
that comes to our mind is to convert our language to numbers. This conversion
takes a few steps to happen. The first step to it is Text Normalisation. Since
human languages are complex, we need to first of all simplify them in order to
make sure that the understanding becomes possible. Text Normalisation helps in
cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data. Let us go through Text Normalisation in
detail.
Text Normalisation

• In Text Normalisation, we undergo several steps to normalise the text


to a lower level. Before we begin, we need to understand that in this
section, we will be working on a collection of written text. That is, we
will be working on text from multiple documents and the term used
for the whole textual data from all the documents altogether is
known as corpus. Not only would we go through all the steps of Text
Normalisation, we would also work them out on a corpus. Let us take
a look at the steps:
Sentence Segmentation

• Under sentence segmentation, the whole corpus is divided into


sentences. Each sentence is taken as a different data so now the
whole corpus gets reduced to sentences.
Tokenisation

• After segmenting the sentences, each sentence is then further


divided into tokens. Tokens is a term used for any word or number or
special character occurring in a sentence. Under tokenisation, every
word, number and special character is considered separately and
each of them is now a separate token.
Removing Stopwords, Special Characters and
Numbers
• In this step, the tokens which are not necessary are removed from the
token list. What can be the possible words which we might not
require?
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it. Humans use grammar to make their
sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which
is to be transmitted through the statement hence they come under
stopwords. Some examples of stopwords are:
• These words occur the most in any given corpus but talk very little or
nothing about the context or the meaning of it. Hence, to make it
easier for the computer to focus on meaningful terms, these words
are removed.
• Along with these words, a lot of times our corpus might have special
characters and/or numbers. Now it depends on the type of corpus
that we are working on whether we should keep them in it or not. For
example, if you are working on a document containing email IDs, then
you might not want to remove the special characters and numbers
whereas in some other textual data if these characters do not make
sense, then you can remove them along with the stopwords.
Converting text to a common case

• After the stopwords removal, we convert the whole text into a similar
case, preferably lower case. This ensures that the case-sensitivity of
the machine does not consider same words as different just because
of different cases.
• Here in this example, the all the 6 forms of hello would be converted
to lower case and hence would be treated as the same word by the
machine.
Stemming

• In this step, the remaining words are reduced to their root words. In
other words, stemming is the process in which the affixes of words
are removed and the words are converted to their base form.
• * Images shown here are the property of individual organisations and
are used here for reference purpose only.
• Note that in stemming, the stemmed words (words which are we get
after removing the affixes) might not be meaningful. Here in this
example as you can see: healed, healing and healer all were reduced
to heal but studies was reduced to studi after the affix removal which
is not a meaningful word. Stemming does not take into account if the
stemmed word is meaningful or not. It just removes the affixes hence
it is faster.
Lemmatization

• Stemming and lemmatization both are alternative processes to each


other as the role of both the processes is same – removal of affixes.
But the difference between both of them is that in lemmatization, the
word we get after affix removal (also known as lemma) is a
meaningful one. Lemmatization makes sure that lemma is a word
with meaning and hence it takes a longer time to execute than
stemming.
Difference between stemming and lemmatization
can be summarized by this example:
• With this we have normalised our text to tokens which are the
simplest form of words present in the corpus. Now it is time to
convert the tokens into numbers. For this, we would use the Bag of
Words algorithm
Bag of Words

• Bag of Words is a Natural Language Processing model which helps in


extracting features out of the text which can be helpful in machine
learning algorithms. In bag of words, we get the occurrences of each
word and construct the vocabulary for the corpus.
• This image gives us a brief overview about how bag of words works. Let us
assume that the text on the left in this image is the normalised corpus which we
have got after going through all the steps of text processing. Now, as we put this
text into the bag of words algorithm, the algorithm returns to us the unique
words out of the corpus and their occurrences in it. As you can see at the right, it
shows us a list of words appearing in the corpus and the numbers corresponding
to it shows how many times the word has occurred in the text body. Thus, we can
say that the bag of words gives us two things:
• 1. A vocabulary of words for the corpus
• 2. The frequency of these words (number of times it has occurred in the whole
corpus).
• Here calling this algorithm “bag” of words symbolises that the sequence of
sentences or tokens does not matter in this case as all we need are the unique
words and their frequency in it.
• Here is the step-by-step approach to implement bag of words algorithm:
• 1. Text Normalisation: Collect data and pre-process it
• 2. Create Dictionary: Make a list of all the unique words occurring in the
corpus. (Vocabulary)
• 3. Create document vectors: For each document in the corpus, find out
how many times the word from the unique list of words has occurred.
• 4. Create document vectors for all the documents.

• Let us go through all the steps with an example:


Step 1: Collecting data and pre-processing it.

• Document 1: Aman and Anil are stressed


• Document 2: Aman went to a therapist
• Document 3: Anil went to download a health chatbot
• Here are three documents having one sentence each. After text
normalisation, the text becomes:
• Document 1: [aman, and, anil, are, stressed]
• Document 2: [aman, went, to, a, therapist]
• Document 3: [anil, went, to, download, a, health, chatbot]
• Note that no tokens have been removed in the stopwords removal step. It
is because we have very little data and since the frequency of all the words
is almost the same, no word can be said to have lesser value than the
other.
• Step 2: Create Dictionary
• Go through all the steps and create a dictionary i.e., list down all the
words which occur in all three documents:
• Dictionary
•:

• Note that even though some words are repeated in different


documents, they are all written just once as while creating the
dictionary, we create the list of unique words.
Step 3: Create document vector

• In this step, the vocabulary is written in the top row. Now, for each
word in the document, if it matches with the vocabulary, put a 1
under it. If the same word appears again, increment the previous
value by 1. And if the word does not occur in that document, put a 0
under it.
• Since in the first document, we have words: aman, and, anil, are,
stressed. So, all these words get a value of 1 and rest of the words get
a 0 value.
• Step 4: Repeat for all documents
• Same exercise has to be done for all the documents. Hence, the table
becomes:
• In this table, the header row contains the vocabulary of the corpus
and three rows correspond to three different documents. Take a look
at this table and analyse the positioning of 0s and 1s in it.
• Finally, this gives us the document vector table for our corpus. But
the tokens have still not converted to numbers. This leads us to the
final steps of our algorithm: TFIDF.
TFIDF: Term Frequency & Inverse Document
Frequency
• Suppose you have a book. Which characters or words do you think would occur
the most in it?
• __________________________________________________________________
__________________________________________________________________
________________________________
• Bag of words algorithm gives us the frequency of words in each document we
have in our corpus. It gives us an idea that if the word is occurring more in a
document, its value is more for that document. For example, if I have a document
on air pollution, air and pollution would be the words which occur many times in
it. And these words are valuable too as they give us some context around the
document. But let us suppose we have 10 documents and all of them talk about
different issues. One is on women empowerment, the other is on unemployment
and so on. Do you think air and pollution would still be one of the most occurring
words in the whole corpus? If not, then which words do you think would have the
highest frequency in all of them?
• And, this, is, the, etc. are the words which occur the most in almost
all the documents. But these words do not talk about the corpus at
all. Though they are important for humans as they make the
statements understandable to us, for the machine they are a
complete waste as they do not provide us with any information
regarding the corpus. Hence, these are termed as stopwords and are
mostly removed at the pre-processing stage only.
• Take a look at this graph. It is a plot of occurrence of words versus their value. As
you can see, if the words have highest occurrence in all the documents of the
corpus, they are said to have negligible value hence they are termed as stop
words. These words are mostly removed at the pre-processing stage only. Now as
we move ahead from the stopwords, the occurrence level drops drastically and
the words which have adequate occurrence in the corpus are said to have some
amount of value and are termed as frequent words. These words mostly talk
about the document’s subject and their occurrence is adequate in the corpus.
Then as the occurrence of words drops further, the value of such words rises.
These words are termed as rare or valuable words. These words occur the least
but add the most value to the corpus. Hence, when we look at the text, we take
frequent and rare words into consideration.
• Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse
Document Frequency. TFIDF helps un in identifying the value for each word. Let
us understand each term one by one
Term Frequency

• Term frequency is the frequency of a word in one document. Term


frequency can easily be found from the document vector table as in
that table we mention the frequency of each word of the vocabulary
in each document.
• Here, you can see that the frequency of each word for each
document has been recorded in the table. These numbers are nothing
but the Term Frequencies!
Inverse Document Frequency

• Now, let us look at the other half of TFIDF which is Inverse Document
Frequency. For this, let us first understand what does document
frequency mean. Document Frequency is the number of documents
in which the word occurs irrespective of how many times it has
occurred in those documents. The document frequency for the
exemplar vocabulary would be:
• Here, you can see that the document frequency of ‘aman’, ‘anil’,
‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two documents. Rest
of them occurred in just one document hence the document
frequency for them is one.
• Talking about inverse document frequency, we need to put the
document frequency in the denominator while the total number of
documents is the numerator. Here, the total number of documents
are 3, hence inverse document frequency becomes:
• Finally, the formula of TFIDF for any word W becomes:
• TFIDF(W) = TF(W) * log( IDF(W) )
• Here, log is to the base of 10. Don’t worry! You don’t need to
calculate the log values by yourself. Simply use the log function in the
calculator and find out!
• Now, let’s multiply the IDF values to the TF values. Note that the TF
values are for each document while the IDF values are for the whole
corpus. Hence, we need to multiply the IDF values to each row of the
document vector table.
• Here, you can see that the IDF values for Aman in each row is the
same and similar pattern is followed for all the words of the
vocabulary. After calculating all the values, we get:
• Finally, the words have been converted to numbers. These numbers are the
values of each for each document. Here, you can see that since we have less
amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF
value increases, the value of that word decreases. That is, for example:
• Total Number of documents: 10
• Number of documents in which ‘and’ occurs: 10
• Therefore, IDF(and) = 10/10 = 1
• Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
• On the other hand, number of documents in which ‘pollution’ occurs: 3
• IDF(pollution) = 10/3 = 3.3333…
• Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has
considerable value in the corpus.
• Summarising the concept, we can say that:
• 1. Words that occur in all the documents with high term frequencies
have the least values and are considered to be the stopwords.
• 2. For a word to have high TFIDF value, the word needs to have a high
term frequency but less document frequency which shows that the
word is important for one document but is not a common word for all
documents.
• 3. These values help the computer understand which words are to be
considered while processing the natural language. The higher the
value, the more important the word is for a given corpus.
Applications of TFIDF

• TFIDF is commonly used in the Natural Language Processing domain.


Some of its applications are:

You might also like