ccs369-unit-1-summarized-lecture-note
ccs369-unit-1-summarized-lecture-note
ccs369-unit-1-summarized-lecture-note
INTRODUCTION
Artificial intelligence (AI) integration has revolutionized various industries, and now it is transforming the realm of
human behavior research. This integration marks a significant milestone in the data collection and analysis endeavors,
enabling users to unlock deeper insights from spoken language and empower researchers and analysts with enhanced
capabilities for understanding and interpreting human communication. Human interactions are a critical part of many
organizations. Many organizations analyze speech or text via natural language processing (NLP) and link them to insights
and automation such as text categorization, text classification, information extraction, etc.
In business intelligence, speech and text analytics enable us to gain insights into customer-agent conversations through
sentiment analysis, and topic trends. These insights highlight areas of improvement, recognition, and concern, to better
understand and serve customers and employees. Speech and text analytics features provide automated speech and text
analytics capabilities on 100% of interactions to provide deep insight into customer-agent conversations. Speech and text
analytics is a set of features that uses natural language processing (NLP) to provide an automated analysis of an
interaction’s content and insight into customer-agent conversations. Speech and text analytics includes transcribing voice
interactions, analysis for customer sentiment and topic spotting, and creating meaning from otherwise unstructured data.
FOUNDATIONS OF NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP) is the process of producing meaningful phrases and sentences in the form of natural
language. Natural Language Processing precludes Natural Language Understanding (NLU) and Natural Language
Generation (NLG). NLU takes the data input and maps it into natural language. NLG conducts information extraction and
retrieval, sentiment analysis, and more. NLP can be thought of as an intersection of Linguistics, Computer Science and
Artificial Intelligence that helps computers understand, interpret and manipulate human language.
1. Data Preprocessing
2. Algorithm Development
In Natural Language Processing, machine learning training algorithms study millions of examples of text — words,
sentences, and paragraphs — written by humans. By studying the samples, the training algorithms gain an understanding
of the “context” of human speech, writing, and other modes of communication. This training helps NLP software to
differentiate between the meanings of various texts. The five phases of NLP involve lexical (structure) analysis, parsing,
semantic analysis, discourse integration, and pragmatic analysis. Some well-known application areas of NLP are Optical
Character Recognition (OCR), Speech Recognition, Machine Translation, and Chatbots .
The first phase of NLP is word structure analysis, which is referred to as lexical or morphological analysis. A lexicon is
defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of
splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or
characters.
Similarly, morphological analysis is the process of identifying the morphemes of a word. A morpheme is a basic unit of
English language construction, which is a small element of a word, that carries meaning. These can be either a free
morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter
cannot stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning.
In search engine optimization (SEO), lexical or morphological analysis helps guide web searching. For instance, when
doing on-page analysis, you can perform lexical and morphological analysis to understand how often the target keywords
are used in their core form (as free morphemes, or when in composition with bound morphemes). This type of analysis
can ensure that you have an accurate understanding of the different variations of the morphemes that are used.
Morphological analysis can also be applied in transcription and translation projects, so can be very useful in content
repurposing projects, and international SEO and linguistic analysis.
Syntax Analysis is the second phase of natural language processing. Syntax analysis or parsing is the process of checking
grammar, word arrangement, and overall – the identification of relationships between words and whether those make
sense. The process involved examination of all words and phrases in a sentence, and the structures between them.
As part of the process, there’s a visualisation built of semantic relationships referred to as a syntax tree (similar to a
knowledge graph). This process ensures that the structure and order and grammar of sentences makes sense, when
considering the words and phrases that make up those sentences. Syntax analysis also involves tagging words and phrases
with POS tags. There are two common methods, and multiple approaches to construct the syntax tree – top-down and
bottom-up, however, both are logical and check for sentence formation, or else they reject the input.
Semantic analysis is the third stage in NLP, when an analysis is performed to understand the meaning in a statement. This
type of analysis is focused on uncovering the definitions of words, phrases, and sentences and identifying whether the way
words are organized in a sentence makes sense semantically.
This task is performed by mapping the syntactic structure, and checking for logic in the presented relationships between
entities, words, phrases, and sentences in the text. There are a couple of important functions of semantic analysis, which
allow for natural language understanding:
To ensure that the data types are used in a way that’s consistent with their definition.
To ensure that the flow of the text is consistent.
Identification of synonyms, antonyms, homonyms, and other lexical items.
Overall word sense disambiguation.
Relationship extraction from the different entities identified from the text.
There are several things you can utilise semantic analysis for in SEO. Here are some examples:
Topic modeling and classification – sort your page content into topics (predefined or modelled by an algorithm).
You can then use this for ML-enabled internal linking, where you link pages together on your website using the
identified topics. Topic modeling can also be used for classifying first-party collected data such as customer
service tickets, or feedback users left on your articles or videos in free form (i.e. comments).
Entity analysis, sentiment analysis, and intent classification – You can use this type of analysis to perform
sentiment analysis and identify intent expressed in the content analysed. Entity identification and sentiment
analysis are separate tasks, and both can be done on things like keywords, titles, meta descriptions, page content,
but works best when analysing data like comments, feedback forms, or customer service or social media
interactions. Intent classification can be done on user queries (in keyword research or traffic analysis), but can
also be done in analysis of customer service interactions.
Understanding of the expressed motivations within the text, and its underlying meaning.
Understanding of the relationships between entities and topics mentioned, thematic understanding, and
interactions analysis.
Discourse integration and analysis can be used in SEO to ensure that appropriate tense is used, that the relationships
expressed in the text make logical sense, and that there is overall coherency in the text analysed. This can be especially
useful for programmatic SEO initiatives or text generation at scale. The analysis can also be used as part of international
SEO localization, translation, or transcription tasks on big corpuses of data.
There are some research efforts to incorporate discourse analysis into systems that detect hate speech (or in the SEO space
for things like content and comment moderation), with this technology being aimed at uncovering intention behind text by
aligning the expression with meaning, derived from other texts. This means that, theoretically, discourse analysis can also
be used for modeling of user intent (e.g search intent or purchase intent) and detection of such notions in texts.
Phase V: Pragmatic analysis
Pragmatic analysis is the fifth and final phase of natural language processing. As the final stage, pragmatic analysis
extrapolates and incorporates the learnings from all other, preceding phases of NLP. Pragmatic analysis involves the
process of abstracting or extracting meaning from the use of language, and translating a text, using the gathered
knowledge from all other NLP steps performed beforehand.
Here are some complexities that are introduced during this phase
Information extraction, enabling an advanced text understanding functions such as question-answering.
Meaning extraction, which allows for programs to break down definitions or documentation into a more
accessible language.
Understanding of the meaning of the words, and context, in which they are used, which enables conversational
functions between machine and human (e.g. chatbots).
Pragmatic analysis has multiple applications in SEO. One of the most straightforward ones is programmatic SEO and
automated content generation. This type of analysis can also be used for generating FAQ sections on your product, using
textual analysis of product documentation, or even capitalizing on the ‘People Also Ask’ featured snippets by adding an
automatically-generated FAQ section for each page you produce on your site.
LANGUAGE SYNTAX AND STRUCTURE
For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into
sentences. We will be talking specifically about the English language syntax and structure in this section. In English,
words usually combine together to form other constituent units. These constituents include words, phrases, clauses, and
sentences. Considering a sentence, “The brown fox is quick and he is jumping over the lazy dog”, it is made of a bunch
of words and just looking at the words by themselves don’t tell us much.
Adj(ective): Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The
phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ)
beautiful . The POS tag symbol for adjectives is ADJ .
Adv(erb): Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs.
The phrase very beautiful flower has the adverb (ADV) very , which modifies the adjective (ADJ) beautiful ,
indicating the degree to which the flower is beautiful. The POS tag symbol for adverbs is ADV.
Besides these four major categories of parts of speech , there are other categories that occur frequently in the English
language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. Furthermore,
each POS tag like the noun (N) can be further subdivided into categories like singular nouns (NN), singular proper
nouns(NNP), and plural nouns (NNS).
The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are
used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down
upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.
Let us consider both nltk and spacy which usually use the Penn Treebank notation for POS tagging. NLTK and spaCy
are two of the most popular Natural Language Processing (NLP) tools available in Python. You can build chatbots,
automatic summarizers, and entity extraction engines with either of these libraries. While both can theoretically
accomplish any NLP task, each one excels in certain scenarios. The Penn Treebank, or PTB for short, is a dataset
maintained by the University of Pennsylvania.
We can see that each of these libraries treat tokens in their own way and assign specific tags for them. Based on what we
see, spacy seems to be doing slightly better than nltk.
Shallow Parsing or Chunking
Based on the hierarchy we depicted earlier, groups of words make up phrases. There are five major categories of phrases:
Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or
object to a verb.
Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually, there are
two forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives,
or adverbs as parts of the object.
Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or
qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun.
Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as the head word in the phrase.
Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that
describe or qualify them.
Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical
components like nouns, pronouns, and so on. These act like an adjective or adverb describing other words or
phrases.
Shallow parsing, also known as light parsing or chunking, is a popular natural language processing technique of analyzing
the structure of a sentence to break it down into its smallest constituents (which are tokens such as words) and group them
together into higher-level phrases. This includes POS tags and phrases from a sentence.
Constituency Parsing
Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be
used to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their
constituents. Each and every word usually belongs to a specific lexical category in the case and forms the head word of
different phrases. These phrases are formed based on rules called phrase structure rules.
Phrase structure rules form the core of constituency grammars, because they talk about syntax and rules that govern the
hierarchy and ordering of the various constituents in the sentences. These rules cater to two things primarily.
They determine what words are used to construct the phrases or constituents.
They determine how we need to order these constituents together.
The generic representation of a phrase structure rule is S → AB , which depicts that the structure S consists of
constituents A and B , and the ordering is A followed by B . While there are several rules (refer to Chapter 1, Page 19:
Text Analytics with Python, if you want to dive deeper), the most important rule describes how to divide a sentence or a
clause. The phrase structure rule denotes a binary division for a sentence or a clause as S → NP VP where S is the
sentence or clause, and it is divided into the subject, denoted by the noun phrase ( NP) and the predicate, denoted by the
verb phrase (VP).
A constituency parser can be built based on such grammars/rules, which are usually collectively available as context-free
grammar (CFG) or phrase-structured grammar. The parser will process input sentences according to these rules, and help
in building a parse tree.
Dependency Parsing
In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic
dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in
any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence.
The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most
cases. All the other words are directly or indirectly linked to the root verb using links, which are the dependencies.
Considering the sentence “The brown fox is quick and he is jumping over the lazy dog”, if we wanted to draw the
dependency syntax tree for this, we would have the structure
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP
methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers. As tokens are the
building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
Tokens are the building blocks of Natural Language.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words,
characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-
gram characters) tokenization.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the
sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
Similarly, tokens can be either characters or sub-words. For example, let us consider “smarter”:
Here, Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a
vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by
considering each unique token in the corpus or by considering the top K Frequently Occurring Words.
Creating Vocabulary is the ultimate goal of Tokenization.
One of the simplest hacks to boost the performance of the NLP model is to create a vocabulary out of top K frequently
occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the
vocabulary is treated as a unique feature:
• It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the
vocabulary contains a unique set of characters
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are
representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between
the characters to form meaningful words. This brings us to another tokenization known as Subword Tokenization which is
in between a Word and Character tokenization.
Subword Tokenization
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be
segmented as low-er, smartest as smart-est, and so on.
Transformed-based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary.
Now, we will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).
Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-based models. BPE addresses the
issues of Word and Character Tokenizers:
• BPE tackles OOV effectively. It segments OOV as subwords and represents the word in terms of these subwords
• The length of input and output sentences after BPE are shorter compared to character tokenization
BPE is a word segmentation algorithm that merges the most frequently occurring character or character sequences
iteratively. Here is a step by step guide to learn BPE.
Steps to learn BPE
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations
1a) Append the end of the word (say </w>) symbol to every word in the corpus:
Iteration 1:
3. Compute frequency:
Repeat steps 3-5 for every iteration from now. Let me illustrate for one more iteration.
Iteration 2:
3. Compute frequency:
STEMMING
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly
referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”,
“choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is
an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
How do we get these tokenized words? Well, tokenization involves breaking down the document into different words.
Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the
root form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text
pre-processing, and it is commonly used in information retrieval and text mining applications. There are several different
algorithms for stemming as follows:
Porter stemmer
Snowball stemmer
Lancaster stemmer.
The Porter stemmer is the most widely used algorithm, and it is based on a set of heuristics that are used to remove
common suffixes from words. The Snowball stemmer is a more advanced algorithm that is based on the Porter stemmer,
but it also supports several other languages in addition to English. The Lancaster stemmer is a more aggressive stemmer
and it is less accurate than the Porter stemmer and Snowball stemmer.
Stemming can be useful for several natural language processing tasks such as text classification, information retrieval, and
text summarization. However, stemming can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word. It is important to note that stemming is different from
Lemmatization. Lemmatization is the process of reducing a word to its base form, but unlike stemming, it takes into
account the context of the word, and it produces a valid word, unlike stemming which can produce a non-word as the root
form.
Some more examples stemming from the root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
Errors in Stemming:
Applications of stemming:
Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in
domain analysis. To display search results by indexing while documents are evolving into numbers and to map documents
to common subjects by stemming. Sentiment Analysis, which examines reviews and comments made by different users
about anything, is frequently used for product analysis, such as for online retail stores. Before it is interpreted, stemming
is accepted in the form of the text-preparation mean.
A method of group analysis used on textual materials is called document clustering (also known as text clustering).
Important uses of it include subject extraction, automatic document structuring, and quick information retrieval.
Fun Fact: Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned
“fishing” or “fishes”.
N-Gram Stemmer
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion
of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent.
Limitation: It requires space to create and index the n-grams and it is not time efficient.
Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other
languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from
the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is
the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to
as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is
having greater computational speed.
Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really
faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball
Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm. Lancaster Stemmer
is straightforward, although it often produces results with excessive stemming. Over-stemming renders stems non-
linguistic or meaningless.
LEMMATIZATION
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word
down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to
its root word, or lemme, good. In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of
the word. There are different algorithms used to find out how many characters have to be chopped off, but the algorithms
don’t actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this
knowledge. In fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word
before reducing it to its root word, or lemma. So, a lemmatization algorithm would know that the word better is
derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do
the same. There could be over-stemming or under-stemming, and the word better could be reduced to either bet,
or bett, or just retained as better. But there is no way in stemming that can reduce better to its root word good.
This is the difference between stemming and lemmatization.
Lemmatization gives more context to chatbot conversations as it recognizes words based on their exact and contextual
meaning. On the other hand, lemmatization is a time-consuming and slow process. The obvious advantage of
lemmatization is that it is more accurate than stemming. So, if you’re dealing with an NLP application such as a chat bot
or a virtual assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this
accuracy comes at a cost. Because lemmatization involves deriving the meaning of a word from something like a
dictionary, it’s very time-consuming. So, most lemmatization algorithms are slower compared to their stemming
counterparts. There is also a computation overhead for lemmatization, however, in most machine-learning problems,
computational resources are rarely a cause of concern.
REMOVING STOP-WORDS
The words which are generally filtered out before processing a natural language are called stop words. These are actually
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much
information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. Stop words are
available in abundance in any human language. By removing these words, we remove the low-level information from our
text in order to give more focus to the important information. In order words, we can say that the removal of such words
does not show any negative consequences on the model we train for our task.
Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of
tokens involved in the training.
We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing
and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we
might not remove the stop words.
Movie review: “The movie was not good at all.”
Text after removal of stop words: “movie good”
We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review
became positive, which is not the reality. Thus, the removal of stop words can be problematic here.
Tasks like text classification do not generally need stop words as the other words present in the dataset are more important
and give the general idea of the text. So, we generally remove stop words in such tasks.
In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the removal of stop words. So, think
before performing this step. The catch here is that no rule is universal and no stop words list is universal. A list not
conveying any important information to one task can convey a lot of information to the other task.
Word of caution: Before removing stop words, research a bit about your task and the problem you are trying to solve,
and then make your decision.
2. As the frequency of stop words are too high, removing them from the corpus results in much smaller data in
terms of size. Reduced size results in faster computations on text data and the text classification model need to
deal with a lesser number of features resulting in a robust model.
Advanced Methods
These methods can also be called vectorized methods as they aim to map a word, sentence, document to a fixed-length
vector of real numbers. The goal of this method is to extract semantics from a piece of text, both lexical and distributional.
Lexical semantics is just the meaning reflected by the words whereas distributional semantics refers to finding meaning
based on various distributions in a corpus.
Word2Vec
GloVe: Global Vector for word representation
Fi
g. Word2Vec vs GloVe
BAG OF WORDS MODEL
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval
(IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding
grammar and even word order but keeping multiplicity. A bag-of-words model, or BoW for short, is a way of extracting
features from text for use in modelling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two
things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded.
The model is only concerned with whether known words occur in the document, not where in the document. The intuition
is that documents are similar if they have similar content. Further, that from the content alone we can learn something
about the meaning of the document. The bag-of-words can be as simple or complex as you like. The complexity comes
both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known
words.
One of the biggest problems with text is that it is messy and unstructured, and machine learning algorithms prefer
structured, well defined fixed-length inputs and by using the Bag-of-Words technique we can convert variable-length texts
into a fixed-length vector.
Also, at a much granular level, the machine learning models work with numerical data rather than textual data. So to be
more specific, by using the bag-of-words (BoW) technique, we convert a text into its equivalent vector of numbers.
Let us see an example of how the bag of words technique converts text into vectors
Example (1) without preprocessing:
Sentence 1: “Welcome to Great Learning, Now start learning”
Sentence 2: “Learning is a good practice”
Sentence 1 Sentence 2
Welcome Learning
to is
Great a
Learning good
, practice
Now
start
learning
Step 1: Go through all the words in the above text and make a list of all of the words in the model vocabulary.
Welcome
To
Great
Learning
,
Now
start
learning
is
a
good
practice
Note that the words ‘Learning’ and ‘ learning’ are not the same here because of the difference in their cases and hence are
repeated. Also, note that a comma ‘ , ’ is also taken in the list. Because we know the vocabulary has 12 words, we can use
a fixed-length document-representation of 12, with one position in the vector to score each word.
The scoring method we use here is to count the presence of each word and mark 0 for absence. This scoring method is
used more generally.
The scoring of sentence 1 would look as follows:
Word Frequency
Welcome 1
to 1
Great 1
Learning 1
, 1
Now 1
start 1
learning 1
is 0
a 0
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,1,1,1,1,1,1,0,0,0 ]
Now for sentence 2, the scoring would like:
Word Frequency
Welcome 0
to 0
Great 0
Learning 1
, 0
Now 0
start 0
learning 0
is 1
a 1
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,0,0,0,0,0,1,1,1,1,1 ]
Sentence Welcome to Great Learning , Now start learning is a good practice
Sentence1 1 1 1 1 1 1 1 1 0 0 0 0
Sentence2 0 0 0 0 0 0 0 1 1 1 1 1
But is this the best way to perform a bag of words. The above example was not the best example of how to use a
bag of words. The words Learning and learning, although having the same meaning are taken twice. Also, a
comma “,” which does not convey any information is also included in the vocabulary.
Let us make some changes and see how we can use ‘bag of words in a more effective way.
Step 1: Convert the above sentences in lower case as the case of the word does not hold any information.
Step 2: Remove special characters and stopwords from the text. Stopwords are the words that do not contain much
information about text like ‘is’, ‘a’,’the and many more’.
Although the above sentences do not make much sense the maximum information is contained in these words only.
Step 3: Go through all the words in the above text and make a list of all of the words in our model vocabulary.
welcome
great
learning
now
start
good
practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-representation of 7, with one position in the
vector to score each word.
The scoring method we use here is the same as used in the previous example. For sentence 1, the count of words is as
follow:
Word Frequency
welcome 1
great 1
learning 2
now 1
start 1
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]
Word Frequency
welcome 0
great 0
learning 1
now 0
start 0
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Sentence1 1 1 2 1 1 0 0
Sentence2 0 0 1 0 0 1 1
The approach used in example two is the one that is generally used in the Bag-of-Words technique, the reason being that
the datasets used in Machine learning are tremendously large and can contain vocabulary of a few thousand or even
millions of words. Hence, preprocessing the text before using bag-of-words is a better way to go. There are various
preprocessing steps that can increase the performance of Bag-of-Words. Some of them are explained in great detail in
this blog.
In the examples above we use all the words from vocabulary to form a vector, which is neither a practical way nor the best
way to implement the BoW model. In practice, only a few words from the vocabulary, more preferably the most common
words are used to form the vector.
Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on
your specific text data. It has been used with great success on prediction problems like language modeling and
documentation classification.
For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”
becomes
<start>James
James is
is the
the best
best person
person ever.
ever.<end>
In a typical bag-of-n-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a
corpus. And then James is the best person ever. would be encoded in a representation showing which of the corpus’s
bigrams were observed in the sentence. A bag-of-n-grams model has the simplicity of the bag-of-words model but allows
the preservation of more word locality information.
TF-IDF MODEL
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how
relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a
word appears but is compensated by the word frequency in the corpus (data-set).
Terminologies:
Term Frequency: In document , the frequency represents the number of instances of a given word . Therefore,
we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of
terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term
in the paper, there is an entry with the value being the term frequency.
The weight of a term that occurs in a document is simply proportional to the term frequency.
Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus
collection. The only difference is that in document d, TF is the frequency counter for a term , while df is the
number of occurrences in the document set N of the term t. In other words, the number of papers in which the
word is present is DF.
Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate
the appropriate records that fit the demand. Since considers all terms equally significant, it is therefore not only
possible to use the term frequencies to measure the weight of the term in the paper. First, find the document
frequency of a term by counting the number of documents containing the term:
Term frequency is the number of instances of a term in a single document only; although the frequency of the document is
the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the
definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated
by the frequency of the text.
The more common word is supposed to be considered less significant, but the element (most definite integers) seems too
harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:
Computation: TF-IDF is one of the best metrics to determine how significant a term is to a text in a series or a
corpus. TF-IDF is a weighting system that assigns a weight to each word in a document based on its term
frequency (TF) and the reciprocal document frequency (TF) (IDF). The words with higher scores of weight are
deemed to be more significant.
Numerical Example
Imagine the term appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of can be
calculated as follow:
Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain
the term , Inverse Document Frequency (IDF) of can be calculated as follows
Using these two quantities, we can calculate TF-IDF score of the term for the document.
**********