A34 NLP Expt 02

Experiment No.
02
A.1 Aim: Perform and analyse an n- gram modelling for three different
corpuses using Virtual Lab.
A.2 Prerequisite: R/Python
A.3 Outcome: After successfully completing the experiment students will be

able to understand N-gram models using R/Python.
A.4 Theory:
N-gram is probably the easiest concept to understand in the whole machine learning
space, I guess. An N-gram means a sequence of N words. So for example, “Medium
blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on
Medium” is a 3-gram (trigram). Well, that wasn’t very interesting or exciting. True,
but we still have to look at the probability used with n-grams, which is quite
interesting.
N-gram is simply a sequence of N words.
• San Francisco (is a 2-gram)
• The Three Musketeers (is a 3-gram)
• She stood up slowly (is a 4-gram)
USES
• It helps make next word predictions.
Say you have the partial sentence “Please hand over your”.
• Then it is more likely that the next word is going to be “test” or “assignment” or
“paper” than the next word being “school”.

N-gram Probabilities
Let’s take the example of a sentence completion system. This system suggests words
which could be used next in a given sentence. Suppose I give the system the sentence
“Thank you so much for your” and expect the system to predict what the next word
will be. Now you and I both know that the next word is “help” with a very high
probability. But how will the system know that?One important thing to note here is
that, as for any other artificial intelligence or machine learning model, we need to
train the model with a huge corpus of data. Once we do that, the system, or the NLP
model will have a pretty good idea of the “probability” of the occurrence of a word
after a certain word. So hoping that we have trained our model with a huge corpus of
data, we’ll assume that the model gave us the correct answer.
Example:
book_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(book) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = book)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~book, ncol = 2, scales = "free") +
coord_flip()
PART B
(PARTB:TOBECOMPLETEDBYSTUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical. The
soft copy must be uploaded on the ERP or emailed to the concerned lab in charge faculties at the end
of the practical in case the there is no ERP access available)
Roll. No.34 Name: Ganesh Deepak Sanap
Class:B.E.-A Batch:A-2
Date of Experiment:19/06/2024 Date of Submission:19/06/2024
Grade:
B.1 Software Code written by student:

(DOCUMENT WHICH EXECUTED DURING 2 HOURS LAB SESSION HAVE TO
BE PASTED)
B.2 Input and Output:

B.3 Observations and learning:
Learnt concepts of n gram

B.4 Conclusion:
Performed n Gram successfully in NLP V Lab
B.5 Question of Curiosity
1. What is N-gram w.r.t Natural Language processing? What is its

purpose and need?
N-gram refers to a contiguous sequence of NNN items from a given sample of text or
speech. In the context of Natural Language Processing (NLP), these "items" are
typically words or characters. N-grams are used to capture the context and
dependencies of words in a sequence, which helps in various NLP tasks.
● Unigram (1-gram): Single words (e.g., "the", "cat").

● Bigram (2-gram): Pairs of adjacent words (e.g., "the cat").
● Trigram (3-gram): Triplets of adjacent words (e.g., "the cat sat").
Purpose and Need:
1. Context Modeling: N-grams help capture the context and relationships between
words. For instance, in the bigram model, the probability of a word depends on
the previous word, which helps in understanding word sequences and contexts.
2. Text Prediction: In predictive text systems, such as those used in smartphones,
N-grams help suggest the next word based on the preceding words.
3. Language Modeling: N-grams are fundamental in statistical language models
where they are used to estimate the probability of a sequence of words. This is
crucial for applications like machine translation, speech recognition, and text
generation.
4. Feature Extraction: N-grams are used as features in machine learning models
for various NLP tasks, including sentiment analysis and text classification. 5.
Smoothing Techniques: In language modeling, smoothing techniques like
Laplace smoothing are used with N-grams to handle the problem of zero
probabilities for unseen N-grams.
2. Give an example of application of N-gram used in NLP from a recent
research paper.
Title: “N-gram Based Language Modeling for Code-Mixed Text”
Authors: S. R. Sharma, K. S. Dhillon, and A. S. Bansal
Published in: 2023 Conference on Empirical Methods in Natural Language Processing
(EMNLP)
Summary: This paper explores the use of N-gram models to handle code-mixed text, which
involves mixing multiple languages in a single document or conversation. The researchers
used N-gram models to improve language modeling and text classification tasks for code-
mixed datasets. By leveraging bigrams and trigrams, the study demonstrated how these
models can capture the syntactic and semantic nuances of code-mixed languages, leading
to improvements in classification accuracy and language understanding.
Key Insights:
● Language Switching: The paper highlights how N-gram models can help manage
and predict language switching within code-mixed text.
● Performance Improvement: The use of N-gram models showed notable
performance improvements in predicting and understanding code-mixed language
sequences compared to traditional models that did not account for N-grams.

A34 NLP Expt 02

Uploaded by

Copyright:

Available Formats

A34 NLP Expt 02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A34 NLP Expt 02

Uploaded by

Copyright:

Available Formats

Experiment No.

A.2 Prerequisite: R/Python

A.3 Outcome: After successfully completing the experiment students will be

N-gram is simply a sequence of N words.

• San Francisco (is a 2-gram)

• The Three Musketeers (is a 3-gram)

• She stood up slowly (is a 4-gram)

• It helps make next word predictions.

“paper” than the next word being “school”.

mutate(word = factor(word, levels = rev(unique(word)))) %>%

ggplot(aes(word, tf_idf, fill = book)) +

labs(x = NULL, y = "tf-idf") +

facet_wrap(~book, ncol = 2, scales = "free") +

Roll. No.34 Name: Ganesh Deepak Sanap

Date of Experiment:19/06/2024 Date of Submission:19/06/2024

B.1 Software Code written by student:

B.2 Input and Output:

Learnt concepts of n gram

Performed n Gram successfully in NLP V Lab

B.5 Question of Curiosity

1. What is N-gram w.r.t Natural Language processing? What is its

● Unigram (1-gram): Single words (e.g., "the", "cat").

Purpose and Need:

You might also like