CL Unit 1
CL Unit 1
CL Unit 1
1] What is NLP?
Linguistics: is concerned with language, it’s formation, syntax, meaning, different kind of
phrases (noun or verb)
Computer Science: is concerned with applying linguistic knowledge, by transforming it into
computer programs with the help of sub-fields such as Artificial Intelligence (Machine Learning
& Deep Learning).
Natural language processing (NLP) is the intersection of computer science, linguistics and
machine learning.
Natural Language Processing (NLP) is an aspect of Artificial Intelligence that helps computers
understand, interpret, and utilize human languages.
The field focuses on communication between computers and humans in natural language and
NLP is all about making computers understand and generate human language.
Advantages of NLP
NLP helps users to ask questions about any subject and get a direct response within seconds.
NLP helps computers to communicate with humans in their languages.
Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.
Used to process raw and unstructured data from online sources.
Helps to have a deep understanding of broad natural language.
Understand semantics (meaning) of tokens used in Natural Languages.
Provides an easy way of communication using language translation and generation of new
text.
Natural Language Processing also provides computers with the ability to read text, hear
speech, and interpret it.
So, for building NLP systems, it’s important to include all of a word’s possible meanings and all
possible synonyms.
3. Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.
5. Language differences
In the United States, most people speak English, but if you’re thinking of reaching an international
and/or multicultural audience, you’ll need to provide support for multiple languages.
6. Training data
At its core, NLP is all about analyzing language to better understand it. A human being must be
immersed in a language constantly for a period of years to become fluent in it; even the best AI
must also spend a significant amount of time reading, listening to, and utilizing a language. The
abilities of an NLP system depend on the training data provided to it. If you feed the system bad or
questionable data, it’s going to learn the wrong things, or learn in an inefficient way.
P, a set of productions of the form a -> b, where a is a non-terminal and b is a sequence of one or
more symbols from T Union V (where V – Set of variables (also called as Non-terminal symbols))
Example:
Q10] Discuss the need for Regular expression in NLP with example.
Many linguistic processing tasks involve pattern matching. For example, we can find
words ending with ed using endswith('ed').
Regular expressions give us a more powerful and flexible method for describing
the character patterns we are interested in.
To use regular expressions in Python, we need to import the re library using:
>>>import re.
>>> [w for w in wordlist if re.search('ed$', w)]
Example:
Extracting Word Pieces
The re.findall() (“find all”) method finds all (non-overlapping) matches of the given
regular expression. Let’s find all the vowels in a word, then count them:
>>>import re
>>> word = 'supercalifragilisticexpialidocious'
>>> re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))
16
Q11] Discuss the Grams and its variation such as Bigram and Trigram.
N-grams are continuous sequences of words or symbols or tokens in a document and are
defined as the neighboring sequences of items in a document.
They are used most importantly in tasks dealing with text data in NLP (Natural Language
Processing).
N-gram models are widely used in statistical natural language processing, speech recognition,
phonemes and sequences of phonemes, machine translation and predictive text input, and
many others for which the modeling inputs are n-gram distributions.
N-grams are defined as the contiguous sequence of n items that can be extracted from a given
sample of text or speech.
The items can be letters, words, or base pairs, according to the application.
The N-grams typically are collected from a text or speech corpus (Usually a corpus of long text
dataset).
N-grams can also be seen as a set of co-occurring words within a given window computed by
basically moving the window some k words forward (k can be from 1 or more than 1).
The co-occurring words are called "n-grams," and "n" is a number saying how long a string of
words we have considered in the construction of n-grams.
Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are four
words, 5-grams are five words, etc.
Example:
from nltk import ngrams
from nltk.tokenize import word_tokenize
sentence = "The big cat ate the little mouse who was after fresh cheese"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Generate bigrams
bigrams = list(ngrams(tokens, 2))
print(bigrams)
# Generate trigrams
trigrams = list(ngrams(tokens, 3))
print(trigrams)