Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Module 4

Natural Language Processing


Contents
Natural Language Processing: Introduction,
Syntactic Processing,
Semantic Analysis,
Discourse and Pragmatic Processing,
Statistical NLP,
Spell checking.
Introduction
• Language is meant for communication about the world.
• Human linguistic communication occurs as speech and as well as
written.
• Processing written language is easier than processing speech,
• Therefore, Language Processing problem is divided into two parts:
• Processing written text using lexical, syntactic and semantic knowledge
of the language as well as required relevant world information.
• Processing spoken language, using all above information plus
additional knowledge about phonology, information about noise and to
handle audio signals.
What is Natural Language Processing
NLP stands for Natural Language Processing (NLP), is the
technology that is used by machines to understand,
analyse, manipulate, and interpret human's languages.

The two process in NLP are:


NL Understanding NL Generation
NLU is the process of reading and NLG is the process of writing or
interpreting language. generating language.
Used to map the given input into useful Converts the computerized data into
representation. natural language representation.
Components of Natural Language
Understanding Processing
Morphological Analysis: This is also called as Lexical analysis: It
involves identifying and analyzing the structure of words. Lexicon of a
language means the collection of words and phrases in a language. Lexical
analysis is dividing the whole chunk of txt into paragraphs, sentences, and
words.
Syntactic Analysis: It involves analysis of words in the sentence for
grammar and arranging words in a manner that shows the relationship
among the words. The sentence such as “The school goes to boy” is rejected
by English syntactic analyzer.
Components of Natural Language
Understanding Processing
Semantic Analysis
Semantic Analysis is a structure created by the syntactic analyzer
which assigns meanings. This component transfers linear sequences
of words into structures. It shows how the words are associated with
each other.
Semantics focuses only on the literal meaning of words, phrases,
and sentences. This only abstracts the dictionary meaning or the
real meaning from the given context.
E.g.. “colorless green idea.” This would be rejected by the Symantec
analysis as colorless Here; green doesn’t make any sense.
Components of Natural Language
Understanding Processing
Discourse Integration: The meaning of an individual sentence may
depend on the sentence that precede it and may influence the
meanings of the sentences that follow it.
Example: The sentence “John wanted it. He always had”. Here it
refers to the previous sentence. And John influence the meaning of
later sentence.
Pragmatic Analysis:
The structure representing what was said is reinterpreted to
determine what was actually meant.
Example: Do you know what time it is? This sentence is interpreted
as a request to be told the time.
Syntactic Processing: Grammar &
Parsers
There are a number of algorithms researchers have developed for
syntactic analysis:
Here the focus is on two methods:

• Grammar: Grammar consists of rules with a single symbol on


the left-hand side of the rewrite rules.

• Top-Down Parser: Begin with the start symbol and apply the
grammar rules forward until the symbols at the terminals of the
tree correspond to the components of the sentence being parsed.
Consider the sentence
“The bird pecks the grains”
Grammar consists of rules with a single symbol on the left-hand
side of the rewrite rules.
Example:
Articles (DET) − a | an | the
Nouns − bird | birds | grain | grains
Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun
= DET N | DET ADJ N
Verbs − pecks | pecking | pecked
Verb Phrase (VP) − NP V | V NP
Adjectives (ADJ) − beautiful | small | chirping
The parse tree breaks down the sentence into structured parts so
that the computer can easily understand and process it by following
certain rules.
As per the rules of the first order logic, if there are two strings Noun
Phrase (NP) and Verb Phrase (VP), then the string combined by NP
followed by VP is a sentence.
S → NP VP Lexocon −

NP → DET N | DET ADJ N DET → a | the

VP → V NP ADJ → beautiful | perching


N → bird | birds | grain | grains
V → peck | pecks | pecking
S → NP VP
NP → DET N | DET ADJ N
VP → V NP

Lexocon −
DET → a | the
ADJ → beautiful | perching
N → bird | birds | grain | grains
V → peck | pecks | pecking
Example 2
Semantic Analysis
• Semantic Analysis is a process of understand the meaning of
Natural Language or a given text while taking into account context,
logical structuring of sentences and grammar roles.
The most important task of semantic analysis is to get the proper
meaning of the sentence.
For example, analyze the sentence “Ram is great.” In this sentence,
the speaker is talking either about Lord Ram or about a person
whose name is Ram.
Therefore semantic analyzer is important.
Parts of Semantic Analysis
Semantic Analysis of Natural Language can be classified into two
broad parts:
1. Lexical Semantic Analysis: Lexical Semantic Analysis involves
understanding the meaning of each word of the text individually. It
basically refers to fetching the dictionary meaning that a word in the
text is deputed to carry.
2. Sentence-level Semantics Analysis: Although knowing the
meaning of each word of the text is essential, that the same time
understanding the meaning of the complete text is very important.
Importance of Sentence-level Semantics Analysis:
Consider two sentences:
1. Students loves college.
2. College loves students.
• Although both these sentences 1 and 2 use the same set of root words
they convey entirely different meanings.
• Hence, under Sentence-level Semantics Analysis, how combinations of
individual words form the meaning of the text will be analysed.
Tasks involved in Semantic
Analysis
In order to understand the meaning of a sentence, the following are
the major processes involved in Semantic Analysis:

1. Word Sense Disambiguation


2. Relationship Extraction
Word Sense Disambiguation (Case
Grammar)
• In Natural Language, the meaning of a word may vary as per its usage in
sentences and the context of the text. Word Sense Disambiguation involves
interpreting the meaning of a word based upon the context of its occurrence in
a text.
For example,
the word ‘Bark’ may mean ‘the sound made by a dog’ or ‘the outermost layer of a
tree.’
the word ‘rock’ may mean ‘a stone‘ or ‘a genre of music‘ – hence, the accurate
meaning of the word is highly dependent upon its context and usage in the text.
• Thus, the ability of a machine to overcome the ambiguity involved in identifying
the meaning of a word based on its usage and context is called Word Sense
Disambiguation.
Relationship Extraction
• Relationship Extracting involves identifying various entities present
in the sentence and then extracting the relationships between
those entities.
Basic Units of Semantic System
• Entity: An entity refers to a particular unit or individual in specific
such as a person or a location. For example Udemy, Bangalore,
Nithya Patil, etc.
• Concept: A Concept may be understood as a generalization of entities.
It refers to a broad class of individual units. For example Learning
Portals, City, Students.
• Relations: Relations help establish relationships between various
entities and concepts. For example: ‘Udemy is a Learning Portal’,
‘Bangalore is a City.’, Nithya Patil is a student, etc.
• Predicate: Predicates represent the verb structures of the sentences.
In Meaning Representation, all these basic units are used to represent
textual information.
Approaches to Meaning
Representations
• First-order predicate logic (FOPL)
• Semantic Nets
• Frames
✔ Conceptual dependency (CD)
• Rule-based architecture
• Case Grammar
• Conceptual Graphs
Conceptual Dependency (CD) / C
Parsing
• CD provides the structures and the primitives from which the
representation from the language can be built.
• CD representation of the sentence is not through the words but through
the conceptual primitives which helps in understanding the
intended meaning of the words.
CD Primitives
ATRANS is used to represent a transfer such as "give" or "take“.
PTRANS is used to act on locations such as "move" or "go".
MTRANS represents mental acts such as "tell", etc.
MOVE represents the movement of body parts by its owners such as “kick”
etc.
PROPEL represents the application of physical force to an object such as
“push”
MBUILD represents making new information such as “decide”
INGEST represents something is taken inside.

Example: A sentence such as "John gave a book to Mary" is then


represented as the action of an ATRANS on two real world objects, John and
Mary.
CD categories
ACT – Actions
PP – Objects
AA – Modifiers of actions
PA – Modifiers of PP’s
Examples
Spelling Errors
There are three causes of errors:
1. Insertion: Insertion of an extra letter while typing.
Example: Maximum typed as maxiimum. The extra i is causing error.
2. Deletion: A case of a letter missing or not typed in a word.
Example: Netwrk instead of network.
3. Substitution: Typing of a letter in place of the correct one as in
Intellugence, letter i has been wrongly substituted by u.
Classification of spelling errors
1. Typographic errors: These errors are those that are caused due to
mistakes committed while typing.
Example: Netwrk instead of network
2. Orthographic errors: These errors are due to lack of understanding of
the language by the user.
Example: Arithmetic, wellcome and accomodation.
3. Phonetic errors: These result due to poor cognition on part of the listener.
Example: The word rough could be ruff. Listen as lisen.
Spell checking techniques
Spell checking techniques can be broadly classified into three categories:
(a) Non-word error occurs due to the spelling error, where the word itself
is not in the dictionary and is not a known word.
For example, mistakenly spelling “apple” into “appll” is a non-word error
because “appll” is not in our dictionary.
The techniques used to detect such errors are
N-gram analysis and
Dictionary look-up.
N-gram analysis:
• An N-gram language model predicts the probability of a given N-gram
within any sequence of words in the language. A good N-gram model can
predict the next word in the sentence i.e the value of p(w|h)
• Example of N-gram such as unigram (“This”, “article”, “is”, “on”,
“NLP”) or bi-gram (‘This article’, ‘article is’, ‘is on’,’on NLP’).
• For example in the above example, lets’ consider, we want to calculate
what is the probability of the last word being “NLP” given the previous
words
• p(NLP|this article is on). If the probability is above 0.5 then the last word
is NLP otherwise no.
Dictionary Lookup Technique:
In this, Dictionary lookup technique is used which checks every
word of input text for its presence in dictionary.
If that word is present in the dictionary, then it is a correct word.
Otherwise it is put into the list of error words.
(b) Isolated-word error correction: This process focus on the
correction of an isolated non-word by finding its nearest and
meaningful word and makes an attempt to rectify the errors. It
thus transforms the word soper into super without looking into
context.
Minimum Edit distance between two strings str1 and str2 is defined as the minimum number
of insert/delete/substitute operations required to transform str1 into str2.

For example if str1 = "ab", str2 = "abc" then making an insert operation of character 'c' on str1
transforms str1 into str2.

Therefore, edit distance between str1 and str2 is 1.


Edit distance is also calculated as number of operations required to transform str2 into str1. For
above example, if we perform a delete operation of character 'c' on str2, it is transformed into
str1 resulting in same edit distance of 1.
Looking at another example, if str1 = "INTENTION" and str2 =
"EXECUTION", then the minimum edit distance between str1 and
str2 turns out to be 5 as shown below. All operations are performed
on str1.
(C) Context dependent Error detection and correction:
In addition to detect errors, this process try to find whether
the corrected word fits into the context of the sentence.
This process is more complex and requires the additional
resources than the previous process.
Discourse and Pragmatic
Processing
• Discourse integration:
It is written or spoken communication.
The meaning of any sentence depends upon the meaning of the
sentence just before it.
In addition, it also brings about the meaning of immediately
succeeding sentence.
It deals with how the immediately preceding sentence can affect the
interpretation of the next sentence.
Example: Bill had a red balloon. John wanted it.
Here it depends upon the prior discourse context.
• Pragmatic analysis:
Practical usage of language. What a sentence means in practice. It deals with
using and understanding sentences in different situations and how the
interpretation of the sentence is affected.
Deals with outside word knowledge, which means knowledge that is external to
the documents and/or queries.
Focuses on what was described is reinterpreted by what it actually meant.
Involves deriving those aspects of language which require real world knowledge.
Example: Close the window. This sentence should be interpreted as a request
instead of an order.
• Pragmatic analysis:
Reinterpretation is done by applying a set of rules that characterize
cooperative dialogues.
Statistical NLP
Statistical nlp is the process of predicting the next word in the sequence given
the words that precede it.

Statistical modeling helps to:


• Suggest auto-completes
• Recognize handwriting with lexical acquisition even if it’s in a poorly written
text
• Detect and correct spelling errors
• Recognize speech
• Recognize multi-token named entities
• Caption images
Statistical models are used in NLP for two reasons:
• To make algorithms for processing language able to learn from
observations of language (and other contextual clues). This is called
machine learning. There’s also an alternative called expert systems.
However, it doesn’t scale because it is not feasible for engineers to
write down all of the “rules” for “understanding” text.
• Natural language is based on referential and prototypical context to
disambiguate its precise meaning. In comparison with rule-based
systems, statistical models are best for this type of
situation-dependent interface and corpus-based work.
Applications of Statistical NLP:
• Time-series: Frequency-domain methods and time-domain methods.
• Survival analysis: Analysis of the expected duration of time until one
or more events happen, such as a death in biological organisms and
failure in mechanical systems.
• Market segmentation: Dividing a broad market into the groups of
customers with similar characteristics
• Recommendation systems: The filtering system predicts the ‘rating’
or ‘preference’ that a user would give to an item.
• Scoring: Statistics processing to predict the outcome and assign it a
corresponding score.
Question Bank
1. Define NLP. Explain the components of the NLP?
2. Explain grammar and parsers in syntactic analysis of NLP with an
example?
3. Explain the two processes involved in the semantic analysis?
4. Explain Conceptual Dependency with primitives and examples used in
semantic analysis.
5. Explain Discourse and Pragmatic Processing of NLP?
6. Explain statistical NLP?
7. Explain the classification of spelling errors and the causes for the
spelling errors.
8. Explain the spell checking techniques used in NLP?

You might also like