Natural Language Processing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 48

Natural Language Processing

Computer science (Malla Reddy Group of Institutions)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by SRIKANTH p ([email protected])


NATURAL LANGUAGE PROCESSING

UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1. Words and Their Components
2. Issues and Challenges
3.Morphological Models
PART 2: Finding the Structure of Documents:(chapter 2 txtbk 1)
1.Introduction
2.Methods
3. Complexity of the Approaches
4.Performances of the Approaches

UNIT - II: Syntax Analysis: (chapter 3 txtbk 1)


1. Parsing Natural Language
2. Treebanks: A Data-Driven Approach to Syntax
3.Representation of Syntactic Structure
4. Parsing Algorithms
5. Models for Ambiguity Resolution in Parsing
6.Multilingual Issues

UNIT - III : Semantic Parsing: (chapter 4 txtbk 1)


1. Introduction
2. Semantic Interpretation
3. System Paradigms
4. Word Sense Systems Software

UNIT - IV :(chapter 4 txtbk 1)


1. Predicate-Argument Structure
2.Meaning Representation
3. Systems Software

UNIT - V:
PART 1: Discourse Processing: (chapter 6 txtbk 2)
1. Cohesion
2. Reference Resolution 3.Discourse
Cohesion and Structure
PART 2:Language Modelling:(chapter 5 txtbk 1)
1. Introduction
2.N-Gram Models
3. Language Model Evaluation
4.Parameter Estimation
5.Language Model Adaptation
6.Types of Language Models 7.Language-
Specific Modeling Problems
8.Multilingual and Cross Lingual Language Modeling

TEXTBOOK1:
Multilingual Natural Language Processing Applications: From Theory to Practice –
Daniel M. Bikel and Imed Zitouni, Pearson Publication

Downloaded by SRIKANTH p ([email protected])


UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1. Words and Their Components
2. Issues and Challenges
3.Morphological Models

Finding the Structure of Words:


In natural language processing (NLP), finding the structure of words involves breaking down
words into their constituent parts and identifying the relationships between those parts. This
process is known as morphological analysis, and it helps NLP systems understand the structure
of language.

There are several ways to find the structure of words in NLP, including:

1. Tokenization: This involves breaking a sentence or document into individual words


or tokens, which can then be analysed further.
2. Stemming and Lemmatization: These techniques involve reducing words to their
base or root form, which can help identify patterns and relationships between
words.
3. Part-of-Speech Tagging: This involves labelling each word in a sentence with its part
of speech, such as noun, verb, adjective, or adverb.
4. Parsing: This involves analysing the grammatical structure of a sentence by
identifying its constituent parts, such as subject, object, and predicate.
5. Named Entity Recognition: This involves identifying and classifying named
entities in text, such as people, organisations, and locations.
6. Dependency Parsing: This involves analysing the relationships between words in a
sentence and identifying which words depend on or modify other words.

By finding the structure of words in text, NLP systems can perform a wide range of tasks,
such as machine translation, text classification, sentiment analysis, and information
extraction.

1. Words and Their Components:


In natural language processing (NLP), words are analysed by breaking them down into
smaller units called components or morphemes. The analysis of words and their components
is important for various NLP tasks such as stemming, lemmatization, part-of-speech tagging,
and sentiment analysis.

There are two main types of morphemes:

1. Free Morphemes: These are standalone words that can convey meaning on their own,
such as "book," "dog," or "happy."

Downloaded by SRIKANTH p ([email protected])


2. Bound Morphemes: These are units of meaning that cannot stand alone but must be
attached to a free morpheme to convey meaning. There are two types of bound
morphemes:
● Prefixes: These are morphemes that are attached to the beginning of a free
morpheme, such as "un-" in "unhappy" or "pre-" in "preview."
● Suffixes: These are morphemes that are attached to the end of a free
morpheme, such as "-ness" in "happiness" or "-ed" in "jumped."

For example, the word "unhappily" has three morphemes: "un-" (a prefix meaning "not"),
"happy" (a free morpheme meaning "feeling or showing pleasure or contentment"), and "-ly"
(a suffix that changes the word into an adverb). By analyzing the morphemes in a word, NLP
systems can better understand its meaning and how it relates to other words in a sentence.

In addition to morphemes, words can also be analyzed by their part of speech, such as noun,
verb, adjective, or adverb. By identifying the part of speech of each word in a sentence, NLP
systems can better understand the relationships between words and the structure of the
sentence.

1.1 Tokens:
In natural language processing (NLP), a token refers to a sequence of characters that
represents a meaningful unit of text. This could be a word, punctuation mark, number, or
other entity that serves as a basic unit of analysis in NLP.

For example, in the sentence "The quick brown fox jumps over the lazy dog," the tokens are
"The," "quick," "brown," "fox," "jumps," "over," "the," "lazy," and "dog." Each of these tokens
represents a separate unit of meaning that can be analyzed and processed by an NLP system.

Here are some additional examples of tokens:

● Punctuation marks, such as periods, commas, and semicolons, are tokens that
represent the boundaries between sentences and clauses.
● Numbers, such as "123" or "3.14," are tokens that represent numeric quantities or
measurements.
● Special characters, such as "@" or "#," can be tokens that represent symbols used
in social media or other online contexts.

Tokens are often used as the input for various NLP tasks, such as text classification,
sentiment analysis, and named entity recognition. In these tasks, the NLP system analyzes
the tokens to identify patterns and relationships between them, and uses this information to
make predictions or draw insights about the text.

Downloaded by SRIKANTH p ([email protected])


In order to analyze and process text effectively, NLP systems must be able to identify and
distinguish between different types of tokens, and understand their relationships to one
another. This can involve tasks such as tokenization, where the text is divided into individual
tokens, and part-of-speech tagging, where each token is assigned a grammatical category
(such as noun, verb, or adjective). By accurately identifying and processing tokens, NLP
systems can better understand the meaning and structure of a text.

1.2 Lexemes:
In natural language processing (NLP), a lexeme is a unit of vocabulary that represents a
single concept, regardless of its inflected forms or grammatical variations. It can be thought
of as the abstract representation of a word, with all its possible inflections and variations.

For example, the word "run" has many inflected forms, such as "ran," "running," and "runs."
These inflections are not considered separate lexemes because they all represent the same
concept of running or moving quickly on foot.

In contrast, words that have different meanings, even if they are spelled the same way, are
considered separate lexemes. For example, the word "bank" can refer to a financial
institution or the edge of a river. These different meanings are considered separate lexemes
because they represent different concepts.

Here are some additional examples of lexemes:

● "Walk" and "walked" are inflected forms of the same lexeme, representing the
concept of walking.
● "Cat" and "cats" are inflected forms of the same lexeme, representing the
concept of a feline animal.
● "Bank" and "banking" are derived forms of the same lexeme, representing the
concept of finance and financial institutions.

Lexical analysis involves identifying and categorizing lexemes in a text, which is an


important step in many NLP tasks, such as text classification, sentiment analysis, and
information retrieval. By identifying and categorizing lexemes, NLP systems can better
understand the meaning and context of a text.

Lexical analysis is also used to identify and analyze the morphological and syntactical
features of a word, such as its part of speech, inflection, and derivation.
This information is important for tasks such as stemming, lemmatization, and
part-of-speech tagging, which involve reducing words to their base or root forms and
identifying their grammatical functions.

Downloaded by SRIKANTH p ([email protected])


1.3 Morphemes:

In natural language processing (NLP), morphemes are the smallest units of meaning in a
language. A morpheme is a sequence of phonemes (the smallest units of sound in a language)
that carries meaning. Morphemes can be divided into two types: free morphemes and bound
morphemes.

Free morphemes are words that can stand alone and convey meaning. Examples of free
morphemes include "book," "cat," "happy," and "run."

Bound morphemes are units of meaning that cannot stand alone but must be attached to a
free morpheme to convey meaning. Bound morphemes can be further divided into two types:
prefixes and suffixes.

● A prefix is a bound morpheme that is added to the beginning of a word to change


its meaning. For example, the prefix "un-" added to the word "happy" creates the
word "unhappy," which means not happy.
● A suffix is a bound morpheme that is added to the end of a word to change its
meaning. For example, the suffix "-ed" added to the word "walk" creates the word
"walked," which represents the past tense of "walk."

Here are some examples of words broken down into their morphemes:

● "unhappily" = "un-" (prefix meaning "not") + "happy" + "-ly" (suffix meaning "in a
manner of")
● "rearrangement" = "re-" (prefix meaning "again") + "arrange" + "-ment" (suffix
indicating the act of doing something)
● "cats" = "cat" (free morpheme) + "-s" (suffix indicating plural form)

By analysing the morphemes in a word, NLP systems can better understand its meaning and
how it relates to other words in a sentence. This can be helpful for tasks such as part-of-
speech tagging, sentiment analysis, and language translation.

1.4 Typology:
In natural language processing (NLP), typology refers to the classification of languages
based on their structural and functional features. This can include features such as word
order, morphology, tense and aspect systems, and syntactic structures.

There are many different approaches to typology in NLP, but a common one is the
distinction between analytic and synthetic languages. Analytic languages have a relatively
simple grammatical structure and tend to rely on word order and

Downloaded by SRIKANTH p ([email protected])


prepositions to convey meaning. In contrast, synthetic languages have a more complex
grammatical structure and use inflections and conjugations to indicate tense, number,
and other grammatical features.

For example, English is considered to be an analytic language, as it relies heavily on word


order and prepositions to convey meaning. In contrast, Russian is a synthetic language, with
a complex system of noun declensions, verb conjugations, and case markings to convey
grammatical information.

Another example of typology in NLP is the distinction between head-initial and


head-final languages. In head-initial languages, the head of a phrase (usually a noun) comes
before its modifiers (adjectives or other nouns). In head-final languages, the head comes
after its modifiers. For example, English is a head-initial language, as in the phrase "red
apple," where "apple" is the head and "red" is the modifier. In contrast, Japanese is a head-
final language, as in the phrase "aka-i ringo" (red apple), where "ringo" (apple) is the head
and "aka-i" (red) is the modifier.

By understanding the typology of a language, NLP systems can better model its grammatical
and structural features, and improve their performance in tasks such as language modelling,
parsing, and machine translation.

2. Issues and Challenges:


Finding the structure of words in natural language processing (NLP) can be a challenging
task due to various issues and challenges. Some of these issues and challenges are:

1. Ambiguity: Many words in natural language have multiple meanings, and it can
be difficult to determine the correct meaning of a word in a particular context.
2. Morphology: Many languages have complex morphology, meaning that words can
change their form based on various grammatical features like tense, gender, and
number. This makes it difficult to identify the underlying structure of a word.
3. Word order: The order of words in a sentence can have a significant impact on the
meaning of the sentence, making it important to correctly identify the relationship
between words.
4. Informal language: Informal language, such as slang or colloquialisms, can be
challenging for NLP systems to process since they often deviate from the standard
rules of grammar.
5. Out-of-vocabulary words: NLP systems may not have encountered a word before,
making it difficult to determine its structure and meaning.
6. Named entities: Proper nouns, such as names of people or organizations, can be
challenging to recognize and structure correctly.

Downloaded by SRIKANTH p ([email protected])


7. Language-specific challenges: Different languages have different structures and
rules, making it necessary to develop language-specific approaches for NLP.
8. Domain-specific challenges: NLP systems trained on one domain may not be
effective in another domain, such as medical or legal language.

Overcoming these issues and challenges requires a combination of linguistic knowledge,


machine learning techniques, and careful model design and evaluation.

2.1 Irregularity:
Irregularity is a challenge in natural language processing (NLP) because it refers to words
that do not follow regular patterns of formation or inflection. Many languages have irregular
words that are exceptions to the standard rules, making it difficult for NLP systems to
accurately identify and categorize these words.

For example, in English, irregular verbs such as "go," "do," and "have" do not follow the
regular pattern of adding "-ed" to the base form to form the past tense. Instead, they have
their unique past tense forms ("went," "did," "had") that must be memorized.

Similarly, in English, there are many irregular plural nouns, such as "child" and "foot," that do
not follow the standard rule of adding "-s" to form the plural. Instead, these words have their
unique plural forms ("children," "feet") that must be memorized.

Irregularity can also occur in inflectional morphology, where different forms of a word are
created by adding inflectional affixes. For example, in Spanish, the irregular verb "tener" (to
have) has a unique conjugation pattern that does not follow the standard pattern of other
regular verbs in the language.

To address the challenge of irregularity in NLP, researchers have developed various


techniques, including creating rule-based systems that incorporate irregular forms into the
standard patterns of word formation or using machine learning algorithms that can learn to
recognize and categorize irregular forms based on the patterns present in large datasets.

However, dealing with irregularity remains an ongoing challenge in NLP, particularly in


languages with a high degree of lexical variation and complex morphological systems.
Therefore, NLP researchers are continually working to improve the accuracy of NLP
systems in dealing with irregularity.

2.2 Ambiguity:
Ambiguity is a challenge in natural language processing (NLP) because it refers to situations
where a word or phrase can have multiple possible meanings, making it difficult for NLP
systems to accurately identify the intended meaning. Ambiguity can arise in various forms,
such as homonyms, polysemous words, and syntactic ambiguity.

Downloaded by SRIKANTH p ([email protected])


Homonyms are words that have the same spelling and pronunciation but different meanings.
For example, the word "bank" can refer to a financial institution or the side of a river. This
can create ambiguity in NLP tasks, such as named entity recognition, where the system needs
to identify the correct entity based on the context.

Polysemous words are words that have multiple related meanings. For example, the word
"book" can refer to a physical object or the act of reserving something. In this case, the
intended meaning of the word can be difficult to identify without considering the context in
which the word is used.

Syntactic ambiguity occurs when a sentence can be parsed in multiple ways. For example, the
sentence "I saw her duck" can be interpreted as "I saw the bird she owns" or "I saw her lower
her head to avoid something." In this case, the meaning of the sentence can only be
determined by considering the context in which it is used.

Ambiguity can also occur due to cultural or linguistic differences. For example, the phrase
"kick the bucket" means "to die" in English, but its meaning may not be apparent to non-native
speakers or speakers of other languages.

To address ambiguity in NLP, researchers have developed various techniques, including


using contextual information, part-of-speech tagging, and syntactic parsing to disambiguate
words and phrases. These techniques involve analyzing the surrounding context of a word to
determine its intended meaning based on the context. Additionally, machine learning
algorithms can be trained on large datasets to learn to disambiguate words and phrases
automatically. However, dealing with ambiguity remains an ongoing challenge in NLP,
particularly in languages with complex grammatical structures and a high degree of lexical
variation.

2.3 Productivity:
Productivity is a challenge in natural language processing (NLP) because it refers to the
ability of a language to generate new words or forms based on existing patterns or rules. This
can create a vast number of possible word forms that may not be present in dictionaries or
training data, which makes it difficult for NLP systems to accurately identify and categorize
words.

For example, in English, new words can be created by combining existing words, such
as "smartphone," "cyberbully," or "workaholic." These words are formed by combining
two or more words to create a new word with a specific meaning.

Another example is the use of prefixes and suffixes to create new words. For instance, in
English, the prefix "un-" can be added to words to create their opposite meaning, such as
"happy" and "unhappy." The suffix "-er" can be added to a verb to create a noun indicating
the person who performs the action, such as "run" and "runner."

Downloaded by SRIKANTH p ([email protected])


Productivity can also occur in inflectional morphology, where different forms of a word are
created by adding inflectional affixes. For example, in English, the verb "walk" can be
inflected to "walked" to indicate the past tense. Similarly, the adjective "big" can be inflected
to "bigger" to indicate a comparative degree.

These examples demonstrate how productivity can create a vast number of possible word
forms, making it challenging for NLP systems to accurately identify and categorize words.
To address this challenge, NLP researchers have developed various techniques, including
morphological analysis algorithms that use statistical models to predict the likely structure of
a word based on its context. Additionally, machine learning algorithms can be trained on
large datasets to learn to recognize and categorize new word forms.

3. Morphological Models:
In natural language processing (NLP), morphological models refer to computational models
that are designed to analyze the morphological structure of words in a language. Morphology
is the study of the internal structure and the forms of words, including their inflectional and
derivational patterns. Morphological models are used in a wide range of NLP applications,
including part-of-speech tagging, named entity recognition, machine translation, and text-to-
speech synthesis.

There are several types of morphological models used in NLP, including rule-based models,
statistical models, and neural models.

Rule-based models rely on a set of handcrafted rules that describe the morphological
structure of words. These rules are based on linguistic knowledge and are manually created
by experts in the language. Rule-based models are often used in languages with relatively
simple morphological systems, such as English.

Statistical models use machine learning algorithms to learn the morphological structure of
words from large datasets of annotated text. These models use probabilistic models, such as
Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs), to predict the
morphological features of words. Statistical models are more accurate than rule-based models
and are used in many NLP applications.

Neural models, such as recurrent neural networks (RNNs) and transformers, use deep
learning techniques to learn the morphological structure of words. These models have
achieved state-of-the-art results in many NLP tasks and are particularly effective in languages
with complex morphological systems, such as Arabic and Turkish.

In addition to these models, there are also morphological analyzers, which are tools that can
automatically segment words into their constituent morphemes and provide additional
information about the inflectional and derivational properties of each morpheme.
Morphological analyzers are widely used in machine translation and

Downloaded by SRIKANTH p ([email protected])


information retrieval applications, where they can improve the accuracy of these systems by
providing more precise linguistic information about the words in a text.

3.1 Dictionary Lookup:


Dictionary lookup is one of the simplest forms of morphological modeling used in NLP. In
this approach, a dictionary or lexicon is used to store information about the words in a
language, including their inflectional and derivational forms, parts of speech, and other
relevant features. When a word is encountered in a text, the dictionary is consulted to retrieve
its properties.

Dictionary lookup is effective for languages with simple morphological systems, such as
English, where most words follow regular patterns of inflection and derivation. However, it
is less effective for languages with complex morphological systems, such as Arabic, Turkish,
or Finnish, where many words have irregular forms and the inflectional and derivational
patterns are highly productive.

To improve the accuracy of dictionary lookup, various techniques have been developed, such
as:

● Lemmatization: This involves reducing inflected words to their base or dictionary


form, also known as the lemma. For example, the verb "running" would be
lemmatized to "run". This helps to reduce the size of the dictionary and make it
more manageable.
● Stemming: This involves reducing words to their stem or root form, which is similar
to the lemma but not always identical. For example, the word "jumping" would be
stemmed to "jump". This can help to group related words together and reduce the size
of the dictionary.
● Morphological analysis: This involves analyzing the internal structure of words and
identifying their constituent morphemes, such as prefixes, suffixes, and roots. This
can help to identify the inflectional and derivational patterns of words and make it
easier to store them in the dictionary.

Dictionary lookup is a simple and effective way to handle morphological analysis in NLP
for languages with simple morphological systems. However, for more complex languages,
it may be necessary to use more advanced morphological models, such as rule-based,
statistical, or neural models.

3.2 Finite-State Morphology:


Finite-state morphology is a type of morphological modeling used in natural language
processing (NLP) that is based on the principles of finite-state automata. It is a rule-based
approach that uses a set of finite-state transducers to generate and recognize words in a
language.

Downloaded by SRIKANTH p ([email protected])


In finite-state morphology, words are modeled as finite-state automata that accept a set of
strings or sequences of symbols, which represent the morphemes that make up the word.
Each morpheme is associated with a set of features that describe its properties, such as its part
of speech, gender, tense, or case.

The finite-state transducers used in finite-state morphology are designed to perform two main
operations: analysis and generation. In analysis, the transducer takes a word as input and
breaks it down into its constituent morphemes, identifying their features and properties. In
generation, the transducer takes a sequence of morphemes and generates a word that
corresponds to that sequence, inflecting it for the appropriate features and properties.

Finite-state morphology is particularly effective for languages with regular and productive
morphological systems, such as Turkish or Finnish, where many words are generated through
inflectional or derivational patterns. It can handle large morphological paradigms with high
productivity, such as the conjugation of verbs or the declension of nouns, by using a set of
cascading transducers that apply different rules and transformations to the input.

One of the main advantages of finite-state morphology is that it is efficient and fast, since it
can handle large vocabularies and morphological paradigms using compact and optimized
finite-state transducers. It is also transparent and interpretable, since the rules and
transformations used by the transducers can be easily inspected and understood by linguists
and language experts.

Finite-state morphology has been used in various NLP applications, such as machine
translation, speech recognition, and information retrieval, and it has been shown to be
effective for many languages and domains. However, it may be less effective for languages
with irregular or non-productive morphological systems, or for languages with complex
syntactic or semantic structures that require more sophisticated linguistic analysis.

3.3 Unification-Based Morphology:


Unification-based morphology is a type of morphological modeling used in natural
language processing (NLP) that is based on the principles of unification and feature-based
grammar. It is a rule-based approach that uses a set of rules and constraints to generate
and recognize words in a language.

In unification-based morphology, words are modeled as a set of feature structures, which are
hierarchically organized representations of the properties and attributes of a word. Each
feature structure is associated with a set of features and values that describe the word's
morphological and syntactic properties, such as its part of speech, gender, number, tense, or
case.

Downloaded by SRIKANTH p ([email protected])


The rules and constraints used in unification-based morphology are designed to perform two
main operations: analysis and generation. In analysis, the rules and constraints are applied to
the input word and its feature structure, in order to identify its morphemes, their properties,
and their relationships. In generation, the rules and constraints are used to construct a feature
structure that corresponds to a given set of morphemes, inflecting the word for the
appropriate features and properties.

Unification-based morphology is particularly effective for languages with complex and


irregular morphological systems, such as Arabic or German, where many words are generated
through complex and idiosyncratic patterns. It can handle rich and detailed morphological
and syntactic structures, by using a set of constraints and agreements that ensure the
consistency and coherence of the generated words.

One of the main advantages of unification-based morphology is that it is flexible and


expressive, since it can handle a wide range of linguistic phenomena and constraints, by
using a set of powerful and adaptable rules and constraints. It is also modular and extensible,
since the feature structures and the rules and constraints can be easily combined and reused
for different tasks and domains.

Unification-based morphology has been used in various NLP applications, such as text-to-
speech synthesis, grammar checking, and machine translation, and it has been shown to be
effective for many languages and domains. However, it may be less efficient and scalable
than other morphological models, since the unification and constraint-solving algorithms
can be computationally expensive and complex.

3.4 Functional Morphology:


Functional morphology is a type of morphological modeling used in natural language processing (NLP) that

is based on the principles of functional and cognitive linguistics. It is a usage-based approach that

emphasizes the functional and communicative aspects of language, and seeks to model the ways in which

words are used and interpreted in context.

In functional morphology, words are modeled as units of meaning, or lexemes, which are associated with a

set of functions and communicative contexts. Each lexeme is composed of a set of abstract features that

describe its semantic, pragmatic, and discursive properties, such as its thematic roles, discourse status, or

information structure.

The functional morphology model seeks to capture the relationship between the form and meaning of

words, by analyzing the ways in which the morphological and syntactic structures of words reflect their

communicative and discourse functions. It emphasizes the role of context and discourse in the

interpretation of words, and seeks to explain the ways in which words are used and modified in response to

the communicative needs of the speaker and the listener.

Downloaded by SRIKANTH p ([email protected])


Functional morphology is particularly effective for modeling the ways in which words are inflected,

derived, or modified in response to the communicative and discourse context, such as in the case of

argument structure alternations or pragmatic marking. It can handle the complexity and variability of

natural language, by focusing on the functional and communicative properties of words, and by using a

set of flexible and adaptive rules and constraints.

One of the main advantages of functional morphology is that it is usage-based and

corpus-driven, since it is based on the analysis of natural language data and usage patterns. It is also

compatible with other models of language and cognition, such as construction grammar and cognitive

linguistics, and can be integrated with other NLP techniques, such as discourse analysis and sentiment

analysis.

Functional morphology has been used in various NLP applications, such as text classification, sentiment

analysis, and language generation, and it has been shown to be effective for many languages and domains.

However, it may require large amounts of annotated data and computational resources, in order to model

the complex and variable patterns of natural language use and interpretation.

3.5 Morphology Induction :


Morphology induction is a type of morphological modeling used in natural language
processing (NLP) that is based on the principles of unsupervised learning and statistical
inference. It is a data-driven approach that seeks to discover the underlying morphological
structure of a language, by analyzing large amounts of raw text data.

In morphology induction, words are analyzed as sequences of characters or sub-word


units, which are assumed to represent the basic building blocks of the
language's morphology. The task of morphology induction is to group these units into
meaningful morphemes, based on their distributional properties and statistical patterns in the
data.

Morphology induction can be approached through various unsupervised learning algorithms,


such as clustering, probabilistic modeling, or neural networks. These algorithms use a set of
heuristics and metrics to identify the most probable morpheme boundaries and groupings,
based on the frequency, entropy, or coherence of the sub-word units in the data.

Morphology induction is particularly effective for modeling the morphological structure of


languages with agglutinative or isolating morphologies, where words are composed of
multiple morphemes with clear boundaries and meanings. It can also

Downloaded by SRIKANTH p ([email protected])


handle the richness and complexity of the morphology of low-resource and
under-studied languages, where annotated data and linguistic resources are scarce.

One of the main advantages of morphology induction is that it is unsupervised and data-
driven, since it does not require explicit linguistic knowledge or annotated data. It can also be
easily adapted to different languages and domains, by using different data sources and
feature representations.

Morphology induction has been used in various NLP applications, such as machine
translation, information retrieval, and language modeling, and it has been shown to be
effective for many languages and domains. However, it may produce less accurate and
interpretable results than other morphological models, since it relies on statistical patterns and
does not capture the full range of morphological and syntactic structures in the language.

PART 2: Finding the Structure of Documents:


1. Introduction
2.Methods
3.Complexity of the Approaches
4.Performances of the Approaches

Finding the Structure of Documents:


1.introduction:
Finding the structure of documents in natural language processing (NLP) refers to the process
of identifying the different components and sections of a document, and organizing them in a
hierarchical or linear structure. This is a crucial step in many NLP tasks, such as information
retrieval, text classification, and summarization, as it allows for a more accurate and effective
analysis of the document's content and meaning.

There are several approaches to finding the structure of documents in NLP, including:

1. Rule-based methods: These methods rely on a set of predefined rules and heuristics
to identify the different structural elements of a document, such as headings,
paragraphs, and sections. For example, a rule-based method might identify a section
heading based on its font size, position, or formatting.
2. Machine learning methods: These methods use statistical and machine learning
algorithms to automatically learn the structural patterns and features of a document,
based on a training set of annotated data. For example, a machine learning method
might use a support vector machine (SVM) classifier to identify the different sections
of a document based on their linguistic and structural features.
3. Hybrid methods: These methods combine rule-based and machine learning
approaches, in order to leverage the strengths of both. For example, a hybrid
method might use a rule-based algorithm to identify the headings and

Downloaded by SRIKANTH p ([email protected])


sections of a document, and then use a machine learning algorithm to classify the
content of each section.

Some of the specific techniques and tools used in finding the structure of documents in
NLP include:

1. Named entity recognition: This technique identifies and extracts specific entities,
such as people, places, and organizations, from the document, which can help in
identifying the different sections and topics.
2. Part-of-speech tagging: This technique assigns a part-of-speech tag to each word in the
document, which can help in identifying the syntactic and semantic structure of the
text.
3. Dependency parsing: This technique analyzes the relationships between the words in
a sentence, and can be used to identify the different clauses and phrases in the text.
4. Topic modeling: This technique uses unsupervised learning algorithms to identify
the different topics and themes in the document, which can be used to organize the
content into different sections.

Finding the structure of documents in NLP is a complex and challenging task, as it requires
the analysis of multiple linguistic and non-linguistic cues, as well as the use of domain-
specific knowledge and expertise. However, it is a critical step in many NLP applications,
and can greatly improve the accuracy and effectiveness of the analysis and interpretation of
the document's content.

1.1 Sentence Boundary Detection:


Sentence boundary detection is a subtask of finding the structure of documents in NLP that
involves identifying the boundaries between sentences in a document. This is an important
task, as it is a fundamental step in many NLP applications, such as machine translation, text
summarization, and information retrieval.

Sentence boundary detection is a challenging task due to the presence of ambiguities and
irregularities in natural language, such as abbreviations, acronyms, and names that end with a
period. To address these challenges, several methods and techniques have been developed for
sentence boundary detection, including:

1. Rule-based methods: These methods use a set of pre-defined rules and heuristics to
identify the end of a sentence. For example, a rule-based method may consider a
period followed by a whitespace character as an
end-of-sentence marker, unless the period is part of an abbreviation.
2. Machine learning methods: These methods use statistical and machine learning
algorithms to learn the patterns and features of sentence boundaries

Downloaded by SRIKANTH p ([email protected])


based on a training set of annotated data. For example, a machine learning method
may use a support vector machine (SVM) classifier to identify the boundaries
between sentences based on linguistic and contextual features, such as the length of
the sentence, the presence of quotation marks, and the part-of-speech of the last
word.
3. Hybrid methods: These methods combine the strengths of rule-based and machine
learning approaches, in order to leverage the advantages of both. For example, a
hybrid method may use a rule-based algorithm to identify most sentence boundaries,
and then use a machine learning algorithm to correct any errors or exceptions.

Some of the specific techniques and tools used in sentence boundary detection include:

1. Regular expressions: These are patterns that can be used to match specific
character sequences in a text, such as periods followed by whitespace characters,
and can be used to identify the end of a sentence.
2. Hidden Markov Models: These are statistical models that can be used to identify
the most likely sequence of sentence boundaries in a text, based on the probabilities
of different sentence boundary markers.
3. Deep learning models: These are neural network models that can learn complex
patterns and features of sentence boundaries from a large corpus of text, and can be
used to achieve state-of-the-art performance in sentence boundary detection.

Sentence boundary detection is an essential step in many NLP tasks, as it provides the
foundation for analyzing and interpreting the structure and meaning of a document. By
accurately identifying the boundaries between sentences, NLP systems can more effectively
extract information, generate summaries, and perform other language-related tasks.

1.2 Topic Boundary Detection:


Topic boundary detection is another important subtask of finding the structure of documents
in NLP. It involves identifying the points in a document where the topic or theme of the text
shifts. This task is particularly useful for organizing and summarizing large amounts of text,
as it allows for the identification of different topics or subtopics within a document.

Topic boundary detection is a challenging task, as it involves understanding the underlying


semantic structure and meaning of the text, rather than simply identifying specific markers or
patterns. As such, there are several methods and techniques that have been developed to
address this challenge, including:

Downloaded by SRIKANTH p ([email protected])


1. Lexical cohesion: This method looks at the patterns of words and phrases that appear
in a text, and identifies changes in the frequency or distribution of these patterns as
potential topic boundaries. For example, if the frequency of a particular keyword or
phrase drops off sharply after a certain point in the text, this could indicate a shift in
topic.
2. Discourse markers: This method looks at the use of discourse markers, such as
"however", "in contrast", and "furthermore", which are often used to signal a change
in topic or subtopic. By identifying these markers in a text, it is possible to locate
potential topic boundaries.
3. Machine learning: This method involves training a machine learning model to
identify patterns and features in a text that are associated with topic boundaries. This
can involve using a variety of linguistic and contextual features, such as sentence
length, word frequency, and part-of-speech tags, to identify potential topic boundaries.

Some of the specific techniques and tools used in topic boundary detection include:

1. Latent Dirichlet Allocation (LDA): This is a probabilistic topic modeling


technique that can be used to identify topics within a corpus of text. By analyzing
the distribution of words within a text, LDA can identify the most likely topics
and subtopics within the text, and can be used to locate topic boundaries.
2. TextTiling: This is a technique that involves breaking a text into smaller segments, or
"tiles", based on the frequency and distribution of key words and phrases. By
comparing the tiles to each other, it is possible to identify shifts in topic or subtopic,
and locate potential topic boundaries.
3. Coh-Metrix: This is a text analysis tool that uses a range of linguistic and discourse-
based features to identify different aspects of text complexity, including topic
boundaries. By analyzing the patterns of words, syntax, and discourse in a text, Coh-
Metrix can identify potential topic boundaries, as well as provide insights into the
overall structure and organization of the text.

Topic boundary detection is an important task in NLP, as it enables more effective


organization and analysis of large amounts of text. By accurately identifying topic
boundaries, NLP systems can more effectively extract and summarize information, identify
key themes and ideas, and provide more insightful and relevant responses to user queries.

2. Methods:
There are several methods and techniques used in NLP to find the structure of documents,
which include:

Downloaded by SRIKANTH p ([email protected])


1. Sentence boundary detection: This involves identifying the boundaries between
sentences in a document, which is important for tasks like parsing, machine
translation, and text-to-speech synthesis.
2. Part-of-speech tagging: This involves assigning a part of speech (noun, verb,
adjective, etc.) to each word in a sentence, which is useful for tasks like parsing,
information extraction, and sentiment analysis.
3. Named entity recognition: This involves identifying and classifying named entities
(such as people, organizations, and locations) in a document, which is important for
tasks like information extraction and text categorization.
4. Coreference resolution: This involves identifying all the expressions in a text that
refer to the same entity, which is important for tasks like information extraction
and machine translation.
5. Topic boundary detection: This involves identifying the points in a document where
the topic or theme of the text shifts, which is useful for organizing and summarizing
large amounts of text.
6. Parsing: This involves analyzing the grammatical structure of sentences in a
document, which is important for tasks like machine translation,
text-to-speech synthesis, and information extraction.
7. Sentiment analysis: This involves identifying the sentiment (positive, negative, or
neutral) expressed in a document, which is useful for tasks like brand monitoring,
customer feedback analysis, and market research.

There are several tools and techniques used in NLP to perform these tasks, including
machine learning algorithms, rule-based systems, and statistical models. These tools can be
used in combination to build more complex NLP systems that can accurately analyze and
understand the structure and content of large amounts of text.

2.1 Generative Sequence Classification Methods:

Generative sequence classification methods are a type of NLP method used to find the
structure of documents. These methods involve using probabilistic models to classify
sequences of words into predefined categories or labels.

One popular generative sequence classification method is Hidden Markov Models (HMMs).
HMMs are statistical models that can be used to classify sequences of words by modeling the
probability distribution of the observed words given a set of hidden states. The hidden states
in an HMM can represent different linguistic features, such as part-of-speech tags or named
entities, and the model can be trained using labeled data to learn the most likely sequence of
hidden states for a given sequence of words.

Another type of generative sequence classification method is Conditional Random Fields


(CRFs). CRFs are similar to HMMs in that they model the conditional probability of a
sequence of labels given a sequence of words, but they are more

Downloaded by SRIKANTH p ([email protected])


flexible in that they can take into account more complex features and dependencies between
labels.

Both HMMs and CRFs can be used for tasks like part-of-speech tagging, named entity
recognition, and chunking, which involve classifying sequences of words into predefined
categories or labels. These methods have been shown to be effective in a variety of NLP
applications and are widely used in industry and academia.

2.2 Discriminative Local Classification Methods:


Discriminative local classification methods are another type of NLP method used to find the
structure of documents. These methods involve training a model to classify each individual
word or token in a document based on its features and the context in which it appears.

One popular example of a discriminative local classification method is Conditional Random


Fields (CRFs). CRFs are a type of generative model that can also be used as a discriminative
model, as they can model the conditional probability of a sequence of labels given a sequence
of features, without making assumptions about the underlying distribution of the data. CRFs
have been used for tasks such as named entity recognition, part-of-speech tagging, and
chunking.

Another example of a discriminative local classification method is Maximum Entropy Markov


Models (MEMMs), which are similar to CRFs but use maximum entropy modeling to make
predictions about the next label in a sequence given the current label and features. MEMMs
have been used for tasks such as speech recognition, named entity recognition, and machine
translation.

Other discriminative local classification methods include support vector machines (SVMs),
decision trees, and neural networks. These methods have also been used for tasks such as
sentiment analysis, topic classification, and document categorization.

Overall, discriminative local classification methods are useful for tasks where it is necessary
to classify each individual word or token in a document based on its features and context.
These methods are often used in conjunction with other NLP techniques, such as sentence
boundary detection and parsing, to build more complex NLP systems for document analysis
and understanding.

2.3 Discriminative Sequence Classification Methods:


Discriminative sequence classification methods are another type of NLP method used to find
the structure of documents. These methods involve training a model to predict the label or
category for a sequence of words in a document, based on the features of the sequence and
the context in which it appears.

Downloaded by SRIKANTH p ([email protected])


One popular example of a discriminative sequence classification method is the Maximum
Entropy Markov Model (MEMM). MEMMs are a type of discriminative model that can
predict the label or category for a sequence of words in a document, based on the features of
the sequence and the context in which it appears. MEMMs have been used for tasks such as
named entity recognition, part-of-speech tagging, and text classification.

Another example of a discriminative sequence classification method is Conditional Random


Fields (CRFs), which were mentioned earlier as a type of generative model. CRFs can also be
used as discriminative models, as they can model the conditional probability of a sequence of
labels given a sequence of features, without making assumptions about the underlying
distribution of the data. CRFs have been used for tasks such as named entity recognition, part-
of-speech tagging, and chunking.

Other discriminative sequence classification methods include Hidden Markov Models


(HMMs), which were mentioned earlier as a type of generative model. HMMs can also be
used as discriminative models, by directly estimating the probability of a sequence of labels
given a sequence of features. HMMs have been used for tasks such as speech recognition,
named entity recognition, and part-of-speech tagging.

Overall, discriminative sequence classification methods are useful for tasks where it is
necessary to predict the label or category for a sequence of words in a document, based on the
features of the sequence and the context in which it appears. These methods have been shown
to be effective in a variety of NLP applications and are widely used in industry and academia.

2.4 Hybrid Approaches:


Hybrid approaches to finding the structure of documents in NLP combine multiple methods
to achieve better results than any one method alone. For example, a hybrid approach might
combine generative and discriminative models, or combine different types of models with
different types of features.

One example of a hybrid approach is the use of Conditional Random Fields (CRFs) and
Support Vector Machines (SVMs) for named entity recognition. CRFs are used to model the
dependencies between neighboring labels in the sequence, while SVMs are used to model the
relationship between the input features and the labels.

Another example of a hybrid approach is the use of a rule-based system in combination


with machine learning models for sentence boundary detection. The rule-based system
might use heuristics to identify common sentence-ending punctuation, while a machine
learning model might be trained on a large corpus of text to identify less common patterns.

Hybrid approaches can also be used to combine different types of features in a model. For
example, a model might use both lexical features (such as the words in

Downloaded by SRIKANTH p ([email protected])


the sequence) and syntactic features (such as the part-of-speech tags of the words) to predict
the labels for a sequence.

Overall, hybrid approaches are useful for tasks where a single method may not be sufficient
to achieve high accuracy. By combining multiple methods, hybrid approaches can take
advantage of the strengths of each method and achieve better performance than any one
method alone.

2.5 Extensions for Global Modeling for Sentence Segmentation:


Extensions for global modeling for sentence segmentation in NLP involve using algorithms
that analyze an entire document or corpus of documents to identify sentence boundaries,
rather than analyzing sentences in isolation. These methods can be more effective in situations
where sentence boundaries are not clearly indicated by punctuation, or where there are other
sources of ambiguity.

One example of an extension for global modeling for sentence segmentation is the use of
Hidden Markov Models (HMMs). HMMs are statistical models that can be used to identify
patterns in a sequence of observations. In the case of sentence segmentation, the
observations are the words in the document, and the model tries to identify patterns that
correspond to the beginning and end of sentences. HMMs can take into account context
beyond just the current sentence, which can improve accuracy in cases where sentence
boundaries are not clearly marked.

Another example of an extension for global modeling is the use of clustering algorithms.
Clustering algorithms group similar sentences together based on features such as the
frequency of certain words or the number of common n-grams. Once sentences are clustered
together, the boundaries between the clusters can be used to identify sentence boundaries.

Additionally, there are also neural network-based approaches, such as the use of
convolutional neural networks (CNNs) or recurrent neural networks (RNNs) for sentence
boundary detection. These models can learn to recognize patterns in the text by analyzing
larger contexts, and can be trained on large corpora of text to improve their accuracy.

Overall, extensions for global modeling for sentence segmentation can be more effective than
local models when dealing with more complex or ambiguous text, and can lead to more
accurate results in certain situations.

3. Complexity of the Approaches:

Downloaded by SRIKANTH p ([email protected])


Finding the structure of documents in natural language processing (NLP) can be a complex
task, and there are several approaches with varying degrees of complexity. Here are a few
examples:

1. Rule-based approaches: These approaches use a set of predefined rules to identify


the structure of a document. For instance, they might identify headings based on
font size and style or look for bullet points or numbered lists. While these
approaches can be effective in some cases, they are often limited in their ability to
handle complex or ambiguous structures.
2. Statistical approaches: These approaches use machine learning algorithms to identify
the structure of a document based on patterns in the data. For instance, they might use
a classifier to predict whether a given sentence is a heading or a body paragraph.
These approaches can be quite effective, but they require large amounts of labeled
data to train the model.
3. Deep learning approaches: These approaches use deep neural networks to learn the
structure of a document. For instance, they might use a hierarchical attention
network to identify headings and subheadings, or a
sequence-to-sequence model to summarize the document. These approaches can be
very powerful, but they require even larger amounts of labeled data and significant
computational resources to train.

Overall, the complexity of these approaches depends on the level of accuracy and precision
desired, the size and complexity of the documents being analyzed, and the amount of labeled
data available for training. In general, more complex approaches tend to be more accurate but
also require more resources and expertise to implement.

4. Performances of the Approaches:


The performance of different approaches for finding the structure of documents in natural
language processing (NLP) can vary depending on the specific task and the complexity of
the document. Here are some general trends:

1. Rule-based approaches: These approaches can be effective when the document


structure is relatively simple and the rules are well-defined. However, they can
struggle with more complex or ambiguous structures, and require a lot of manual
effort to define the rules.
2. Statistical approaches: These approaches can be quite effective when there is a large
amount of labeled data available for training, and the document structure is relatively
consistent across examples. However, they may struggle with identifying new or
unusual structures that are not well-represented in the training data.

Downloaded by SRIKANTH p ([email protected])


3. Deep learning approaches: These approaches can be very effective in identifying
complex and ambiguous document structures, and can even discover new structures
that were not present in the training data. However, they require large amounts of
labeled data and significant computational resources to train, and can be difficult to
interpret.

In general, the performance of these approaches will depend on factors such as the quality
and quantity of the training data, the complexity and variability of the document structure,
and the specific metrics used to evaluate performance (e.g. accuracy, precision, recall, F1-
score). It's also worth noting that different approaches may be better suited for different sub-
tasks within document structure analysis, such as identifying headings, lists, tables, or section
breaks.

UNIT - II: Syntax Analysis: (chapter 3 txtbk 1)


1. Parsing Natural Language
2. Treebanks: A Data-Driven Approach to Syntax
3.Representation of Syntactic Structure
4. Parsing Algorithms
5. Models for Ambiguity Resolution in Parsing
6.Multilingual Issues

Syntax Analysis:
Syntax analysis in natural language processing (NLP) refers to the process of identifying the
structure of a sentence and its component parts, such as phrases and clauses, based on the
rules of the language's syntax.

There are several approaches to syntax analysis in NLP, including:

1. Part-of-speech (POS) tagging: This involves identifying the syntactic category of each
word in a sentence, such as noun, verb, adjective, etc. This can be done using
machine learning algorithms trained on annotated corpora of text.
2. Dependency parsing: This involves identifying the relationships between words in
a sentence, such as subject-verb or object-verb relationships. This can be done
using a dependency parser, which generates a parse tree that represents the
relationships between words.
3. Constituency parsing: This involves identifying the constituent parts of a
sentence, such as phrases and clauses. This can be done using a
phrase-structure parser, which generates a parse tree that represents the structure of the
sentence.

Downloaded by SRIKANTH p ([email protected])


Syntax analysis is important for many NLP tasks, such as named entity recognition, sentiment
analysis, and machine translation. By understanding the syntactic structure of a sentence,
NLP systems can better identify the relationships between words and the overall structure of
the text, which can be used to extract meaning and perform various downstream tasks.

1. Parsing Natural Language:

In natural language processing (NLP), syntax analysis, also known as parsing, refers to the
process of analyzing the grammatical structure of a sentence in order to determine its
constituent parts, their relationships to each other, and their functions within the sentence.
This involves breaking down the sentence into its individual components, such as nouns,
verbs, adjectives, and phrases, and then analyzing how these components are related to each
other.

There are two main approaches to syntax analysis in NLP: rule-based parsing and statistical
parsing. Rule-based parsing involves the use of a set of pre-defined rules that dictate how the
different parts of speech and phrases in a sentence should be structured and related to each
other. Statistical parsing, on the other hand, uses machine learning algorithms to learn
patterns and relationships in large corpora of text in order to generate parse trees for new
sentences.

Here's an example of how syntax analysis works using a simple sentence: Sentence: "The

cat sat on the mat."

Step 1: Tokenization

The first step is to break the sentence down into its individual words, or tokens: "The",

"cat", "sat", "on", "the", "mat", "."

Step 2: Part of Speech Tagging

Next, each token is assigned a part of speech tag, which indicates its grammatical function in
the sentence:

"The" (determiner), "cat" (noun), "sat" (verb), "on" (preposition), "the" (determiner), "mat"
(noun), "." (punctuation)

Step 3: Dependency Parsing

Finally, the relationships between the words in the sentence are analyzed using a dependency
parser to create a parse tree. In this example, the parse tree might look something like this:

Downloaded by SRIKANTH p ([email protected])


sat

/ \

cat on

/ \ |

The mat the

This parse tree shows that "cat" is the subject of the verb "sat," and "mat" is the object

of the preposition "on."

Syntax analysis is a crucial component of many NLP tasks, including machine translation,

text-to-speech conversion, and sentiment analysis. By understanding the grammatical structure

of a sentence, NLP models can more accurately interpret its meaning and generate

appropriate responses or translations.

2. Treebanks: A Data-Driven Approach to Syntax:

Treebanks are a data-driven approach to syntax analysis in natural language processing


(NLP). They consist of a large collection of sentences, each of which has been manually
annotated with a parse tree that shows the syntactic structure of the sentence. Treebanks are
used to train statistical parsers, which can then automatically analyze new sentences and
generate their own parse trees.

A parse tree is a hierarchical structure that represents the syntactic structure of a sentence.
Each node in the tree represents a constituent of the sentence, such as a noun phrase or a verb
phrase. The edges of the tree represent the relationships between these constituents, such as
subject-verb or verb-object relationships.

Here's an example of a parse tree for the sentence "The cat sat on the mat":

sat(V)
/ \
cat(N) on(PREP)
/ \ / \
The(D) mat(N) the(D)

Downloaded by SRIKANTH p ([email protected])


This parse tree shows that the sentence is composed of a verb phrase ("sat") and a
prepositional phrase ("on the mat"), with the verb phrase consisting of a verb ("sat") and a
noun phrase ("the cat"). The noun phrase, in turn, consists of a determiner ("the") and a noun
("cat"), and the prepositional phrase consists of a preposition ("on") and a noun phrase ("the
mat").

Treebanks can be used to train statistical parsers, which can then automatically analyze new
sentences and generate their own parse trees. These parsers work by identifying patterns in
the treebank data and using these patterns to make predictions about the structure of new
sentences. For example, a statistical parser might learn that a noun phrase is usually followed
by a verb phrase and use this pattern to generate a parse tree for a new sentence.

Treebanks are an important resource in NLP, as they allow researchers and developers to
train and test statistical parsers and other models that rely on syntactic analysis. Some well-
known treebanks include the Penn Treebank and the Universal Dependencies treebanks.
These resources are publicly available and have been used in a wide range of NLP research
and applications.

3. Representation of Syntactic Structure:


In natural language processing (NLP), the representation of syntactic structure refers to how
the structure of a sentence is represented in a machine-readable form. There are several
different ways to represent syntactic structure, including
constituency-based representations and dependency-based representations.

Constituency-Based Representations:

1. Constituency-based representations, also known as phrase structure trees, represent the


structure of a sentence as a hierarchical tree structure, with each node in the tree
representing a constituent of the sentence. The nodes are labeled with a grammatical
category such as noun phrase (NP) or verb phrase (VP), and the branches represent
the syntactic relationships between the nodes. Constituency-based representations are
often used in rule-based approaches to parsing.

Here's an example of a constituency-based representation of the sentence "The cat sat on


the mat":

(S
(NP (DT The) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat)))))

Downloaded by SRIKANTH p ([email protected])


This representation shows that the sentence is composed of a noun phrase ("The cat") and a
verb phrase ("sat on the mat"), with the verb phrase consisting of a verb ("sat") and a
prepositional phrase ("on the mat"), and the prepositional phrase consisting of a preposition
("on") and a noun phrase ("the mat").

Dependency-Based Representations:

2. Dependency-based representations represent the structure of a sentence as a directed


graph, with each word in the sentence represented as a node in the graph, and the
relationships between the words represented as directed edges. The edges are labeled
with a grammatical function such as subject (SUBJ) or object (OBJ), and the nodes
are labeled with a part-of-speech tag such as noun (N) or verb (V). Dependency-based
representations are often used in statistical approaches to parsing.

Here's an example of a dependency-based representation of the sentence "The cat sat on the
mat":

sat-V
|
cat-N
|
on-PREP
|
mat-N

This representation shows that the verb "sat" depends on the subject "cat," and the preposition
"on" depends on the object "mat."

Both constituency-based and dependency-based representations are used in a variety of NLP


tasks, including machine translation, sentiment analysis, and information extraction. The
choice of representation depends on the specific task and the algorithms used to process the
data.

3.1 Syntax Analysis Using Dependency Graphs:


Syntax analysis using dependency graphs is a popular approach in natural language processing
(NLP). Dependency graphs represent the syntactic structure of a sentence as a directed graph,
where each word is a node in the graph and the relationships between words are represented
as directed edges. The nodes in the graph are labeled with the part of speech of the
corresponding word, and the edges are labeled with the grammatical relationship between the
two words.

Here's an example of a dependency graph for the sentence "The cat sat on the mat":

Downloaded by SRIKANTH p ([email protected])


┌─► sat
┌───┐ │ │
│ The │ │ ├─► on
└───┘ │ │ │
└────► cat ──► mat

In this graph, the word "cat" depends on the word "sat" with a subject relationship, and the
word "mat" depends on the word "on" with a prepositional relationship.

Dependency graphs are useful for a variety of NLP tasks, including named entity recognition,
relation extraction, and sentiment analysis. They can also be used for parsing and syntactic
analysis, as they provide a compact and expressive way to represent the structure of a
sentence.

One advantage of dependency graphs is that they are simpler and more efficient than phrase
structure trees, which can be computationally expensive to build and manipulate. Dependency
graphs also provide a more flexible representation of syntactic structure, as they can easily
capture non-projective dependencies and other complex relationships between words.

Here's another example of a dependency graph for the sentence "I saw the man with the
telescope":

┌─── saw

┌──────┐ │ │ │
│ I the man with the telescope
│ │ │ │ │ │ │
│ nsubj │ │ prep │ │
│ │ │ │ │ │
│ det │ pobj │ │
│ │ │ │ │ │
│ │ det pobj │ │
│ │ │ │ │ │
│ ││ prep │ │
│ │ │ │ │ │
│ ││ det │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ ││ │ det │
│ │ │ │ │ │
│ ││ │ │ pobj

This graph shows that the verb "saw" depends on the subject "I," and that the noun phrase
"the man" depends on the verb "saw" with an object relationship. The prepositional phrase
"with the telescope" modifies the noun phrase "the man," with the word "telescope" being the
object of the preposition "with."

Downloaded by SRIKANTH p ([email protected])


In summary, dependency graphs provide a flexible and efficient way to represent the
syntactic structure of a sentence in NLP. They can be used for a variety of tasks and are a
key component of many state-of-the-art NLP models.

3.2 Syntax Analysis Using Phrase Structure Trees:

Syntax analysis, also known as parsing, is the process of analyzing the grammatical structure
of a sentence to identify its constituent parts and the relationships between them. In natural
language processing (NLP), phrase structure trees are often used to represent the syntactic
structure of a sentence.

A phrase structure tree, also known as a parse tree or a syntax tree, is a graphical
representation of the syntactic structure of a sentence. It consists of a hierarchical structure
of nodes, where each node represents a phrase or a constituent of the sentence.

Here's an example of a phrase structure tree for the sentence "The cat sat on the mat":

| |

NP VP

| |

| |

| | | |

Det N V NP

| | | |

The cat sat on the mat

In this tree, the top-level node represents the entire sentence (S), which is divided into two
subparts: the noun phrase (NP) "The cat" and the verb phrase (VP) "sat on
the mat". The NP is further divided into a determiner (Det) "The" and a noun (N) "cat".

Downloaded by SRIKANTH p ([email protected])


The VP is composed of a verb (V) "sat" and a prepositional phrase (PP) "on the mat", which
itself consists of a preposition (P) "on" and another noun phrase (NP) "the mat".

Here's another example of a phrase structure tree for the sentence "John saw the man with
the telescope":

| |

NP VP

| |

| |

| | |

N V PP

| | |

John saw the man with the telescope

| |

P NP

| |

with Det N

| |

the telescope

In this tree, the top-level node represents the entire sentence (S), which is divided into a noun
phrase (NP) "John" and a verb phrase (VP) "saw the man with the telescope". The NP is
simply a single noun (N) "John". The VP is composed of a verb
(V) "saw" and a prepositional phrase (PP) "with the telescope", which itself consists

Downloaded by SRIKANTH p ([email protected])


of a preposition (P) "with" and another noun phrase (NP) "the man with the telescope". The
latter is composed of a determiner (Det) "the" and a noun (N) "man", which is modified by
another prepositional phrase "with the telescope", consisting of a preposition (P) "with" and a
noun phrase (NP) "the telescope".

Phrase structure trees can be used in NLP for a variety of tasks, such as machine translation,
text-to-speech synthesis, and natural language understanding. By identifying the syntactic
structure of a sentence, computers can more accurately understand its meaning and generate
appropriate responses.

4. Parsing Algorithms:
There are several algorithms used in natural language processing (NLP) for syntax analysis
or parsing, each with its own strengths and weaknesses. Here are three common parsing
algorithms and their examples:

1. Recursive descent parsing: This is a top-down parsing algorithm that starts with the
top-level symbol (usually the sentence) and recursively applies production rules to
derive the structure of the sentence. Each production rule corresponds to a non-
terminal symbol in the grammar, which can be expanded into a sequence of other
symbols. The algorithm selects the first production rule that matches the current
input, and recursively applies it to its right-hand side symbols. This process
continues until a match is found for every terminal symbol in the input.

Example: Consider the following context-free grammar for arithmetic expressions:

E -> E + T | E - T | T
T -> T * F | T / F | F
F -> ( E ) | num

Suppose we want to parse the expression "3 + 4 * (5 - 2)" using recursive descent parsing. The
algorithm would start with the top-level symbol E and apply the first production rule E -> E +
T. It would then recursively apply the production rules for E, T, and F until it reaches the
terminals "3", "+", "4", "*", "(", "5", "-", "2", and ")". The resulting parse tree would look like
this:
E
/\
E T
/ /|\
T F*F
| | |
num num E
/|\
T F

Downloaded by SRIKANTH p ([email protected])


| |
num num

2. Shift-reduce parsing: This is a bottom-up parsing algorithm that starts with the input
tokens and constructs a parse tree by repeatedly shifting a token onto a stack and
reducing a group of symbols on the stack to a single symbol based on the production
rules. The algorithm maintains a parse table that specifies which actions to take based
on the current state and the next input symbol.

Example: Consider the following grammar for simple English sentences: S -> NP VP
NP -> Det N | NP PP
VP -> V NP | VP PP
PP -> P NP
Det -> the | a
N -> man | ball | woman V
-> saw | liked
P -> with | in

Suppose we want to parse the sentence "the man saw a woman with a ball" using shift-
reduce parsing. The algorithm would start with an empty stack and shift the tokens "the",
"man", "saw", "a", "woman", "with", "a", and "ball" onto the stack. It would then reduce the
symbols "Det N" to NP, "NP PP" to NP, "V NP" to VP, and "NP PP" to
PP. The resulting parse tree would look like this:
S
|
|
| |
NP VP
| |
| _V_
| | |
Det NP PP
| | |
the | |
| NP
| |
| Det N
| | |
| a man

Downloaded by SRIKANTH p ([email protected])


|
V
| |
saw NP
|
Det N
| |
a woman
|
PP
|
| |
P NP
| |
with Det N
| |
a ball

3. Earley parsing: This is a chart parsing algorithm that uses dynamic programming to store partial
parses in a chart, which can be combined to form complete parses.

4.1 Shift-Reduce Parsing:


Shift-reduce parsing is a bottom-up parsing algorithm commonly used in natural language
processing (NLP) to generate parse trees from input sentences. It works by incrementally
reducing a stack of symbols to a single non-terminal symbol that matches a production rule.

Here is an example of how shift-reduce parsing can be used to parse the sentence "the cat
chased the mouse" using a simple grammar:

S -> NP VP

NP -> Det N

VP -> V NP

Det -> the

N -> cat | mouse

Downloaded by SRIKANTH p ([email protected])


V -> chased

1. Initialization: We start by initializing an empty stack and an input buffer with the
sentence tokens "the", "cat", "chased", "the", and "mouse". We also initialize a parse
chart, which is a table used to keep track of all the possible partial parses of the
sentence.
2. Shifting: We shift the first token "the" onto the stack and the first token "cat" into
the lookahead buffer. The stack now contains only the symbol "the".
3. Shifting again: We shift the next token "cat" onto the stack and the next token
"chased" into the lookahead buffer. The stack now contains the symbols "the" and
"cat".
4. Reduction: We check if the top of the stack and the next token in the lookahead
buffer can be reduced to a non-terminal symbol using a production rule from the
grammar. In this case, "the" and "cat" can be reduced to the
non-terminal symbol NP using the production rule NP -> Det N. We pop "the" and
"cat" from the stack and push the non-terminal symbol NP onto the stack.
5. Shifting again: We shift the next token "chased" onto the stack and the next token
"the" into the lookahead buffer. The stack now contains the symbols NP and
"chased".
6. Reduction again: We check if the top of the stack and the next token in the lookahead
buffer can be reduced to a non-terminal symbol using a production rule from the
grammar. In this case, "chased" and NP can be reduced to the non-terminal symbol
VP using the production rule VP -> V NP. We pop "chased" and NP from the stack
and push the non-terminal symbol VP onto the stack.
7. Shifting again: We shift the next token "the" onto the stack and the next token
"mouse" into the lookahead buffer. The stack now contains the symbols VP and
"the".
8. Reduction again: We check if the top of the stack and the next token in the lookahead
buffer can be reduced to a non-terminal symbol using a production rule from the
grammar. In this case, "the" and VP can be reduced to the
non-terminal symbol S using the production rule S -> NP VP. We pop "the" and VP
from the stack and push the non-terminal symbol S onto the stack.
9. Completion: The stack now contains only the symbol S, which is the final parse of
the input sentence. We can also look at the parse chart to see all the possible partial
parses that were considered during the parsing process. The final parse tree for the
sentence is:

Downloaded by SRIKANTH p ([email protected])


/\

NP VP

/ \ |

/ chased

/ |

Det NP

| /\

the Det N

| |

the mouse

Note that this example uses a simple grammar and a straightforward parsing process, but more
complex grammars and sentences may require additional steps or different strategies to
achieve a successful parse.

4.2 Hypergraphs and Chart Parsing:


Hypergraphs and chart parsing are two related concepts used in natural language processing
(NLP) for syntactic parsing.

Hypergraphs represent a generalization of traditional parse trees, allowing for more complex
structures and more efficient parsing algorithms. A hypergraph consists of a set of nodes
(representing words or phrases in the input sentence) and a set of hyperedges, which connect
nodes and represent higher-level structures. A chart, on

Downloaded by SRIKANTH p ([email protected])


the other hand, is a data structure used in chart parsing to efficiently store and manipulate all
possible partial parses of a sentence.

Here is an example of how chart parsing can be used to parse the sentence "the cat chased the
mouse" using a simple grammar:

S -> NP VP

NP -> Det N

VP -> V NP

Det -> the

N -> cat | mouse

V -> chased

1. Initialization: We start by initializing an empty chart with the length of the input
sentence (5 words) and a set of empty cells representing all possible partial
parses.
2. Scanning: We scan each word in the input sentence and add a corresponding
parse to the chart. For example, for the first word "the", we add a parse for the non-
terminal symbol Det (Det -> the). We do this for each word in the sentence.
3. Predicting: We use the grammar rules to predict possible partial parses for each span
of words in the sentence. For example, we can predict a partial parse for the span (1,
2) (i.e., the first two words "the cat") by applying the rule NP -> Det N to the parses
for "the" and "cat". We add this partial parse to the chart cell for the span (1, 2).
4. Scanning again: We scan the input sentence again, this time looking for matches to
predicted partial parses in the chart. For example, if we predicted a partial parse for
the span (1, 2), we look for a parse for the exact same span in the chart. If we find a
match, we can apply a grammar rule to combine the two partial parses into a larger
parse. For example, if we find a parse for (1, 2) that matches the predicted parse for
NP -> Det N, we can combine them to create a parse for the span (1, 3) and the non-
terminal symbol NP.

Downloaded by SRIKANTH p ([email protected])


5. Combining: We continue to combine partial parses in the chart using grammar rules
until we have a complete parse for the entire sentence.
6. Output: The final parse tree for the sentence is represented by the complete parse
in the chart cell for the span (1, 5) and the non-terminal symbol S.

Chart parsing can be more efficient than other parsing algorithms, such as recursive descent
or shift-reduce parsing, because it stores all possible partial parses in the chart and avoids
redundant parsing of the same span multiple times. Hypergraphs can also be used in chart
parsing to represent more complex structures and enable more efficient parsing algorithms.

4.3 Minimum Spanning Trees and Dependency Parsing:

Dependency parsing is a type of syntactic parsing that represents the grammatical structure of
a sentence as a directed acyclic graph (DAG). The nodes of the graph represent the words of
the sentence, and the edges represent the syntactic relationships between the words.

Minimum spanning tree (MST) algorithms are often used for dependency parsing, as they
provide an efficient way to find the most likely parse for a sentence given a set of syntactic
dependencies.

Here's an example of how a MST algorithm can be used for dependency parsing:

Consider the sentence "The cat chased the mouse". We can represent this sentence as a graph
with nodes for each word and edges representing the syntactic dependencies between them:

We can use a MST algorithm to find the most likely parse for this graph. One popular
algorithm for this is the Chu-Liu/Edmonds algorithm:

1. We first remove all self-loops and multiple edges in the graph. This is because a valid
dependency tree must be acyclic and have only one edge between any two nodes.
2. We then choose a node to be the root of the tree. In this example, we can choose
"chased" to be the root since it is the main verb of the sentence.
3. We then compute the scores for each edge in the graph based on a scoring function
that takes into account the probability of each edge being a valid dependency. The
score function can be based on various linguistic features, such as part-of-speech
tags or word embeddings.

Downloaded by SRIKANTH p ([email protected])


4. We use the MST algorithm to find the tree that maximizes the total score of its edges.
The MST algorithm starts with a set of edges that connect the root node to each of its
immediate dependents, and iteratively adds edges that connect other nodes to the tree.
At each iteration, we select the edge with the highest score that does not create a cycle
in the tree.
5. Once the MST algorithm has constructed the tree, we can assign a label to each
edge in the tree based on the type of dependency it represents (e.g., subject,
object, etc.).

The resulting dependency tree for the example sentence is shown below:

In this tree, each node represents a word in the sentence, and each edge represents a syntactic

dependency between two words.

Dependency parsing can be useful for many NLP tasks, such as information extraction, machine

translation, and sentiment analysis.

One advantage of dependency parsing is that it captures more fine-grained syntactic information than

phrase-structure parsing, as it represents the relationships between individual words rather than just the

hierarchical structure of phrases. However, dependency parsing can be more difficult to perform accurately

than phrase-structure parsing, as it requires more sophisticated algorithms and models to capture the

nuances of syntactic dependencies.

5. Models for Ambiguity Resolution in Parsing:


Ambiguity resolution is an important problem in natural language processing (NLP) as many
sentences can have multiple valid syntactic parses. This means that the same sentence can be
represented by multiple phrase structure trees or dependency graphs. Resolving ambiguity is
crucial for many NLP applications, such as machine translation, text-to-speech synthesis, and
information retrieval.

Here are some common models for ambiguity resolution in parsing:

1. Rule-based models: Rule-based models use hand-crafted grammars and rules to


disambiguate sentences. These rules can be based on linguistic knowledge or
heuristics, and can help resolve ambiguity by preferring certain syntactic structures
over others. For example, a rule-based model might prefer a noun

Downloaded by SRIKANTH p ([email protected])


phrase followed by a verb phrase as the primary syntactic structure for a given
sentence.
2. Statistical models: Statistical models use machine learning algorithms to learn
from large corpora of text and make predictions about the most likely syntactic
structure for a given sentence. These models can be based on various features,
such as part-of-speech tags, word embeddings, or contextual information. For
example, a statistical model might learn to associate certain word sequences with
specific syntactic structures.
3. Hybrid models: Hybrid models combine both rule-based and statistical approaches to
resolve ambiguity. These models can use rules to guide the parsing process and
statistical models to make more fine-grained predictions. For example, a hybrid model
might use a rule-based approach to identify the main syntactic structures in a
sentence, and then use a statistical model to disambiguate specific substructures.
4. Neural network models: Neural network models use deep learning techniques to learn
from large amounts of text and make predictions about the most likely syntactic
structure for a given sentence. These models can be based on various neural
architectures, such as recurrent neural networks (RNNs) or transformer models. For
example, a neural network model might use an attention mechanism to learn which
words in a sentence are most relevant for predicting the syntactic structure.
5. Ensemble models: Ensemble models combine the predictions of multiple parsing
models to achieve higher accuracy and robustness. These models can be based on
various techniques, such as voting, weighting, or stacking.
For example, an ensemble model might combine the predictions of a
rule-based model, a statistical model, and a neural network model to improve the
overall accuracy of the parsing system.

Overall, there are many models for ambiguity resolution in parsing, each with its own
strengths and weaknesses. The choice of model depends on the specific application and the
available resources, such as training data and computational power.

5.1 Probabilistic Context-Free Grammars :


Probabilistic context-free grammars (PCFGs) are a popular model for ambiguity resolution in
parsing. PCFGs extend context-free grammars (CFGs) by assigning probabilities to each
production rule, representing the likelihood of generating a certain symbol given its parent
symbol.

PCFGs can be used to compute the probability of a parse tree for a given sentence, which can
then be used to select the most likely parse. The probability of a parse tree is computed by
multiplying the probabilities of its constituent production rules, from the root symbol down
to the leaves. The probability of a sentence is computed by summing the probabilities of all
parse trees that generate the sentence.

Downloaded by SRIKANTH p ([email protected])


Here is an example of a PCFG for the sentence "the cat saw the dog":

S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]

In this PCFG, each production rule is annotated with a probability. For example, the rule NP -
> Det N [0.6] has a probability of 0.6, indicating that a noun phrase can be generated by first
generating a determiner, followed by a noun, with a probability of 0.6.

To parse the sentence "the cat saw the dog" using this PCFG, we can use the CKY algorithm
to generate all possible parse trees and compute their probabilities. The algorithm starts by
filling in the table of all possible subtrees for each span of the sentence, and then combines
these subtrees using the production rules of the PCFG. The final cell in the table represents
the probability of the best parse tree for the entire sentence.

Using the probabilities from the PCFG, the CKY algorithm generates the following parse
tree for the sentence "the cat saw the dog":

S
/ \
NP VP
/ \ / \
Det N V NP
| | | / \
the cat saw the dog

The probability of this parse tree is computed as follows:


P(S -> NP VP) * P(NP -> Det N) * P(Det -> "the") * P(N -> "cat") * P(VP -> V NP) * P(V ->
"saw") * P(NP -> Det N) * P(Det -> "the") * P(N -> "dog") = 1.0 * 0.6 * 0.9 * 0.5 * 0.8 *
1.0 * 0.6 * 0.9 * 0.5 = 0.11664

Thus, the probability of the best parse tree for the sentence "the cat saw the dog" is 0.11664.
This probability can be used to select the most likely parse among all possible parse trees for
the sentence.

Downloaded by SRIKANTH p ([email protected])


5.2 Generative Models for Parsing:
Generative models for parsing are a family of models that generate a sentence's parse
tree by generating each node in the tree according to a set of probabilistic rules. One
such model is the probabilistic earley parser.

The earley parser uses a chart data structure to store all possible parse trees for a sentence.
The parser starts with an empty chart, and then adds new parse trees to the chart as it
progresses through the sentence. The parser consists of three main stages: prediction,
scanning, and completion.

In the prediction stage, the parser generates new items in the chart by applying grammar
rules that can generate non-terminal symbols. For example, if the grammar has a rule S -> NP
VP, the parser would predict the presence of an S symbol in the current span of the sentence
by adding a new item to the chart that indicates that an S symbol can be generated by an NP
symbol followed by a VP symbol.

In the scanning stage, the parser checks whether a word in the sentence can be assigned to a
non-terminal symbol in the chart. For example, if the parser has predicted an NP symbol in the
current span of the sentence, and the word "dog" appears in that span, the parser would add a
new item to the chart that indicates that the NP symbol can be generated by the word "dog".

In the completion stage, the parser combines items in the chart that have the same end
position and can be combined according to the grammar rules. For example, if the parser has
added an item to the chart that indicates that an NP symbol can be generated by the word
"dog", and another item that indicates that a VP symbol can be generated by the word "saw"
and an NP symbol, the parser would add a new item to the chart that indicates that an S
symbol can be generated by an NP symbol followed by a VP symbol.

Here is an example of a probabilistic earley parser applied to the sentence "the cat saw the
dog":

Grammar:
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]

Initial chart:
0: [S -> * NP VP [1.0], 0, 0]

Downloaded by SRIKANTH p ([email protected])


0: [NP -> * Det N [0.6], 0, 0]
0: [NP -> * N [0.4], 0, 0]
0: [VP -> * V NP [0.8], 0, 0]
0: [VP -> * V [0.2], 0, 0]
0: [Det -> * "the" [0.9], 0, 0]
0: [Det -> * "a" [0.1], 0, 0]
0: [N -> * "cat" [0.5], 0, 0]
0: [N -> * "dog" [0.5], 0, 0]
0: [V -> * "saw" [1.0], 0, 0]

Predicting S:
0: [S -> * NP VP [1.0], 0, 0]
1: [NP -> * Det N [0.6], 0, 0]
1: [NP -> * N [0.4], 0, 0]
1: [VP -> * V NP [0.8], 0

5.3 Discriminative Models for Parsing:


Discriminative models for parsing are a family of models that predict a sentence's parse tree
by learning to discriminate between different possible trees. One such model is the maximum
entropy markov model.

The maximum entropy markov model (MEMM) is a discriminative model that models the
conditional probability of a parse tree given a sentence. The model is trained on a corpus of
labeled sentences and their corresponding parse trees. During training, the model learns a set
of feature functions that map the current state of the parser (i.e., the current span of the
sentence and the partial parse tree constructed so far) to a set of binary features that are
indicative of a particular parse tree. The model then learns the weight of each feature function
using maximum likelihood estimation.

During testing, the MEMM uses the learned feature functions and weights to score each
possible parse tree for the input sentence. The model then selects the parse tree with the
highest score as the final parse tree for the sentence.

Here is an example of a MEMM applied to the sentence "the cat saw the dog":

Features:
F1: current word is "the" F2:
current word is "cat" F3:
current word is "saw" F4:
current word is "dog" F5:
current span is "the cat"
F6: current span is "cat saw"
F7: current span is "saw the"
F8: current span is "the dog"
F9: partial parse tree is "S -> NP VP"

Weights:

Downloaded by SRIKANTH p ([email protected])


F1: 1.2
F2: 0.5
F3: 0.9
F4: 1.1
F5: 0.8
F6: 0.6
F7: 0.7
F8: 0.9
F9: 1.5

Possible parse trees and their scores: S ->


NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 5.7

S -> NP VP
- NP -> N
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 4.9

S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V
- - V -> "saw"
- NP -> Det N
- - Det -> "the"
- - N -> "dog"
Score: 3.5

Selected parse tree: S -


> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N

Downloaded by SRIKANTH p ([email protected])


- - - Det -> "the"
- - - N -> "dog"
Score: 5.7

In this example, the MEMM generates a score for each possible parse tree and selects the
parse tree with the highest score as the final parse tree for the sentence. The selected parse
tree corresponds to the correct parse for the sentence.

6. Multilingual Issues:

In natural language processing (NLP), a token is a sequence of characters that represents a


single unit of meaning. In other words, it is a word or a piece of a word that has a specific
meaning within a language. The process of splitting a text into individual tokens is called
tokenization.

However, the definition of what constitutes a token can vary depending on the language being
analyzed. This is because different languages have different rules for how words are
constructed, how they are written, and how they are used in context.

For example, in English, words are typically separated by spaces, making it relatively easy to
tokenize a sentence into individual words. However, in some languages, such as Chinese or
Japanese, there are no spaces between words, and the text must be segmented into individual
units of meaning based on other cues, such as syntax or context.

Furthermore, even within a single language, there can be variation in how words are spelled or
written. For example, in English, words can be spelled with or without hyphens or
apostrophes, and there can be differences in spelling between American English and British
English.

Multilingual issues in tokenization arise because different languages can have different
character sets, which means that the same sequence of characters can represent different words
in different languages. Additionally, some languages have complex morphology, which means
that a single word can have many different forms that represent different grammatical features
or meanings.

To address these issues, NLP researchers have developed multilingual tokenization techniques
that take into account the specific linguistic features of different languages. These techniques
can include using language-specific dictionaries, models, or rules to identify the boundaries
between words or units of meaning in different languages.

6.1 Tokenization, Case, and Encoding:


Tokenization, case, and encoding are all important aspects of natural language processing
(NLP) that are used to preprocess text data before it can be analyzed by machine learning
algorithms. Here are some examples of each:

Downloaded by SRIKANTH p ([email protected])


Tokenization:
Tokenization is the process of splitting a text into individual tokens or words. In English, this
is typically done by splitting the text on whitespace and punctuation marks. For example, the
sentence "The quick brown fox jumps over the lazy dog." would be tokenized into the
following list of words:

1. ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."].

Case:

2. Case refers to the use of upper and lower case letters in text. In NLP, it is often
important to standardize the case of words to avoid treating the same word as different
simply because it appears in different case. For example, the words "apple" and
"Apple" should be treated as the same word.

Encoding:

3. Encoding refers to the process of representing text data in a way that can be processed
by machine learning algorithms. One common encoding method used in NLP is
Unicode, which is a character encoding standard that can represent a wide range of
characters from different languages.

Here is an example of how tokenization, case, and encoding might be applied to a sentence
of text:

Text: "The quick brown fox jumps over the lazy dog."

Tokenization: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

Case: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

Encoding: [0x74, 0x68, 0x65, 0x20, 0x71, 0x75, 0x69, 0x63, 0x6b, 0x20, 0x62, 0x72, 0x6f,
0x77, 0x6e, 0x20, 0x66, 0x6f, 0x78, 0x20, 0x6a, 0x75, 0x6d, 0x70, 0x73, 0x20,
0x6f, 0x76, 0x65, 0x72, 0x20, 0x74, 0x68, 0x65, 0x20, 0x6c, 0x61, 0x7a, 0x79, 0x20, 0x64,
0x6f, 0x67, 0x2e]

Note that the encoding is represented in hexadecimal to show the underlying bytes that
represent the text.

6.2 Word Segmentation:

Downloaded by SRIKANTH p ([email protected])


Word segmentation is one of the most basic tasks in Natural Language Processing (NLP),
and it involves identifying the boundaries between words in a sentence.
However, in some languages, such as Chinese and Japanese, there is no clear spacing or
punctuation between words, which makes word segmentation more challenging.

In Chinese, for example, a sentence like "我喜欢中文" (which means "I like Chinese")
could be segmented in different ways, such as "我 / 喜欢 / 中文" or "我喜欢 / 中文".
Similarly, in Japanese, a sentence like "私は日本語が好きです" (which also means "I like
Japanese") could be segmented in different ways, such as "私は / 日本語が / 好きです" or
"私は日本語 / が好きです".

Here are some examples of the challenges of word segmentation in different languages:

● Chinese: In addition to the lack of spacing between words, Chinese also has a large
number of homophones, which are words that sound the same but have different
meanings. For example, the words "你" (you) and "年" (year) sound the same in
Mandarin, but they are written with different characters.
● Japanese: Japanese also has a large number of homophones, but it also has different
writing systems, including kanji (Chinese characters), hiragana, and katakana. Kanji
can often have multiple readings, which makes word segmentation more complex.
● Thai: Thai has no spaces between words, and it also has no capitalization or
punctuation. In addition, Thai has a unique script with many consonants that can be
combined with different vowel signs to form words.
● Vietnamese: Vietnamese uses the Latin alphabet, but it also has many diacritics
(accent marks) that can change the meaning of a word. In addition, Vietnamese
words can be formed by combining smaller words, which makes word segmentation
more complex.

To address these challenges, NLP researchers have developed various techniques for word
segmentation, including rule-based approaches, statistical models, and neural networks.
However, word segmentation is still an active area of research, especially for low-resource
languages where large amounts of annotated data are not available.

6.3 Morphology:
Morphology is the study of the structure of words and how they are formed from smaller
units called morphemes. Morphological analysis is important in many natural language
processing tasks, such as machine translation and speech recognition, because it helps to
identify the underlying structure of words and to disambiguate their meanings.

Downloaded by SRIKANTH p ([email protected])


Here are some examples of the challenges of morphology in different languages:

● Turkish: Turkish has a rich morphology, with a complex system of affixes that can
be added to words to convey different meanings. For example, the word "kitap"
(book) can be modified with different suffixes to indicate things like possession,
plurality, or tense.
● Arabic: Arabic also has a rich morphology, with a complex system of prefixes,
suffixes, and infixes that can be added to words to convey different meanings. For
example, the root "k-t-b" (meaning "write") can be modified with different affixes to
form words like "kitab" (book) and "kataba" (he wrote).
● Finnish: Finnish has a complex morphology, with a large number of cases,
suffixes, and vowel harmony rules that can affect the form of a word. For
example, the word "käsi" (hand) can be modified with different suffixes to
indicate things like possession, location, or movement.
● Swahili: Swahili has a complex morphology, with a large number of prefixes and
suffixes that can be added to words to convey different meanings. For example, the
word "kutaka" (to want) can be modified with different prefixes and suffixes to
indicate things like tense, negation, or subject agreement.

To address these challenges, NLP researchers have developed various techniques for
morphological analysis, including rule-based approaches, statistical models, and neural
networks. However, morphological analysis is still an active area of research, especially for
low-resource languages where large amounts of annotated data are not available.

Downloaded by SRIKANTH p ([email protected])

You might also like