0% found this document useful (0 votes)
12 views24 pages

NLP Unit-1 - 1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

INTRODUCTION TO

NATURAL LANGUAGE PROCESSING


Syllabus:
Introduction: Natural Language Processing, Why NLP is hard? Programming
languages Vs Natural Languages, Are natural languages regular? Finite
automata for NLP, Stages of NLP, Challenges and Issues(Open Problems) in
NLP
Basics of text processing: Tokenization, Stemming, Lemmatization, Part of
Speech Tagging
Natural Language Processing
• Natural language processing studies interactions between humans and
computers to find ways for computers to process written and spoken words
similar to how humans do. The field blends computer science, linguistics and
machine learning.

• The ultimate goal of NLP is to help computers understand language as


well as we do. It is the driving force behind things like virtual
assistants, speech recognition, sentiment analysis, automatic text
summarization, machine translation and much more.
Why NLP is Hard?
Natural language processing (NLP) is difficult for a few reasons:
• Ambiguity: Human language is often ambiguous, with words and phrases having
multiple meanings depending on context. This makes it challenging for NLP systems
to accurately interpret and understand language.
• Syntax and grammar variations: Different languages and dialects have their own
syntax and grammar rules, making it difficult for NLP systems to effectively process
and understand all variations of language.
• Context and semantics: Understanding the true meaning of a sentence or phrase
often requires an understanding of the broader context in which it is used. This can
be challenging for NLP systems to capture accurately.
• Cultural and domain-specific knowledge: Language is deeply intertwined with
cultural and domain-specific knowledge, making it difficult for NLP systems to
accurately interpret language without a deep understanding of these factors.
• Data variability: NLP systems require large and diverse datasets to effectively learn
and generalize language patterns, but obtaining and processing such data can be
challenging.
Programming languages Vs Natural Languages
• Programming languages are (designed to be) easily used by machines, but not
people.
• Natural languages (like English) are easily used by humans, but not machines.
• Programming languages are unambiguous, while natural languages are often multiply
ambiguous and require interpretation in context to be fully understood (also why it’s
so hard to get machines to understand them). Natural languages are also creative
and allow poetry, metaphor and other interpretations. Programming does allow some
variation in style, but the meaning is not flexible.
• Natural languages have evolved over time and are incredibly flexible, often containing
ambiguity, metaphors, idioms, and cultural context. They are complex with multiple
grammar rules, dialects, and regional variations, While programming languages have
precise and formalized syntax with specific rules and structures. They are designed to
be unambiguous and strict, following a defined set of instructions and logic
understood by computers.
Programming languages Vs Natural Languages
Parameter Natural language processing Programming Language

Connected in processing the human natural


Purpose Way of writing instructions to the computer
language, one of the sub-categories of AI

Syntax Generates human language syntax Strict syntax for every language

enable computers to interact with human Solve the task and computational problems
Aim
language and do all manipulation

Works with structured data, variables, and


Works on Works with unstructured and speech data
program logic

data scientists, computational linguists and


Used by Programmers, software developers
NLP experts

Focuses on processing and understanding Used for specifying algorithms and


Communication
human language text data manipulation

Chatbots, language translation, speech develop software, applications and


Application
recognition, etc algorithms

Examples machine translation, sentiment analysis C,C++,java,python etc.

Error management uses probabilistic models through try-catch blocks

IDEs (Integrated Development


Tools NLTK, TensorFlow
Environments), compilers
Are natural languages regular?
• Natural languages, like English or Spanish, aren't strictly regular.
• Some parts, like verb endings, follow rules, but many words break these
patterns.
• For instance, irregular verbs don't behave predictably.
• Sentences are flexible, and word meanings depend on context, leading to
variations.
• Regular languages, like computer code, have clear rules, but natural
languages are messier and more creative.
• People can make new words, and meanings can shift over time.
Are natural languages regular?

This dynamic, evolving nature makes natural languages rich


and expressive but also more complex than the neat rules of
regular languages.
Finite automata for NLP
• Finite automata, though simplistic, find application in certain aspects of Natural
Language Processing (NLP).
• In tokenization, automata represent states of the process, transitioning with
each input symbol to recognize and segment words.
• Regular expressions, fundamental in NLP for pattern matching, can be
converted into equivalent finite automata to recognize linguistic patterns.
• Morphological analysis leverages automata to model word forms, transitioning
between states based on character input to understand variations in word
structures.
Finite automata for NLP
• Lexical analyzers in NLP use finite automata to recognize and categorize
tokens or words in a language.
• Even in spell-checking, automata can represent possible spelling corrections
through state transitions.
• Despite their usefulness in simpler tasks, the limitations of finite automata
become apparent when facing the intricate structures and dependencies
inherent in natural languages, prompting the use of more advanced models
like context-free grammars and machine learning approaches in
comprehensive NLP applications.
Stages of NLP
• Natural Language Processing (NLP) involves several stages to understand,
interpret, and generate human language. The typical stages in NLP include:
• Tokenization:
• Breaking down a text into smaller units, such as words or phrases (tokens), to facilitate
analysis. Tokenization is a crucial step in various NLP tasks.
• Part-of-Speech Tagging (POS):
• Assigning grammatical categories (e.g., nouns, verbs) to each token. POS tagging helps in
understanding the syntactic structure of sentences.
• Named Entity Recognition (NER):
• Identifying entities in the text, such as names of people, organizations, locations, or dates.
NER is vital for extracting structured information from unstructured text.
Stages of NLP
• Syntactic Analysis:
• Parsing the text to understand the grammatical structure and relationships between words.
This stage helps in creating parse trees or syntactic structures.
• Semantic Analysis:
• Extracting the meaning of sentences or phrases by considering the context. This stage
involves tasks like semantic role labeling and understanding word sense disambiguation.
• Coreference Resolution:
• Identifying when different words or expressions refer to the same entity. Coreference
resolution is crucial for maintaining context and coherence in text.
Stages of NLP
• Sentiment Analysis:
• Determining the sentiment or emotional tone expressed in a piece of text. This is often used
to gauge opinions or attitudes in reviews, social media, and other sources.
• Machine Translation:
• Translating text from one language to another. Machine translation systems use various
NLP techniques to understand and generate coherent translations.
• Text Summarization:
• Generating concise and coherent summaries of longer texts. Text summarization is
essential for condensing information while retaining key points.
Stages of NLP
• Speech Recognition:
• Converting spoken language into written text. Speech recognition systems use acoustic and
language models to transcribe spoken words.
• Question Answering:
• Developing systems that can understand and respond to user questions. This involves
extracting information from a given text to provide relevant answers.
• Dialogue Systems:
• Building conversational agents that can engage in natural language conversations.
Dialogue systems require understanding context and generating coherent responses.
Stages of NLP
• These stages are often interconnected, and the success of NLP applications
depends on the accurate execution of each stage. Advances in machine
learning, deep learning, and natural language understanding continue to
enhance the capabilities of NLP systems.
Challenges and Issues(Open Problems) in NLP
• Several challenges and open problems persist in the field of Natural Language
Processing (NLP). Some of these include:
• Ambiguity and Polysemy:
• Resolving multiple meanings of words in context remains a challenge, as words often have
different senses depending on the context in which they are used.
• Lack of Common Sense Understanding:
• NLP models often struggle with understanding common sense knowledge, making it
challenging to infer information beyond what is explicitly stated in the text.
• Contextual Ambiguity:
• Capturing and understanding context in a broader sense, especially in situations where
long-term dependencies and world knowledge are required, is a persistent challenge.
Challenges and Issues(Open Problems) in NLP
• Data Limitations and Bias:
• Models can be biased due to the biases present in training data. Ensuring diverse and
representative datasets is crucial to address biases in NLP systems.
• Dynamic Language Evolution:
• Languages evolve over time, incorporating new words, phrases, and meanings. Adapting
NLP models to such dynamic changes poses a continual challenge.
• Understanding Negation and Irony:
• Recognizing negation and irony in text remains challenging, as these linguistic phenomena
often require a deep understanding of context and speaker intent.
Challenges and Issues(Open Problems) in NLP
• Ethical Considerations:
• Addressing ethical concerns, such as the potential misuse of NLP models for malicious
purposes, ensuring privacy, and handling sensitive information responsibly.
• Multimodal Understanding:
• Integrating information from multiple modalities (text, images, audio) to create a holistic
understanding of content is an ongoing challenge, especially in the era of multimedia data.
• Explainability and Interpretability:
• Enhancing the transparency and interpretability of NLP models is crucial for building trust
and understanding how models make decisions, especially in critical applications.
Challenges and Issues(Open Problems) in NLP
• Domain Adaptation:
• Adapting models trained on one domain to perform well in different domains remains a
challenge, as language use and characteristics vary across domains.
• Low-Resource Languages:
• Developing effective NLP solutions for languages with limited labeled data, often referred to
as low-resource languages, is a significant challenge.
• Real-Time Processing:
• Achieving real-time processing for NLP tasks, especially in applications such as chatbots
and virtual assistants, requires addressing latency challenges and maintaining high
accuracy.
Challenges and Issues(Open Problems) in NLP
• Robustness to Adversarial Attacks:
• Ensuring NLP models are robust to adversarial attacks and intentional manipulations of
input data is a critical concern for deploying models in security-sensitive applications.
• Addressing these challenges requires ongoing research, collaboration, and
innovation in the NLP community to advance the capabilities and reliability of
natural language processing systems.
BASICS OF TEXT PROCESSING
Text processing refers to the analysis, manipulation,
and generation of text.
TOKENIZATION
• Tokenization is the process of dividing text into a set of meaningful pieces. These
pieces are called tokens.

Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks
use this approach, in which words are treated as the basic units of meaning.
Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is
useful for tasks requiring individual sentence analysis or processing.
Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which
can be especially useful when dealing with morphologically rich languages or rare
words.
Character Tokenization:
This process divides the text into individual characters. This can be useful
for modelling character-level language.
STEMMING
• In natural language processing, stemming is the text preprocessing
normalization task concerned with bluntly removing word affixes
(prefixes and suffixes).
• Stemming in natural language processing reduces words to their base
or root form, aiding in text normalization for easier processing.
• For example, “chocolates” becomes “chocolate” and “retrieval”
becomes “retrieve.”
LEMMATIZATION
• Lemmatization is the process of grouping together different inflected forms of the
same word.
• Lemmatization is similar to stemming but it brings context to the words.
Three types of lemmatization techniques are:
• 1. Rule Based Lemmatization : Rule-based lemmatization involves the
application of predefined rules to derive the base or root form of a word.
Example:
Word: “walked”, Rule Application: Remove “-ed” , Result: “walk
• 2. Dictionary-Based Lemmatization : Dictionary-based lemmatization relies on
predefined dictionaries or lookup tables to map words to their corresponding base
forms or lemmas.
• 3. Machine Learning-Based Lemmatization: Machine learning-based
lemmatization leverages computational models to automatically learn the
relationships between words and their base forms.
PART OF SPEECH TAGGING
• Parts of Speech (PoS) tagging is giving each word in a text a grammatical category, such
as nouns, verbs, adjectives, and adverbs.
• In many NLP applications, including machine translation, sentiment analysis, and
information retrieval, PoS tagging is essential. PoS tagging serves as a link between
language and machine understanding, enabling the creation of complex language
processing systems and serving as the foundation for advanced linguistic analysis.

• Example of POS Tagging


Consider the sentence: “The quick brown fox jumps over the lazy dog”
After performing POS Tagging:
“The” as determiner (DT) “quick” as adjective (JJ) “brown” as adjective (JJ)
“fox” as noun (NN) “jumps” as verb (VBZ) “over” as preposition (IN)
“the” as determiner (DT)“lazy” as adjective (JJ) “dog” as noun (NN)

You might also like