Chapter 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

DAT21003

ARTIFICIAL INTELLIGENCE
Natural Language Processing
INTRODUCTION
Natural Language Processing
• Natural language processing (NLP) is a branch of AI
that gives the machines the ability to read,
understand and derive meaning from human
languages.
• NLP combines the field of linguistics and computer
science to decipher language structure and
guidelines and to make models which can
comprehend break down and separate significant
details from text and speech data
INTRODUCTION
Natural Language Processing
• To make both components work, computers need to process human
language in the form of text or speech data and understand the full
meaning of the words.
• This becomes a problem because:
• Language can be ambiguous.
• The meaning of the word varies depending on the context.
• When solving NLP problems, it is important to consider how AI is going
to learn the meaning of words and understand potential mistakes.
INTRODUCTION
Natural Language Processing
• Several NLP tasks break down human text and speech data in ways that
help the computer make sense of what it's ingesting:
• Speech recognition also called speech-to-text, is the task of reliably
converting voice data into text data.
• Part of speech tagging, also called grammatical tagging, is the process
of determining the part of speech of a particular word or piece of text
based on its use and context.
• Word sense disambiguation is the selection of the meaning of a word
with multiple meanings through a process of semantic analysis that
determine the word that makes the most sense in the given context.
INTRODUCTION
Natural Language Processing

• Several NLP tasks break down human text and voice data in ways that help the
computer make sense of what it's ingesting:
• Named entity recognition identifies words or phrases as useful entities.
• Co-reference resolution is the task of identifying if and when two words
refer to the same entity.
• Sentiment analysis attempts to extract subjective qualities—attitudes,
emotions, sarcasm, confusion, suspicion—from text.
• Natural language generation is sometimes described as the opposite of
speech recognition; it's the task of putting structured information into
human language.
INTRODUCTION
Natural Language Processing use cases
• NLP is the driving force behind machine intelligence in many modern real-world
applications. Here are a few examples:
• Spam detection: best spam detection technologies use NLP's text classification
capabilities to scan emails for language that often indicates spam or phishing.
• Machine translation: Google Translate is an example of widely available NLP
technology at work. A great way to test any machine translation tool is to
translate text to one language and then back to the original.
• Virtual agents and chatbots: Virtual agents such as Apple's Siri and Amazon's
Alexa use speech recognition to recognize patterns in voice commands and
natural language generation to respond with appropriate action or helpful
comments.
INTRODUCTION
Natural Language Processing use cases
• NLP is the driving force behind machine intelligence in many modern real-world
applications. Here are a few examples:
• Social media sentiment analysis: NLP has become an essential business tool
for uncovering hidden data insights from social media channels.
• Text summarization: Text summarization uses NLP techniques to digest huge
volumes of digital text and create summaries and synopses for indexes,
research databases, or busy readers who don't have time to read full text.
INTRODUCTION
Natural Language Processing
• One of the first steps in NLP is to preprocess the inputs (data) so that
the machine can better understand the intended meaning.
• This chapter focuses on data preprocessing for NLP.
DATA PREPROCESSING

• When we were young, we learn how to separate noises from a language,


and then find meaning in that language, through the most important
words and phrases.
• This process is like data preprocessing.
• Data / text preprocessing, breaks down a corpus (a collection of written
text) into smaller parts, then extracts the most important information
from those parts.
• The extracted information will then be used by an AI model to derive
meaning from that text.
DATA PREPROCESSING PIPELINE
• The data preprocessing pipeline consists of several stages.
• The stages involved varies, depending on the project and purposes.
• The most common stages are:

• Use this text as an example:


Segmentation Tokenization Stop words
London is the capital and most populous
city of England and the United Kingdom.
Standing on the River Thames in the
Speech and
Stemming Lemmatization Named Entity
southeast of the island of Great Britain,
Tagging London has been a major settlement for
two millennia. It was founded by the
Romans, who named it Londinium.
DATA PREPROCESSING PIPELINE
Segmentation
• This stage breaks the document or text into separate sentences.
• Assuming each sentence is a separate thought / idea, it will be a lot
easier to write a program to understand a single sentence than to
understand a whole paragraph at once.

London is the capital and most populous city of England and the
United Kingdom.
Standing on the River Thames in the southeast of the island of Great
Britain, London has been a major settlement for two millennia.
It was founded by the Romans, who named it Londinium.
Result of Sentence Segmentation
DATA PREPROCESSING PIPELINE
Tokenization

• Tokenization is a mandatory stage and


is usually considered one of the first
stages of the pipeline.
• It is a process that splits the sentence
into individual words (tokens).
• Each token can contain words,
punctuation or special characters (if
that’s the case for the language).

London is the capital and most populous city of


England and the United Kingdom .
Result of Tokenization
DATA PREPROCESSING PIPELINE
Stop words
• Remove non-relevant words, shrinking the dictionary down to what’s supposed to
aggregate value to the AI model.
• In many cases it is common to remove stop words (pronouns, conjunctions,
determiners and prepositions).
• Another method is by removing the most frequent words.
• Be aware that this means losing information; therefore, do it at your own intellectual
risk.

Result of Stop words − Stop words Are Greyed Out


DATA PREPROCESSING PIPELINE
Stemming
• Normalize words into its base form or root form.

• Stemming is a process of reducing each word to its stem / root form.


eating eat
runs run
played play

• Even though, stemmer is simple to use and runs very fast, there
is a danger of over-stemming.

news new
make mak
DATA PREPROCESSING PIPELINE
Lemmatization
• Similar to stemming as it maps several words into one common root.
• Group together different inflated forms of a word called lemma.
• Output of lemmatization is a proper word.
• Lemmatization is more intensive (hence, slower) than stemming, but is
more accurate.
• For example, a lemmatizer should map gone, going -> go

Difference Between Lemmatizing and Stemming


DATA PREPROCESSING PIPELINE
Speech Tagging
• Tagging is a process where attribute a tag (or more) to each token.
• Usually, this tag is the part-of-speech (noun, verb, adjective, pronoun,
etc.) for the word represented by the token.
• This means that the token will carry more information than just a string
of characters, which can help in future stages of the pipeline.
• Knowing the role of each word in the sentence will help you start to
figure out what the sentence is talking about.

Result of Tagging
DATA PREPROCESSING PIPELINE
Named Entity Tagging
• Locates named entities in a structured text data and classifies entities
into predefined categories.
• Relate the machine to pop culture references and everyday names by
flagging names of movies, important personalities or locations, etc that
may occur in the document.

You might also like