NLP 9

Protected
Natural Language 22 October 2022
Processing (NLP) Krithiga Ramadass

Protected
Introduction
Krithiga Ramadass, Chennai
Overall experience of 12 years.
8 years in ML and DS.
We are a team of data scientists, engineers, and designers who share the vision of
transforming Toyota from an automotive giant to a mobility company with cutting-
edge technology.
Toyota Connected is enabling improved safety and convenience with a cloud-based
digital connected mobility intelligence platform. We are leveraging vehicle data and
artificial intelligence to change the way people interact with vehicles.
Lead ML Engineer
Natural Language Processing
Conversational AI
Protected
What’s in it?
What’s for the lecture Introduction to Natural Language Processing

- Beginner friendly
- Basic NLP Tools & Techniques
- NLP Applications overview
What are we seeing today Core Concept

Basic NLP tasks
NLP Tools
How do we approach NLP problem

What’s for tomorrow
NLP Applications
What do we do in Toyota Connected India (TCIN)
Protected
Natural Language Processing (NLP)

• Computer’s ability to understand text and spoken words in much the
same way human beings can
• Enable computers to process human language in the form of text or

voice data and to ‘understand’ its full meaning, complete with the
speaker or writer’s intent and sentiment
Source: IBM
Protected
Natural Language Processing (NLP)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence
concerned with the interactions between computers and human language, in particular how to program
computers to process and analyze large amounts of natural language data. The goal is a computer capable of
"understanding" the contents of documents, including the contextual nuances of the language within them.
The technology can then accurately extract information and insights contained in the documents as well as
categorize and organize the documents themselves.
Source: Wikipedia
Protected
What is not
NLP?
Protected
Natural
Language
Processing
(NLP)
Source: O’Reilly
Protected
List down all the applications

that you could think of in NLP?
https://www.menti.com/alj8bo747g2t
Protected
NLP Evolution
- Representation Learning
- Deep Neural Networks
- Primitive acoustic features

- Maximum likelihood
- Hidden Markov Model Methodology
- Turing Test
- Rule based
1954 1990 2006 2010 2014 2022

Protected
You should read this book; it’s a great

novel!
You should book the flights as soon as

possible.
Why NLP?
You should close the books by the end of
the year.
You should do everything by the book to

avoid potential complications.
Protected
Human Language
• When we study human language, we are approaching what some might call the “human
essence” the distinctive qualities of mind that are so far, unique to humans
• A communication tool
• What does it mean to know a language?
- To be able to speak and be understood
Natural Language Natural Language

Speech Recognition
Generation (NLG) Understanding (NLU)
NLP
Protected
Natural
Language
Processing Human
Language
(NLP)
Protected
Building blocks of language & NLP
Source: O’Reilly
Protected
Natural Language Understanding Pyramid

NLP Applications
Word Sense Disambiguation
Parsing
Singular, plural
Gender
Prefix suffix
Source: Nextiva
Protected
PoS Tagging
Source: O’Reilly
Protected
Parsing
• Dependency Parsing
• Constituency Parsing
Source: O’Reilly
Protected
Tokenization
• Given a character sequence and a defined document unit, tokenization is the task of chopping it up into
pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
• Word Tokenization
• Sentence Tokenization
• White space
• Punctuation based tokenization
• Treebank tokenizer
• Tweet tokenizer
• Multi-word tokenizer
• Limitations:
• Doesn’t support all the languages
Source: Stanford NLP
Protected
Stemming and Lemmatization

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and
spelling.
Lemmatization considers the context and converts the word to its meaningful base form, which is called
Lemma.
Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS)
tag for the word in that specific context.
1 lemmatize(‘walking’) -> ‘walk’. stem(‘walking’) -> ‘walk’.
2 Verb lemmatize(‘Stripes’) -> ‘Strip’. 3 Stem('Caring’) -> 'Car’

Noun lemmatize(‘Stripes’) -> ‘Stripe’. lemmatize('Caring’) -> 'Care’.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large
dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to
Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.
Source: SO
Protected
Stop word removal

The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically,
articles and pronouns are generally classified as stop words.
These words have no significance in some of the NLP tasks like information retrieval and classification, which
means these words are not very discriminative.
On the contrary, in some NLP applications stop word removal will have very little impact. Most of the time, the
stop word list for the given language is a well hand-curated list of words that occur most commonly across
corpuses.
Source: O’Reilly
Protected
Coreference Resolution
Ana is a Graduate Student at UT Dallas. She loves working in Natural Language Processing at the institute. Her
hobbies include blogging, dancing and singing.
Source: https://towardsai.net/p/nlp/c-r
Protected
Entity Recognition
Source:SpotDraft
Protected
Vector Representation
Protected
Bag of words
- Representation of words
- One-hot Encoding
Source:PyImageSearch
Protected
Document-Term Matrix
Source:PyImageSearch
Protected
Disadvantages
- We end up counting the word occurrences. Some words appears in a
document more than the other
- Not normalized
Protected
TF-IDF
A tf-idf score is a decimal number that measures the importance of a word in any
document. It gives small values to frequent words in all the documents and more
weight to those more scarce across the corpus.
TF – Term Frequency - the number of

times the word appears in each document.
IDF – Inverse Document Frequency - an
inverse count of the number of documents
a word appears in. Idf measures how
significant a word is in the whole corpus.
https://github.com/Shubha23/Text-processing-NLP/blob/master/NLP%20-
%20Text%20processing%20pipeline.ipynb
Source: OpenClassRooms
Protected
Disadvantages
- Indirectly depends on the word occurrences – Relative to corpus

- Score varies from document to document
- Matrix becomes large and sparse
- Inability to learn:
- Grammar
- Semantics
Protected
Embeddings
- Word2Vec – Words appearing in similar context

- Glove – Words Cooccurrences in the corpus
- Captures Analogies
- Distance between words
- Dense vectors compared to CV/ TF-IDF
- Constant vector size
- Universal vector Representation
- Can be extended to sentences, paragraphs,
documents
http://projector.tensorflow.org/
Protected
Disadvantages
• Cultural Bias
• Out of Vocabulary words
Protected
How do we approach an NLP

problem
Protected
Protected
Machine Learning Pipeline Automated

Protected
Rule based NLP

Protected
Feature* Extraction. Not Future!

Protected
Sentiment
Analysis
https://pair-code.github.io/lit/tutorials/sentiment/
Protected
Sentiment
Analysis -
NLP
Protected
Machine Translation
Protected
Speech Recognition
Protected
Natural Language Generation

Protected
Question & Answering

Protected
Virtual Assistants
Protected
NLP Tools
• NLTK
• Spacy
• Gensim
• Scikit-learn
Protected
Toyota
Connected NLP • Intelligent Assistant
Application
Protected
Reading Materials
• Linguistics:
https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf
• NLP:
CS224n – NLP with Deep Learning - Christopher Manning
Real-World Natural Language Processing by Masato Hagiwara
• Deep Learning:
https://www.deeplearningbook.org/ - Ian Goodfellow and Yoshua Bengio and Aaron
Courville
http://projector.tensorflow.org/
https://pair-code.github.io/lit/tutorials/sentiment/
https://www.youtube.com/watch?v=kiPysxvkmoU&t=63s

NLP 9

Uploaded by

Copyright:

Available Formats

NLP 9

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP 9

Uploaded by

Copyright:

Available Formats

Protected

Natural Language 22 October 2022

Processing (NLP) Krithiga Ramadass

What’s for the lecture Introduction to Natural Language Processing

What are we seeing today Core Concept

How do we approach NLP problem

Natural Language Processing (NLP)

• Enable computers to process human language in the form of text or

Natural Language Processing (NLP)

List down all the applications

- Primitive acoustic features

1954 1990 2006 2010 2014 2022

You should read this book; it’s a great

You should book the flights as soon as

You should do everything by the book to

Natural Language Natural Language

Building blocks of language & NLP

Natural Language Understanding Pyramid

Word Sense Disambiguation

Stemming and Lemmatization

2 Verb lemmatize(‘Stripes’) -> ‘Strip’. 3 Stem('Caring’) -> 'Car’

Stop word removal

TF – Term Frequency - the number of

- Indirectly depends on the word occurrences – Relative to corpus

- Word2Vec – Words appearing in similar context

How do we approach an NLP

Machine Learning Pipeline Automated

Rule based NLP

Feature* Extraction. Not Future!

Natural Language Generation

Question & Answering

You might also like