NLP 9

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Protected

Natural Language 22 October 2022

Processing (NLP) Krithiga Ramadass


Protected

Introduction
Krithiga Ramadass, Chennai
Overall experience of 12 years.
8 years in ML and DS.

We are a team of data scientists, engineers, and designers who share the vision of
transforming Toyota from an automotive giant to a mobility company with cutting-
edge technology.
Toyota Connected is enabling improved safety and convenience with a cloud-based
digital connected mobility intelligence platform. We are leveraging vehicle data and
artificial intelligence to change the way people interact with vehicles.

Lead ML Engineer
Natural Language Processing
Conversational AI
Protected

What’s in it?

What’s for the lecture Introduction to Natural Language Processing


- Beginner friendly
- Basic NLP Tools & Techniques
- NLP Applications overview

What are we seeing today Core Concept


Basic NLP tasks
NLP Tools

How do we approach NLP problem


What’s for tomorrow
NLP Applications
What do we do in Toyota Connected India (TCIN)
Protected

Natural Language Processing (NLP)


• Computer’s ability to understand text and spoken words in much the
same way human beings can

• Enable computers to process human language in the form of text or


voice data and to ‘understand’ its full meaning, complete with the
speaker or writer’s intent and sentiment

Source: IBM
Protected

Natural Language Processing (NLP)


Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence
concerned with the interactions between computers and human language, in particular how to program
computers to process and analyze large amounts of natural language data. The goal is a computer capable of
"understanding" the contents of documents, including the contextual nuances of the language within them.
The technology can then accurately extract information and insights contained in the documents as well as
categorize and organize the documents themselves.

Source: Wikipedia
Protected

What is not
NLP?
Protected

Natural
Language
Processing
(NLP)

Source: O’Reilly
Protected

List down all the applications


that you could think of in NLP?
https://www.menti.com/alj8bo747g2t
Protected

NLP Evolution
- Representation Learning
- Deep Neural Networks

- Primitive acoustic features


- Maximum likelihood
- Hidden Markov Model Methodology

- Turing Test
- Rule based

1954 1990 2006 2010 2014 2022


Protected

You should read this book; it’s a great


novel!

You should book the flights as soon as


possible.
Why NLP?
You should close the books by the end of
the year.

You should do everything by the book to


avoid potential complications.
Protected

Human Language
• When we study human language, we are approaching what some might call the “human
essence” the distinctive qualities of mind that are so far, unique to humans
• A communication tool
• What does it mean to know a language?
- To be able to speak and be understood

Natural Language Natural Language


Speech Recognition
Generation (NLG) Understanding (NLU)

NLP
Protected

Natural
Language
Processing Human
Language

(NLP)
Protected

Building blocks of language & NLP

Source: O’Reilly
Protected

Natural Language Understanding Pyramid


NLP Applications

Word Sense Disambiguation

Parsing

Singular, plural
Gender
Prefix suffix

Source: Nextiva
Protected

PoS Tagging

Source: O’Reilly
Protected

Parsing
• Dependency Parsing

• Constituency Parsing

Source: O’Reilly
Protected

Tokenization
• Given a character sequence and a defined document unit, tokenization is the task of chopping it up into
pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

• Word Tokenization
• Sentence Tokenization
• White space
• Punctuation based tokenization
• Treebank tokenizer
• Tweet tokenizer
• Multi-word tokenizer
• Limitations:
• Doesn’t support all the languages
Source: Stanford NLP
Protected

Stemming and Lemmatization


Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and
spelling.
Lemmatization considers the context and converts the word to its meaningful base form, which is called
Lemma.
Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS)
tag for the word in that specific context.
1 lemmatize(‘walking’) -> ‘walk’. stem(‘walking’) -> ‘walk’.

2 Verb lemmatize(‘Stripes’) -> ‘Strip’. 3 Stem('Caring’) -> 'Car’


Noun lemmatize(‘Stripes’) -> ‘Stripe’. lemmatize('Caring’) -> 'Care’.

Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large
dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to
Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

Source: SO
Protected

Stop word removal


The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically,
articles and pronouns are generally classified as stop words.
These words have no significance in some of the NLP tasks like information retrieval and classification, which
means these words are not very discriminative.

On the contrary, in some NLP applications stop word removal will have very little impact. Most of the time, the
stop word list for the given language is a well hand-curated list of words that occur most commonly across
corpuses.

Source: O’Reilly
Protected

Coreference Resolution

Ana is a Graduate Student at UT Dallas. She loves working in Natural Language Processing at the institute. Her
hobbies include blogging, dancing and singing.

Source: https://towardsai.net/p/nlp/c-r
Protected

Entity Recognition

Source:SpotDraft
Protected

Vector Representation
Protected

Bag of words

- Representation of words
- One-hot Encoding

Source:PyImageSearch
Protected

Document-Term Matrix

Source:PyImageSearch
Protected

Disadvantages
- We end up counting the word occurrences. Some words appears in a
document more than the other
- Not normalized
Protected

TF-IDF
A tf-idf score is a decimal number that measures the importance of a word in any
document. It gives small values to frequent words in all the documents and more
weight to those more scarce across the corpus.

TF – Term Frequency - the number of


times the word appears in each document.
IDF – Inverse Document Frequency - an
inverse count of the number of documents
a word appears in. Idf measures how
significant a word is in the whole corpus.

https://github.com/Shubha23/Text-processing-NLP/blob/master/NLP%20-
%20Text%20processing%20pipeline.ipynb
Source: OpenClassRooms
Protected

Disadvantages

- Indirectly depends on the word occurrences – Relative to corpus


- Score varies from document to document
- Matrix becomes large and sparse
- Inability to learn:
- Grammar
- Semantics
Protected

Embeddings

- Word2Vec – Words appearing in similar context


- Glove – Words Cooccurrences in the corpus

- Captures Analogies
- Distance between words
- Dense vectors compared to CV/ TF-IDF
- Constant vector size
- Universal vector Representation
- Can be extended to sentences, paragraphs,
documents

http://projector.tensorflow.org/
Protected

Disadvantages
• Cultural Bias
• Out of Vocabulary words
Protected

How do we approach an NLP


problem
Protected
Protected

Machine Learning Pipeline Automated


Protected

Rule based NLP


Protected

Feature* Extraction. Not Future!


Protected

Sentiment
Analysis

https://pair-code.github.io/lit/tutorials/sentiment/
Protected

Sentiment
Analysis -
NLP
Protected

Machine Translation
Protected

Speech Recognition
Protected

Natural Language Generation


Protected

Question & Answering


Protected

Virtual Assistants
Protected

NLP Tools
• NLTK
• Spacy
• Gensim
• Scikit-learn
Protected

Toyota
Connected NLP • Intelligent Assistant

Application
Protected

Reading Materials
• Linguistics:
https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf
• NLP:
CS224n – NLP with Deep Learning - Christopher Manning
Real-World Natural Language Processing by Masato Hagiwara
• Deep Learning:
https://www.deeplearningbook.org/ - Ian Goodfellow and Yoshua Bengio and Aaron
Courville

http://projector.tensorflow.org/
https://pair-code.github.io/lit/tutorials/sentiment/
https://www.youtube.com/watch?v=kiPysxvkmoU&t=63s

You might also like