NLP 9
NLP 9
NLP 9
Introduction
Krithiga Ramadass, Chennai
Overall experience of 12 years.
8 years in ML and DS.
We are a team of data scientists, engineers, and designers who share the vision of
transforming Toyota from an automotive giant to a mobility company with cutting-
edge technology.
Toyota Connected is enabling improved safety and convenience with a cloud-based
digital connected mobility intelligence platform. We are leveraging vehicle data and
artificial intelligence to change the way people interact with vehicles.
Lead ML Engineer
Natural Language Processing
Conversational AI
Protected
What’s in it?
Source: IBM
Protected
Source: Wikipedia
Protected
What is not
NLP?
Protected
Natural
Language
Processing
(NLP)
Source: O’Reilly
Protected
NLP Evolution
- Representation Learning
- Deep Neural Networks
- Turing Test
- Rule based
Human Language
• When we study human language, we are approaching what some might call the “human
essence” the distinctive qualities of mind that are so far, unique to humans
• A communication tool
• What does it mean to know a language?
- To be able to speak and be understood
NLP
Protected
Natural
Language
Processing Human
Language
(NLP)
Protected
Source: O’Reilly
Protected
Parsing
Singular, plural
Gender
Prefix suffix
Source: Nextiva
Protected
PoS Tagging
Source: O’Reilly
Protected
Parsing
• Dependency Parsing
• Constituency Parsing
Source: O’Reilly
Protected
Tokenization
• Given a character sequence and a defined document unit, tokenization is the task of chopping it up into
pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
• Word Tokenization
• Sentence Tokenization
• White space
• Punctuation based tokenization
• Treebank tokenizer
• Tweet tokenizer
• Multi-word tokenizer
• Limitations:
• Doesn’t support all the languages
Source: Stanford NLP
Protected
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large
dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to
Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.
Source: SO
Protected
On the contrary, in some NLP applications stop word removal will have very little impact. Most of the time, the
stop word list for the given language is a well hand-curated list of words that occur most commonly across
corpuses.
Source: O’Reilly
Protected
Coreference Resolution
Ana is a Graduate Student at UT Dallas. She loves working in Natural Language Processing at the institute. Her
hobbies include blogging, dancing and singing.
Source: https://towardsai.net/p/nlp/c-r
Protected
Entity Recognition
Source:SpotDraft
Protected
Vector Representation
Protected
Bag of words
- Representation of words
- One-hot Encoding
Source:PyImageSearch
Protected
Document-Term Matrix
Source:PyImageSearch
Protected
Disadvantages
- We end up counting the word occurrences. Some words appears in a
document more than the other
- Not normalized
Protected
TF-IDF
A tf-idf score is a decimal number that measures the importance of a word in any
document. It gives small values to frequent words in all the documents and more
weight to those more scarce across the corpus.
https://github.com/Shubha23/Text-processing-NLP/blob/master/NLP%20-
%20Text%20processing%20pipeline.ipynb
Source: OpenClassRooms
Protected
Disadvantages
Embeddings
- Captures Analogies
- Distance between words
- Dense vectors compared to CV/ TF-IDF
- Constant vector size
- Universal vector Representation
- Can be extended to sentences, paragraphs,
documents
http://projector.tensorflow.org/
Protected
Disadvantages
• Cultural Bias
• Out of Vocabulary words
Protected
Sentiment
Analysis
https://pair-code.github.io/lit/tutorials/sentiment/
Protected
Sentiment
Analysis -
NLP
Protected
Machine Translation
Protected
Speech Recognition
Protected
Virtual Assistants
Protected
NLP Tools
• NLTK
• Spacy
• Gensim
• Scikit-learn
Protected
Toyota
Connected NLP • Intelligent Assistant
Application
Protected
Reading Materials
• Linguistics:
https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf
• NLP:
CS224n – NLP with Deep Learning - Christopher Manning
Real-World Natural Language Processing by Masato Hagiwara
• Deep Learning:
https://www.deeplearningbook.org/ - Ian Goodfellow and Yoshua Bengio and Aaron
Courville
http://projector.tensorflow.org/
https://pair-code.github.io/lit/tutorials/sentiment/
https://www.youtube.com/watch?v=kiPysxvkmoU&t=63s