2020 NLPDeepLearning

SEMINÁRIOS DA ESCOLA DE INFORMÁTICA & COMPUTAÇÃO
AVANÇOS RECENTES NO
PROCESSAMENTO DE
LINGUAGEM NATURAL COM DEEP
LEARNING
Prof. Eduardo Bezerra

[email protected]
08/out/2020
Summary
2
 Introduction
 NLP Periods
 Symbolic based
 Corpus-based
 Neural-based
 Conclusions
3
Introduction
What is NLP?
4
 At the intersection of linguistics,

computer science, and artificial
intelligence.
 Has to do with processing and analyzing
large amounts of natural language data.
 “processing and analyzing”  extract
context and meaning.
NLP is pop, but it is hard!
5
 Homonymy, polysemy, …
Jaguar is the luxury vehicle brand of Land Rover.
The jaguar is an animal of the genus Panthera native to the Americas
 Natural languages are unstructured,

redundant and ambiguous.
Enraged cow injures farmer with ax.
NLP Tasks/Applications
6
 Text classification, clustering, summarization

 Machine translation
 Conversational chatbots
 Question answering
 Speech synthesis & recognition
 Text generation
 Auto-correcting
7
NLP Periods
NLP periods
8
9
Symbolic-based NLP (1950s-1990s)
Symbolic-based NLP
10
 Georgetown Experiment (1954)

 ELISA (1964-1966)
 Cyc Project (1984)
 WordNet (1985)
1950s-1990s
Georgetown-IBM experiment
11
 Machine translation: automatic

translation of Russian sentences into
English.
“The experiment was considered a
success and encouraged
governments to invest in
computational linguistics. The
project managers claimed that
machine translation would be a
reality in three to five years.”
1954
ELIZA
12
“Natural language” conversation through

pattern matching.
“...sister...”  “Tell me more about your
family.”
1966
Cyc Project
13
1984-
WordNet
14
155,327 words
organized in
175,979 synsets for
a total of 207,016
word-sense pairs
1985-
WordNet – graph fragment
15
chicken
Is_a poultry Purpose supply Typ_obj
clean Is_a Quesp
smooth Typ_obj keep
Is_a hen duck
Is_a
Typ_obj Purpose meat
preen Typ_subj Caused_by
Is_a egg
Means quack
Not_is_a plant
chatter Typ_subj animal
Is_a Is_a
Is_a Is_a creature
make bird Is_a
Typ_obj sound
gaggle Part feather
Is_a Is_a
Classifier goose wing Is_a limb
peck Is_a
number Typ_subj Is_a
claw
Is_a Means Is_a
beak Part Part
hawk Is_a
Typ_obj
strike Typ_subj
fly
leg
turtle catch
Is_a Typ_subj Is_a
bill arm
face Location mouth Is_a opening
16
Corpus-based NLP (1990s-2010s)
Corpus-based NLP
17
1990s-2010s
Corpus-based NLP (aka ML-based)
18
 Successful applications of ML methods to

text data
 e.g., SVM, HMM
1990s-2010s
Corpus-based NLP (aka ML-based)
19
 Text Mining
1990s-2010s
20
Neural-based NLP (2010s-present)
21
Conception, gestation, …, birth!
22
"There is a moment of conception and

a moment of birth, but between them
there is a long period of gestation."
Jonas Salk, 1914-1995

Distributional Hypothesis
23
“The more semantically similar two words are, the more

distributionally similar they will be in turn, and thus the more
that they will tend to occur in similar linguistic contexts.”
“words that are similar in meaning occur in similar contexts”
1950s
Distributional Hypothesis
24
“words that are similar in meaning occur in similar contexts”
It would be marvelous to watch a match between Kasparov and Fisher.

similar words
It would be fantastic to watch a match between Kasparov and Fisher.
Zellig Harris, 1909-1992
1950s
Vector Space Model (for Information Retrieval)
25
 SMART Information Retrieval System

term-doc matrix
First attempt to model
1960s text elements as vectors Gerard Salton, 1927-1995

Vector Space Model
26
 Similarity between docs (sentences,

words)
1960s Gerard Salton, 1927-1995

Distributed Representations
27
1986
Latent semantic analysis (LSA)
28
1988
29
1988
30
 LSA creates context vectors
1988
Distributed representation – an example
31
Image by Garrett Hoffman

32

33

Conception, gestation, …, birth!
34
 Conception, gestation
 Distributional hypothesis
 Vector Space model
 LSA
 Distributed representations
 Now, for the Deep Learning based NLP
birth…
Neural-based NLP (aka Deep Learning based)
35
 Most SOTA results in NLP today are

obtained through Deep Learning
methods.
 One of the main achievements of this
period is related to building rich
distributed representations of text
objects through deep neural networks.
2010s-present
word2vec
36
 Efficient Estimation of Word

Representations in Vector
Space, September 7th, 2013.
 Distributed Representations of
Words and Phrases and their
Compositionality, October 16th, Tomas Mikolov
2013. (20K+ citations)

2013 Idea: each word can be represented by a fixed-length numeric
vector. Words of similar meanings have similar vectors.
word2vec
37
 In word2vec, a single hidden layer NN is trained to

perform a certain “fake” task.
 Skip-gram: predicting surrounding context words
given a center word.
 CBOW: predicting a center word from the surrounding
context.
 But this NN is not actually used!
 Instead, the goal is to learn the weights of the hidden
layer– these weights are the “word vectors”.
word2vec: skip-gram alternative
38
 The task: given a specific word w in the middle

of a sentence (the input word), look at the
words nearby and pick one word at random.
 The solution: train an ANN to produce the
probability (for every word in the vocabulary) of
being nearby w.
 “nearby” means there is actually a "window size"
hyperparameter (typical value: 5)
word2vec
39
 Each word in the vocabulary is

represented using one hot encoding (aka
local representation!).
Credits: Marco Bonzaninin

word2vec
40

word2vec
41

word2vec
42
Skip-gram NN
architecture
The amount of neurons in the hidden layer (a hyperparameter) determines de size of the embedding.
word2vec
43
word2vec
44
 word2vec captures context similarity:

 If words wj and wk have similar contexts, then the
model needs to output very similar results for
them.
 Oneway for the network to do this is to make the word
vectors for wj and wk very similar.
 So, if two words have similar contexts, the network
is motivated to learn similar word vectors for
them.
word2vec
45
Credits: http://jalammar.github.io/illustrated-word2vec/
Embedding models
46
 Word2Vec
 GloVe Currently, the distributional hypothesis through vector
embeddings models generated by ANNs is used
 SkipThoughts pervasively in NLP.
 Paragraph2Vec
 Doc2Vec
 FastText
Encoder-Decoder models (aka seq2seq models)
47
Encoder
Decoder
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
“Classical” Encoder-Decoder model
48
“The idea is to use one LSTM to read the input sequence,

one timestep at a time, to obtain large fixed-dimensional
vector representation, and then to use another LSTM to
extract the output sequence from that vector
2014 recurrent architecture

Encoder-Decoder model with Attention
49
2015 recurrent architecture

Attention models into recurrent NNs
50
2015 Bahdanau et al 2015

Transformers
51
ATTENTION
“We propose a new simple network

architecture, the Transformer, based
2017 solely on attention mechanisms,

dispensing with recurrence and
convolutions entirely.” feedforward architecture!
Transformers
52
 Transformers are the

current SOTA neural
architecture when it
comes to produce text
representations to use in
most NLP tasks.
From Vaswani et al (2018)

Famous Transformer Models
53
 BERT (Bidirection Encoder Representations from

Transformers)
 GPT-2 (Generative Pre-Training)
 GPT-3
2018-2020
54
Conclusions
Take away notes
55
 SOTA results in most NLP is currently

neural-based.
 Neural-based NLP is recent, but relies on
older ideas.
 Attention mechanism is a novel and very
promising idea.
Pretrained models
56
https://code.google.com/archive/p/word2vec/
https://github.com/google-research/bert
Neural Nets need a Vapnik!
57
The theory about generalization properties of ANNs is not completely understood.

TODO: Natural Language Understanding
58
 Headlines:
 Enraged Cow Injures Farmer With Ax
 Hospitals Are Sued by 7 Foot Doctors
 Ban on Nude Dancing on Governor’s Desk
 Iraqi Head Seeks Arms
 Local HS Dropouts Cut in Half
 Juvenile Court to Try Shooting Defendant
 Stolen Painting Found by Tree
Humans use their underlying understanding of the world as context
Source: CS188
TODO: Common Sense Knowledge
59
"If a mother has a son, then the son is younger than

the mother and remains younger for his entire life."
"If President Trump is in Washington, then his left foot

is also in Washington,"
Food for thought
60
“There’ll be a lot of people who argue against

it, who say you can’t capture a thought like
that. But there’s no reason why not. I think
you can capture a thought by a vector.”
Geoff Hinton
These slides are available at
http://eic.cefet-rj.br/˜ebezerra/
Eduardo Bezerra (ebezerra@cefet-

rj.br)
62
Backup slides
Language Models (Unigrams, Bigrams, etc.)
63
 A model that assigns a probability to a

sequence of tokens.
 A good language model gives...
 ...(syntactically and semantically) valid
sentences a high probability.
 ...low probability to nonsense.
64
 Mathematically, we can apply a LM to

any given sequence of n words:
65
 An example:
"The quick brown fox jumps over the lazy
dog."
 Another example:
"The quik brown lettuce over jumps the

lazy dog.“
66
Unigram model
Bigram model
But, how to learn these

probabilities?
Transfer Learning
67
68
Neural Nets
Artificial Neural Net
69
 It is possible to build arbitrarily complex

networks using the artificial neuron as
the basic component.
Artificial Neural Net
70
Feedforward Neural Network

Training
71
 Given a training set of the form
 training an ANN corresponds to using this set to

adjust the parameters of the network, so that the
training error is minimized.
 So, training of an ANN is an optimization problem.

Training
72
 The error signal (computed with a cost function)

is used during training to gradually change the
weights (parameters), so that the predictions are
morePick
accurate.
a batch Propagate them through
of training layers from input to
examples output ()
Update Backpropagate the error

parameters for all signal through the
hidden layers layers from the output
W, b. to the input ()

2020 NLPDeepLearning

Uploaded by

Copyright:

Available Formats

2020 NLPDeepLearning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 NLPDeepLearning

Uploaded by

Copyright:

Available Formats

SEMINÁRIOS DA ESCOLA DE INFORMÁTICA & COMPUTAÇÃO

Prof. Eduardo Bezerra

 At the intersection of linguistics,

The jaguar is an animal of the genus Panthera native to the Americas

 Natural languages are unstructured,

 Text classification, clustering, summarization

 Georgetown Experiment (1954)

 Machine translation: automatic

“Natural language” conversation through

 Successful applications of ML methods to

"There is a moment of conception and

Jonas Salk, 1914-1995

“The more semantically similar two words are, the more

“words that are similar in meaning occur in similar contexts”

“words that are similar in meaning occur in similar contexts”

It would be marvelous to watch a match between Kasparov and Fisher.

It would be fantastic to watch a match between Kasparov and Fisher.

Zellig Harris, 1909-1992

 SMART Information Retrieval System

First attempt to model

1960s text elements as vectors Gerard Salton, 1927-1995

 Similarity between docs (sentences,

1960s Gerard Salton, 1927-1995

 LSA creates context vectors

Image by Garrett Hoffman

Image by Garrett Hoffman

Image by Garrett Hoffman

 Most SOTA results in NLP today are

 Efficient Estimation of Word

2013. (20K+ citations)

 In word2vec, a single hidden layer NN is trained to

 The task: given a specific word w in the middle

 Each word in the vocabulary is

Credits: Marco Bonzaninin

Credits: Marco Bonzaninin

Credits: Marco Bonzaninin

 word2vec captures context similarity:

“The idea is to use one LSTM to read the input sequence,

2014 recurrent architecture

2015 recurrent architecture

2015 Bahdanau et al 2015

“We propose a new simple network

2017 solely on attention mechanisms,

 Transformers are the

From Vaswani et al (2018)

 BERT (Bidirection Encoder Representations from

 SOTA results in most NLP is currently

The theory about generalization properties of ANNs is not completely understood.

"If a mother has a son, then the son is younger than

"If President Trump is in Washington, then his left foot

“There’ll be a lot of people who argue against

Eduardo Bezerra (ebezerra@cefet-

 A model that assigns a probability to a

 Mathematically, we can apply a LM to

"The quik brown lettuce over jumps the

But, how to learn these

 It is possible to build arbitrarily complex

Feedforward Neural Network

 Given a training set of the form