2020 NLPDeepLearning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

SEMINÁRIOS DA ESCOLA DE INFORMÁTICA & COMPUTAÇÃO

AVANÇOS RECENTES NO
PROCESSAMENTO DE
LINGUAGEM NATURAL COM DEEP
LEARNING

Prof. Eduardo Bezerra


[email protected]

08/out/2020
Summary
2

 Introduction
 NLP Periods
 Symbolic based
 Corpus-based
 Neural-based
 Conclusions
3
Introduction
What is NLP?
4

 At the intersection of linguistics,


computer science, and artificial
intelligence.
 Has to do with processing and analyzing
large amounts of natural language data.
 “processing and analyzing”  extract
context and meaning.
NLP is pop, but it is hard!
5

 Homonymy, polysemy, …
Jaguar is the luxury vehicle brand of Land Rover.

The jaguar is an animal of the genus Panthera native to the Americas

 Natural languages are unstructured,


redundant and ambiguous.
Enraged cow injures farmer with ax.
NLP Tasks/Applications
6

 Text classification, clustering, summarization


 Machine translation
 Conversational chatbots
 Question answering
 Speech synthesis & recognition
 Text generation
 Auto-correcting
7
NLP Periods
NLP periods
8
9
Symbolic-based NLP (1950s-1990s)
Symbolic-based NLP
10

 Georgetown Experiment (1954)


 ELISA (1964-1966)
 Cyc Project (1984)
 WordNet (1985)

1950s-1990s
Georgetown-IBM experiment
11

 Machine translation: automatic


translation of Russian sentences into
English.
“The experiment was considered a
success and encouraged
governments to invest in
computational linguistics. The
project managers claimed that
machine translation would be a
reality in three to five years.”
1954
ELIZA
12

“Natural language” conversation through


pattern matching.
“...sister...”  “Tell me more about your
family.”

1966
Cyc Project
13

1984-
WordNet
14

155,327 words
organized in
175,979 synsets for
a total of 207,016 
word-sense pairs

1985-
WordNet – graph fragment
15

chicken
Is_a poultry Purpose supply Typ_obj
clean Is_a Quesp
smooth Typ_obj keep
Is_a hen duck
Is_a
Typ_obj Purpose meat
preen Typ_subj Caused_by
Is_a egg
Means quack
Not_is_a plant
chatter Typ_subj animal
Is_a Is_a
Is_a Is_a creature
make bird Is_a
Typ_obj sound
gaggle Part feather
Is_a Is_a
Classifier goose wing Is_a limb
peck Is_a
number Typ_subj Is_a
claw
Is_a Means Is_a
beak Part Part
hawk Is_a
Typ_obj
strike Typ_subj
fly
leg
turtle catch
Is_a Typ_subj Is_a
bill arm
face Location mouth Is_a opening
16
Corpus-based NLP (1990s-2010s)
Corpus-based NLP
17

1990s-2010s
Corpus-based NLP (aka ML-based)
18

 Successful applications of ML methods to


text data
 e.g., SVM, HMM

1990s-2010s
Corpus-based NLP (aka ML-based)
19

 Text Mining

1990s-2010s
20
Neural-based NLP (2010s-present)
21
Conception, gestation, …, birth!
22

"There is a moment of conception and


a moment of birth, but between them
there is a long period of gestation."

Jonas Salk, 1914-1995


Distributional Hypothesis
23

“The more semantically similar two words are, the more


distributionally similar they will be in turn, and thus the more
that they will tend to occur in similar linguistic contexts.”

“words that are similar in meaning occur in similar contexts”

1950s
Distributional Hypothesis
24

“words that are similar in meaning occur in similar contexts”

It would be marvelous to watch a match between Kasparov and Fisher.


similar words

It would be fantastic to watch a match between Kasparov and Fisher.

Zellig Harris, 1909-1992

1950s
Vector Space Model (for Information Retrieval)
25

 SMART Information Retrieval System


term-doc matrix

First attempt to model

1960s text elements as vectors Gerard Salton, 1927-1995


Vector Space Model
26

 Similarity between docs (sentences,


words)

1960s Gerard Salton, 1927-1995


Distributed Representations
27

1986
Latent semantic analysis (LSA)
28

1988
Latent semantic analysis (LSA)
29

1988
Latent semantic analysis (LSA)
30

 LSA creates context vectors

1988
Distributed representation – an example
31

Image by Garrett Hoffman


Distributed representation – an example
32

Image by Garrett Hoffman


Distributed representation – an example
33

Image by Garrett Hoffman


Conception, gestation, …, birth!
34

 Conception, gestation
 Distributional hypothesis
 Vector Space model
 LSA
 Distributed representations
 Now, for the Deep Learning based NLP
birth…
Neural-based NLP (aka Deep Learning based)
35

 Most SOTA results in NLP today are


obtained through Deep Learning
methods.
 One of the main achievements of this
period is related to building rich
distributed representations of text
objects through deep neural networks.
2010s-present
word2vec
36

 Efficient Estimation of Word


Representations in Vector
Space, September 7th, 2013.
 Distributed Representations of
Words and Phrases and their
Compositionality, October 16th, Tomas Mikolov

2013. (20K+ citations)


2013 Idea: each word can be represented by a fixed-length numeric
vector. Words of similar meanings have similar vectors.
word2vec
37

 In word2vec, a single hidden layer NN is trained to


perform a certain “fake” task.
 Skip-gram: predicting surrounding context words
given a center word.
 CBOW: predicting a center word from the surrounding
context.
 But this NN is not actually used!
 Instead, the goal is to learn the weights of the hidden
layer– these weights are the “word vectors”.
word2vec: skip-gram alternative
38

 The task: given a specific word w in the middle


of a sentence (the input word), look at the
words nearby and pick one word at random.
 The solution: train an ANN to produce the
probability (for every word in the vocabulary) of
being nearby w.
 “nearby” means there is actually a "window size"
hyperparameter (typical value: 5)
word2vec
39

 Each word in the vocabulary is


represented using one hot encoding (aka
local representation!).

Credits: Marco Bonzaninin


word2vec
40

Credits: Marco Bonzaninin


word2vec
41

Credits: Marco Bonzaninin


word2vec
42

Skip-gram NN
architecture

The amount of neurons in the hidden layer (a hyperparameter) determines de size of the embedding.
word2vec
43
word2vec
44

 word2vec captures context similarity:


 If words wj and wk have similar contexts, then the
model needs to output very similar results for
them.
 Oneway for the network to do this is to make the word
vectors for wj and wk very similar.
 So, if two words have similar contexts, the network
is motivated to learn similar word vectors for
them.
word2vec
45

Credits: http://jalammar.github.io/illustrated-word2vec/
Embedding models
46

 Word2Vec
 GloVe Currently, the distributional hypothesis through vector
embeddings models generated by ANNs is used
 SkipThoughts pervasively in NLP.

 Paragraph2Vec
 Doc2Vec
 FastText
Encoder-Decoder models (aka seq2seq models)
47

Encoder

Decoder

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
“Classical” Encoder-Decoder model
48

“The idea is to use one LSTM to read the input sequence,


one timestep at a time, to obtain large fixed-dimensional
vector representation, and then to use another LSTM to
extract the output sequence from that vector

2014 recurrent architecture


Encoder-Decoder model with Attention
49

2015 recurrent architecture


Attention models into recurrent NNs
50

2015 Bahdanau et al 2015


Transformers
51

ATTENTION

“We propose a new simple network


architecture, the Transformer, based

2017 solely on attention mechanisms,


dispensing with recurrence and
convolutions entirely.” feedforward architecture!
Transformers
52

 Transformers are the


current SOTA neural
architecture when it
comes to produce text
representations to use in
most NLP tasks.

From Vaswani et al (2018)


Famous Transformer Models
53

 BERT (Bidirection Encoder Representations from


Transformers)
 GPT-2 (Generative Pre-Training)
 GPT-3

2018-2020
54
Conclusions
Take away notes
55

 SOTA results in most NLP is currently


neural-based.
 Neural-based NLP is recent, but relies on
older ideas.
 Attention mechanism is a novel and very
promising idea.
Pretrained models
56

https://code.google.com/archive/p/word2vec/
https://github.com/google-research/bert
Neural Nets need a Vapnik!
57

The theory about generalization properties of ANNs is not completely understood.


TODO: Natural Language Understanding
58

 Headlines:
 Enraged Cow Injures Farmer With Ax
 Hospitals Are Sued by 7 Foot Doctors
 Ban on Nude Dancing on Governor’s Desk
 Iraqi Head Seeks Arms
 Local HS Dropouts Cut in Half
 Juvenile Court to Try Shooting Defendant
 Stolen Painting Found by Tree
Humans use their underlying understanding of the world as context
Source: CS188
TODO: Common Sense Knowledge
59

"If a mother has a son, then the son is younger than


the mother and remains younger for his entire life."

"If President Trump is in Washington, then his left foot


is also in Washington,"
Food for thought
60

“There’ll be a lot of people who argue against


it, who say you can’t capture a thought like
that. But there’s no reason why not. I think
you can capture a thought by a vector.”

Geoff Hinton
These slides are available at
http://eic.cefet-rj.br/˜ebezerra/

Eduardo Bezerra (ebezerra@cefet-


rj.br)
62
Backup slides
Language Models (Unigrams, Bigrams, etc.)
63

 A model that assigns a probability to a


sequence of tokens.
 A good language model gives... 
 ...(syntactically and semantically) valid
sentences a high probability.
 ...low probability to nonsense. 
Language Models (Unigrams, Bigrams, etc.)
64

 Mathematically, we can apply a LM to


any given sequence of n words:
Language Models (Unigrams, Bigrams, etc.)
65

 An example:
"The quick brown fox jumps over the lazy
dog."
 Another example:

"The quik brown lettuce over jumps the


lazy dog.“
Language Models (Unigrams, Bigrams, etc.)
66

Unigram model

Bigram model

But, how to learn these


probabilities?
Transfer Learning
67
68
Neural Nets
Artificial Neural Net
69

 It is possible to build arbitrarily complex


networks using the artificial neuron as
the basic component.
Artificial Neural Net
70

Feedforward Neural Network


Training
71

 Given a training set of the form

 training an ANN corresponds to using this set to


adjust the parameters of the network, so that the
training error is minimized.

 So, training of an ANN is an optimization problem.


Training
72

 The error signal (computed with a cost function)


is used during training to gradually change the
weights (parameters), so that the predictions are
morePick
accurate.
a batch Propagate them through
of training layers from input to
examples output ()

Update Backpropagate the error


parameters for all signal through the
hidden layers layers from the output
W, b. to the input ()

You might also like