Transformer Part3 16 Mar 23 PDF
Transformer Part3 16 Mar 23 PDF
Transformer Part3 16 Mar 23 PDF
Spring 2023
Sudeshna Sarkar
Transformer- Part 3
Sudeshna Sarkar
16 Mar 2023
Pre-training
CS60010: Deep Learning
Spring 2022
Sudeshna Sarkar
BERT etc
Sudeshna Sarkar
11 May 2022
Transfer Learning
• Leverage unlabeled data to cut
down on the number of labeled
examples needed.
• Take a network trained on a task for
which it is easy to generate labels,
and adapt it to a different task for
which it is harder.
• Train a really big language model on
billions of words, transfer to every
NLP task!
Contextualized Word Embedding
Contextualized
Word Embedding
• Language models only use left context or right context, but language
understanding is bidirectional.
• taaaasty
• laern
• tarnsformerify
The byte-pair encoding algorithm
1. Start with a vocabulary containing only characters and an “end-of-word” symbol.
2. Using a corpus of text, find the most common adjacent characters “a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword; repeat until desired vocab size.
Common words end up being a part of the subword vocabulary, while rarer words are split into
(sometimes intuitive, sometimes not) components.
In the worst case, words are split into as many subwords as they have characters.
• taa## aaa## sty
• la## ern##
• Transformer## ify
pretraining
1. pretrained word embeddings
2. Pretraining whole models
• In modern NLP:
• All (or almost all) parameters in NLP networks are initialized via pretraining.
• Pretraining methods hide parts of the input from the model, and train the model to
reconstruct those parts.
• This has been exceptionally effective at building strong:
• representations of language
• parameter initializations for strong NLP
• models.
• Probability distributions over language that we can sample from
Pretraining through language modeling
• Recall the language modeling task:
• Model 𝑝𝜃 𝑤𝑡 |𝑤1:𝑡−1 , the probability distribution over words given their past
contexts.
Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task)
Lots of text; learn general things! Not many labels; adapt to the task!
goes to make tasty tea END ☺/
position-wise softmax
Output: I think therefore I am
BERT
潮水 [MASK]
退了 就 知道 ……
Training of BERT
Approach 2: Next Sentence Prediction
yes
[CLS]: the position that outputs classification results
Randomly
swap the
order of
the
sentences BERT
50% of the
time
[CLS] 醒醒 吧 [SEP] 你 沒有 妹妹
Bidirectional Encoder Representations from
Transformers (BERT)
• BERT = Encoder of Transformer
Learned from a large amount of text
without annotation
Encoder
……
BERT
潮水 退了 就 知道 ……
BERT
class
Example:
Sentiment analysis
BERT Fine-tune Document Classification
[CLS] w1 w2 w3
sentence
Sentence Classification
[CLS] w1 w2 w3
sentence
Tagging with BERT
BERT
[CLS] w1 w2 [SEP] w3 w4 w5
Sentence 1 Sentence 2
Sentence-pair Classification with BERT
⚫ BERT BASE
⚫ 12 layers, 768-dim per word-piece token
⚫ 12 heads.
⚫ Total parameters = 110M
⚫ BERT LARGE
⚫ 24 layers, 1024-dim per word-piece token
⚫ 16 heads.
⚫ Total parameters = 340M
MultiNLI CoLa
Premise: Hills and mountains are especially Sentence: The wagon rumbled down the
sanctified in Jainism. road. Label: Acceptable
Hypothesis: Jainism hates nature. Sentence: The car honked down the road.
Label: Contradiction Label: Unacceptable
If your task involves generating sequences, consider using a pretrained decoder; BERT and other
pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation
methods.
Iroh goes to [MASK] tasty tea Iroh goes to make tasty tea
29
Extensions of BERT
You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++
Some generally accepted improvements to the BERT pretraining formula:
• RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
• SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task
BERT SpanBERT
☺/ ☺/
☺/
(Transformer, LSTM, ++ )
𝑊 ∈ ℝ𝑑 ×𝑑 B ∈ ℝ𝑘 ×𝑑
☺/
(Transformer, LSTM, ++ )
A ∈ ℝ𝑑 ×𝑘
𝑊 + AB
… the movie was …
ℎ1 , … , ℎ𝑇
ℎ1, … , ℎ𝑇 = Decoder (w1, … , w𝑇)
𝑦~𝐴ℎ𝑇 + 𝑏
Where A and 𝑏 are randomly initialized and
specified by the downstream task. w 1, … , w 𝑇
Gradients backpropagate through the whole the linear layer hasn’t been pretrained
network. and must be learned from scratch.]
Pretraining decoders
It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their 𝑝𝜃 (𝑤𝑡 |𝑤1:𝑡−1)
w2 w3 w4 w5 w6
This is helpful in tasks where the output is a
sequence with a vocabulary like that at A, b
pretraining time!
• Dialogue (context=dialogue history) ℎ1 , … , ℎ𝑇
• Summarization (context=document)
w1 w2 w3 w4 w5
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
The in-context examples seem to specify the task to be performed, and the conditional
distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):
“ thanks -> merci
hello -> bonjour
mint -> menthe
otter -> ”
Output (conditional generations):
loutre…”
The prefix as task specification and scratch pad:
chain-of-thought
network
attention) model can do it!
masked self-attention
• GPT (GPT-2, GPT-3, etc.) is a classic
example of this
position-wise encoder
GPT et al.
position-wise softmax
position-wise nonlinear
repeated Nx
network
masked self-attention
position-wise encoder
Pretrained language models summary