Rui Feng
Rui Feng
Rui Feng
AUTOREGRESSIVE
PRETRAINING FOR LANGUAGE
UNDERSTANDING
ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME
CARBONELL , RUSLAN SALAKHUTDINOV , QUOC V. LE
AR and AE
An autoregressive model’s output ht at
Autoregressive (AR) language time t depends on not just xt, but
modeling: also all xs from previous time steps.
Examples:
GPT , ELMO(The
simple combination of 2 AR
models)
AR and AE
Autoencoding(AE) Language
The AE language model aims to
Modeling:
reconstruct the original data from
corrupted input.
Example:
BERT
Pros and Cons of AR and AE
AR: AE:
Advantages: Advantages: bi-directional.
good at generative NLP Downstream language understanding
tasks often require bidirectional
tasks.
context information.
(When generating
context,It’s usually Disadvantages: 1. the
forward) artificial symbols like [MASK] used by
BERT during pretraining are absent
Disadvantages: it from real data at finetuning time,
only concerns one resulting in a pretrain-finetune
direction(forward or discrepancy.
backward.) 2. It assumes the predicted
(masked) tokens are independent of
each other given the unmasked
tokens.
Can we combine the 2 methods so that
we can take their pros and avoid their
cons?
Yes!
XLNet, a generalized autoregressive
method that leverages the best of both AR
language modeling and AE while avoiding
their limitations.
XLNet
Recall: In AR methods, we maximize the likelyhood in a fixed
forward or backward factorization order
Idea:
XLNet maximizes the expected log likelihood of a sequence
w.r.t. all possible permutations of the factorization order.
Permutations:
For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and
we want to predict x3.
Then we will have 4!(24) permutations.
[x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]…
Every token can appear before x3, then if we apply the forward
maximum likelihood function, it concerns all tokens in this
sentence.
XLNet
Maximize the expectation
for a factor order in all
permutations
Sample a random
factorization order:
[x3,x2,x4,x1]