Rui Feng

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

XLNET: GENERALIZED

AUTOREGRESSIVE
PRETRAINING FOR LANGUAGE
UNDERSTANDING
ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME
CARBONELL , RUSLAN SALAKHUTDINOV , QUOC V. LE
AR and AE
An autoregressive model’s output ht at
Autoregressive (AR) language time t depends on not just xt, but
modeling: also all xs from previous time steps.

given a text sequence x = (x1, · · · ,


xT ), AR language modeling factorizes
the likelihood into a forward product.

Examples:
GPT , ELMO(The
simple combination of 2 AR
models)
AR and AE
Autoencoding(AE) Language
The AE language model aims to
Modeling:
reconstruct the original data from
corrupted input.

Corrupted input: The corrupted input


means we use [MASK] to replace the
original token

Example:
BERT
Pros and Cons of AR and AE

AR: AE:
Advantages: Advantages: bi-directional.
good at generative NLP Downstream language understanding
tasks often require bidirectional
tasks.
context information.
(When generating
context,It’s usually Disadvantages: 1. the
forward) artificial symbols like [MASK] used by
BERT during pretraining are absent
Disadvantages: it from real data at finetuning time,
only concerns one resulting in a pretrain-finetune
direction(forward or discrepancy.
backward.) 2. It assumes the predicted
(masked) tokens are independent of
each other given the unmasked
tokens.
Can we combine the 2 methods so that
we can take their pros and avoid their
cons?
Yes!
XLNet, a generalized autoregressive
method that leverages the best of both AR
language modeling and AE while avoiding
their limitations.
XLNet
Recall: In AR methods, we maximize the likelyhood in a fixed
forward or backward factorization order

Idea:
XLNet maximizes the expected log likelihood of a sequence
w.r.t. all possible permutations of the factorization order.

Permutations:
For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and
we want to predict x3.
Then we will have 4!(24) permutations.
[x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]…

Every token can appear before x3, then if we apply the forward
maximum likelihood function, it concerns all tokens in this
sentence.
XLNet
Maximize the expectation
for a factor order in all
permutations

Also, , XLNet does not rely on data corruption.


(which means no masks in XLnet)

FYI: For Bert, since Bert


introduces masks, mt indicates if
it’s a mask
Problems:
contradictory in a standard Transformer architecture:
1.To predict the token x_t, the model should only
see the position of x_t, not the content of x_t
2.To predict the token x_t, the model should
encode all tokens before x_t as the content
(In transformers, word embedding and position
information are combined)

Solution: 2-stream self-attention

The model will only encounter text sequences with the


natural order during finetuning. Which means we can
not chenge the sentences, we have to implement
permutation in encoders.

Solution: Attention Mask


Two-Stream Self-
Attention

• Content stream attention: the standard self-attention in


transformers.
• query stream attention.: it’s used for predicting x_t
Original sequence order:
[x1,x2,x3,x4]

Sample a random
factorization order:
[x3,x2,x4,x1]

Calculate content stream


attention:
KV = [h1, h2, h3, h4] Q=h1

Calculate query stream


attention:
KV = [h2, h3, h4]
Q=g1
Recal
In this graph, the other parts of l: The initial value for h_i =
the encoder are omitted. e(x_i) and
Actually, they used the same g_i = w
structure as Transformer-XL
Where g is from the last layer
of query representation
Partial
prediction
It’s expensive since N! is really large
and it will cause slow convergence

Formally, we split z into a non-target subsequence z≤c


and a target subsequence z>c, where c is the cutting
point. And we only predict the tokens after c

Generally, only 1/K tokens will be selected for


predictions. (k=6 or 7 is recommended)

Some methods in Transformer-XL are


incorporated.
Such as relative positional encoding
scheme and the segment recurrence
mechanism.(They are helpful for long
Results:

XLNet outperforms BERT.


ROBERTa was released after
XLNet. And It’s hard to tell which
one is better, XLNet may be
better in reading comprehension
tasks(especially for longer
Reference:
https://arxiv.org/pdf/1906.08237.pdf
https://eigenfoo.xyz/deep-autoregressive-models/
https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-
8d8fce710335

You might also like