Rui Feng

XLNET: GENERALIZED
AUTOREGRESSIVE
PRETRAINING FOR LANGUAGE
UNDERSTANDING
ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME
CARBONELL , RUSLAN SALAKHUTDINOV , QUOC V. LE
AR and AE
An autoregressive model’s output ht at
Autoregressive (AR) language time t depends on not just xt, but
modeling: also all xs from previous time steps.
given a text sequence x = (x1, · · · ,

xT ), AR language modeling factorizes
the likelihood into a forward product.
Examples:
GPT , ELMO(The
simple combination of 2 AR
models)
AR and AE
Autoencoding(AE) Language
The AE language model aims to
Modeling:
reconstruct the original data from
corrupted input.
Corrupted input: The corrupted input

means we use [MASK] to replace the
original token
Example:
BERT
Pros and Cons of AR and AE
AR: AE:
Advantages: Advantages: bi-directional.
good at generative NLP Downstream language understanding
tasks often require bidirectional
tasks.
context information.
(When generating
context,It’s usually Disadvantages: 1. the
forward) artificial symbols like [MASK] used by
BERT during pretraining are absent
Disadvantages: it from real data at finetuning time,
only concerns one resulting in a pretrain-finetune
direction(forward or discrepancy.
backward.) 2. It assumes the predicted
(masked) tokens are independent of
each other given the unmasked
tokens.
Can we combine the 2 methods so that
we can take their pros and avoid their
cons?
Yes!
XLNet, a generalized autoregressive
method that leverages the best of both AR
language modeling and AE while avoiding
their limitations.
XLNet
Recall: In AR methods, we maximize the likelyhood in a fixed
forward or backward factorization order
Idea:
XLNet maximizes the expected log likelihood of a sequence
w.r.t. all possible permutations of the factorization order.
Permutations:
For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and
we want to predict x3.
Then we will have 4!(24) permutations.
[x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]…
Every token can appear before x3, then if we apply the forward
maximum likelihood function, it concerns all tokens in this
sentence.
XLNet
Maximize the expectation
for a factor order in all
permutations
Also, , XLNet does not rely on data corruption.

(which means no masks in XLnet)
FYI: For Bert, since Bert

introduces masks, mt indicates if
it’s a mask
Problems:
contradictory in a standard Transformer architecture:
1.To predict the token x_t, the model should only
see the position of x_t, not the content of x_t
2.To predict the token x_t, the model should
encode all tokens before x_t as the content
(In transformers, word embedding and position
information are combined)
Solution: 2-stream self-attention
The model will only encounter text sequences with the

natural order during finetuning. Which means we can
not chenge the sentences, we have to implement
permutation in encoders.
Solution: Attention Mask

Two-Stream Self-
Attention
• Content stream attention: the standard self-attention in

transformers.
• query stream attention.: it’s used for predicting x_t
Original sequence order:
[x1,x2,x3,x4]
Sample a random
factorization order:
[x3,x2,x4,x1]
Calculate content stream

attention:
KV = [h1, h2, h3, h4] Q=h1
Calculate query stream

attention:
KV = [h2, h3, h4]
Q=g1
Recal
In this graph, the other parts of l: The initial value for h_i =
the encoder are omitted. e(x_i) and
Actually, they used the same g_i = w
structure as Transformer-XL
Where g is from the last layer
of query representation
Partial
prediction
It’s expensive since N! is really large
and it will cause slow convergence
Formally, we split z into a non-target subsequence z≤c

and a target subsequence z>c, where c is the cutting
point. And we only predict the tokens after c
Generally, only 1/K tokens will be selected for

predictions. (k=6 or 7 is recommended)
Some methods in Transformer-XL are

incorporated.
Such as relative positional encoding
scheme and the segment recurrence
mechanism.(They are helpful for long
Results:
XLNet outperforms BERT.

ROBERTa was released after
XLNet. And It’s hard to tell which
one is better, XLNet may be
better in reading comprehension
tasks(especially for longer
Reference:
https://arxiv.org/pdf/1906.08237.pdf
https://eigenfoo.xyz/deep-autoregressive-models/
https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-
8d8fce710335

Rui Feng

Uploaded by

Copyright:

Available Formats

Rui Feng

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rui Feng

Uploaded by

Copyright:

Available Formats

XLNET: GENERALIZED

given a text sequence x = (x1, · · · ,

Corrupted input: The corrupted input

Also, , XLNet does not rely on data corruption.

FYI: For Bert, since Bert

Solution: 2-stream self-attention

The model will only encounter text sequences with the

Solution: Attention Mask

• Content stream attention: the standard self-attention in

Calculate content stream

Calculate query stream

Formally, we split z into a non-target subsequence z≤c

Generally, only 1/K tokens will be selected for

Some methods in Transformer-XL are

XLNet outperforms BERT.

You might also like