495 Lecture 10 Attall

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CSE 495

Natural Language Processing

Lecture 10
Attention is all you need
Self Attention
# Stefania Christina (2022) The Attention Mechanism from
Scratch [Source code].
# https://machinelearningmastery.com/the-attention-
mechanism-from-scratch/ # generating the queries, keys and values
Q = words @ W_Q
from numpy import random K = words @ W_K
from numpy import dot V = words @ W_V
from scipy.special import softmax
# scoring the query vectors against all key vectors
# encoder representations of four different words scores = Q @ K.transpose()
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0]) # computing the weights by a softmax operation
word_3 = array([1, 1, 0]) weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
word_4 = array([0, 0, 1])
# computing the attention by a weighted sum of the value
# stacking the word embeddings into a single array vectors
words = array([word_1, word_2, word_3, word_4]) attention = weights @ V

# generating the weight matrices print(attention)


random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))
Attention is all you need
• On the left, the inputs enter the encoder side of the
Transformer through an attention sublayer and a
feedforward sublayer.
• On the right, the target outputs go into the decoder
side of the Transformer through two attention
sublayers and a feedforward network sublayer.
• The attention mechanism is a “word to word”
operation.
• It is a token-to-token operation, but we can think of it
to the word level to keep the explanation simple.
• The attention mechanism will find how each word is
related to all other words in a sequence, including the
word being analyzed itself.
The Encoder Stack
• The original encoder layer structure remains the same for all N=6 layers of
the Transformer model.
• Each layer contains two main sublayers: a multi-headed attention
mechanism and a fully connected position-wise feedforward network.
• Notice that a residual connection surrounds each main sublayer,
sublayer(x), in the Transformer model. These connections transport the
unprocessed input x of a sublayer to a layer normalization function. This
way, we are certain that key information such as positional encoding is not
lost on the way. The normalized output of each layer is thus:
• LayerNormalization (x + Sublayer(x))
• Though the structure of each of the N=6 layers of the encoder is identical,
the content of each layer is not strictly identical to the previous layer.
• For example, the embedding sublayer is only present at the bottom level
of the stack. The other five layers do not contain an embedding layer, and
this guarantees that the encoded input is stable through all the layers.
The Encoder Stack
• The multi-head attention mechanisms perform the same functions from layer 1
to 6.
• But these layers do not learn the same things.
• Each layer add more knowledge to the input it recivesfrom the previous layer by
exploring different ways of associating the tokens in the sequence.
• It looks for various associations of words, just like we look for different associations of
letters and words when we solve a crossword puzzle.
• The output of every sublayer of the model has a constant dimension, including
the embedding layer and the residual connections. This dimension is dmodel and
can be set to another value depending on the goals. In the original Transformer
architecture, dmodel = 512.
• dmodel has a powerful consequence.
• We can add as many layers as we want or have GPUs available.
• This global view of the encoder shows the highly optimized architecture of the
Transformer.
Input Embedding
• The embedding sublayer works like other standard transduction
models. A tokenizer will transform a sentence into tokens. Each
tokenizer has its methods, such as BPE, word piece, and sentence
piece tokenization.
• The goals are similar, and the choice depends on the strategy
chosen. For example, a tokenizer applied to the sequence the
Transformer is an innovative NLP model! will produce the following
tokens in one type of model:

• You will notice that this tokenizer normalized the string to lowercase
and truncated it into subparts. A tokenizer will generally provide an
integer representation that will be used for the embedding process.
For example:
Positional Encoding
• Positional embeddings are based on a simple, yet very effective idea: augment the token
embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is
characteristic for each position, the attention heads and feed-forward layers in each stack can
learn to incorporate positional information into their trans‐ formations.
• There are several ways to achieve this, and one of the most popular approaches is to use a
learnable pattern, especially when the pretraining dataset is sufficiently large. This works exactly
the same way as the token embeddings, but using the position index instead of the token ID as
input. With that approach, an efficient way of encod‐ ing the positions of tokens is learned
during pretraining.
• Absolute positional representations: Transformer models can use static patterns consisting of
modulated sine and cosine signals to encode the positions of the tokens. This works especially
well when there are not large volumes of data available.
• Relative positional representations: Although absolute positions are important, one can argue
that when computing an embedding, the surrounding tokens are most important. Relative
positional representations follow that intuition and encode the relative positions between
tokens. This cannot be set up by just introducing a new relative embedding layer at the
beginning, since the relative embedding changes for each token depending on where from the
sequence we are attending to it. Instead, the attention mecha‐ nism itself is modified with
additional terms that take the relative position between tokens into account. Models such as
DeBERTa use such representations.
Positional Encoding
Self Attention
• Attention is a mechanism that allows neural networks to assign a different
amount of weight or “attention” to each element in a sequence
• The main idea behind self-attention is that instead of using a fixed
embedding for each token, we can use the whole sequence to compute a
weighted average of each embedding.
• Another way to formulate this is to say that given a sequence of token
embeddings x1, ..., xn, self-attention produces a sequence of new
embeddings x1′ , ..., xn′ where each x′i is a linear combination of all the xj:

• The coefficients wji are called attention weights and are normalized so that
∑jwji = 1.
Self dot product attention

• There are several ways to implement a self-attention layer, but the most common one is scaled dot-
product attention, from the paper introducing the Transformer architecture.
• There are four main steps required to implement this mechanism:
1. Project each token embedding into three vectors called query, key, and value.
2. Compute attention scores. We determine how much the query and key vectors relate to each other
using a similarity function. The similarity function for scaled dot-product attention is the dot product,
computed efficiently using matrix multiplication of the embeddings. Queries and keys that are similar
will have a large dot product, while those that don’t share much in common will have little to no
overlap. The outputs from this step are called the attention scores, and for a sequence with n input
tokens there is a corresponding n × n matrix of attention scores.
3. Compute attention weights. Dot products can in general produce arbitrarily large numbers, which can
destabilize the training process. To handle this, the attention scores are first multiplied by a scaling
factor to normalize their variance and then normalized with a softmax to ensure all the column values
sum to 1. The resulting n × n matrix now contains all the attention weights, wji.
4. Update the token embeddings. Once the attention weights are computed, we multiply them by the
value vector v1, ..., vn to obtain an updated representation for embedding x′i = ∑jwjivj.
Key Query Value
• The notion of query, key, and value vectors may seem a bit cryptic the first
time.
• Their names were inspired by information retrieval systems, but we can
motivate their meaning with a simple analogy.
• Imagine that you’re at the super‐ market buying all the ingredients you
need for your dinner.
• You have the dish’s recipe, and each of the required ingredients can be
thought of as a query.
• As you scan the shelves, you look at the labels (keys) and check whether
they match an ingredient on your list (similarity function). If you have a
match, then you take the item (value) from the shelf.
Multi-head Attention
• Transformer uses multiple sets of linear
projections, each one representing a so-
called attention head.
• Why do we need more than one
attention head?
• The reason is that the softmax of one head
tends to focus on mostly one aspect of
similarity.
• Having several heads allows the model
to focus on several aspects at once.
• For instance, one head can focus on
subject-verb interaction, whereas
another finds nearby adjectives.
• We don’t handcraft these relations into
the model, and they are fully learned
from the data.
A Transformer Encoder: BERT

• BERT is based on the transformer Encoder


• Two versions of BERT
• BERT-BASE
• N=12, d=768,h=12, #parameters=110M
• BERT-LARGE
• N=24, d=1024,h=16, #parameters=340M
Parameter Count for BERT (1)
• Embedding Matrices:
• Word Embedding Matrix size [Vocabulary size, embedding dimension] = [30522,
768] = 23440896
• Bert uses WordPiece tokenizer (vocabulary size)
• The WordPiece tokenizer follows the subword tokenization scheme
• Consider the following sentence:
• "Let us start pretraining the model.”
• Now, if we tokenize the sentence using the WordPiece tokenizer, then we obtain the tokens
• tokens = [let, us, start, pre, ##train, ##ing, the, model]
• Position embedding matrix size, [Maximum sequence length, embedding
dimension] = [512, 768] = 393216
• Token Type Embedding matrix size [2, 768] = 1536
• Embedding Layer Normalization, weight and Bias [768] + [768] = 1536
• Total Embedding parameters = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 ≈ 𝟐𝟒𝑴
Parameter Count for BERT (2)
• Attention Head:
• Query Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
• Key Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
• Value Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
• Total parameters for one layer attention with 12 heads = 12∗(3 ∗(49152+768)) =
1797120
• After concatenation of the 12 heads (first fully connected layer) [768, 768] = 589824 and Bias [768]
= 768, (589824+768 = 590592)
• Then, layer Normalization weight and Bias [768], [768] = 1536
• Position wise feedforward network (1 hidden and one output) weight matrices and bias [3072,
768] = 2359296, [3072] = 3072 and [768, 3072 ] = 2359296, [768] = 768, (2359296+3072+
2359296+768 = 4722432)
• Another layer Normalization weight and Bias [768], [768] = 1536
• Total parameters for one complete attention layer (1797120 + 590592 + 1536 +
4722432 + 1536 = 7113216 ≈ 7𝑀)
• Total parameters for 12 layers of attention (𝟏𝟐 ∗ 𝟕𝟏𝟏𝟑𝟐𝟏𝟔 = 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 ≈ 𝟖𝟓𝑴)
Parameter Count for BERT (3)

• Output layer of BERT Encoder:


• Dense Weight Matrix and Bias [768, 768] = 589824, [768] = 768, (589824 + 768 =
590592)
• Total Parameters in 𝑩𝑬𝑹𝑻 𝑩ase = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 (embedding) + 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐
(attention)+ 𝟓𝟗𝟎𝟓𝟗𝟐 (classification)= 𝟏𝟎𝟗𝟕𝟖𝟔𝟑𝟔𝟖 ≈ 𝟏𝟏𝟎𝑴

You might also like