495 Lecture 10 Attall
495 Lecture 10 Attall
495 Lecture 10 Attall
Lecture 10
Attention is all you need
Self Attention
# Stefania Christina (2022) The Attention Mechanism from
Scratch [Source code].
# https://machinelearningmastery.com/the-attention-
mechanism-from-scratch/ # generating the queries, keys and values
Q = words @ W_Q
from numpy import random K = words @ W_K
from numpy import dot V = words @ W_V
from scipy.special import softmax
# scoring the query vectors against all key vectors
# encoder representations of four different words scores = Q @ K.transpose()
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0]) # computing the weights by a softmax operation
word_3 = array([1, 1, 0]) weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
word_4 = array([0, 0, 1])
# computing the attention by a weighted sum of the value
# stacking the word embeddings into a single array vectors
words = array([word_1, word_2, word_3, word_4]) attention = weights @ V
• You will notice that this tokenizer normalized the string to lowercase
and truncated it into subparts. A tokenizer will generally provide an
integer representation that will be used for the embedding process.
For example:
Positional Encoding
• Positional embeddings are based on a simple, yet very effective idea: augment the token
embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is
characteristic for each position, the attention heads and feed-forward layers in each stack can
learn to incorporate positional information into their trans‐ formations.
• There are several ways to achieve this, and one of the most popular approaches is to use a
learnable pattern, especially when the pretraining dataset is sufficiently large. This works exactly
the same way as the token embeddings, but using the position index instead of the token ID as
input. With that approach, an efficient way of encod‐ ing the positions of tokens is learned
during pretraining.
• Absolute positional representations: Transformer models can use static patterns consisting of
modulated sine and cosine signals to encode the positions of the tokens. This works especially
well when there are not large volumes of data available.
• Relative positional representations: Although absolute positions are important, one can argue
that when computing an embedding, the surrounding tokens are most important. Relative
positional representations follow that intuition and encode the relative positions between
tokens. This cannot be set up by just introducing a new relative embedding layer at the
beginning, since the relative embedding changes for each token depending on where from the
sequence we are attending to it. Instead, the attention mecha‐ nism itself is modified with
additional terms that take the relative position between tokens into account. Models such as
DeBERTa use such representations.
Positional Encoding
Self Attention
• Attention is a mechanism that allows neural networks to assign a different
amount of weight or “attention” to each element in a sequence
• The main idea behind self-attention is that instead of using a fixed
embedding for each token, we can use the whole sequence to compute a
weighted average of each embedding.
• Another way to formulate this is to say that given a sequence of token
embeddings x1, ..., xn, self-attention produces a sequence of new
embeddings x1′ , ..., xn′ where each x′i is a linear combination of all the xj:
• The coefficients wji are called attention weights and are normalized so that
∑jwji = 1.
Self dot product attention
•
• There are several ways to implement a self-attention layer, but the most common one is scaled dot-
product attention, from the paper introducing the Transformer architecture.
• There are four main steps required to implement this mechanism:
1. Project each token embedding into three vectors called query, key, and value.
2. Compute attention scores. We determine how much the query and key vectors relate to each other
using a similarity function. The similarity function for scaled dot-product attention is the dot product,
computed efficiently using matrix multiplication of the embeddings. Queries and keys that are similar
will have a large dot product, while those that don’t share much in common will have little to no
overlap. The outputs from this step are called the attention scores, and for a sequence with n input
tokens there is a corresponding n × n matrix of attention scores.
3. Compute attention weights. Dot products can in general produce arbitrarily large numbers, which can
destabilize the training process. To handle this, the attention scores are first multiplied by a scaling
factor to normalize their variance and then normalized with a softmax to ensure all the column values
sum to 1. The resulting n × n matrix now contains all the attention weights, wji.
4. Update the token embeddings. Once the attention weights are computed, we multiply them by the
value vector v1, ..., vn to obtain an updated representation for embedding x′i = ∑jwjivj.
Key Query Value
• The notion of query, key, and value vectors may seem a bit cryptic the first
time.
• Their names were inspired by information retrieval systems, but we can
motivate their meaning with a simple analogy.
• Imagine that you’re at the super‐ market buying all the ingredients you
need for your dinner.
• You have the dish’s recipe, and each of the required ingredients can be
thought of as a query.
• As you scan the shelves, you look at the labels (keys) and check whether
they match an ingredient on your list (similarity function). If you have a
match, then you take the item (value) from the shelf.
Multi-head Attention
• Transformer uses multiple sets of linear
projections, each one representing a so-
called attention head.
• Why do we need more than one
attention head?
• The reason is that the softmax of one head
tends to focus on mostly one aspect of
similarity.
• Having several heads allows the model
to focus on several aspects at once.
• For instance, one head can focus on
subject-verb interaction, whereas
another finds nearby adjectives.
• We don’t handcraft these relations into
the model, and they are fully learned
from the data.
A Transformer Encoder: BERT