Chapter Transformers
Chapter Transformers
Chapter Transformers
Transformers
Transformers are a very recent family of architectures that have revolutionized fields like
natural language processing (NLP), image processing, and multi-modal generative AI.
Transformers were originally introduced in the field of NLP in 2017, as an approach
to process and understand human language. Human language is inherently sequential in
nature (e.g., characters form words, words form sentences, and sentences form paragraphs
and documents). Prior to the advent of the transformers architecture, recurrent neural net-
works (RNNs) briefly dominated the field for their ability to process sequential information
(RNNs are described in Appendix C for reference). However, RNNs, like many other ar-
chitectures, processed sequential information in an iterative/sequential fashion, whereby
each item of a sequence was individually processed one after another. Transformers offer
many advantages over RNNs, including their ability to process all items in a sequence in a
parallel fashion (as do CNNs).
Like CNNs, transformers factorize the signal processing problem into stages that in-
volve independent and identically processed chunks. However, they also include layers
that mix information across the chunks, called attention layers, so that the full pipeline can
model dependencies between the chunks.
In this chapter, we describe transformers from the bottom up. We start with the idea
of embeddings and tokens (Section 8.1). We then describe the attention mechanism (Sec-
tion 8.2). And finally we then assemble all these ideas together to arrive at the full trans-
former architecture in Section 8.3.
70
MIT 6.390 Spring 2024 71
each vector being slightly different based on its context in the sentence and story at large.
To measure how similar any two word embeddings are (in terms of their numeric val-
ues) it is common to use cosine similarity as the metric:
uT v
= cos < u, v > , (8.1)
|u| |v|
where |u| and |v| are the lengths of the vectors, and < u, v > is the angle between u and v.
The cosine similarity is +1 when u = v, zero when the two vectors are perpendicular to
each other, and −1 when the two vectors are diametrically opposed to each other. Thus,
higher values correspond to vectors that are numerically more similar to each other.
While word embeddings – and various approaches to create them – have existed for
decades, the first approach that produced astonishingly effective word embeddings was
word2vec in 2012. This revolutionary approach was the first highly-successful approach of
applying deep learning to NLP, and it enabled all subsequent progress in the field, includ-
ing Transformers. The details of word2vec are beyond the scope of this course, but we note
two facts: (1) it created a single word embedding for each distinct word in the training cor-
pus (not on a per-occurrence basis); (2) it produced word embeddings that were so useful,
many relationships between the vectors corresponded with real-world semantic related-
ness. For example, when using Euclidean distance as a distance metric between two vectors,
word2vec produced word embeddings with properties such as (where vword is the vector
for word):
This corresponds with the real-world property that Paris is to France what Rome is to
Italy. This incredible finding existed not only for geographic words but all sorts of real-
world concepts in the vocabulary. Nevertheless, to some extent, the exact values in each
embedding is arbitrary, and what matters most is the holistic relation between all embed-
dings, along with how performant/useful they are for the exact task that we care about.
For example, an embedding may be considered good if it accurately captures the con-
ditional probability for a given word to appear next in a sequence of words. You probably
have a good idea of what words might typically fill in the blank at the end of this sentence:
Or a model could be built that tries to correctly predict words in the middle of sentences:
The model can be built by minimizing a loss function that penalizes incorrect word guesses,
and rewards correct ones. This is done by training a model on a very large corpus of written
material, such as all of Wikipedia, or even all the accessible digitized written materials
produced by humans.
While we will not dive into the full details of tokenization, the high-level idea is straight-
forward: the individual inputs of data that are represented and processed by a model are
referred to as tokens. And, instead of processing each word as a whole, words are typically
split into smaller, meaningful pieces (akin to syllables). Thus, when we refer to tokens,
know that we’re referring to each individual input, and that in practice, nowadays, they
tend to be sub-words (e.g., the word ‘talked’ may be split into two tokens, ‘talk’ and ‘ed’).
√
softmax q⊤ ⊤
1 k1 q 1 k2 · · · q⊤
1 knk /√dk
softmax q⊤ ⊤
2 k1 q 2 k2 · · · q⊤2 k nk / d k
A= .. (8.4)
.
√
softmax q⊤ nq k 1 q ⊤
k
nq 2 · · · q⊤n q k n k / dk
done to reduce the magnitude of the dot product, which would otherwise grow undesir-
ably large with increasing dk , making it difficult for (overall) training.
Let αij be the entry in ith row and jth column in the attention matrix A. Then αij helps
answer the question "which tokens x(j) help the most with predicting the corresponding
output token y(i) ?" The attention output is given by a weighted sum over the values:
n
X
y(i) = αij vj
j=1
Comparing this self-attention matrix with the attention matrix described in Equation
8.2, we notice the only difference lies in the dimensions: since in self-attention, the query,
key, and value all come from the same input, we have nq = nk = nv , and we often denote
all three with a unified n.
The self-attention output is then given by a weighted sum over the values:
n
X
y(i) = αij vj
j=1
This diagram below shows (only) the middle input token generating a query that is then
combined with the keys computed with all tokens to generate the attention weights via a
softmax. The output of the softmax is then combined with values computed from all to-
kens, to generate the attention output corresponding to the middle input token. Repeating
this for each input token then generates the output.
Study Question: We have five colored tokens in the diagram above (gray, blue, or-
gange, green, red). Could you read off the diagram the correspondence between the
color and input, query, query, value, output?
Note that the size of the output is the same as the size of the input. Also, observe that
there is no apparent notion of ordering of the input words in the depicted structure. Posi-
tional information can be added by encoding a number for token (giving say, the token’s
position relative to the start of the sequence) into the vector embedding of each token.
And note that a given query need not pay attention to all other tokens in the input; in this
example, the token used for the query is not used for a key or value.
More generally, a mask may be applied to limit which tokens are used in the attention
computation. For example, one common mask limits the attention computation to tokens
that occur previously in time to the one being used for the query. This prevents the atten-
tion mechanism from “looking ahead” in scenarios where the transformer is being used to
generate one token at a time.
Each self-attention stage is trained to have key, value, and query embeddings that lead
it to pay specific attention to some particular feature of the input. We generally want to
pay attention to many different kinds of features in the input; for example, in translation
one feature might be be the verbs, and another might be objects or subjects. A transformer
utilizes multiple instances of self-attention, each known as an “attention head,” to allow
combinations of attention paid to many different features.
8.3 Transformers
A transformer is the composition of a number of transformer blocks, each of which has
multiple attention heads. At a very high-level, the goal of a transformer block is to output
a really rich, useful representation for each input token, all for the sake of being high-
performant for whatever task the model is trained to learn.
Rather than depicting the transformer graphically, it is worth returning to the beauty of
the underlying equations1 .
Q = XWq
K = XWk
V = XWv
These Q, K, V triple can then be used to produce one (self)attention-layer output. One such
layer is called one "attention head".
One can have more than one "attention head", such that: the queries, keys, and values
are embedded via encoding matrices:
and Wh,q , Wh,k , Wh,v ∈ Rd×dk where dk is the size of the key/query embedding space, and
h ∈ {1, · · · , H} is an index over “attention heads.” for each attention-head
We then perform a weighted sum over all the outputs for each head, h, we learn one set of
Wh,q , Wh,k , Wh,v .
H
X n
X
(i) (h) (h)
u′ = T
Wh,c αij Vj , (8.9)
h=1 j=1
where Wh,c ∈ Rdk ×d , u ′ (i) ∈ Rd×1 , the indices i ∈ {1, · · · , n} and j ∈ {1, · · · , n} are an
integer index over tokens. (h)
Vj is the dk × 1 value
This is then standardized and combined with x(i) using a LayerNorm function (defined embedding vector that
below) to become corresponds to the input
token xj for attention
(i) head h.
u(i) = LayerNorm x(i) + u ′ ; γ1 , β1 (8.10)
with parameters γ1 , β1 ∈ Rd .
1 The presentation here follows the notes by John Thickstun.
To get the final output, we follow the “intermediate output then layer norm" recipe
again. In particular, we first get the transformer block output z ′ (i) given by
(i)
z ′ = W2T ReLU W1T u(i) (8.11)
with weights W1 ∈ Rd×m and W2 ∈ Rm×d . This is then standardized and combined with
u(i) to give the final output z(i) :
(i)
z(i) = LayerNorm u(i) + z ′ ; γ2 , β2 , (8.12)
with parameters γ2 , β2 ∈ Rd . These vectors are then assembled (e.g., through parallel
computation) to produce z ∈ Rn×d .
The LayerNorm function transforms a d-dimensional input z with parameters γ, β ∈ Rd
into
z − µz
LayerNorm(z; γ, β) = γ + β, (8.13)
σz
where µz is the mean and σz the standard deviation of z:
d
1X
µz = zi (8.14)
d
i=1
v
u d
u1 X
σz = t (zi − µz )2 . (8.15)
d
i=1
model was trained to predict the masked words. BERT was also trained on sequences of
sentences, where the model was trained to predict whether two sentences are likely to be
contextually close together or not. The pre-training stage is generally very expensive.
The second “fine-tuning” stage trains the model for a specific task, such as classification
or question answering. This training stage can be relatively inexpensive, but it generally
requires labeled data.