0% found this document useful (0 votes)
4 views8 pages

BERT Architecture

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 8

Explain the

BERT
architecture
Transformer Encoder
The transformer encoder is the core architecture of
BERT, comprising multiple layers of identical
submodules. Each submodule includes multi-head self-
attention and a position-wise feedforward network,
followed by residual connections and layer
normalization. This architecture processes all input
tokens simultaneously, allowing the model to capture
bidirectional relationships in text.

Example: In the sentence “The bank near the river


flooded,” the encoder ensures that the token "bank" is
processed with full awareness of the entire sentence,
using both preceding ("The bank") and succeeding
("near the river flooded") tokens. This bidirectional
context helps distinguish "bank" as "riverbank."
Multi-Head Self-Attention
Multi-head self-attention computes relationships
between tokens using three matrices for each input
token: Query (Q), Key (K), and Value (V). The attention
score between tokens is calculated using the scaled dot-
product
Multi-head attention applies this process multiple times
(with different learned weights) to capture diverse
aspects of relationships, such as semantics, syntax, or
positional nuances.

Example: In “She gave him the book because it was


interesting,” one attention head might focus on "it" and
"book," while another connects "gave" with "him." By
aggregating insights from multiple heads, the model
forms a richer understanding of the sentence.
Feedforward Neural Network (FFN)
The FFN is applied to each token independently, after
the attention mechanism. It consists of two linear layers
with a non-linear activation (e.g., GELU) in between:

FFN(x)=max(0,xW1+b1)W2+b2

This layer increases the model's capacity to learn


complex transformations and refine token
representations.

Example: After attention resolves the context of "bank"


as "riverbank," the FFN enriches the representation,
ensuring that subtle nuances, such as associations with
"river" or "flooded," are embedded in the token's final
vector.
Token Embeddings
BERT uses WordPiece tokenization, which splits rare or
compound words into smaller, more common
subwords. Each subword is assigned a vector, initialized
randomly and learned during training. These token
embeddings represent the input vocabulary.

Example: For the word "unbelievable," tokenization


might produce ["un," "believ," "able"], each with its
embedding. These subword embeddings are combined
(e.g., by summing or averaging) during processing,
allowing BERT to generalize effectively to unseen or rare
words.
Segment Embeddings
In sentence pair tasks, segment embeddings are added
to token embeddings to indicate which sentence a
token belongs to. Segment A (e.g., question) and
Segment B (e.g., answer) are represented by learned
embeddings (EAE_AEA​and EBE_BEB​). This helps BERT
distinguish tokens from the two segments during
training and inference.

Example: In a QA task with “Where is the bank?”


(Segment A) and “It is near the river.” (Segment B),
segment embeddings ensure that "bank" and "river" are
correctly linked, even though they occur in separate
sentences.
Position Embeddings
Transformers lack inherent positional awareness since
they process input tokens in parallel. To address this,
positional embeddings are added to token embeddings.
These embeddings are either learned or based on fixed
sinusoidal functions:

PE{(pos, 2i)} = sin(pos/10000^{2i/d_model})


PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_model})

This encodes position information into embeddings,


ensuring that word order impacts the model’s
understanding.

Example: In “She is reading a book,” position


embeddings help distinguish "She is" from "is She."
Without positional information, word order could
become meaningless, and the model might
misinterpret the sentence.
Contextual Embeddings in Action
By combining all these components, BERT generates
contextual embeddings for each token. Unlike static
embeddings (e.g., Word2Vec or GloVe), these embeddings
adapt based on the token's context. For instance:
In “The bank near the river flooded,” the embedding for
"bank" encodes "riverbank."
In “She deposited money at the bank,” the embedding
for "bank" encodes "financial institution."
This dynamic adaptability underpins BERT’s success in a
wide range of NLP tasks.

You might also like