BERT Architecture

Explain the
BERT
architecture
Transformer Encoder
The transformer encoder is the core architecture of
BERT, comprising multiple layers of identical
submodules. Each submodule includes multi-head self-
attention and a position-wise feedforward network,
followed by residual connections and layer
normalization. This architecture processes all input
tokens simultaneously, allowing the model to capture
bidirectional relationships in text.
Example: In the sentence “The bank near the river

flooded,” the encoder ensures that the token "bank" is
processed with full awareness of the entire sentence,
using both preceding ("The bank") and succeeding
("near the river flooded") tokens. This bidirectional
context helps distinguish "bank" as "riverbank."
Multi-Head Self-Attention
Multi-head self-attention computes relationships
between tokens using three matrices for each input
token: Query (Q), Key (K), and Value (V). The attention
score between tokens is calculated using the scaled dot-
product
Multi-head attention applies this process multiple times
(with different learned weights) to capture diverse
aspects of relationships, such as semantics, syntax, or
positional nuances.
Example: In “She gave him the book because it was

interesting,” one attention head might focus on "it" and
"book," while another connects "gave" with "him." By
aggregating insights from multiple heads, the model
forms a richer understanding of the sentence.
Feedforward Neural Network (FFN)
The FFN is applied to each token independently, after
the attention mechanism. It consists of two linear layers
with a non-linear activation (e.g., GELU) in between:
FFN(x)=max(0,xW1+b1)W2+b2
This layer increases the model's capacity to learn

complex transformations and refine token
representations.
Example: After attention resolves the context of "bank"

as "riverbank," the FFN enriches the representation,
ensuring that subtle nuances, such as associations with
"river" or "flooded," are embedded in the token's final
vector.
Token Embeddings
BERT uses WordPiece tokenization, which splits rare or
compound words into smaller, more common
subwords. Each subword is assigned a vector, initialized
randomly and learned during training. These token
embeddings represent the input vocabulary.
Example: For the word "unbelievable," tokenization

might produce ["un," "believ," "able"], each with its
embedding. These subword embeddings are combined
(e.g., by summing or averaging) during processing,
allowing BERT to generalize effectively to unseen or rare
words.
Segment Embeddings
In sentence pair tasks, segment embeddings are added
to token embeddings to indicate which sentence a
token belongs to. Segment A (e.g., question) and
Segment B (e.g., answer) are represented by learned
embeddings (EAE_AEAand EBE_BEB). This helps BERT
distinguish tokens from the two segments during
training and inference.
Example: In a QA task with “Where is the bank?”

(Segment A) and “It is near the river.” (Segment B),
segment embeddings ensure that "bank" and "river" are
correctly linked, even though they occur in separate
sentences.
Position Embeddings
Transformers lack inherent positional awareness since
they process input tokens in parallel. To address this,
positional embeddings are added to token embeddings.
These embeddings are either learned or based on fixed
sinusoidal functions:
PE{(pos, 2i)} = sin(pos/10000^{2i/d_model})

PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_model})
This encodes position information into embeddings,

ensuring that word order impacts the model’s
understanding.
Example: In “She is reading a book,” position

embeddings help distinguish "She is" from "is She."
Without positional information, word order could
become meaningless, and the model might
misinterpret the sentence.
Contextual Embeddings in Action
By combining all these components, BERT generates
contextual embeddings for each token. Unlike static
embeddings (e.g., Word2Vec or GloVe), these embeddings
adapt based on the token's context. For instance:
In “The bank near the river flooded,” the embedding for
"bank" encodes "riverbank."
In “She deposited money at the bank,” the embedding
for "bank" encodes "financial institution."
This dynamic adaptability underpins BERT’s success in a
wide range of NLP tasks.

BERT Architecture

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

BERT Architecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BERT Architecture

Uploaded by

Copyright:

Available Formats

Explain the

Example: In the sentence “The bank near the river

Example: In “She gave him the book because it was

This layer increases the model's capacity to learn

Example: After attention resolves the context of "bank"

Example: For the word "unbelievable," tokenization

Example: In a QA task with “Where is the bank?”

PE{(pos, 2i)} = sin(pos/10000^{2i/d_model})

This encodes position information into embeddings,

Example: In “She is reading a book,” position

You might also like