Key Components of Transformer Models
Key Components of Transformer Models
Name
Sajid Khursheed Bhat
Roll No
2021A1R175
Semester
8th Semester
Department
Computer Science and Engineering
Assignment Objectives:
The assignment aims to deepen students' understanding of sequence modeling and language
generation using modern generative AI techniques. It focuses on the architecture and operational
mechanisms of Transformer models, emphasizing components such as self-attention, multi-head
attention, and positional encoding. Students will analyze the limitations of RNNs, LSTMs, and GRUs in
handling long-term dependencies and compare these with Transformer-based models. The assignment
also explores the design and application of pre-trained models like GPT for conditional text generation.
Additionally, it covers evaluation frameworks for generative models, introducing key metrics like
BLEU, ROUGE, METEOR, and Perplexity to assess model output quality across different NLP tasks.
Assignment Questions:
Ans) The Transformer architecture represents a significant shift in how natural language
processing tasks are performed. Proposed by Vaswani et al. in 2017, it removed the need for
recurrence or convolution, which were previously central to sequence modelling. The
Transformer is composed of an encoder-decoder structure, although models like BERT and
GPT typically use only one of these components. In a standard Transformer, the encoder
processes the input and the decoder generates the output. At the heart of this architecture is the
self-attention mechanism, which enables the model to weigh and relate all words in the input
sequence simultaneously. For each token, the model computes three vectors: Query (Q), Key
(K), and Value (V). The attention score is calculated as the dot product of the Query and Key,
scaled and passed through a softmax function to assign weights. These weights are applied to
the Value vectors to generate a weighted sum. This allows each word to "attend" to other words
based on relevance. For instance, in the sentence “The cat sat on the mat,” the word “sat” will
strongly attend to “cat” and “mat” to understand context.
Multiple self-attention layers run in parallel (multi-head attention), allowing the model to learn
various semantic and syntactic relationships. After this, the output passes through a feed-
forward neural network which adds non-linearity and enables the model to learn complex
patterns. Each layer is wrapped with residual connections and layer normalization, which help
in gradient flow and training stability. However, since Transformers do not process input
sequentially like RNNs, they lack an inherent sense of order. This is where positional encoding
comes in. Positional encodings are vectors added to input embeddings, derived from sine and
cosine functions of different frequencies. These encodings help the model understand the
position of each word in a sentence. Without positional encodings, “I love dogs” and “Dogs
love I” would appear similar to the model. Thus, positional encoding is a key enabler for
preserving sequence information. Following the self-attention block is the feed-forward layer,
which is a fully connected neural network applied independently to each position. It consists of
two linear transformations with a ReLU (or sometimes GELU) activation function in between.
While self-attention enables the model to exchange information between different positions in
the sequence, the feed-forward network provides a deeper transformation of each position’s
representation, helping the model generalize and learn complex patterns. One of the challenges
faced by the Transformer is that, unlike RNNs, it processes tokens in parallel and thus lacks a
built-in sense of order or sequence. This is where positional encoding comes into play. Since
the model itself does not inherently know the position of a word in the sentence, positional
encodings are added to the input embeddings to give the model information about the order of
tokens. These encodings are fixed sinusoidal functions or learned positional vectors that are
added element-wise to the word embeddings. This addition enables the model to distinguish
between tokens based not only on their identity but also on their position in the sequence. As a
result, the model can understand that in the sentence “He went home,” the word “home” comes
after “went,” which is crucial for understanding meaning. In practice, positional encoding is a
clever solution to the challenge of modeling sequential data without using recurrence. The
sinusoidal form has the advantage of being able to extrapolate to longer sequences, while
learned positional embeddings may adapt better to specific tasks. Another critical aspect of the
Transformer architecture is the use of residual connections and layer normalization. Each sub-
layer in the Transformer is wrapped with a residual connection followed by layer
normalization. This means the output of each sub-layer is added to its input, and then
normalized. These techniques help stabilize training, speed up convergence, and allow for the
stacking of many layers without the vanishing gradient problem that typically affects deep
neural networks. To illustrate this architecture in a simple way, a diagram showing the
Transformer encoder layer would be helpful. The diagram would include an input embedding
layer followed by positional encoding, then a self-attention block, followed by a feed-forward
layer, all wrapped in residual connections and layer normalization. For a full Transformer
model, stacking several such layers leads to a powerful architecture capable of handling
complex language tasks.
Question 2: Compare RNN, LSTM, and GRU in terms of their ability to handle
long-term dependencies. Where do they fall short in modelling context in language
tasks?
Ans) Recurrent Neural Networks (RNNs) represent a foundational class of neural architectures
designed for modelling sequential data, such as natural language text, speech signals, or
financial time series. Unlike feedforward neural networks that treat each input independently,
RNNs are inherently designed to capture temporal dependencies by maintaining a hidden state
vector that evolves with each time step. At every step, the hidden state integrates new input
data with previously retained context, allowing the network to theoretically remember
arbitrarily long sequences. This makes RNNs particularly suitable for tasks where the order and
context of data points matter, such as language modelling, sequence labelling, and time-series
forecasting. However, RNNs face significant challenges during training, primarily due to the
vanishing and exploding gradient problems encountered during backpropagation through
time (BPTT). In long sequences, the gradients used to update network weights can either
decay exponentially (vanish) or grow uncontrollably (explode), making it extremely difficult
for standard RNNs to learn long-range dependencies. This limitation becomes critical in
domains like natural language processing (NLP), where understanding a word's meaning often
depends on context many steps prior. To address these shortcomings, Long Short-Term
Memory (LSTM) networks were introduced by Hochreiter and Schmidhuber in 1997. LSTMs
enhance the standard RNN architecture by incorporating a memory cell and a gating
mechanism that regulates the flow of information. Specifically, LSTMs introduce three gates—
input, forget, and output—each of which performs a learned, element-wise operation to either
store, discard, or expose information. The forget gate selectively removes information from
the previous cell state, the input gate determines what new information to encode, and the
output gate decides what part of the current cell state should be exposed as output. This design
enables LSTMs to maintain stable gradients over long sequences, thereby effectively learning
long-term dependencies and mitigating the vanishing gradient issue. A more computationally
efficient alternative to LSTMs is the Gated Recurrent Unit (GRU), introduced by Cho et al.
GRUs simplify the LSTM architecture by combining the forget and input gates into a single
update gate, and replacing the output gate with a reset gate. The update gate balances the
retention of past information against the incorporation of new inputs, while the reset gate
controls how much of the previous hidden state is forgotten. This streamlined architecture
requires fewer parameters and computations, which makes GRUs attractive in scenarios with
limited computational resources or smaller datasets, while still preserving the capacity to model
dependencies over moderate-length sequences. Despite these improvements, both LSTMs and
GRUs inherently operate in a sequential manner, meaning they process inputs one time step at
a time. This sequential dependency impedes parallelization during training and inference,
making them slower and less scalable compared to modern architectures like Transformers,
which leverage self-attention mechanisms to process sequences in parallel. Furthermore, RNN-
based models are still susceptible to memory compression—a phenomenon where information
from long sequences is compressed into a fixed-size hidden state. This often leads to the loss of
fine-grained contextual nuances, especially in tasks requiring the integration of distant semantic
information.
RNNs, LSTMs, and GRUs marked significant advancements in sequence modelling, their
architectural constraints have spurred the development of more efficient and context-aware
models such as Transformers, which are now considered state-of-the-art in many sequence-
processing tasks. Despite their advantages, both LSTM and GRU still process sequences
sequentially, which makes them slower to train compared to parallelizable architectures like
Transformers. Moreover, they are still limited in their ability to capture very long-range
dependencies due to their inherent step-by-step design. They may also suffer from memory
compression, where too much information gets squeezed into a fixed-size hidden state, causing
loss of nuanced context.
Fig: LSTM
Question 3: Analyse and summarize how GPT and BERT differ in their model
architecture and training objectives. Highlight practical use cases where each
excels.
Ans) GPT and BERT are derived from the Transformer architecture but are built and trained
with fundamentally different objectives and components. GPT (Generative Pre-trained
Transformer) is a decoder-only architecture trained in an autoregressive manner. This means it
learns to predict the next word in a sentence given the previous words. It processes data from
left to right, making it well-suited for tasks involving generation, such as text completion or
creative writing.
BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder part
of the Transformer. It is trained using masked language modelling, where random tokens in a
sentence are masked, and the model learns to predict them using the context from both left and
right.
These architectural differences translate directly into their most effective use cases. GPT
models excel in creative and generative tasks, such as:
Completion-based tasks
Because GPT is autoregressive and trained to predict the next token, it is inherently flexible in
tasks where text needs to be generated dynamically based on a given prompt. BERT, on the
other hand, shines in structured, discriminative tasks where the goal is to understand or
classify text rather than generate it. Prominent applications include:
The training objectives further reinforce their roles. GPT is trained using a causal language
modelling (CLM) objective, predicting the next word in a sequence given all previous words.
This enables GPT to learn sequential dependencies and generate fluent text.
BERT, by contrast, uses a masked language modelling (MLM) objective, where random
tokens in the input are masked and the model learns to predict them. Additionally, it uses a
next sentence prediction (NSP) task to improve understanding of sentence-level relationships.
This bidirectional training allows BERT to generate contextual embeddings that capture fine-
grained semantic relationships.
As a result, GPT generalizes well in open-ended scenarios, especially in later iterations like
GPT-3 and GPT-4, which leverage few-shot, one-shot, and zero-shot learning paradigms
through prompt engineering. BERT, on the other hand, excels in precision tasks where
structured input and output formats are well-defined and where labelled data is available for
fine-tuning.
Another significant distinction lies in how these models are adapted to downstream tasks.
BERT is typically fine-tuned with the addition of task-specific heads. For instance:
This modular fine-tuning strategy, combined with BERT’s robust embeddings, enables high
accuracy across a variety of supervised tasks, even with relatively small, labelled datasets.
GPT, particularly in its later versions, is often used in a few-shot or zero-shot setting without
explicit fine-tuning. This is made possible by prompt engineering, where the task instructions
and a few demonstrations are embedded directly into the prompt. For example:
Translate 'Hello' to French: Bonjour. Translate 'Thank you' to French: will naturally lead GPT
to generate "Merci" based on its pretrained knowledge.
To bridge the gap between these two paradigms, hybrid models like T5 (Text-to-Text
Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformer) have
been proposed. These models combine the strengths of both encoder and decoder architectures:
These models reflect a broader shift toward multifunctional architectures that can flexibly
switch between comprehension and generation based on task formulation.
Conditional text generation refers to the computational paradigm in which a language model
generates coherent, contextually relevant, and goal-directed text in response to an explicit input
—termed the conditioning signal. Unlike unconditional generation, where models produce
text from random or generic seeds, conditional generation is context-sensitive, driven by
prompts, structured queries, or natural language instructions. These conditions act as semantic
anchors, guiding the model toward producing output aligned with the user's intent or the task
at hand.
The pipeline for conditional generation unfolds through several distinct yet interconnected
stages:
1. Task Identification and Prompt Construction: The first step involves defining the
task and the desired form of the output—be it a dialogue turn, a summary, a narrative, or
an email draft. The conditioning input could range from structured data (e.g., tables,
code snippets) to unstructured language (e.g., questions, descriptions, or partial
sentences). This input is then transformed into a natural language prompt that clearly
communicates the intent to the model. The effectiveness of prompt engineering is
critical here. A vague instruction like "Write" lacks semantic guidance, whereas a
structured prompt such as "Story Prompt: A dragon guards a village. Continue the
story:" is rich in context, purpose, and tone, thereby enhancing the model’s ability to
generate relevant and stylistically appropriate text.
2. Tokenization and Numerical Encoding :Once the prompt is finalized, it undergoes
tokenization, where it is segmented into subword units or tokens. These are then
mapped to numerical token IDs using the model’s vocabulary embedding. This
transformation into a dense numerical representation is essential, as transformer models
operate on vectorized inputs within high-dimensional spaces.
3. Autoregressive Text Generation : The input tokens are fed into the transformer’s
encoder-decoder (e.g., T5, BART) or decoder-only (e.g., GPT) architecture, where the
generation process is initiated. This process is autoregressive: the model predicts the
next token based on the previously generated tokens, recursively building the sequence
one token at a time. The self-attention mechanism enables the model to contextualize
each new token with respect to the entire history of preceding tokens, capturing both
local coherence and long-range dependencies.
4. Decoding Strategies : The quality, coherence, and stylistic diversity of the generated
text are heavily influenced by the choice of decoding algorithm:
o Greedy Decoding: Selects the token with the highest probability at each
timestep. While computationally efficient, it often leads to deterministic and
repetitive outputs.
o Beam Search: Explores multiple candidate sequences simultaneously, selecting
the one with the highest cumulative probability. This yields more grammatical
and coherent results, though it may reduce output variability.
o Top-k Sampling: Restricts the sampling pool to the top k most probable tokens
at each step, introducing controlled randomness and creativity.
o Top-p (Nucleus) Sampling: Dynamically selects the smallest subset of tokens
whose cumulative probability exceeds a threshold p. This strategy balances
fluency and diversity and is generally preferred for open-domain text
generation.
o Temperature Scaling: Adjusts the sharpness of the probability distribution. A
high temperature (e.g., >1.0) encourages creative, less predictable outputs,
while a low temperature (e.g., 0.3–0.6) promotes more conservative,
deterministic responses.
5. Detokenization and Post-processing : After the token sequence is generated, it is
detokenized—converted back from token IDs into human-readable text. Post-
processing may involve formatting, correcting punctuation, removing special tokens, or
applying task-specific refinements to ensure the final output is coherent, polished, and
presentation-ready.
In production environments and research settings, the generated output is evaluated using a
combination of automatic metrics and human evaluation. Traditional evaluation metrics
such as BLEU, ROUGE, and METEOR measure syntactic overlap and are useful for tasks
like summarization or translation. However, they fall short in capturing semantic equivalence
and discourse-level coherence. As a result, more semantic-aware metrics like BERTScore
and BLEURT are increasingly preferred. These leverage contextual embeddings and human-
labelled datasets, respectively, to assess the meaning, factual correctness, and linguistic
quality of generated text. Nevertheless, human-in-the-loop evaluation remains the gold
standard for assessing pragmatic suitability, user satisfaction, and domain-specific
appropriateness.
A well-constructed prompt might look like: “Email: I would like to schedule a meeting next
week regarding the product update. Reply:” Given this prompt, the model could generate:
“Thank you for reaching out. I’d be happy to schedule a meeting. Please share your
availability.”
This response is not only grammatically fluent but also contextually appropriate and
professionally courteous demonstrating the efficacy of conditional generation in real-world,
goal-oriented text production.
Question 5: List and explain evaluation metrics used for generative text models
(BLEU, ROUGE, METEOR, Perplexity). Which metric would you choose for
summarization vs. dialogue generation tasks and why?
A suite of classical evaluation metrics has been developed to provide quantitative proxies for
textual quality, although they predominantly rely on surface-level n-gram overlaps between
the generated output and one or more human-authored reference texts.
BLEU (Bilingual Evaluation Understudy) is one of the earliest and most widely
adopted metrics, particularly in the domain of machine translation. It employs a
Despite its broader applicability, ROUGE still exhibits a bias toward lexical similarity, often
underrepresenting the semantic equivalence between text spans that diverge in surface form but
not in meaning.
While metrics such as BLEU, ROUGE, and METEOR provide a baseline for automated
evaluation, they fall short when confronted with tasks that demand semantic depth, pragmatic
consistency, and contextual nuance. These shortcomings have catalysed the development of
semantic-aware evaluation metrics that exploit the representational power of pre-trained deep
learning models.
Both BERTScore and BLEURT are computationally intensive, often requiring GPU
acceleration for efficient evaluation. Their reliance on deep contextual encoders also makes
them less interpretable than simpler metrics, although their alignment with human perception
often justifies the computational overhead in research and production environments.
Given the diverse nature of generative tasks, metric selection should be tailored to task-
specific objectives:
For summarization, metrics like ROUGE remain a staple due to their emphasis on
content coverage. However, when deeper semantic evaluation is needed, BERTScore
and BLEURT provide richer, more faithful assessments of the summary’s
informativeness and coherence.
In dialogue generation, traditional metrics often fail to capture subtleties such as
coherence across turns, engagement, or contextual relevance. Here, METEOR may
offer moderate gains, but BLEURT or human evaluations are often preferred due to
their ability to reflect real conversational quality, including the appropriateness of tone,
factual alignment, and user intent.