0% found this document useful (0 votes)
33 views20 pages

Key Components of Transformer Models

Uploaded by

bhatsajid8494
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views20 pages

Key Components of Transformer Models

Uploaded by

bhatsajid8494
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

ASSIGNMENT II (MST II)

Name
Sajid Khursheed Bhat
Roll No
2021A1R175
Semester
8th Semester
Department
Computer Science and Engineering

Model Institute of Engineering & Technology (Autonomous)


(Permanently Affiliated to the University of Jammu, Accredited by NAAC with “A” Grade)
Jammu, India
2024
Assignment: COM-801 MST II

ASSIGNMENT II (MST II)


Subject Code: COM-801(Generative AI) Due Date: 22 May 2025

Question Course Outcomes Blooms’ Level Maximum Marks


Number Marks Obtained
Q1 CO 4 Understanding 4
Q2 CO 4 Analysing 4
Q3 CO 5 Evaluating 4
Q4 CO 5 Creating 4
Q5 CO 5 Evaluating 4
Total Marks 20
Faculty Signature
Email:- [Link]@[Link]

Assignment Objectives:
The assignment aims to deepen students' understanding of sequence modeling and language
generation using modern generative AI techniques. It focuses on the architecture and operational
mechanisms of Transformer models, emphasizing components such as self-attention, multi-head
attention, and positional encoding. Students will analyze the limitations of RNNs, LSTMs, and GRUs in
handling long-term dependencies and compare these with Transformer-based models. The assignment
also explores the design and application of pre-trained models like GPT for conditional text generation.
Additionally, it covers evaluation frameworks for generative models, introducing key metrics like
BLEU, ROUGE, METEOR, and Perplexity to assess model output quality across different NLP tasks.

Assignment Questions:

Q. No. Questions BL CO Marks Total


Marks
1 Explain the key components of the Transformer Understanding
architecture, including the role of self-attention and
4 4 4
feed-forward layers. How does positional encoding
contribute to the model?
2 Compare RNN, LSTM, and GRU in terms of their Analyzing
ability to handle long-term dependencies. Where do
4 4 4
they fall short in modeling context in language
tasks?
3 Analyze and summarize how GPT and BERT differ Evaluating
in their model architecture and training objectives. 5 4 4
Highlight practical use cases where each excels.
4 Draft a step-by-step pipeline for a conditional text Creating
generation system using a pre-trained transformer
5 4 4
(like GPT). Mention how input prompts and
decoding strategies affect output.
5 List and explain evaluation metrics used for Evaluating
generative text models (BLEU, ROUGE, METEOR,
Perplexity). Which metric would you choose for 5 4 4
summarization vs. dialogue generation tasks and
why?

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Question 1: Explain the key components of the Transformer architecture,


including the role of self-attention and feed-forward layers. How does positional
encoding contribute to the model?

Ans) The Transformer architecture represents a significant shift in how natural language
processing tasks are performed. Proposed by Vaswani et al. in 2017, it removed the need for
recurrence or convolution, which were previously central to sequence modelling. The
Transformer is composed of an encoder-decoder structure, although models like BERT and
GPT typically use only one of these components. In a standard Transformer, the encoder
processes the input and the decoder generates the output. At the heart of this architecture is the
self-attention mechanism, which enables the model to weigh and relate all words in the input
sequence simultaneously. For each token, the model computes three vectors: Query (Q), Key
(K), and Value (V). The attention score is calculated as the dot product of the Query and Key,
scaled and passed through a softmax function to assign weights. These weights are applied to
the Value vectors to generate a weighted sum. This allows each word to "attend" to other words
based on relevance. For instance, in the sentence “The cat sat on the mat,” the word “sat” will
strongly attend to “cat” and “mat” to understand context.

Fig: Transformer architecture

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Multiple self-attention layers run in parallel (multi-head attention), allowing the model to learn
various semantic and syntactic relationships. After this, the output passes through a feed-
forward neural network which adds non-linearity and enables the model to learn complex
patterns. Each layer is wrapped with residual connections and layer normalization, which help
in gradient flow and training stability. However, since Transformers do not process input
sequentially like RNNs, they lack an inherent sense of order. This is where positional encoding
comes in. Positional encodings are vectors added to input embeddings, derived from sine and
cosine functions of different frequencies. These encodings help the model understand the
position of each word in a sentence. Without positional encodings, “I love dogs” and “Dogs
love I” would appear similar to the model. Thus, positional encoding is a key enabler for
preserving sequence information. Following the self-attention block is the feed-forward layer,
which is a fully connected neural network applied independently to each position. It consists of
two linear transformations with a ReLU (or sometimes GELU) activation function in between.
While self-attention enables the model to exchange information between different positions in
the sequence, the feed-forward network provides a deeper transformation of each position’s
representation, helping the model generalize and learn complex patterns. One of the challenges
faced by the Transformer is that, unlike RNNs, it processes tokens in parallel and thus lacks a
built-in sense of order or sequence. This is where positional encoding comes into play. Since
the model itself does not inherently know the position of a word in the sentence, positional
encodings are added to the input embeddings to give the model information about the order of
tokens. These encodings are fixed sinusoidal functions or learned positional vectors that are
added element-wise to the word embeddings. This addition enables the model to distinguish
between tokens based not only on their identity but also on their position in the sequence. As a
result, the model can understand that in the sentence “He went home,” the word “home” comes
after “went,” which is crucial for understanding meaning. In practice, positional encoding is a
clever solution to the challenge of modeling sequential data without using recurrence. The
sinusoidal form has the advantage of being able to extrapolate to longer sequences, while
learned positional embeddings may adapt better to specific tasks. Another critical aspect of the
Transformer architecture is the use of residual connections and layer normalization. Each sub-
layer in the Transformer is wrapped with a residual connection followed by layer
normalization. This means the output of each sub-layer is added to its input, and then
normalized. These techniques help stabilize training, speed up convergence, and allow for the
stacking of many layers without the vanishing gradient problem that typically affects deep

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

neural networks. To illustrate this architecture in a simple way, a diagram showing the
Transformer encoder layer would be helpful. The diagram would include an input embedding
layer followed by positional encoding, then a self-attention block, followed by a feed-forward
layer, all wrapped in residual connections and layer normalization. For a full Transformer
model, stacking several such layers leads to a powerful architecture capable of handling
complex language tasks.

The Transformer model's key components—self-attention, multi-head attention, feed-forward


layers, and positional encodings—work together to enable powerful sequence modeling
without recurrence. The self-attention mechanism allows the model to focus on relevant words
regardless of their distance, while the feed-forward layers help refine these representations.
Positional encoding ensures the model captures word order, which is vital for syntactic and
semantic understanding. Together, these elements form the foundation of most modern
language models, including BERT, GPT, and many others that power applications ranging
from chatbots to translation systems.

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Question 2: Compare RNN, LSTM, and GRU in terms of their ability to handle
long-term dependencies. Where do they fall short in modelling context in language
tasks?

Ans) Recurrent Neural Networks (RNNs) represent a foundational class of neural architectures
designed for modelling sequential data, such as natural language text, speech signals, or
financial time series. Unlike feedforward neural networks that treat each input independently,
RNNs are inherently designed to capture temporal dependencies by maintaining a hidden state
vector that evolves with each time step. At every step, the hidden state integrates new input
data with previously retained context, allowing the network to theoretically remember
arbitrarily long sequences. This makes RNNs particularly suitable for tasks where the order and
context of data points matter, such as language modelling, sequence labelling, and time-series
forecasting. However, RNNs face significant challenges during training, primarily due to the
vanishing and exploding gradient problems encountered during backpropagation through
time (BPTT). In long sequences, the gradients used to update network weights can either
decay exponentially (vanish) or grow uncontrollably (explode), making it extremely difficult
for standard RNNs to learn long-range dependencies. This limitation becomes critical in
domains like natural language processing (NLP), where understanding a word's meaning often
depends on context many steps prior. To address these shortcomings, Long Short-Term
Memory (LSTM) networks were introduced by Hochreiter and Schmidhuber in 1997. LSTMs
enhance the standard RNN architecture by incorporating a memory cell and a gating
mechanism that regulates the flow of information. Specifically, LSTMs introduce three gates—
input, forget, and output—each of which performs a learned, element-wise operation to either
store, discard, or expose information. The forget gate selectively removes information from
the previous cell state, the input gate determines what new information to encode, and the
output gate decides what part of the current cell state should be exposed as output. This design
enables LSTMs to maintain stable gradients over long sequences, thereby effectively learning
long-term dependencies and mitigating the vanishing gradient issue. A more computationally
efficient alternative to LSTMs is the Gated Recurrent Unit (GRU), introduced by Cho et al.
GRUs simplify the LSTM architecture by combining the forget and input gates into a single
update gate, and replacing the output gate with a reset gate. The update gate balances the

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

retention of past information against the incorporation of new inputs, while the reset gate
controls how much of the previous hidden state is forgotten. This streamlined architecture
requires fewer parameters and computations, which makes GRUs attractive in scenarios with
limited computational resources or smaller datasets, while still preserving the capacity to model
dependencies over moderate-length sequences. Despite these improvements, both LSTMs and
GRUs inherently operate in a sequential manner, meaning they process inputs one time step at
a time. This sequential dependency impedes parallelization during training and inference,
making them slower and less scalable compared to modern architectures like Transformers,
which leverage self-attention mechanisms to process sequences in parallel. Furthermore, RNN-
based models are still susceptible to memory compression—a phenomenon where information
from long sequences is compressed into a fixed-size hidden state. This often leads to the loss of
fine-grained contextual nuances, especially in tasks requiring the integration of distant semantic
information.

RNNs, LSTMs, and GRUs marked significant advancements in sequence modelling, their
architectural constraints have spurred the development of more efficient and context-aware
models such as Transformers, which are now considered state-of-the-art in many sequence-
processing tasks. Despite their advantages, both LSTM and GRU still process sequences
sequentially, which makes them slower to train compared to parallelizable architectures like
Transformers. Moreover, they are still limited in their ability to capture very long-range
dependencies due to their inherent step-by-step design. They may also suffer from memory
compression, where too much information gets squeezed into a fixed-size hidden state, causing
loss of nuanced context.

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Fig: LSTM

Question 3: Analyse and summarize how GPT and BERT differ in their model
architecture and training objectives. Highlight practical use cases where each
excels.

Ans) GPT and BERT are derived from the Transformer architecture but are built and trained
with fundamentally different objectives and components. GPT (Generative Pre-trained
Transformer) is a decoder-only architecture trained in an autoregressive manner. This means it
learns to predict the next word in a sentence given the previous words. It processes data from
left to right, making it well-suited for tasks involving generation, such as text completion or
creative writing.

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder part
of the Transformer. It is trained using masked language modelling, where random tokens in a
sentence are masked, and the model learns to predict them using the context from both left and
right.

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Fig: BERT v GPT

Architectural and Functional Distinctions Between GPT and BERT

At the architectural level, Generative Pre-trained Transformers (GPT) and Bidirectional


Encoder Representations from Transformers (BERT) represent two distinct approaches to
leveraging the Transformer framework introduced by Vaswani et al. While both utilize the core
Transformer design, their structural and functional differences reflect divergent philosophical
and operational goals. GPT employs a decoder-only architecture with masked (causal) self-
attention, meaning that at each position in the sequence, the model can only attend to tokens
that appear earlier (i.e., to the left) in the sequence. This prevents the model from "looking
ahead" and ensures that each token is predicted based solely on past context. This
autoregressive behaviour makes GPT inherently generative, capable of producing coherent
sequences one token at a time, which is ideal for open-ended tasks such as story generation,
dialogue systems, and code synthesis. BERT is based on an encoder-only architecture that
uses bidirectional self-attention, enabling each token to attend to all other tokens in the input
simultaneously, regardless of their position. This full contextualization allows BERT to
generate deep, bidirectional representations of the input, making it well-suited for
discriminative and contextual understanding tasks, such as sentence classification, entity
recognition, and extractive question answering. BERT is not generative in the traditional sense
and is instead optimized for understanding and labelling existing text.

Use Cases and Task Alignment

These architectural differences translate directly into their most effective use cases. GPT
models excel in creative and generative tasks, such as:

 Conversational agents and chatbots


 Creative writing and storytelling
 Automatic code generation
 Language translation (especially through prompt engineering)

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

 Completion-based tasks

Because GPT is autoregressive and trained to predict the next token, it is inherently flexible in
tasks where text needs to be generated dynamically based on a given prompt. BERT, on the
other hand, shines in structured, discriminative tasks where the goal is to understand or
classify text rather than generate it. Prominent applications include:

 Text classification (e.g., sentiment analysis, spam detection)


 Named Entity Recognition (NER)
 Intent recognition in dialog systems
 Extractive question answering (e.g., SQuAD), where the model identifies start and end
spans of an answer within a passage

Training Objectives and Generalization Capabilities

The training objectives further reinforce their roles. GPT is trained using a causal language
modelling (CLM) objective, predicting the next word in a sequence given all previous words.
This enables GPT to learn sequential dependencies and generate fluent text.

BERT, by contrast, uses a masked language modelling (MLM) objective, where random
tokens in the input are masked and the model learns to predict them. Additionally, it uses a
next sentence prediction (NSP) task to improve understanding of sentence-level relationships.
This bidirectional training allows BERT to generate contextual embeddings that capture fine-
grained semantic relationships.

As a result, GPT generalizes well in open-ended scenarios, especially in later iterations like
GPT-3 and GPT-4, which leverage few-shot, one-shot, and zero-shot learning paradigms
through prompt engineering. BERT, on the other hand, excels in precision tasks where
structured input and output formats are well-defined and where labelled data is available for
fine-tuning.

Fine-Tuning Approaches and Practical Deployment

Another significant distinction lies in how these models are adapted to downstream tasks.
BERT is typically fine-tuned with the addition of task-specific heads. For instance:

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

 In sentence classification, a softmax classification head is added to the [CLS] token


output.
 For question answering, two linear layers are appended to predict the start and end
positions of the answer span in the input.

This modular fine-tuning strategy, combined with BERT’s robust embeddings, enables high
accuracy across a variety of supervised tasks, even with relatively small, labelled datasets.

GPT, particularly in its later versions, is often used in a few-shot or zero-shot setting without
explicit fine-tuning. This is made possible by prompt engineering, where the task instructions
and a few demonstrations are embedded directly into the prompt. For example:

Translate 'Hello' to French: Bonjour. Translate 'Thank you' to French: will naturally lead GPT
to generate "Merci" based on its pretrained knowledge.

This approach is especially advantageous in scenarios where task-specific labelled data is


scarce or where rapid adaptability across domains is needed.

Hybrid Models: Combining the Best of Both Worlds

To bridge the gap between these two paradigms, hybrid models like T5 (Text-to-Text
Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformer) have
been proposed. These models combine the strengths of both encoder and decoder architectures:

 T5 frames all NLP tasks—from translation to classification—as text-to-text problems,


leveraging a unified architecture.
 BART uses a denoising autoencoder structure, employing an encoder to read corrupted
input text and a decoder to reconstruct the original, making it versatile for both
generative and discriminative tasks.

These models reflect a broader shift toward multifunctional architectures that can flexibly
switch between comprehension and generation based on task formulation.

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Question 4: Draft a step-by-step pipeline for a conditional text generation system


using a pre-trained transformer (like GPT). Mention how input prompts and
decoding strategies affect output.

Ans) Conditional Text Generation: Architecture, Workflow, and Applications

Conditional text generation refers to the computational paradigm in which a language model
generates coherent, contextually relevant, and goal-directed text in response to an explicit input
—termed the conditioning signal. Unlike unconditional generation, where models produce
text from random or generic seeds, conditional generation is context-sensitive, driven by
prompts, structured queries, or natural language instructions. These conditions act as semantic
anchors, guiding the model toward producing output aligned with the user's intent or the task
at hand.

In the modern deep learning landscape, conditional generation is predominantly executed


through pre-trained transformer-based architectures, such as GPT (Generative Pre-
trained Transformer), T5 (Text-to-Text Transfer Transformer), and BART (Bidirectional
and Auto-Regressive Transformers). These models are fine-tuned or prompted to produce
high-quality natural language outputs across a wide array of downstream applications. Among
these, GPT has emerged as a particularly effective architecture for generative tasks due to its
autoregressive nature, expansive training corpus, and impressive fluency in language
synthesis.

Pipeline Overview: From Condition to Output

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

The pipeline for conditional generation unfolds through several distinct yet interconnected
stages:

1. Task Identification and Prompt Construction: The first step involves defining the
task and the desired form of the output—be it a dialogue turn, a summary, a narrative, or
an email draft. The conditioning input could range from structured data (e.g., tables,
code snippets) to unstructured language (e.g., questions, descriptions, or partial
sentences). This input is then transformed into a natural language prompt that clearly
communicates the intent to the model. The effectiveness of prompt engineering is
critical here. A vague instruction like "Write" lacks semantic guidance, whereas a
structured prompt such as "Story Prompt: A dragon guards a village. Continue the
story:" is rich in context, purpose, and tone, thereby enhancing the model’s ability to
generate relevant and stylistically appropriate text.
2. Tokenization and Numerical Encoding :Once the prompt is finalized, it undergoes
tokenization, where it is segmented into subword units or tokens. These are then
mapped to numerical token IDs using the model’s vocabulary embedding. This
transformation into a dense numerical representation is essential, as transformer models
operate on vectorized inputs within high-dimensional spaces.
3. Autoregressive Text Generation : The input tokens are fed into the transformer’s
encoder-decoder (e.g., T5, BART) or decoder-only (e.g., GPT) architecture, where the
generation process is initiated. This process is autoregressive: the model predicts the
next token based on the previously generated tokens, recursively building the sequence
one token at a time. The self-attention mechanism enables the model to contextualize
each new token with respect to the entire history of preceding tokens, capturing both
local coherence and long-range dependencies.
4. Decoding Strategies : The quality, coherence, and stylistic diversity of the generated
text are heavily influenced by the choice of decoding algorithm:
o Greedy Decoding: Selects the token with the highest probability at each
timestep. While computationally efficient, it often leads to deterministic and
repetitive outputs.
o Beam Search: Explores multiple candidate sequences simultaneously, selecting
the one with the highest cumulative probability. This yields more grammatical
and coherent results, though it may reduce output variability.

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

o Top-k Sampling: Restricts the sampling pool to the top k most probable tokens
at each step, introducing controlled randomness and creativity.
o Top-p (Nucleus) Sampling: Dynamically selects the smallest subset of tokens
whose cumulative probability exceeds a threshold p. This strategy balances
fluency and diversity and is generally preferred for open-domain text
generation.
o Temperature Scaling: Adjusts the sharpness of the probability distribution. A
high temperature (e.g., >1.0) encourages creative, less predictable outputs,
while a low temperature (e.g., 0.3–0.6) promotes more conservative,
deterministic responses.
5. Detokenization and Post-processing : After the token sequence is generated, it is
detokenized—converted back from token IDs into human-readable text. Post-
processing may involve formatting, correcting punctuation, removing special tokens, or
applying task-specific refinements to ensure the final output is coherent, polished, and
presentation-ready.

Evaluation and Quality Assessment

In production environments and research settings, the generated output is evaluated using a
combination of automatic metrics and human evaluation. Traditional evaluation metrics
such as BLEU, ROUGE, and METEOR measure syntactic overlap and are useful for tasks
like summarization or translation. However, they fall short in capturing semantic equivalence
and discourse-level coherence. As a result, more semantic-aware metrics like BERTScore
and BLEURT are increasingly preferred. These leverage contextual embeddings and human-
labelled datasets, respectively, to assess the meaning, factual correctness, and linguistic
quality of generated text. Nevertheless, human-in-the-loop evaluation remains the gold
standard for assessing pragmatic suitability, user satisfaction, and domain-specific
appropriateness.

Applied Example: Automating Email Responses

To illustrate the practical deployment of conditional generation, consider a use case in


enterprise automation: automated email reply generation. Suppose an incoming email reads:
“I would like to schedule a meeting next week regarding the product update.”

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

A well-constructed prompt might look like: “Email: I would like to schedule a meeting next
week regarding the product update. Reply:” Given this prompt, the model could generate:
“Thank you for reaching out. I’d be happy to schedule a meeting. Please share your
availability.”

This response is not only grammatically fluent but also contextually appropriate and
professionally courteous demonstrating the efficacy of conditional generation in real-world,
goal-oriented text production.

Conditional text generation epitomizes the convergence of natural language understanding


and generation, enabling AI systems to craft responses that are context-aware, semantically
coherent, and task-specific. Using pre-trained transformer architectures and sophisticated
decoding strategies, models can be guided to produce highly customized outputs from minimal
input signals. As advancements in prompt engineering, evaluation metrics, and transfer
learning continue to mature, conditional generation is poised to play an ever-expanding role in
intelligent automation, creative assistance, and human-AI interaction.

Fig: Anatomy of LLMs

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Question 5: List and explain evaluation metrics used for generative text models
(BLEU, ROUGE, METEOR, Perplexity). Which metric would you choose for
summarization vs. dialogue generation tasks and why?

Ans) Evaluating Generative Language Models: Challenges, Metrics, and Advances

The evaluation of generative language models constitutes a fundamental yet intrinsically


complex aspect of natural language processing (NLP), primarily due to the non-deterministic
and open-ended nature of language generation tasks. Unlike traditional classification or
regression problems that have well-defined ground truths, generative outputs are inherently
multi-referential—multiple semantically valid responses can exist for a single input. This
multiplicity challenges the development of standardized evaluation methodologies and
necessitates a hybrid approach, combining both automatic metrics and human judgment to
assess the quality, relevance, and coherence of generated text.

Traditional Surface-Level Evaluation Metrics

A suite of classical evaluation metrics has been developed to provide quantitative proxies for
textual quality, although they predominantly rely on surface-level n-gram overlaps between
the generated output and one or more human-authored reference texts.

 BLEU (Bilingual Evaluation Understudy) is one of the earliest and most widely
adopted metrics, particularly in the domain of machine translation. It employs a

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

precision-oriented framework, calculating the proportion of matching n-grams


between the candidate and reference outputs. BLEU’s strength lies in its simplicity and
scalability; however, it suffers from significant limitations. Specifically, BLEU can
unfairly penalize lexically diverse yet semantically equivalent paraphrases, owing
to its reliance on exact word or phrase matching without accounting for synonymy or
sentence structure variations.
 ROUGE (Recall-Oriented Understudy for Gisting Evaluation), on the other hand, is
more recall-focused and has gained prominence in summarization tasks, where
retaining core informational content is paramount. ROUGE variants include:
o ROUGE-1: Measures unigram overlap.
o ROUGE-2: Captures bigram co-occurrence.
o ROUGE-L: Identifies the longest common subsequence, which provides insight
into fluency and structural alignment.

Despite its broader applicability, ROUGE still exhibits a bias toward lexical similarity, often
underrepresenting the semantic equivalence between text spans that diverge in surface form but
not in meaning.

 METEOR (Metric for Evaluation of Translation with Explicit ORdering) was


proposed to address BLEU's shortcomings by introducing linguistically motivated
enhancements. It integrates stemming, synonym matching (via resources like
WordNet), and word-order penalties, making it more sensitive to paraphrastic and
semantically diverse expressions. This renders METEOR particularly valuable in tasks
where semantic flexibility—such as paraphrasing and dialogue generation—is more
critical than syntactic exactness.
 Perplexity, unlike the above metrics, is an intrinsic evaluation measure that quantifies
how well a probabilistic language model predicts the next token in a sequence. A lower
perplexity indicates higher model confidence and fluency. However, perplexity is
inadequate for evaluating semantic relevance or factual accuracy, as it operates
independently of any reference output and offers no insight into whether the generated
content is contextually or topically appropriate.

Limitations of Traditional Metrics and the Rise of Semantic Evaluation

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

While metrics such as BLEU, ROUGE, and METEOR provide a baseline for automated
evaluation, they fall short when confronted with tasks that demand semantic depth, pragmatic
consistency, and contextual nuance. These shortcomings have catalysed the development of
semantic-aware evaluation metrics that exploit the representational power of pre-trained deep
learning models.

 BERTScore, for example, leverages contextualized token embeddings from


Transformer-based models like BERT to perform token-level matching based on cosine
similarity in embedding space. Unlike BLEU or ROUGE, which require exact word
matches, BERTScore can reward semantic equivalence even when lexical choices
diverge significantly. This makes it especially useful in applications such as abstractive
summarization, open-ended question answering, and conversational agents, where
semantic fidelity is more crucial than verbatim repetition. For instance, it would assign
a high similarity score to the pair: “The boy is running.” vs. “The child is sprinting.”
despite their complete n-gram dissimilarity.
 BLEURT (Bilingual Evaluation Understudy with Representations from
Transformers) represents a further evolution. Unlike hecuristic-based methods,
BLEURT is a learned metric fine-tuned on large corpora of human judgments. It is
trained to predict human evaluation scores directly, allowing it to infer not just surface
alignment but also deeper linguistic attributes such as fluency, grammar, coherence,
factual correctness, and contextual appropriateness. BLEURT demonstrates high
correlation with human evaluators, making it one of the most reliable metrics for real-
world deployment, especially in high-stakes NLP applications like clinical
summarization or policy dialogue generation.

Both BERTScore and BLEURT are computationally intensive, often requiring GPU
acceleration for efficient evaluation. Their reliance on deep contextual encoders also makes
them less interpretable than simpler metrics, although their alignment with human perception
often justifies the computational overhead in research and production environments.

Task-Specific Metric Preferences

Given the diverse nature of generative tasks, metric selection should be tailored to task-
specific objectives:

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

 For summarization, metrics like ROUGE remain a staple due to their emphasis on
content coverage. However, when deeper semantic evaluation is needed, BERTScore
and BLEURT provide richer, more faithful assessments of the summary’s
informativeness and coherence.
 In dialogue generation, traditional metrics often fail to capture subtleties such as
coherence across turns, engagement, or contextual relevance. Here, METEOR may
offer moderate gains, but BLEURT or human evaluations are often preferred due to
their ability to reflect real conversational quality, including the appropriateness of tone,
factual alignment, and user intent.

Evaluating generative models remains a multifaceted challenge that transcends simple


token-level comparisons. While traditional metrics such as BLEU, ROUGE, METEOR, and
perplexity offer computational efficiency and historical relevance, they are increasingly
insufficient for capturing the semantic and pragmatic dimensions of language
generation. The emergence of neural evaluation metrics like BERTScore and BLEURT
signifies a paradigm shift toward semantically grounded, human-aligned assessment
frameworks. As generative models continue to evolve in sophistication, the development
and adoption of robust, task-specific, and semantically informed evaluation methodologies
will be critical in driving meaningful progress in natural language generation research and
deployment.

Fig: ROUGE: Decoding the Quality of Machine-Generated Text

Model Institute of Engineering and Technology (Autonomous), Jammu


Assignment: COM-801 MST II

Model Institute of Engineering and Technology (Autonomous), Jammu

You might also like