0% found this document useful (0 votes)
83 views127 pages

Understanding Transformer Architecture

Transformers are a deep learning architecture introduced in 2017 that revolutionized the processing of sequential data by utilizing self-attention mechanisms, allowing for parallel processing and improved handling of long sequences. They address the bottleneck problem of traditional sequence-to-sequence models by enabling the model to focus on different parts of the input dynamically, thus retaining more information. The architecture has become foundational in natural language processing and other AI applications, leading to significant advancements in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views127 pages

Understanding Transformer Architecture

Transformers are a deep learning architecture introduced in 2017 that revolutionized the processing of sequential data by utilizing self-attention mechanisms, allowing for parallel processing and improved handling of long sequences. They address the bottleneck problem of traditional sequence-to-sequence models by enabling the model to focus on different parts of the input dynamically, thus retaining more information. The architecture has become foundational in natural language processing and other AI applications, leading to significant advancements in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Transformers

Agenda:
1 What are Transformers?

2 Challenges of Sequence-to-Sequence Models and Their Solutions

3 What is Attention?

4 The Birth of Transformers

5 The Building Blocks of the Transformer Architecture

6 Multi-Head Self-Attention
What are Transformers?
Transformers
Transformers are a type of deep learning architecture designed to handle sequential data, such as text,
more efficiently than older models like RNNs (Recurrent Neural Networks), GRUs (Gated Recurrent Units),
and LSTMs (Long Short-Term Memory networks). Transformers were introduced in the landmark paper
"Attention is All You Need" by Vaswani et al. in 2017 and have since revolutionized the field of Natural
Language Processing (NLP) and other domains like computer vision and audio processing.
Transformers
Transformers are at the core of the latest advancements in AI
• Most of the recent developments and big breakthroughs in AI are all based on the
Transformer architecture. It has revolutionized the way models are designed and how they
function.
Transformers
Seq-to-seq (Sequence to Sequence) model
• Originally, the Seq-to-Seq models were used for tasks like translation from one language
to another. However, the Transformer architecture improved these models by allowing for
parallel processing and handling longer sequences more efficiently than previous models
like LSTMs and GRUs.
Transformers
Advanced topics in Transformers
• Despite its simplicity, the Transformer architecture has made massive contributions to the
field and is being used in advanced applications such as:
• Text Generation
• Video Generation
Transformers
Comparison with previous models (LSTM, GRU, RNN)
• Transformers consist of a collection of blocks that function in a parallel manner, unlike
older models like LSTMs and GRUs, which rely on sequential processing (working step by
step). This parallel processing capability makes Transformers much faster and more
efficient in handling large datasets.
Why Transformers are Important ?
• Transformers have become the backbone of many state-of-the-art models in NLP and other areas due to
their ability to handle long sequences and capture intricate relationships in data. Models like BERT
(Bidirectional Encoder Representations from Transformers) and GPT are based on the Transformer
architecture and have achieved groundbreaking results in various tasks like language understanding,
text generation, and even question-answering systems.
Challenges of Sequence-to-Sequence
Models and Their Solutions
Sequence-to-Sequence (Seq2Seq) Architecture
• Seq2Seq models are a common architecture for tasks that involve converting one sequence into
another, such as machine translation, where the input is a sentence in one language, and the output is
the translated sentence in another language.
The architecture typically consists of

• Encoder: The encoder takes the input sequence (e.g., a sentence) and compresses it
into a fixed-size vector representation.

• Decoder: The decoder takes this fixed-size vector from the encoder and generates the
output sequence (e.g., the translated sentence).
The Bottleneck Problem
The image explains a key limitation of the traditional seq2seq architecture—known as the
bottleneck problem. This problem arises because:
The Bottleneck Problem
• The encoder has to compress all information from the input sequence (often very long) into a
fixed-size vector. This vector is then passed to the decoder, which uses it to generate the output.

• If the input sequence is too long or complex, the encoder might fail to capture all the necessary
information in this fixed-size representation. This leads to an information bottleneck because the
decoder depends entirely on this fixed-size vector for generating the output.
The Two Major Issues
The bottleneck problem leads to two specific issues in seq2seq models
1. Encoder Issue
• The encoder is forced to capture all the information about the input sequence
(e.g., every word in a long sentence) in a single fixed-size vector. This becomes
difficult as the sequence gets longer, resulting in the loss of important details.

• In the diagram: The orange text shows how the encoder (represented by the red
blocks) struggles to compress all the information into a fixed-size vector.
The Two Major Issues
The bottleneck problem leads to two specific issues in seq2seq models
2. Decoder Issue
• The decoder relies on this fixed-size vector to reproduce the entire output
sequence. If the vector lacks crucial information from the input, the decoder will
generate inaccurate or incomplete sequences.

• In the diagram: The green text shows how the decoder (represented by the green
blocks) uses the fixed-size vector to generate the output sentence. If important
information is lost during encoding, it leads to poor results.
How Transformers Solve the Bottleneck Problem ?
traditional seq2seq architectures using RNNs, the Transformer architecture was introduced to
overcome these limitations. Here's how Transformers address the bottleneck problem
• Self-Attention Mechanism: Instead of compressing the entire input sequence into a
single fixed-size vector, Transformers use a self-attention mechanism to allow the model
to focus on different parts of the input sequence at different times. This avoids the need
for a fixed-size vector and allows the model to handle longer sequences more effectively.
How Transformers Solve the Bottleneck Problem ?
traditional seq2seq architectures using RNNs, the Transformer architecture was introduced to
overcome these limitations. Here's how Transformers address the bottleneck problem
• Parallel Processing: Transformers process the entire sequence in parallel, unlike RNNs,
which process one word at a time sequentially. This parallelism speeds up computation
and improves performance on longer sequences.
Summary of the Key Points
• Bottleneck Problem: The encoder compresses all input information into a fixed-size vector,
which can lead to information loss, especially with long or complex sequences. This results in
poorer performance for the decoder.

• Seq2Seq with RNNs: Traditional seq2seq models using RNNs are susceptible to the bottleneck
issue because they rely on a fixed-size vector representation of the input.

• Transformers: Overcome this issue by using self-attention mechanisms and parallel processing,
allowing them to capture more context and handle longer sequences effectively.
What is Attention?
What is Attention?
Attention Mechanism
• The Attention mechanism was introduced to solve the bottleneck problem present in
sequence-to-sequence models. Here are the key points
• Problem: The bottleneck problem occurs because the encoder compresses the
entire input sequence into a fixed-size vector. This means that when the input
sequence is long, important information may be lost.

• Solution via Attention: The Attention mechanism helps by directly connecting


the decoder to the encoder at each step. This allows the model to focus on
different parts of the input sequence depending on the task at hand.

• Core Idea: In each step of the decoder, attention assigns importance scores (or
weights) to different parts of the input sequence. The decoder then uses these
scores to focus on the relevant information from the input while generating the
output. This helps overcome the limitations of compressing the input into a fixed-
size vector.
What is Attention?
How Attention Works ?
• Step 1: The decoder output is combined with the hidden states of the encoder via a dot
product operation to calculate attention scores.

• Step 2: These attention scores are normalized using the softmax function to turn them
into a probability distribution. This distribution helps the model determine which part of
the input sequence to focus on.

• Illustration: In the diagram, at each step, the decoder is focusing on a specific part of the
encoder's hidden state. For example, at one timestep, the model is focusing more on the
word "he."
What is Attention?
How Attention Works ?
Handling Sequential Data Using Convolutions vs. RNNs
Two major methods are used for handling sequential data
1-D Convolutions
• Word Embeddings: Inputs are first converted into word embeddings, which turn
each word into a vector representation.

• Convolutions: A convolutional filter is applied to these word vectors to extract


important features.

• Non-linearity and Activation: After convolution, non-linearity is introduced to


capture complex patterns. The output is passed through an activation function.

• Problem: While this approach can work in parallel and capture local patterns, it
struggles with long sequences because deeper convolution layers are required,
increasing the computational cost.
Handling Sequential Data Using Convolutions vs. RNNs
Two major methods are used for handling sequential data
1-D Convolutions
Handling Sequential Data Using Convolutions vs. RNNs
Two major methods are used for handling sequential data
1-D Convolutions
• Word Embeddings: Inputs are first converted into word embeddings, which turn
each word into a vector representation.

• Convolutions: A convolutional filter is applied to these word vectors to extract


important features.

• Non-linearity and Activation: After convolution, non-linearity is introduced to


capture complex patterns. The output is passed through an activation function.

• Problem: While this approach can work in parallel and capture local patterns, it
struggles with long sequences because deeper convolution layers are required,
increasing the computational cost.
Handling Sequential Data Using Convolutions vs. RNNs
Two major methods are used for handling sequential data
RNNs
• RNNs process inputs sequentially, handling variable-length inputs. However, they rely on
hidden states to capture long-term dependencies, which makes them slow for long
sequences. Also, they suffer from issues like vanishing gradients over long sequences.
Handling Sequential Data Using Convolutions vs. RNNs
Two major methods are used for handling sequential data
RNNs
RNN-Based Sequence-to-Sequence Model
how traditional RNN-based seq2seq models work and highlights their limitations:
1. Many-to-One and One-to-Many Mapping
• RNNs are typically used for converting a sequence of inputs into a single output (many-
to-one) or generating a sequence of outputs from a single input (one-to-many).
2. Context Vector
• In traditional RNNs, the entire input sequence is encoded into a single context vector. The
problem is that as the sequence length increases, the context vector becomes less
effective at retaining all the necessary information (bottleneck problem).
RNN-Based Sequence-to-Sequence Model
how traditional RNN-based seq2seq models work and highlights their limitations:
3. Limitations
• RNNs cannot process long sequences efficiently due to the fixed-size context vector.

• They are also difficult to parallelize, making them slower and harder to scale.

• Vanishing gradients: When backpropagating through long sequences, RNNs suffer


from vanishing gradients, which makes it difficult to learn long-term dependencies.
RNN-Based Sequence-to-Sequence Model
how traditional RNN-based seq2seq models work and highlights their limitations:
RNN-Based Sequence-to-Sequence Model
how traditional RNN-based seq2seq models work and highlights their limitations:
RNN-Based Sequence-to-Sequence Model
how traditional RNN-based seq2seq models work and highlights their limitations:
Organized Summary
1. What is Attention
• Attention solves the bottleneck problem by allowing the decoder to focus on
different parts of the input sequence.
2. How Attention Works
• Attention scores are computed by taking the dot product between the decoder's
current output and the encoder's hidden states. These scores are converted into a
probability distribution using softmax, guiding the model's focus.
Organized Summary
2. Handling Sequential Data
• 1-D Convolutions: Work in parallel and are computationally efficient for short
sequences, but struggle with longer sequences.
• RNNs: Can process long sequences but suffer from slow processing and the
vanishing gradient problem.
4. Limitations of RNNs in Seq2Seq Models
• RNNs struggle with long sequences due to the bottleneck issue with the fixed-size
context vector.
• They are difficult to parallelize and scale for large datasets.
• They also face the vanishing gradient problem, which affects their ability to capture
long-term dependencies.
The Birth of Transformers
The Birth of Transformers
Origins of Transformers
Attention Mechanism Preceding Transformers
• Before the advent of Transformers, the attention mechanism was first
introduced to improve sequence-to-sequence (seq2seq) models. The attention
mechanism allowed models to focus on different parts of the input sequence
dynamically rather than relying on a fixed-size context vector, solving the
bottleneck problem in seq2seq architectures.
The Birth of Transformers
Origins of Transformers
How Attention Works with Transformers
• Transformers fully adopted the attention mechanism as a core part of their
architecture. Unlike previous models that used attention as an add-on,
Transformers made attention a fundamental building block for processing
sequences, using self-attention to handle long dependencies in data more
effectively.
The Birth of Transformers
How Attention Solved the Bottleneck Problem
• In traditional seq2seq models, the encoder compresses all the information into a
single vector, leading to information loss, especially in long sequences.

• Attention improves this by using a weighted sum of hidden states from the encoder.
Instead of relying solely on the final hidden state, attention dynamically computes
importance scores (weights) for each hidden state, allowing the decoder to focus on the
most relevant parts of the input sequence at each timestep.

• This allows the model to retain more information, preventing the bottleneck issue and
enabling the model to better capture long-range dependencies in the input sequence.
The Birth of Transformers
How Attention Solved the Bottleneck Problem
The Birth of Transformers
Key Advantages of Transformers
Parallelization of the Encoder and Decoder
• Traditional RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term
Memory) rely on sequential processing, which makes them slow and
inefficient for long sequences. Transformers, by contrast, process input
sequences in parallel through self-attention, drastically speeding up
training and inference.
The Birth of Transformers
Key Advantages of Transformers
Handling Very Long Sequences
• RNN-based models struggle with very long sequences due to the vanishing
gradient problem and their inability to retain important information over
long distances. The self-attention mechanism in Transformers allows the
model to focus on all parts of the input sequence simultaneously, making
them much more effective for handling long sequences.
The Birth of Transformers
The Breakthrough of the "Attention is All You Need" Paper
Introduction of Transformers
• The paper titled "Attention is All You Need", published in December 2017,
introduced the Transformer architecture. It fundamentally changed the way
sequence data is processed by replacing recurrent layers (RNNs, LSTMs) with
self-attention mechanisms. This allowed for parallel processing and much faster
computation times.
Summary
• Transformers emerged as a groundbreaking architecture because they fully
integrated the self-attention mechanism, solving bottlenecks in sequence-
to-sequence models.

• By parallelizing the encoder and decoder and utilizing self-attention,


Transformers significantly improved the handling of long sequences and
reduced training time.

• The introduction of the "Attention is All You Need" paper in 2017 was a major
milestone in the field of deep learning, leading to significant advancements in
NLP (Natural Language Processing) and beyond.
The Building Blocks of the Transformer
Architecture
Building Blocks of Transformers
Encoder and Decoder Structure
• The Transformer consists of two main components
• Encoder: Processes the input sequence and outputs a context vector.

• Decoder: Takes the context vector and generates the output sequence. In
tasks like translation, the decoder produces the translated text word by
word, using the context information from the encoder.
• Context Vector and Attention
• The context vector represents the semantic meaning of the input sequence.

• Attention Mechanism: Helps the decoder focus on relevant parts of the input
sequence by assigning weights to different encoder hidden states. This allows for
capturing relationships between words even when they are far apart in the
sequence.
Building Blocks of Transformers
Encoder and Decoder Structure
Parallel Processing and Positional Encoding
Parallelism in Transformers
• Unlike traditional models like RNNs, which process sequences word-by-word,
Transformers operate in parallel. This means that all words in the input are processed
simultaneously, which speeds up the computation. However, parallel processing doesn't
naturally capture the order of words, which is why positional encoding is needed.
Positional Encoding
• Since Transformers don't inherently understand word order (due to parallel processing),
positional encoding is added to represent the position of each word in the sequence. This
ensures that the model can capture the correct order and relationships between words.
Parallel Processing and Positional Encoding
Input Embedding
1. Tokenization
• Before feeding the input into the Transformer, the text is broken down into smaller pieces
called tokens. These tokens can be words or even subwords. Each token is then converted
into a numerical representation (ID) using a dictionary.
2. Embedding
• Once tokenized, the tokens are transformed into word embeddings, which are continuous-
valued vectors. These vectors represent each word in a way that captures its meaning and
relationships to other words (e.g., similar words have similar vectors). Popular techniques
for generating embeddings include Word2Vec and GloVe.
Input Embedding
Positional Encoding
Purpose of Positional Encoding
• Since Transformers process all words in parallel, they require a way to understand the order
of the words. Positional encoding gives each word a unique vector based on its position in
the sequence. This allows the model to differentiate between, for example, "The mouse ran
up the clock" and "The clock ran up the mouse."
Calculation of Positional Encoding
• Positional encodings are generated using sinusoidal functions. The formula involves a
combination of sine and cosine functions applied to the position of each word and the
dimensions of the model.

• The encoding vectors are unique for each position and help the model preserve word order
while still benefiting from parallel processing.
Positional Encoding
How the Positional Vector is Computed
Sin and Cos Functions
• Positional vectors are computed using sine and cosine functions that incorporate the
word's position (pos) and the model's dimension size (d_model). The values alternate
between sine and cosine functions for different dimensions of the vector.
Large Number Division

• Each position index is divided by powers of 10,000 to normalize the values and ensure
that the positional encodings differ for different positions. This unique encoding helps
the Transformer distinguish between words based on their positions, even when their
embeddings are otherwise similar.
Positional Encoding
How the Positional Vector is Computed
Encoder Block in the Transformer
The Encoder Block is a crucial part of the Transformer architecture, responsible for processing the
input data and transforming it into a context-rich representation for the decoder to use. The encoder
block has several components that work together to achieve this.
Encoder Block in the Transformer
Components of the Encoder Block
1. Positional Encoding
• Since Transformers process the input sequence in parallel (unlike RNNs), they need a way
to maintain the order of the sequence. Positional Encoding is added to each input token
to provide information about its position in the sequence.
2. Multi-Head Attention
• This is the heart of the Transformer model. In multi-head attention, the input is
transformed into three different matrices: Query (Q), Key (K), and Value (V). Each "head"
attends to different parts of the input sequence, allowing the model to capture various
relationships within the data. These heads run attention mechanisms in parallel, and their
results are concatenated before passing to the next layer.
Components of the Encoder Block
3. Layer Normalization
• Layer normalization is applied to stabilize and speed up the training process by
normalizing the input across the batch.
4. Feed-Forward Neural Network
• After multi-head attention, the data passes through a fully connected feed-forward neural
network, which applies additional transformations to enrich the representation.
5. Residual (Skip) Connections
• Residual connections are added around the multi-head attention and feed-forward
layers. These connections help prevent vanishing gradients and improve the flow of
information through the network by allowing the output of one layer to "skip" ahead to
later layers. This is often referred to as a skip connection and ensures that even deep
models can be trained effectively.
Multi-Head Attention Process Explanation
The Multi-Head Attention is a crucial part of the Transformer model. It allows the model to attend to
different positions in the sequence at the same time, enabling it to understand relationships
between words in a more dynamic way. Here's how it works
How does it work?
The attention mechanism takes the input and splits it into three matrices
• Q (Query): Represents the current word or position the model is attending to.
• K (Key): Represents the words or positions the model is comparing against.
• V (Value): Contains the information related to each word in the input.
These matrices are multiplied in pairs, and the Scaled Dot-Product between Q and K is used to
calculate how much attention should be given to each word.
Scaled Dot-Product Attention
The Scaled Dot-Product Attention is the mathematical operation behind the attention
mechanism. It measures how relevant or "important" each word is to the current word by
calculating the dot product of Q (Query) and K (Key). Here's the process step by step
Steps in Scaled Dot-Product Attention
Steps in Scaled Dot-Product Attention
1. Tokenization and Embedding
• First, the input words (e.g., "I", "love", "tennis") are tokenized into smaller units (tokens).
• After tokenization, each token is converted into embeddings (vector representations of
words) which carry semantic meaning.
• Positional encoding is then added to these embeddings to retain the order of the words
Steps in Scaled Dot-Product Attention
2. Initial Weights for Query, Key, and Value (Q, K, V)
• The Transformer model initializes weights for each of the three matrices: Query (Q), Key
(K), and Value (V).
• These weights are multiplied by the embeddings to form the Q, K, and V vectors for each
word in the sequence.
• The weights are trainable, which means they get updated during training as the model
learns.
Steps in Scaled Dot-Product Attention
3. Dot Product of Q and K
• Once Q, K, and V are created, the first key operation is the dot product of the Q (Query)
and K (Key) vectors for each word in the sequence.
• The dot product measures the similarity between the words. Words that are more related
to each other will have higher dot product values.
Steps in Scaled Dot-Product Attention
4. Scaling the Dot Product
• The dot product results are scaled by dividing them by the square root of the dimensionality
of the keys, usually denoted as √d_k. This step prevents the values from becoming too
large, which would make the gradients too small and slow down learning.
• After scaling, the values are normalized to ensure that the model can handle larger
sequences more efficiently.
Steps in Scaled Dot-Product Attention
5. Softmax Function for Attention Scores
• Next, the scaled dot product results are passed through a softmax function.
• Softmax converts the scores into a probability distribution, where the sum of the scores for
each word adds up to 1. This helps the model determine how much focus (attention) to give
each word in the input sequence.
Steps in Scaled Dot-Product Attention
6. Attention Output Calculation
• After obtaining the attention scores, each score is used to weight the corresponding V
(Value) vector for each word.
• The weighted sum of the value vectors gives the attention output, which contains the final
contextual representation of each word in the input sequence.
• This final attention output is then passed to the next layer in the Transformer model.
Steps in Scaled Dot-Product Attention
1. Vector of Vectors (Step 1)
• You have a sentence with 4 words, each represented by a
vector. To calculate attention, create copies of the
vectors (query, key, and value vectors).
2, Dot Product & Softmax (Step 2)
• Perform the dot product between the query and each key
vector. Multiply the result with the corresponding value
vector.
• Pass the results through softmax to turn them into a
probability distribution. This ensures the scores sum to
1, as shown in the matrix.
Steps in Scaled Dot-Product Attention
3. Multiply Weights with Vectors (Step 3)
• You have 4 weights for each word in the sentence. These weights are the attention
scores calculated for the word you're focusing on (in this case, 𝑋2).
• Multiply these weights by the value (V) vectors. The result of this multiplication gives
the weighted vectors.
Sum of Weighted Vectors

• After multiplying the weights with the value vectors, take the sum of these weighted vectors.
• This sum gives the final weighted sum, which captures the attention-based context for the
word 𝑋2 in relation to all the other words in the sequence.
Final Output
• This weighted sum is then used as the output for the current word, which incorporates
information from the rest of the sentence, highlighting how X2X_2X2 relates to other words
in the sequence.
Cosine Similarity
Cosine similarity is a way to measure the similarity between two vectors by calculating the
cosine of the angle between them. This metric is particularly useful in Natural Language
Processing (NLP) and many other fields to determine the semantic similarity between two
documents or word embeddings.
Cosine Similarity
Why Cosine Similarity is Important

• Cosine similarity measures how similar two vectors are by looking at the angle (θ)
between them, not their magnitude (size).
• It's particularly robust in NLP, as it helps in comparing the direction of vectors without
being affected by their magnitude.
Range
Cosine similarity always produces a value between -1 and 1
• 1 means the vectors are perfectly aligned (the same direction).
• 0 means the vectors are orthogonal (no similarity).
• -1 means the vectors are perfectly opposite (completely dissimilar).
Cosine Similarity
How Cosine Similarity Works
1. Angle (θ) Between Vectors
• The angle (θ) between two vectors (A and B) represents
how similar they are
• If θ = 0 degrees, the vectors are identical or have high similarity.
• As θ increases, the similarity decreases.
• If θ = 90 degrees, the vectors are perpendicular and have no similarity.
• If θ = 180 degrees, the vectors are completely opposite and have a similarity of -1.
Cosine Similarity
How Cosine Similarity Works
1. Dot Product in Cosine Similarity
• The dot product of two vectors measures how much they
point in the same direction
• Cosine similarity is essentially the normalized dot product

• Here, the dot product of vectors A and B is divided by the product of their
magnitudes (lengths) to give a value between -1 and 1.
Cosine Similarity
Examples of Cosine Similarity
Why Did We Explain Cosine Similarity Here in
Relation to NLP and Why Do We Multiply Each
Vector with Another Vector?
1. Cosine Similarity in NLP
• Cosine similarity measures how similar two vectors are by calculating the dot product
between them, giving us an idea of the angle between them. In NLP, we use cosine similarity
to understand how semantically similar two word vectors are.
• Each word is represented as a vector, and cosine similarity helps us compare how closely
related two words are based on their vector representations.
2. Vector Multiplication
• In the context of NLP, multiplying vectors using the dot product is a way of determining how
related or close two words are. For instance, multiplying the vector for "pasta" with other
food-related vectors can help the model understand which words are most semantically
related to "pasta."
Now That We Understand Why We Multiply Vectors,
Let's Understand the Query, Key, and Value and Their
Role in This Example
Now That We Understand Why We Multiply Vectors,
Let's Understand the Query, Key, and Value and Their
Role in This Example
1. Query, Key, and Value Example
• In this example, the query might be the word "pasta".
• The query (Q) is compared to several key (K) vectors to see which key is most relevant to the query.
• Each key is linked to a corresponding value (V). The attention mechanism works by using the query
to select the key, and the selected key returns the associated value.
2. Softmax to Find the Most Relevant Value
• The dot product between the query and each key produces scores that represent the relevance of
the query to each key.
• These scores are passed through softmax, which normalizes the scores into a probability
distribution. This distribution helps the model pick the most relevant key.
Now That We Understand Why We Multiply Vectors,
Let's Understand the Query, Key, and Value and Their
Role in This Example
3. Returning the Value
• Based on the attention scores, the model retrieves the value associated with the key that has the
highest score. For example, if the key "food" has the highest attention score for the query "pasta," the
associated value (e.g., "rice") would be returned.

4. Purpose of Query-Key-Value Mechanism


• This mechanism allows the model to focus on the most relevant parts of the input. For instance,
given the query "pasta," the model focuses on finding the most relevant key and retrieving its value.
There’s a Strong Relationship Between Multiplying
Vectors in This Example and What We Expect as an
Output Below!
1. Why Do We Need to Train the Weights?
• The weights for Q, K, and V are trainable, meaning they are updated during backpropagation. This is
important for the model to learn which relationships between words are the most meaningful.
• As the model trains, the weights get updated to better focus on the most important relationships
between words in the input sequence.
2. Visualization of the Self-Attention Mechanism
• In the self-attention diagram, you can see how the query, key, and value vectors are used in
practice. After calculating the attention scores (by multiplying vectors), the model produces an
output based on these weighted scores, allowing the model to focus on the relevant parts of the
sequence.
There’s a Strong Relationship Between Multiplying
Vectors in This Example and What We Expect as an
Output Below!
Summary
• Cosine Similarity: This is used in NLP to calculate how similar two word vectors are, based
on the angle between them.

• Query-Key-Value Mechanism: The model uses a query to find the most relevant key, and
based on this key, retrieves the associated value.

• Self-Attention: The model learns how to attend to different parts of the sequence by
training the weights for Q, K, and V. It then returns the weighted output based on the
attention scores.
Understanding Scaling in Self-Attention
Why Do We Use Scaling?
• In self-attention, when we compute the dot product between the query (Q) and key (K) vectors,
the result can become very large, especially when the dimension of the vectors (d) is high.

• If these values become too large, they will cause issues with the softmax function, leading to
very small gradients, which can slow down learning. To prevent this, we scale the dot product by
dividing it by the square root of the dimension √d.
Understanding Scaling in Self-Attention
How Scaling Works?
Scaling the Dot Product
• The dot product of Q and K is divided by √d to ensure that the resulting values remain
at a reasonable scale. This avoids making the softmax function too sensitive to large
values, which could cause it to produce a skewed output.

• By scaling, we ensure that the values passed into the softmax function don’t
dominate each other, leading to a smoother probability distribution and more
balanced attention.
Understanding Scaling in Self-Attention
Why is Scaling Important for Softmax?
Effect of Scaling on Softmax
• Without scaling, when Q and K are large, their dot product would be very high,
leading to large values being passed into softmax. This would make softmax assign
very high probabilities to a few elements and very low probabilities to others.

• Scaling keeps the dot product values in a more manageable range, which results in a
more balanced softmax output and better attention distribution across all elements.
Understanding Scaling in Self-Attention
Why is Scaling Important for Softmax?
High Weights and Vanishing Gradients
• When we have very high weights, the gradients tend to vanish, which slows down
training. Scaling the dot product helps prevent this problem by keeping the values
smaller and gradients more stable.
Understanding Scaling in Self-Attention
In the image, the scaling factor is illustrated
Understanding Scaling in Self-Attention
• For instance, if the embedding dimension is d = 512, we scale the dot product
of Q and K by dividing it by √512, which is approximately 22.6.

• This scaling ensures that the values fed into softmax remain in a reasonable
range, avoiding large probabilities and leading to more stable training.
Summary
• Scaling in self-attention ensures that the dot product between Q and K doesn't result in
overly large values, which would distort the softmax function.

• By dividing the dot product by √d, we keep the attention scores balanced, avoid
vanishing gradients, and stabilize the training process.

• This is an important step in ensuring that the model learns effectively without assigning
too much focus to a few tokens.
Self-Attention Mechanism with Scaling
Self-Attention Mechanism with Scaling
1. Input Embedding
• Each word in the input sequence (e.g., 𝑋1,𝑋2,𝑋3,𝑋4) is first converted into a vector
embedding. These embeddings represent the semantic meaning of each word.
2. Creating Query, Key, and Value Vectors

• Each word's embedding is passed through different weight matrices to produce


three vectors:
• Q (Query)
• K (Key)
• V (Value)
• These vectors are used to perform the attention calculation.
Self-Attention Mechanism with Scaling
3. Dot Product for Similarity
The dot product between Q and K is computed to measure the similarity between each word
in the sequence with respect to the word currently being processed (in this case, 𝑋2).
4. Scaling

• After the dot product is computed, the result is scaled by dividing it by √d_k
(where d_k is the dimensionality of the key vectors). This step ensures that the
values passed to the softmax function are in a manageable range, preventing
large values from dominating the attention weights.
Self-Attention Mechanism with Scaling
5. Softmax Function
The scaled dot product is passed through the softmax function to convert the similarity
scores into probabilities. These probabilities represent how much attention each word in the
sequence should receive relative to the current word (𝑋2).

6. Weighted Sum

• The resulting attention scores are then used to compute a weighted sum of the
value vectors (V). This weighted sum gives a contextually aware representation
of the word, taking into account the importance of all other words in the
sequence.
Self-Attention Mechanism with Scaling
7. Final Output
The weighted sum is passed through another linear layer to produce the attention output for
the word being processed. This output is then used in the following layers of the Transformer.
Multi-Head Self-Attention
Multi-Head Self-Attention
Multi-Head Self-Attention
What is Multi-Head Self-Attention?
• In self-attention, the model attends to every word in a sequence and computes the
relationships between them to capture context. Multi-head self-attention takes this
one step further by running multiple attention mechanisms in parallel.

• This allows the model to learn different types of relationships in the sequence at the
same time. Each "head" focuses on different parts of the input.
Multi-Head Self-Attention
Parallel Processing
• Unlike traditional models (like RNNs) that process sequences word by word, multi-head
self-attention processes all the words in the sequence at the same time, in parallel.

• This parallel processing greatly speeds up computation, especially for long sequences.
Each word in the sequence is passed through the attention mechanism simultaneously.
Key Terminologies in Multi-Head Attention
1. Self-Attention
• Self-attention is a mechanism that allows each word (or token) in a sequence to attend to
every other word. It computes a weighted sum of all the words to understand the
relationships between them.
2. Multiple Heads
• In multi-head attention, the attention mechanism is applied multiple times, with each
attention mechanism called a head. Each head processes the sequence in parallel,
allowing the model to capture different aspects or relationships between the words.
3. Query, Key, and Value

• Each head has its own set of Query (Q), Key (K), and Value (V) matrices, which are used to
calculate attention for the sequence.
Key Terminologies in Multi-Head Attention
4. Output from Each Head
• The input sequence is processed by each head independently. The output of each head
represents a specific relationship between the words in the sequence. Each head generates
its own attention output.
5. Concatenation and Linear Transformation
• After all heads have computed their individual outputs, these outputs are concatenated
(joined together) and passed through a linear transformation to combine them. This allows
the model to integrate the various relationships captured by each head into a final output.
Why Does Each Layer Share the Same Vector and
Output, But It Behaves Differently in Every Layer?
1. Initial Weights in Each Layer
• The weights for each layer are initialized differently. These initial weights change how each layer
interprets the input, even if the input vectors are the same.

2. Concatenation and Architectures

• After processing, the output of the heads is concatenated. However, the model architecture can
differ, leading to two types:

• Wide Architecture: Outputs are wider, meaning more information is retained.


• Narrow Architecture: Outputs are more compressed, focusing on specific, refined
information.
Why Does Each Layer Share the Same Vector and
Output, But It Behaves Differently in Every Layer?
Why Does Each Layer Share the Same Vector and
Output, But It Behaves Differently in Every Layer?
3. Wide Architecture
• In the Wide architecture, multiple layers process the inputs in parallel, allowing for a broader
exploration of relationships between inputs. This leads to a larger size for the linear
transformation after attention. This is useful in tasks like classification, where a large amount
of input detail is required.
Why Does Each Layer Share the Same Vector and
Output, But It Behaves Differently in Every Layer?
4. Narrow Architecture
• In contrast, the Narrow architecture compresses the information from multiple heads. Instead
of having many wide vectors, each head produces compressed vectors, which are
concatenated and passed through a linear transformation.

• For example, if the vector size is 1000, each head might output smaller vectors (e.g., 100100100),
which are concatenated to get the final output. This helps in reducing dimensionality and
focusing on more refined aspects of the input.
Why Does Each Layer Share the Same Vector and
Output, But It Behaves Differently in Every Layer?
Why Does Each Layer Share the Same Vector and
Output, But It Behaves Differently in Every Layer?
5. Final Output
• The Narrow architecture results in a more refined and compressed output, making it easier to
classify or use in downstream tasks.
What Happens in the Norm Layer
1. Positioning of the Norm Layer
• After the concatenation of the output from multi-head attention, the model applies
normalization. This layer normalizes the result before adding positional encoding to it.
What Happens in the Norm Layer
2. Normalization Process
• The model applies layer normalization, which differs from methods like standard normalization
or min-max normalization. In layer normalization, the mean and variance are computed across
the layer to stabilize training.
• Scaling is applied in this normalization step to ensure the output is properly distributed and easier
for the model to process in subsequent layers.
What Happens in the Norm Layer
Why is Normalization Important?
1. Faster Training
• Normalization helps speed up training by preventing extremely large or small
values from affecting the learning process.
What Happens in the Norm Layer
Why is Normalization Important?
2. Reduce Bias
• It helps in reducing bias by keeping the activations in a balanced range, ensuring
that no one input dominates the model's learning.
What Happens in the Norm Layer
Why is Normalization Important?
3. Prevents Weight Explosion
• By controlling the range of outputs, normalization prevents the weights from
exploding, which can destabilize the model.
Difference Between Layer Normalization and Batch Normalization
1. Layer Normalization vs. Batch Normalization
• Layer Normalization normalizes across the features for each input independently, while
Batch Normalization normalizes across the batch, meaning it calculates the mean and
variance for a whole batch of inputs rather than just one input at a time.
What is the Difference?
1. Batch Normalization
• If we consider the example where the vector contains four words: "Popcorn", "Popped", "Tea",
and "Steeped," Batch Normalization works by normalizing each column independently
across the batch of data.

• For example, in the "Before" column for "Popcorn", the values are 0.14, 0.14, etc., which after
normalization become 0.94, 0.93, etc. (normalized across the batch).
What is the Difference?
What is the Difference?
2. Layer Normalization
• In contrast, Layer Normalization works within each block or input, normalizing across
features rather than across the batch. It ensures that each block has a mean of zero and a
variance of one.

• In the "Before" state for "Popcorn," the values are normalized within the block itself to give a
normalized output in the "After" state.
What is the Difference?
Masked Head Attention
1. Why Masking is Applied in the Decoder
• In the decoder, masking is applied as an optional step to prevent the model from seeing
future words. This is because the model should not use information from future words when
generating the current word in the sequence. The encoder does not need this masking step
since it processes the entire input sequence at once.
2. Parallel Processing in the Decoder
• In the decoder, when predicting the next word in a sequence, masking prevents the model
from attending to future words. This makes sure that only past words are considered.
However, the encoder does not apply masking since it processes the entire input sequence
in parallel.
Masked Head Attention
Masked Head Attention
3. Why Softmax with Masking
• When masked, future words are assigned a very low score (negative infinity) so that after
applying softmax, the model gives a probability of zero for those future words. This way, the
model never considers those masked words when making predictions.
How Masked Self-Attention Works
1. Calculating Attention
• Similarity is calculated using the dot product of the query and key vectors, but masked
attention ignores the future words by ensuring that those elements are not included in the
similarity computation.
How Masked Self-Attention Works
How Masked Self-Attention Works
2. Scaling
• The attention values are still scaled to prevent large dot product values from causing
imbalanced attention scores. This ensures stable training and prevents some values from
dominating the attention weights.
Translation Scenario Using Transformers
1. Translation Example
• In a translation task, such as converting "I love you very much" (English) into "Ti amo molto"
(Italian), the Transformer model takes the sentence in the source language as input and
produces the sentence in the target language as output.
Training Phase
How Training Works
• During training, both the input (source sentence) and output (target sentence) are
available to the model. The source sentence goes through the encoder, while the target
sentence is passed through the decoder.

• Special tokens such as start and end tokens are used to mark the beginning and end of
the target sentence.

• The model is trained to predict the next word in the target sequence based on the previous
words and the encoded source sentence.
Training Phase
How Training Works
Inference Phase
1. How Inference Works
• During inference, the model does not have access to the entire target sentence. It must generate
the target sentence one word at a time. The process begins with the start token and the model
produces the first word.

• After generating the first word, the model uses that word as input for the next step to generate the
second word, and so on, until it produces the end token.
Inference Phase
1. How Inference Works
Inference Phase
2. Greedy vs Beam Search
• In Greedy decoding, the model selects the word with the highest probability at each step.

• Beam search is a more advanced technique that considers multiple possible words at each step
and selects the best sequence based on overall probability.
Summary
• Training: During training, the model has access to both input and output sequences, and it learns to
predict the next word in the target sequence.

• Inference: During inference, the model generates the target sequence word by word, using either
Greedy or Beam Search for decoding.

You might also like