Arti cial Intelligence 2
CI 2 - S3
Pr. Hamza Alami
Academic year: 2024/2025
fi
Outline
1. Recap
2. Self Structured Attention
3. Pump your NNs performance
4. Transformers
5. Transformers the python library
2 Hamza Alami CI2-S3 2024/2025
Recap
• When we should use attention in NNs ?
• What are the different types of attentions we previously seen ?
• In which case ReLU activation can cause problems and what is the
alternative ?
• What is a seq2seq model ?
3 Hamza Alami CI2-S3 2024/2025
Outline
1. Recap
2. Self Structured Attention
3. Pump your NNs performance
4. Transformers
5. Transformers the python library
4 Hamza Alami CI2-S3 2024/2025
Structured self attention
• How can we use attention when our model is not seq2seq ?
Moliere
Encoder
The greater the obstacle the more glory in overcoming it
5 Hamza Alami
Structured self attention n
∑
ct = αihi
i
We don’t have st−1
α1 α2 α3 α4 α5 αn
Softmax
g is a scoring function g (st−1, h1) g (st−1, h5) g (st−1, hn)
- Considering the following sentence:
The food at the restaurant was enjoyable, with a variety of
h1 h2 h3 h4 h5 hn well-prepared and avorful dishes that showcased the
chef's skill. However, the overall experience was somewhat
overshadowed by the high noise levels in the dining area.
The greater the obstacle the it
6 Hamza Alami
fl
Structured self attention
• The sentence contains multiple components or aspects
• We need multi hops of attention to focus on on various components of the
sentence
The food at the restaurant was enjoyable, with a variety of
well-prepared and avorful dishes that showcased the
chef's skill. However, the overall experience was somewhat
overshadowed by the high noise levels in the dining area.
How to make the NNs focuses on multiple parts
effectively ?
The food at the restaurant was enjoyable, with a variety of However, the overall experience was somewhat
well-prepared and avorful dishes that showcased the overshadowed by the high noise levels in the dining area.
chef's skill.
7 Hamza Alami
fl
fl
Structured self attention
• The paper entitled “A Structured Self-Attentive Sentence Embedding”
introduced a method to attend on various part of a sentence and make the
decision of the NNs more intrepretable
• Instead of using a vector to represent a sentence, the paper uses a matrix
• Each vector in the matrix attend on different components (aspects) within
the sentence
8 Hamza Alami
Structured self attention sentence_embedding = AH Matrix
What can get worse ?
n
∑
sentence_embedding = αihi
i
α11 α21 α31 αn1
α12 α22 α32 αn2
α1 α2 α3 α4 α5 αn
A α13 α23 α33 αn3
α1k α2k α3k αnk
Softmax (wattention·tanh(WH)) Softmax (Wattention·tanh(WH))
H H
h1 h2 h3 h4 h5 hn h1 h2 h3 h4 h5 hn
… …
The greater the obstacle the it The greater the obstacle the it
9 Hamza Alami
Structured self attention
• What if the model learned the same attention weights ?
• To ensure that the model learn to attend on different components they
added a penalization term to the loss:
2
T
P= AA − I F
• Like L1 and L2 penalization terms we multiply P by a coe cient and then
add it to the loss term
10 Hamza Alami
ffi
Outline
1. Recap
2. Self Structured Attention
3. Pump your NNs performance
4. Transformers
5. Transformers the python library
11 Hamza Alami CI2-S3 2024/2025
Normalisation layers (Modules)
• Usually we normalize data before feeding it to the NNs
• How about introducing some normalizations inside the NNs ?
• This is the idea behind:
• Batch normalization
• Layer normalization
12 Hamza Alami
Where to apply normalization ?
X Weighted Sum Activation … Weighted Sum Activation Output
Standard neural network architecture
X Normalization … Normalization Output
Apply normalization at the beginning of the weighted sum layer
X Weighted Sum Normalization Activation … Weighted Sum Normalization Activation Output
Apply normalization at the end of the weighted sum layer
13 Hamza Alami
Batch Normalization
• Considering that we have an input X with size: (batch_dim, features_dim)
• We apply batch normalization (over batches) for a single feature following:
Learnable parameters
̂ +β
normalized_batch = x·γ
x−μ Scale Shift
x̂ =
σ
(xi − μ)
batch_dim batch_dim 2
∑i=1 xi ∑i=1
μ= ; and σ =
batch_dim batch_dim
14 Hamza Alami
Batch Normalization
• Batch normalization is sensitive to batch size
• During inference time, we should remove the dependence between batch
data. We should not compute μ and σ as the training part
• While training a running μ and σ are computed, these quantities are used
during the inference time
μ_running = (1 − α)·μ + α·μ_running
σ_running = (1 − α)·σ + α·σ_running
15 Hamza Alami
Layer Normalization
• Considering that we have an input X with size: (batch_dim, features_dim)
• We apply batch normalization (over features) for a single element following:
Learnable parameters
̂ +β
normalized_batch = x·γ Shift
x−μ Scale
x̂ =
σ
(xi − μ)
features_dim features_dim 2
∑i=1 xi ∑i=1
μ= ; and σ =
features_dim features_dim
16 Hamza Alami
Layer Normalization
• Unlike batch normalization, layer normalization is insensitive to batch size
• Layer normalization is preferred over batch normalization when:
• The batch size is small
• RNNs architecture is used
17 Hamza Alami
Normalization is mysterious
• Normalization does not impact the capacity of a NNs model, however it
can improve its performance
• For instance, we can have a model with large capacity and performs poorly
on a task. Integrating normalization layers can improve its performance
without impacting its capacity
18 Hamza Alami
Normalization is mysterious
19 Hamza Alami
Outline
1. Recap
2. Self Structured Attention
3. Pump your NNs performance
4. Transformers
5. Transformers the python library
20 Hamza Alami CI2-S3 2024/2025
Transformers
• Transformers are proposed in “Attention Is All You Need”
paper, 2017
• Transformers are proposed to solve machine translation task
• Transformers are highly parallelizable which is similar to
“Pondemonium: A Paradigm For Learning” paper, 1958
21 Hamza Alami
Transformers
• Can we ? build a model that :
• Catches long dependencies without using RNNs (avoid problems)
• Apply attention mechanism
• Highly parallelizable to take advantages of recent advances in
processing units
22 Hamza Alami
Transformers
• Transformers are Seq2Seq models, take as input a sentence and generate
a sentence
<START> < أكبر عليه التغلب في املجد كان كلما أكبر العائق كان كلماEND>
Decoder
Encoder
Embeddings
The greater the obstacle the more glory in overcoming it
23 Hamza Alami
Transformers (Encoder)
T
QK Normalize Softmax Attention
Q K V
Encoder
Matrix
24 Hamza Alami
MHAttention
Transformers (Encoder-Multihead)
Concat
TT
QQhKKh T Normalize Softmax Head
QhhKhh Head
Headkk
k
Q3
Q2 K3 V3
K2 V2
Q1
K1 V1
Matrix
…
25 Hamza Alami
Transformers (Encoder)
LayerNorm
Feed Forward Encoder Block
Stacked N times
LayerNorm
+ Source: attention is all you need,
paper
Multi Head Attention
Matrix
…
26 Hamza Alami
MHAttention
Transformers (Decoder-MaskedMultihead)
Concat
Prevent attending on future
TT
QQhKKh T Normalize Mask Softmax Head
QhhKhh Head
Headkk
k
Q3
Q2 K3 V3
K2 V2
Q1
K1 V1
Matrix
…
27 Hamza Alami
Transformers (Decoder)
Source: attention is all you need, paper
28 Hamza Alami
Transformers (Positional encoding)
• How about the position of the token, how to fed the position information to
the encoder and decoder ?
• The authors used the following to build positional encoding:
2i/dim_embeddings
PE(pos,2i) = sin(pos/1000 )
2i/dim_embeddings
PE(pos,2i+1) = cos(pos/1000 )
• The positional embeddings are then added to the embeddings ( rst layer)
29 Hamza Alami
fi
Transformers (Positional encoding)
source: [Link]
30 Hamza Alami
Transformers reshaped the world
• Transformers have achieve SoTA in various tasks: text classi cation,
question answering, machine translations, text generation, etc …
• Transformers need large amount of data and heavy computational
resources for training
• In most cases, transfert learning ( ne-tuning) methods are used to make
transformers based models perform well on speci c tasks
31 Hamza Alami
fi
fi
fi
GPT
• Generative Pre-Training was introduced in the paper “Improving Natural
Language Understanding by Generative Pre-Training” paper, 2018
• The idea is to train a language model using the Transformers decoder
block to maximize autoregressive language model likelihood
• The same model is ne tuned on another NLU task, which achieved the
SoTA in various tasks (back then)
32 Hamza Alami
fi
BERT
• Bidirectional Encoder Representation from Transformers (BERT), is a
based on the Transformers encoder only
• The idea is to create embeddings of text that can perform well on various
tasks of natural language understanding
33 Hamza Alami
GPT vs BERT architectures
34 Hamza Alami
Outline
1. Recap
2. Self Structured Attention
3. Pump your NNs performance
4. Transformers
5. Transformers the python library
35 Hamza Alami CI2-S3 2024/2025
Transformers Library
• Transformers is an open-source library designed to make transformers-
based models accessible to the broader machine learning community.
source: Transformers: State-of-the-Art Natural Language Processing, paper
36 Hamza Alami
Transformer library
• Each model in the library is de ned by three blocks:
• Tokenizer
• Transformer
• Head
37 Hamza Alami
fi
Transformer model hub
• A huge model hub contains various open-source transformer pertained
models is available
38 Hamza Alami
Loading a model with transformers
39 Hamza Alami
Con gure training process
40 Hamza Alami
fi
Starting the training process
41 Hamza Alami