0% found this document useful (0 votes)

83 views41 pages

Structured Self-Attention in Neural Networks

The document outlines a course on Artificial Intelligence, focusing on self-structured attention, normalization techniques, and transformers, including their architecture and applications. It discusses the importance of attention mechanisms in neural networks and introduces the Transformers library for implementing these models. Key concepts include batch and layer normalization, the structure of transformers, and notable models like GPT and BERT.

Uploaded by

Mohi Gpt4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views41 pages

Structured Self-Attention in Neural Networks

Uploaded by

Mohi Gpt4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Arti cial Intelligence 2

CI 2 - S3

Pr. Hamza Alami

Academic year: 2024/2025

fi
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

2 Hamza Alami CI2-S3 2024/2025

Recap
• When we should use attention in NNs ?

• What are the different types of attentions we previously seen ?

• In which case ReLU activation can cause problems and what is the
alternative ?

• What is a seq2seq model ?

3 Hamza Alami CI2-S3 2024/2025

Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

4 Hamza Alami CI2-S3 2024/2025

Structured self attention
• How can we use attention when our model is not seq2seq ?

Moliere

Encoder
The greater the obstacle the more glory in overcoming it

5 Hamza Alami
Structured self attention n

∑
ct = αihi
i

We don’t have st−1

α1 α2 α3 α4 α5 αn
Softmax

g is a scoring function g (st−1, h1) g (st−1, h5) g (st−1, hn)

- Considering the following sentence:
The food at the restaurant was enjoyable, with a variety of
h1 h2 h3 h4 h5 hn well-prepared and avorful dishes that showcased the
chef's skill. However, the overall experience was somewhat
overshadowed by the high noise levels in the dining area.

The greater the obstacle the it

6 Hamza Alami
fl
Structured self attention
• The sentence contains multiple components or aspects

• We need multi hops of attention to focus on on various components of the

sentence
The food at the restaurant was enjoyable, with a variety of
well-prepared and avorful dishes that showcased the
chef's skill. However, the overall experience was somewhat
overshadowed by the high noise levels in the dining area.
How to make the NNs focuses on multiple parts
effectively ?

The food at the restaurant was enjoyable, with a variety of However, the overall experience was somewhat
well-prepared and avorful dishes that showcased the overshadowed by the high noise levels in the dining area.
chef's skill.

7 Hamza Alami
fl
fl
Structured self attention
• The paper entitled “A Structured Self-Attentive Sentence Embedding”
introduced a method to attend on various part of a sentence and make the
decision of the NNs more intrepretable

• Instead of using a vector to represent a sentence, the paper uses a matrix

• Each vector in the matrix attend on different components (aspects) within

the sentence

8 Hamza Alami
Structured self attention sentence_embedding = AH Matrix

What can get worse ?

∑
sentence_embedding = αihi
i
α11 α21 α31 αn1
α12 α22 α32 αn2

α1 α2 α3 α4 α5 αn
A α13 α23 α33 αn3

α1k α2k α3k αnk

Softmax (wattention·tanh(WH)) Softmax (Wattention·tanh(WH))

H H

h1 h2 h3 h4 h5 hn h1 h2 h3 h4 h5 hn
… …

The greater the obstacle the it The greater the obstacle the it

9 Hamza Alami
Structured self attention
• What if the model learned the same attention weights ?

• To ensure that the model learn to attend on different components they

added a penalization term to the loss:

2
T
P= AA − I F

• Like L1 and L2 penalization terms we multiply P by a coe cient and then

add it to the loss term

10 Hamza Alami

ffi
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

11 Hamza Alami CI2-S3 2024/2025

Normalisation layers (Modules)
• Usually we normalize data before feeding it to the NNs

• How about introducing some normalizations inside the NNs ?

• This is the idea behind:

• Batch normalization

• Layer normalization

12 Hamza Alami
Where to apply normalization ?

X Weighted Sum Activation … Weighted Sum Activation Output

Standard neural network architecture

X Normalization … Normalization Output

Apply normalization at the beginning of the weighted sum layer

X Weighted Sum Normalization Activation … Weighted Sum Normalization Activation Output

Apply normalization at the end of the weighted sum layer

13 Hamza Alami
Batch Normalization
• Considering that we have an input X with size: (batch_dim, features_dim)
• We apply batch normalization (over batches) for a single feature following:
Learnable parameters

̂ +β
normalized_batch = x·γ
x−μ Scale Shift

x̂ =
σ

(xi − μ)
batch_dim batch_dim 2
∑i=1 xi ∑i=1
μ= ; and σ =
batch_dim batch_dim
14 Hamza Alami
Batch Normalization
• Batch normalization is sensitive to batch size

• During inference time, we should remove the dependence between batch

data. We should not compute μ and σ as the training part

• While training a running μ and σ are computed, these quantities are used
during the inference time

μ_running = (1 − α)·μ + α·μ_running

σ_running = (1 − α)·σ + α·σ_running

15 Hamza Alami
Layer Normalization
• Considering that we have an input X with size: (batch_dim, features_dim)
• We apply batch normalization (over features) for a single element following:
Learnable parameters

̂ +β
normalized_batch = x·γ Shift

x−μ Scale

x̂ =
σ

(xi − μ)
features_dim features_dim 2
∑i=1 xi ∑i=1
μ= ; and σ =
features_dim features_dim
16 Hamza Alami
Layer Normalization
• Unlike batch normalization, layer normalization is insensitive to batch size

• Layer normalization is preferred over batch normalization when:

• The batch size is small

• RNNs architecture is used

17 Hamza Alami
Normalization is mysterious
• Normalization does not impact the capacity of a NNs model, however it
can improve its performance

• For instance, we can have a model with large capacity and performs poorly
on a task. Integrating normalization layers can improve its performance
without impacting its capacity

18 Hamza Alami
Normalization is mysterious

19 Hamza Alami
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

20 Hamza Alami CI2-S3 2024/2025

Transformers
• Transformers are proposed in “Attention Is All You Need”
paper, 2017

• Transformers are proposed to solve machine translation task

• Transformers are highly parallelizable which is similar to

“Pondemonium: A Paradigm For Learning” paper, 1958

21 Hamza Alami
Transformers
• Can we ? build a model that :

• Catches long dependencies without using RNNs (avoid problems)

• Apply attention mechanism

• Highly parallelizable to take advantages of recent advances in

processing units

22 Hamza Alami
Transformers
• Transformers are Seq2Seq models, take as input a sentence and generate
a sentence
<START> ‫< أكبر عليه التغلب في املجد كان كلما أكبر العائق كان كلما‬END>

Decoder

Encoder

Embeddings
The greater the obstacle the more glory in overcoming it

23 Hamza Alami
Transformers (Encoder)
T
QK Normalize Softmax Attention

Q K V

Encoder
Matrix

24 Hamza Alami
MHAttention
Transformers (Encoder-Multihead)
Concat

TT
QQhKKh T Normalize Softmax Head
QhhKhh Head
Headkk
k
Q3
Q2 K3 V3
K2 V2
Q1
K1 V1

Matrix

…
25 Hamza Alami
Transformers (Encoder)
LayerNorm

Feed Forward Encoder Block

Stacked N times
LayerNorm

+ Source: attention is all you need,

paper
Multi Head Attention

Matrix
…
26 Hamza Alami
MHAttention
Transformers (Decoder-MaskedMultihead)
Concat
Prevent attending on future

TT
QQhKKh T Normalize Mask Softmax Head
QhhKhh Head
Headkk
k
Q3
Q2 K3 V3
K2 V2
Q1
K1 V1

Matrix

…
27 Hamza Alami
Transformers (Decoder)
Source: attention is all you need, paper

28 Hamza Alami
Transformers (Positional encoding)
• How about the position of the token, how to fed the position information to
the encoder and decoder ?

• The authors used the following to build positional encoding:

2i/dim_embeddings
PE(pos,2i) = sin(pos/1000 )
2i/dim_embeddings
PE(pos,2i+1) = cos(pos/1000 )

• The positional embeddings are then added to the embeddings ( rst layer)

29 Hamza Alami

fi
Transformers (Positional encoding)

source: [Link]

30 Hamza Alami
Transformers reshaped the world
• Transformers have achieve SoTA in various tasks: text classi cation,
question answering, machine translations, text generation, etc …

• Transformers need large amount of data and heavy computational

resources for training

• In most cases, transfert learning ( ne-tuning) methods are used to make

transformers based models perform well on speci c tasks

31 Hamza Alami
fi
fi
fi
GPT
• Generative Pre-Training was introduced in the paper “Improving Natural
Language Understanding by Generative Pre-Training” paper, 2018

• The idea is to train a language model using the Transformers decoder

block to maximize autoregressive language model likelihood

• The same model is ne tuned on another NLU task, which achieved the
SoTA in various tasks (back then)

32 Hamza Alami
fi
BERT
• Bidirectional Encoder Representation from Transformers (BERT), is a
based on the Transformers encoder only

• The idea is to create embeddings of text that can perform well on various
tasks of natural language understanding

33 Hamza Alami
GPT vs BERT architectures

34 Hamza Alami
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

35 Hamza Alami CI2-S3 2024/2025

Transformers Library
• Transformers is an open-source library designed to make transformers-
based models accessible to the broader machine learning community.

source: Transformers: State-of-the-Art Natural Language Processing, paper

36 Hamza Alami
Transformer library
• Each model in the library is de ned by three blocks:

• Tokenizer

• Transformer

• Head

37 Hamza Alami
fi
Transformer model hub
• A huge model hub contains various open-source transformer pertained
models is available

38 Hamza Alami
Loading a model with transformers

39 Hamza Alami
Con gure training process

40 Hamza Alami
fi
Starting the training process

41 Hamza Alami

Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
5 pages
Understanding Transformers in NLP
No ratings yet
Understanding Transformers in NLP
62 pages
Understanding Transformers in AI
No ratings yet
Understanding Transformers in AI
41 pages
Transformer NLP Applications Overview
No ratings yet
Transformer NLP Applications Overview
57 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
50 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
49 pages
Understanding Transformers in NLP
No ratings yet
Understanding Transformers in NLP
62 pages
Introduction to Transformer Models
No ratings yet
Introduction to Transformer Models
119 pages
ML & NLP Techniques Overview
No ratings yet
ML & NLP Techniques Overview
8 pages
Understanding Transformers vs RNNs
No ratings yet
Understanding Transformers vs RNNs
54 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
5 pages
Advanced Deep Learning in NLP Techniques
No ratings yet
Advanced Deep Learning in NLP Techniques
22 pages
Attention Mechanisms in Neural Networks
No ratings yet
Attention Mechanisms in Neural Networks
27 pages
Transformer Models in NLP
No ratings yet
Transformer Models in NLP
124 pages
Understanding Transformers in Deep Learning
No ratings yet
Understanding Transformers in Deep Learning
21 pages
Generative AI Interview Questions Guide
100% (2)
Generative AI Interview Questions Guide
7 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
127 pages
Understanding Transformers in ML
No ratings yet
Understanding Transformers in ML
9 pages
Transformer Model in AI Revolution
No ratings yet
Transformer Model in AI Revolution
6 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
103 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
22 pages
Understanding Generative AI Basics
No ratings yet
Understanding Generative AI Basics
54 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Composable Codex in Transformer Models
No ratings yet
Composable Codex in Transformer Models
131 pages
Understanding Transformer Encoders
No ratings yet
Understanding Transformer Encoders
75 pages
Introduction to Transformer Architecture
No ratings yet
Introduction to Transformer Architecture
10 pages
Machine Learning: Transformers Explained
No ratings yet
Machine Learning: Transformers Explained
21 pages
TransEvolve: Efficient Transformer Redesign
No ratings yet
TransEvolve: Efficient Transformer Redesign
14 pages
Understanding Transformer Models in AI
No ratings yet
Understanding Transformer Models in AI
36 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Attention Models and Transformers Overview
No ratings yet
Attention Models and Transformers Overview
40 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Comprehensive Survey of Transformers
No ratings yet
Comprehensive Survey of Transformers
22 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Introduction to Transformer Models
No ratings yet
Introduction to Transformer Models
22 pages
Transformer Model: Attention Mechanism
No ratings yet
Transformer Model: Attention Mechanism
15 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
15 pages
Understanding Transformers in NLP
No ratings yet
Understanding Transformers in NLP
55 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
17 pages
RNNs and PyTorch: A Comprehensive Guide
No ratings yet
RNNs and PyTorch: A Comprehensive Guide
37 pages
Self-Attention Layer Overview and Optimizations
No ratings yet
Self-Attention Layer Overview and Optimizations
80 pages
Understanding Contextualized Word Embeddings
No ratings yet
Understanding Contextualized Word Embeddings
52 pages
Transformer: Attention-Based Model
No ratings yet
Transformer: Attention-Based Model
4 pages
A1
No ratings yet
A1
11 pages
Deep Learning Architectures Overview
No ratings yet
Deep Learning Architectures Overview
65 pages
Historical Advances in Transformer Models
No ratings yet
Historical Advances in Transformer Models
27 pages
Understanding Transformers and Transfer Learning
No ratings yet
Understanding Transformers and Transfer Learning
27 pages
Transfer Learning and BERT Overview
No ratings yet
Transfer Learning and BERT Overview
26 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
18 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
8 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Training Data for Large Language Models
No ratings yet
Training Data for Large Language Models
41 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Efficient Transformers Survey Overview
No ratings yet
Efficient Transformers Survey Overview
28 pages
FAANG Transformer Interview Q&A Guide
No ratings yet
FAANG Transformer Interview Q&A Guide
3 pages
Generative AI & LLMs: A Beginner's Guide
No ratings yet
Generative AI & LLMs: A Beginner's Guide
53 pages
BART: Denoising Pre-training for NLP
No ratings yet
BART: Denoising Pre-training for NLP
37 pages
Deep Learning Course Overview 2024/2025
No ratings yet
Deep Learning Course Overview 2024/2025
48 pages
Backpropagation and Loss Functions in AI
No ratings yet
Backpropagation and Loss Functions in AI
33 pages
Poisson Processes Problem Sheet
No ratings yet
Poisson Processes Problem Sheet
2 pages
Deadlock and File Management in OS
No ratings yet
Deadlock and File Management in OS
10 pages
CNNs in Deep Learning Overview
No ratings yet
CNNs in Deep Learning Overview
56 pages
Understanding Neural Networks Simplified
No ratings yet
Understanding Neural Networks Simplified
10 pages
Deep Learning Model Question Paper 21CS743
100% (1)
Deep Learning Model Question Paper 21CS743
1 page
Neuro-Fuzzy Techniques Overview
No ratings yet
Neuro-Fuzzy Techniques Overview
352 pages
Understanding Perceptrons and Activation Functions
No ratings yet
Understanding Perceptrons and Activation Functions
18 pages
Non-Differentiable Activation Functions
No ratings yet
Non-Differentiable Activation Functions
15 pages
LVQ Algorithm for Pattern Clustering
No ratings yet
LVQ Algorithm for Pattern Clustering
3 pages
Understanding the Transformer Model
No ratings yet
Understanding the Transformer Model
46 pages
Suppressing Trivial Attention in ViTs
No ratings yet
Suppressing Trivial Attention in ViTs
9 pages
Neural Network Topologies Explained
No ratings yet
Neural Network Topologies Explained
14 pages
Deep Learning Exam Question Paper
No ratings yet
Deep Learning Exam Question Paper
4 pages
Zhou 2020
No ratings yet
Zhou 2020
4 pages
Fundamentals of Deep Learning Explained
No ratings yet
Fundamentals of Deep Learning Explained
47 pages
Neural Networks vs. Organized Networks
No ratings yet
Neural Networks vs. Organized Networks
3 pages
Understanding Artificial Intelligence Basics
No ratings yet
Understanding Artificial Intelligence Basics
8 pages
FPGA MobileNet V3 CNN Accelerator Guide
No ratings yet
FPGA MobileNet V3 CNN Accelerator Guide
11 pages
AI Model for Breast Cancer Diagnosis
No ratings yet
AI Model for Breast Cancer Diagnosis
6 pages
Understanding Regularization in Machine Learning
No ratings yet
Understanding Regularization in Machine Learning
154 pages
Backpropagation Explained by Matt Mazur
No ratings yet
Backpropagation Explained by Matt Mazur
14 pages
AI & ML Practice Problems Overview
No ratings yet
AI & ML Practice Problems Overview
35 pages
Convolutional Neural Networks Overview
No ratings yet
Convolutional Neural Networks Overview
17 pages
Understanding GoogLeNet Architecture
No ratings yet
Understanding GoogLeNet Architecture
30 pages
Deep Learning Course Syllabus
100% (2)
Deep Learning Course Syllabus
2 pages
Perceptron - Wikipedia
No ratings yet
Perceptron - Wikipedia
9 pages
Soft Computing Exam Questions
No ratings yet
Soft Computing Exam Questions
1 page
Internship Report on Machine Learning
No ratings yet
Internship Report on Machine Learning
25 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
3 pages
McCulloch-Pitts Neuron and Perceptron MCQs
No ratings yet
McCulloch-Pitts Neuron and Perceptron MCQs
20 pages
Understanding Feedback in Neural Networks
No ratings yet
Understanding Feedback in Neural Networks
2 pages
Inception V3 Model Architecture Overview
No ratings yet
Inception V3 Model Architecture Overview
28 pages

Structured Self-Attention in Neural Networks

Uploaded by

Structured Self-Attention in Neural Networks

Uploaded by

Arti cial Intelligence 2

Pr. Hamza Alami

Academic year: 2024/2025

2. Self Structured Attention

3. Pump your NNs performance

5. Transformers the python library

2 Hamza Alami CI2-S3 2024/2025

• What are the different types of attentions we previously seen ?

• What is a seq2seq model ?

3 Hamza Alami CI2-S3 2024/2025

2. Self Structured Attention

3. Pump your NNs performance

5. Transformers the python library

4 Hamza Alami CI2-S3 2024/2025

We don’t have st−1

g is a scoring function g (st−1, h1) g (st−1, h5) g (st−1, hn)

The greater the obstacle the it

• We need multi hops of attention to focus on on various components of the

• Instead of using a vector to represent a sentence, the paper uses a matrix

• Each vector in the matrix attend on different components (aspects) within

What can get worse ?

α1k α2k α3k αnk

Softmax (wattention·tanh(WH)) Softmax (Wattention·tanh(WH))

• To ensure that the model learn to attend on different components they

• Like L1 and L2 penalization terms we multiply P by a coe cient and then

2. Self Structured Attention

3. Pump your NNs performance

5. Transformers the python library

11 Hamza Alami CI2-S3 2024/2025

• How about introducing some normalizations inside the NNs ?

• This is the idea behind:

X Weighted Sum Activation … Weighted Sum Activation Output

Standard neural network architecture

X Normalization … Normalization Output

Apply normalization at the beginning of the weighted sum layer

X Weighted Sum Normalization Activation … Weighted Sum Normalization Activation Output

Apply normalization at the end of the weighted sum layer

• During inference time, we should remove the dependence between batch

μ_running = (1 − α)·μ + α·μ_running

• Layer normalization is preferred over batch normalization when:

• The batch size is small

• RNNs architecture is used

2. Self Structured Attention

3. Pump your NNs performance

5. Transformers the python library

20 Hamza Alami CI2-S3 2024/2025

• Transformers are proposed to solve machine translation task

• Transformers are highly parallelizable which is similar to

• Catches long dependencies without using RNNs (avoid problems)

• Apply attention mechanism

• Highly parallelizable to take advantages of recent advances in

Feed Forward Encoder Block

+ Source: attention is all you need,

• The authors used the following to build positional encoding:

• Transformers need large amount of data and heavy computational

• In most cases, transfert learning ( ne-tuning) methods are used to make

• The idea is to train a language model using the Transformers decoder

2. Self Structured Attention

3. Pump your NNs performance

5. Transformers the python library

35 Hamza Alami CI2-S3 2024/2025

source: Transformers: State-of-the-Art Natural Language Processing, paper

You might also like