0% found this document useful (0 votes)
83 views41 pages

Structured Self-Attention in Neural Networks

The document outlines a course on Artificial Intelligence, focusing on self-structured attention, normalization techniques, and transformers, including their architecture and applications. It discusses the importance of attention mechanisms in neural networks and introduces the Transformers library for implementing these models. Key concepts include batch and layer normalization, the structure of transformers, and notable models like GPT and BERT.

Uploaded by

Mohi Gpt4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views41 pages

Structured Self-Attention in Neural Networks

The document outlines a course on Artificial Intelligence, focusing on self-structured attention, normalization techniques, and transformers, including their architecture and applications. It discusses the importance of attention mechanisms in neural networks and introduces the Transformers library for implementing these models. Key concepts include batch and layer normalization, the structure of transformers, and notable models like GPT and BERT.

Uploaded by

Mohi Gpt4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Arti cial Intelligence 2

CI 2 - S3

Pr. Hamza Alami

Academic year: 2024/2025


fi
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

2 Hamza Alami CI2-S3 2024/2025


Recap
• When we should use attention in NNs ?

• What are the different types of attentions we previously seen ?

• In which case ReLU activation can cause problems and what is the
alternative ?

• What is a seq2seq model ?

3 Hamza Alami CI2-S3 2024/2025


Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

4 Hamza Alami CI2-S3 2024/2025


Structured self attention
• How can we use attention when our model is not seq2seq ?

Moliere

Encoder
The greater the obstacle the more glory in overcoming it

5 Hamza Alami
Structured self attention n


ct = αihi
i

We don’t have st−1

α1 α2 α3 α4 α5 αn
Softmax

g is a scoring function g (st−1, h1) g (st−1, h5) g (st−1, hn)


- Considering the following sentence:
The food at the restaurant was enjoyable, with a variety of
h1 h2 h3 h4 h5 hn well-prepared and avorful dishes that showcased the
chef's skill. However, the overall experience was somewhat
overshadowed by the high noise levels in the dining area.

The greater the obstacle the it

6 Hamza Alami
fl
Structured self attention
• The sentence contains multiple components or aspects

• We need multi hops of attention to focus on on various components of the


sentence
The food at the restaurant was enjoyable, with a variety of
well-prepared and avorful dishes that showcased the
chef's skill. However, the overall experience was somewhat
overshadowed by the high noise levels in the dining area.
How to make the NNs focuses on multiple parts
effectively ?

The food at the restaurant was enjoyable, with a variety of However, the overall experience was somewhat
well-prepared and avorful dishes that showcased the overshadowed by the high noise levels in the dining area.
chef's skill.

7 Hamza Alami
fl
fl
Structured self attention
• The paper entitled “A Structured Self-Attentive Sentence Embedding”
introduced a method to attend on various part of a sentence and make the
decision of the NNs more intrepretable

• Instead of using a vector to represent a sentence, the paper uses a matrix

• Each vector in the matrix attend on different components (aspects) within


the sentence

8 Hamza Alami
Structured self attention sentence_embedding = AH Matrix

What can get worse ?


n


sentence_embedding = αihi
i
α11 α21 α31 αn1
α12 α22 α32 αn2

α1 α2 α3 α4 α5 αn
A α13 α23 α33 αn3

α1k α2k α3k αnk

Softmax (wattention·tanh(WH)) Softmax (Wattention·tanh(WH))

H H

h1 h2 h3 h4 h5 hn h1 h2 h3 h4 h5 hn
… …

The greater the obstacle the it The greater the obstacle the it

9 Hamza Alami
Structured self attention
• What if the model learned the same attention weights ?

• To ensure that the model learn to attend on different components they


added a penalization term to the loss:

2
T
P= AA − I F

• Like L1 and L2 penalization terms we multiply P by a coe cient and then


add it to the loss term

10 Hamza Alami

ffi
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

11 Hamza Alami CI2-S3 2024/2025


Normalisation layers (Modules)
• Usually we normalize data before feeding it to the NNs

• How about introducing some normalizations inside the NNs ?

• This is the idea behind:

• Batch normalization

• Layer normalization

12 Hamza Alami
Where to apply normalization ?

X Weighted Sum Activation … Weighted Sum Activation Output

Standard neural network architecture

X Normalization … Normalization Output

Apply normalization at the beginning of the weighted sum layer

X Weighted Sum Normalization Activation … Weighted Sum Normalization Activation Output

Apply normalization at the end of the weighted sum layer

13 Hamza Alami
Batch Normalization
• Considering that we have an input X with size: (batch_dim, features_dim)
• We apply batch normalization (over batches) for a single feature following:
Learnable parameters

̂ +β
normalized_batch = x·γ
x−μ Scale Shift

x̂ =
σ

(xi − μ)
batch_dim batch_dim 2
∑i=1 xi ∑i=1
μ= ; and σ =
batch_dim batch_dim
14 Hamza Alami
Batch Normalization
• Batch normalization is sensitive to batch size

• During inference time, we should remove the dependence between batch


data. We should not compute μ and σ as the training part

• While training a running μ and σ are computed, these quantities are used
during the inference time

μ_running = (1 − α)·μ + α·μ_running


σ_running = (1 − α)·σ + α·σ_running

15 Hamza Alami
Layer Normalization
• Considering that we have an input X with size: (batch_dim, features_dim)
• We apply batch normalization (over features) for a single element following:
Learnable parameters

̂ +β
normalized_batch = x·γ Shift

x−μ Scale

x̂ =
σ

(xi − μ)
features_dim features_dim 2
∑i=1 xi ∑i=1
μ= ; and σ =
features_dim features_dim
16 Hamza Alami
Layer Normalization
• Unlike batch normalization, layer normalization is insensitive to batch size

• Layer normalization is preferred over batch normalization when:

• The batch size is small

• RNNs architecture is used

17 Hamza Alami
Normalization is mysterious
• Normalization does not impact the capacity of a NNs model, however it
can improve its performance

• For instance, we can have a model with large capacity and performs poorly
on a task. Integrating normalization layers can improve its performance
without impacting its capacity

18 Hamza Alami
Normalization is mysterious

19 Hamza Alami
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

20 Hamza Alami CI2-S3 2024/2025


Transformers
• Transformers are proposed in “Attention Is All You Need”
paper, 2017

• Transformers are proposed to solve machine translation task

• Transformers are highly parallelizable which is similar to


“Pondemonium: A Paradigm For Learning” paper, 1958

21 Hamza Alami
Transformers
• Can we ? build a model that :

• Catches long dependencies without using RNNs (avoid problems)

• Apply attention mechanism

• Highly parallelizable to take advantages of recent advances in


processing units

22 Hamza Alami
Transformers
• Transformers are Seq2Seq models, take as input a sentence and generate
a sentence
<START> ‫< أكبر عليه التغلب في املجد كان كلما أكبر العائق كان كلما‬END>

Decoder

Encoder

Embeddings
The greater the obstacle the more glory in overcoming it

23 Hamza Alami
Transformers (Encoder)
T
QK Normalize Softmax Attention

Q K V

Encoder
Matrix

24 Hamza Alami
MHAttention
Transformers (Encoder-Multihead)
Concat

TT
QQhKKh T Normalize Softmax Head
QhhKhh Head
Headkk
k
Q3
Q2 K3 V3
K2 V2
Q1
K1 V1

Matrix


25 Hamza Alami
Transformers (Encoder)
LayerNorm

Feed Forward Encoder Block


Stacked N times
LayerNorm

+ Source: attention is all you need,


paper
Multi Head Attention

Matrix

26 Hamza Alami
MHAttention
Transformers (Decoder-MaskedMultihead)
Concat
Prevent attending on future

TT
QQhKKh T Normalize Mask Softmax Head
QhhKhh Head
Headkk
k
Q3
Q2 K3 V3
K2 V2
Q1
K1 V1

Matrix


27 Hamza Alami
Transformers (Decoder)
Source: attention is all you need, paper

28 Hamza Alami
Transformers (Positional encoding)
• How about the position of the token, how to fed the position information to
the encoder and decoder ?

• The authors used the following to build positional encoding:

2i/dim_embeddings
PE(pos,2i) = sin(pos/1000 )
2i/dim_embeddings
PE(pos,2i+1) = cos(pos/1000 )

• The positional embeddings are then added to the embeddings ( rst layer)

29 Hamza Alami

fi
Transformers (Positional encoding)

source: [Link]

30 Hamza Alami
Transformers reshaped the world
• Transformers have achieve SoTA in various tasks: text classi cation,
question answering, machine translations, text generation, etc …

• Transformers need large amount of data and heavy computational


resources for training

• In most cases, transfert learning ( ne-tuning) methods are used to make


transformers based models perform well on speci c tasks

31 Hamza Alami
fi
fi
fi
GPT
• Generative Pre-Training was introduced in the paper “Improving Natural
Language Understanding by Generative Pre-Training” paper, 2018

• The idea is to train a language model using the Transformers decoder


block to maximize autoregressive language model likelihood

• The same model is ne tuned on another NLU task, which achieved the
SoTA in various tasks (back then)

32 Hamza Alami
fi
BERT
• Bidirectional Encoder Representation from Transformers (BERT), is a
based on the Transformers encoder only

• The idea is to create embeddings of text that can perform well on various
tasks of natural language understanding

33 Hamza Alami
GPT vs BERT architectures

34 Hamza Alami
Outline
1. Recap

2. Self Structured Attention

3. Pump your NNs performance

4. Transformers

5. Transformers the python library

35 Hamza Alami CI2-S3 2024/2025


Transformers Library
• Transformers is an open-source library designed to make transformers-
based models accessible to the broader machine learning community.

source: Transformers: State-of-the-Art Natural Language Processing, paper

36 Hamza Alami
Transformer library
• Each model in the library is de ned by three blocks:

• Tokenizer

• Transformer

• Head

37 Hamza Alami
fi
Transformer model hub
• A huge model hub contains various open-source transformer pertained
models is available

38 Hamza Alami
Loading a model with transformers

39 Hamza Alami
Con gure training process

40 Hamza Alami
fi
Starting the training process

41 Hamza Alami

You might also like