0% found this document useful (0 votes)

57 views21 pages

Machine Learning: Transformers Explained

The document discusses the foundations of machine learning with a focus on large language models and transformers. It covers prediction tasks in language processing, self-supervised learning, and the transformer architecture, including self-attention mechanisms and multi-head attention. Additionally, it explores generative AI techniques such as autoregressive models and search strategies like greedy and beam search.

Uploaded by

timmyparks2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views21 pages

Machine Learning: Transformers Explained

Uploaded by

timmyparks2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Foundations of machine learning

Large Language Models and Transformers

Maximilian Kasy

Department of Economics, University of Oxford

Fall 2023
1 / 17
Prediction tasks in language processing

The transformer architecture

Generative AI

References
Prediction tasks in language processing
• Suppose the data consist of pairs of sequences of “tokens:”
x = (x1 , . . . , xn ) and y = (y1 , . . . , ym ).
• Various tasks in language processing require to estimate models P̂ for
P(Y|X)
• Typical loss function for an observation (x, y): Negative log likelihood,
− log P̂(Y = y|X = x).

Examples:
1. Machine translation:
x is a sentence in the source language.
y is a sentence in the target language.
2. Question answering:
x is a question. y is an answer.
2 / 17
Self-supervised learning

• These prediction problems require speciﬁc data – pairs of x and y.

• There is much greater availability of data of “unlabeled” sequences x.

E.g., all the text on the internet (Wikipedia, Arxiv, Github, ...).

• Self-supervised learning ﬁts models for the distribution of such sequences.

Leading cases:
1. Autoregressive models:
Model P(xi |x1 , . . . , xi−1 ), for all i.

2. Masking:
Model P(xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ), for all i.

3 / 17
Masking

4 / 17
Embeddings and pre-training

• Many language models are trained in two steps:

1. Self-supervised learning on a large corpus of sequences x, using masking.
This yields an embedding (representation) of the source data x.
2. Fine-tuning on a task-speciﬁc corpus:
Using the embeddings from 1. as predictors for y.

• This is also known as transfer learning.

It yields much better results than simply training on the task-speciﬁc corpus.

5 / 17
Prediction tasks in language processing

The transformer architecture

Generative AI

References
The transformer architecture

• How do we get an embedding for a sequence of tokens?

• What functional form should we choose?

• Leading answer: Transformers.

• Transformers consist of multiple transformer blocks.

• Each of which includes self-attention layers.

6 / 17
Self-attention layers
• Take as given a sequence of input vectors x1 , . . . , xn ,

• We want to transform it,

to produce a sequence of output vectors y1 , . . . , yn
of the same dimension.

• yj is supposed to encode the meaning of xi in the context of the other xj .

• First step: Take a linear tranformation of the xi .

vi = W v · xi .

• Second step: Take a weighted average of the vi to get the output yi .

yi = ∑ αij vj .
j

7 / 17
Self-attention layers continued
• The weights αij capture the importance of xj as context for xi .

• But where do the weights come from? Self-attention!

exp(scoreij )
αij = .
∑j′ exp(scoreij′ )

Normalizing sum of weights to 1 (aka softmax / multinomial logit).

• scoreij : Relevance of xj as context for xi .

scoreij = ⟨qi , kj ⟩ inner product

q
qi = W · xi query
kj = W k · xj key

• Contrast to time series models: Weights depend only on |i − j|.

⇒ Would not recognize importance of far-away sentence parts for context.
8 / 17
Backward looking and bi-directional self-attention

9 / 17
Transformer blocks

• Self-attention layers are packaged with some additional transformations as

follows:

z = LayerNorm(x + SelfAttention(x))
y = LayerNorm(z + FFN(z))

• LayerNorm(x) normalizes x = (x1 , . . . , xn )

by subtracting the mean and dividing by the standard deviation.

• The addition of x to SelfAttention(x) is called “residual connection.”

This keeps raw information from previous input.

• FFN(z) is a standard feed-forward neural network.

10 / 17
11 / 17
Multi-head attention

• Tweak on the transformer block:

Replace the single self-attention layer
by several parallel versions, indexed by b.

• Thus:
h i exp(scorebij )
yib = ∑ αij · W v,b · xi , αijb = ,
j ∑j′ exp(scorebij′ )
Dh i h iE
scorebij = W q,b · xi , W k,b · xj .

• The rest of the transformer block stays the same.

• Motivation: Context matters in various ways.

12 / 17
Prediction tasks in language processing

The transformer architecture

Generative AI

References
Generative AI

• Suppose you have ﬁt an autoregressive model, which gives

P̂(yi+1 |x, y1 , . . . , yi−1 ).

• Suppose you would like to generate a prediction of y, given an input x.

• That is you would like to ﬁnd

ŷ = argmax P̂(y|x) = argmax ∏ P̂(yi |x, y1 , . . . , yi−1 ).

y y i

• Such forecasting of autoregressive models is at the heart of “generative AI.”

13 / 17
Greedy sampling

• Naive idea: Sequentially ﬁnd the highest probability prediction, one step at a
time:
ŷi = argmax P̂(yi+1 |x, y1 , . . . , yi−1 ).

• This is known as greedy search.

• Problem:
This does not take into account the impact of the choice of ŷi
on the availability of high probability choices later.

• Dynamic programming problem!

14 / 17
Beam search

• Exhaustive search of the tree of possible sequences is too costly.

• Compromise: Beam search.

1. Start with the k highest-probability choices for ŷ1 .
2. For each of these choices separately,
ﬁnd the k highest probability choices for ŷ2 .
3. Keep the k sequences of ŷ1 , ŷ2 with the highest probability,
discard the rest.
4. Iterate.

15 / 17
16 / 17
References

Speech and Language Processing,

Dan Jurafsky and James H. Martin, 2023,
chapters 10-11.

17 / 17

Understanding Transformers in AI
No ratings yet
Understanding Transformers in AI
41 pages
Introduction to Transformer Models
No ratings yet
Introduction to Transformer Models
119 pages
Composable Codex in Transformer Models
No ratings yet
Composable Codex in Transformer Models
131 pages
Understanding Generative AI Basics
No ratings yet
Understanding Generative AI Basics
54 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
5 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
103 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Introduction to Transformer Architecture
No ratings yet
Introduction to Transformer Architecture
10 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
22 pages
Generative AI and Transformers Overview
100% (1)
Generative AI and Transformers Overview
64 pages
Advanced Deep Learning in NLP Techniques
No ratings yet
Advanced Deep Learning in NLP Techniques
22 pages
Understanding Transformers in Deep Learning
No ratings yet
Understanding Transformers in Deep Learning
21 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Understanding Transformers in NLP
No ratings yet
Understanding Transformers in NLP
62 pages
Transformer Architecture in NLP
No ratings yet
Transformer Architecture in NLP
15 pages
RNNs, Transformers, and GANs Explained
No ratings yet
RNNs, Transformers, and GANs Explained
24 pages
Transformer Model in AI Revolution
No ratings yet
Transformer Model in AI Revolution
6 pages
Deep Learning and Language Models Guide
No ratings yet
Deep Learning and Language Models Guide
39 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
8 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
10 pages
Understanding Transformers in ML
No ratings yet
Understanding Transformers in ML
9 pages
Generative AI Interview Questions Guide
100% (2)
Generative AI Interview Questions Guide
7 pages
Transformers in NLP: Architecture & Models
No ratings yet
Transformers in NLP: Architecture & Models
9 pages
AI and Machine Learning Expertise Overview
No ratings yet
AI and Machine Learning Expertise Overview
10 pages
Generative AI and LLMs Overview
No ratings yet
Generative AI and LLMs Overview
42 pages
Evaluating Clustering and Deep Learning
No ratings yet
Evaluating Clustering and Deep Learning
4 pages
Transformer Models in NLP
No ratings yet
Transformer Models in NLP
124 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
50 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
5 pages
Automating Transcription with OCI AI
No ratings yet
Automating Transcription with OCI AI
8 pages
Understanding Transformers vs RNNs
No ratings yet
Understanding Transformers vs RNNs
54 pages
Deep Learning Architectures Overview
No ratings yet
Deep Learning Architectures Overview
51 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
49 pages
Attention Models and Transformers Overview
No ratings yet
Attention Models and Transformers Overview
40 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
Generative AI and NLP Workshop Overview
No ratings yet
Generative AI and NLP Workshop Overview
35 pages
Understanding Transformer Architecture in ML
No ratings yet
Understanding Transformer Architecture in ML
31 pages
ChatGPT Architecture Overview
No ratings yet
ChatGPT Architecture Overview
13 pages
Understanding Transformers and Self-Attention
No ratings yet
Understanding Transformers and Self-Attention
30 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
20 pages
Generative AI Assignment: Transformer Models
No ratings yet
Generative AI Assignment: Transformer Models
11 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
19 pages
Key Components of Transformer Models
No ratings yet
Key Components of Transformer Models
20 pages
Structured Self-Attention in Neural Networks
No ratings yet
Structured Self-Attention in Neural Networks
41 pages
Advanced AI Course Overview and Models
No ratings yet
Advanced AI Course Overview and Models
109 pages
AWS AI Practitioner Exam Prep Guide
No ratings yet
AWS AI Practitioner Exam Prep Guide
20 pages
Generative AI Overview and Models
No ratings yet
Generative AI Overview and Models
31 pages
Transformers in AI: Key Innovations
No ratings yet
Transformers in AI: Key Innovations
3 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Understanding Transformer Architecture in LLMs
No ratings yet
Understanding Transformer Architecture in LLMs
5 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
21 pages
Transformer Model Overview
No ratings yet
Transformer Model Overview
5 pages
Understanding Transformers in GenAI
No ratings yet
Understanding Transformers in GenAI
205 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
18 pages
Generative AI Course Notes Overview
No ratings yet
Generative AI Course Notes Overview
36 pages
Understanding the Transformer Model
No ratings yet
Understanding the Transformer Model
46 pages
Key Components of Transformer Models
No ratings yet
Key Components of Transformer Models
19 pages
Historical Advances in Transformer Models
No ratings yet
Historical Advances in Transformer Models
27 pages
Biochemistry Fundamentals and DNA Insights
No ratings yet
Biochemistry Fundamentals and DNA Insights
12 pages
OMICRON QuickCMC Test Results 3.01
No ratings yet
OMICRON QuickCMC Test Results 3.01
3 pages
Data Analysis Oforphanagehomedonation
No ratings yet
Data Analysis Oforphanagehomedonation
27 pages
Bandit Regret Lower Bounds Explained
No ratings yet
Bandit Regret Lower Bounds Explained
14 pages
HIRIS RF43 Technical Manual Overview
No ratings yet
HIRIS RF43 Technical Manual Overview
316 pages
BURNiT Comfort PM/PM-B Installation Manual
No ratings yet
BURNiT Comfort PM/PM-B Installation Manual
56 pages
Benefit-Cost Analysis for Highways
No ratings yet
Benefit-Cost Analysis for Highways
18 pages
Python String Manipulation Functions
No ratings yet
Python String Manipulation Functions
3 pages
Question Bank for Software Engineering
No ratings yet
Question Bank for Software Engineering
8 pages
Tangent Element Panel Setup Guide
No ratings yet
Tangent Element Panel Setup Guide
26 pages
ICS Series 12ICS100: Intensive Cycle Service
No ratings yet
ICS Series 12ICS100: Intensive Cycle Service
2 pages
Data Representation in Computing
No ratings yet
Data Representation in Computing
69 pages
Closed Traverse Survey Data Analysis
No ratings yet
Closed Traverse Survey Data Analysis
4 pages
D3.js Network Graph Component in React
No ratings yet
D3.js Network Graph Component in React
20 pages
Forex Trading Strategies Guide
No ratings yet
Forex Trading Strategies Guide
5 pages
Specification of SERDE in RCFile
No ratings yet
Specification of SERDE in RCFile
5 pages
Taylor Series Expansion Coefficients
No ratings yet
Taylor Series Expansion Coefficients
1 page
Voltage Regulation in Power Supplies
No ratings yet
Voltage Regulation in Power Supplies
9 pages
Vector and Matrix Operations in C++
No ratings yet
Vector and Matrix Operations in C++
12 pages
Installing NumPy and Matplotlib on Windows
No ratings yet
Installing NumPy and Matplotlib on Windows
2 pages
Kumon Algebra Practice Workbook 6-8
No ratings yet
Kumon Algebra Practice Workbook 6-8
13 pages
Topic 1 Fluid Mosaic Model & Transport Across Membranes My Notes
50% (2)
Topic 1 Fluid Mosaic Model & Transport Across Membranes My Notes
24 pages
2019 Mock Chemistry Exam Paper
No ratings yet
2019 Mock Chemistry Exam Paper
4 pages
Year 8 Math Word Problems Worksheet
No ratings yet
Year 8 Math Word Problems Worksheet
3 pages
Heat Transfer Methods and Applications
No ratings yet
Heat Transfer Methods and Applications
57 pages
IIT-JEE 2027 Question Paper Format
No ratings yet
IIT-JEE 2027 Question Paper Format
32 pages
Seating Arrangements 1
100% (1)
Seating Arrangements 1
84 pages
SEPIC
No ratings yet
SEPIC
11 pages
Gas Turbine Efficiency Techniques
No ratings yet
Gas Turbine Efficiency Techniques
15 pages
Junior Secondary Math Revision Exercises
No ratings yet
Junior Secondary Math Revision Exercises
13 pages