Foundations of machine learning
Large Language Models and Transformers
Maximilian Kasy
Department of Economics, University of Oxford
Fall 2023
1 / 17
Prediction tasks in language processing
The transformer architecture
Generative AI
References
Prediction tasks in language processing
• Suppose the data consist of pairs of sequences of “tokens:”
x = (x1 , . . . , xn ) and y = (y1 , . . . , ym ).
• Various tasks in language processing require to estimate models P̂ for
P(Y|X)
• Typical loss function for an observation (x, y): Negative log likelihood,
− log P̂(Y = y|X = x).
Examples:
1. Machine translation:
x is a sentence in the source language.
y is a sentence in the target language.
2. Question answering:
x is a question. y is an answer.
2 / 17
Self-supervised learning
• These prediction problems require specific data – pairs of x and y.
• There is much greater availability of data of “unlabeled” sequences x.
E.g., all the text on the internet (Wikipedia, Arxiv, Github, ...).
• Self-supervised learning fits models for the distribution of such sequences.
Leading cases:
1. Autoregressive models:
Model P(xi |x1 , . . . , xi−1 ), for all i.
2. Masking:
Model P(xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ), for all i.
3 / 17
Masking
4 / 17
Embeddings and pre-training
• Many language models are trained in two steps:
1. Self-supervised learning on a large corpus of sequences x, using masking.
This yields an embedding (representation) of the source data x.
2. Fine-tuning on a task-specific corpus:
Using the embeddings from 1. as predictors for y.
• This is also known as transfer learning.
It yields much better results than simply training on the task-specific corpus.
5 / 17
Prediction tasks in language processing
The transformer architecture
Generative AI
References
The transformer architecture
• How do we get an embedding for a sequence of tokens?
• What functional form should we choose?
• Leading answer: Transformers.
• Transformers consist of multiple transformer blocks.
• Each of which includes self-attention layers.
6 / 17
Self-attention layers
• Take as given a sequence of input vectors x1 , . . . , xn ,
• We want to transform it,
to produce a sequence of output vectors y1 , . . . , yn
of the same dimension.
• yj is supposed to encode the meaning of xi in the context of the other xj .
• First step: Take a linear tranformation of the xi .
vi = W v · xi .
• Second step: Take a weighted average of the vi to get the output yi .
yi = ∑ αij vj .
j
7 / 17
Self-attention layers continued
• The weights αij capture the importance of xj as context for xi .
• But where do the weights come from? Self-attention!
exp(scoreij )
αij = .
∑j′ exp(scoreij′ )
Normalizing sum of weights to 1 (aka softmax / multinomial logit).
• scoreij : Relevance of xj as context for xi .
scoreij = ⟨qi , kj ⟩ inner product
q
qi = W · xi query
kj = W k · xj key
• Contrast to time series models: Weights depend only on |i − j|.
⇒ Would not recognize importance of far-away sentence parts for context.
8 / 17
Backward looking and bi-directional self-attention
9 / 17
Transformer blocks
• Self-attention layers are packaged with some additional transformations as
follows:
z = LayerNorm(x + SelfAttention(x))
y = LayerNorm(z + FFN(z))
• LayerNorm(x) normalizes x = (x1 , . . . , xn )
by subtracting the mean and dividing by the standard deviation.
• The addition of x to SelfAttention(x) is called “residual connection.”
This keeps raw information from previous input.
• FFN(z) is a standard feed-forward neural network.
10 / 17
11 / 17
Multi-head attention
• Tweak on the transformer block:
Replace the single self-attention layer
by several parallel versions, indexed by b.
• Thus:
h i exp(scorebij )
yib = ∑ αij · W v,b · xi , αijb = ,
j ∑j′ exp(scorebij′ )
Dh i h iE
scorebij = W q,b · xi , W k,b · xj .
• The rest of the transformer block stays the same.
• Motivation: Context matters in various ways.
12 / 17
Prediction tasks in language processing
The transformer architecture
Generative AI
References
Generative AI
• Suppose you have fit an autoregressive model, which gives
P̂(yi+1 |x, y1 , . . . , yi−1 ).
• Suppose you would like to generate a prediction of y, given an input x.
• That is you would like to find
ŷ = argmax P̂(y|x) = argmax ∏ P̂(yi |x, y1 , . . . , yi−1 ).
y y i
• Such forecasting of autoregressive models is at the heart of “generative AI.”
13 / 17
Greedy sampling
• Naive idea: Sequentially find the highest probability prediction, one step at a
time:
ŷi = argmax P̂(yi+1 |x, y1 , . . . , yi−1 ).
• This is known as greedy search.
• Problem:
This does not take into account the impact of the choice of ŷi
on the availability of high probability choices later.
• Dynamic programming problem!
14 / 17
Beam search
• Exhaustive search of the tree of possible sequences is too costly.
• Compromise: Beam search.
1. Start with the k highest-probability choices for ŷ1 .
2. For each of these choices separately,
find the k highest probability choices for ŷ2 .
3. Keep the k sequences of ŷ1 , ŷ2 with the highest probability,
discard the rest.
4. Iterate.
15 / 17
16 / 17
References
Speech and Language Processing,
Dan Jurafsky and James H. Martin, 2023,
chapters 10-11.
17 / 17