0% found this document useful (0 votes)
57 views21 pages

Machine Learning: Transformers Explained

The document discusses the foundations of machine learning with a focus on large language models and transformers. It covers prediction tasks in language processing, self-supervised learning, and the transformer architecture, including self-attention mechanisms and multi-head attention. Additionally, it explores generative AI techniques such as autoregressive models and search strategies like greedy and beam search.

Uploaded by

timmyparks2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views21 pages

Machine Learning: Transformers Explained

The document discusses the foundations of machine learning with a focus on large language models and transformers. It covers prediction tasks in language processing, self-supervised learning, and the transformer architecture, including self-attention mechanisms and multi-head attention. Additionally, it explores generative AI techniques such as autoregressive models and search strategies like greedy and beam search.

Uploaded by

timmyparks2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Foundations of machine learning

Large Language Models and Transformers

Maximilian Kasy

Department of Economics, University of Oxford

Fall 2023
1 / 17
Prediction tasks in language processing

The transformer architecture

Generative AI

References
Prediction tasks in language processing
• Suppose the data consist of pairs of sequences of “tokens:”
x = (x1 , . . . , xn ) and y = (y1 , . . . , ym ).
• Various tasks in language processing require to estimate models P̂ for
P(Y|X)
• Typical loss function for an observation (x, y): Negative log likelihood,
− log P̂(Y = y|X = x).

Examples:
1. Machine translation:
x is a sentence in the source language.
y is a sentence in the target language.
2. Question answering:
x is a question. y is an answer.
2 / 17
Self-supervised learning

• These prediction problems require specific data – pairs of x and y.

• There is much greater availability of data of “unlabeled” sequences x.


E.g., all the text on the internet (Wikipedia, Arxiv, Github, ...).

• Self-supervised learning fits models for the distribution of such sequences.

Leading cases:
1. Autoregressive models:
Model P(xi |x1 , . . . , xi−1 ), for all i.

2. Masking:
Model P(xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ), for all i.

3 / 17
Masking

4 / 17
Embeddings and pre-training

• Many language models are trained in two steps:


1. Self-supervised learning on a large corpus of sequences x, using masking.
This yields an embedding (representation) of the source data x.
2. Fine-tuning on a task-specific corpus:
Using the embeddings from 1. as predictors for y.

• This is also known as transfer learning.


It yields much better results than simply training on the task-specific corpus.

5 / 17
Prediction tasks in language processing

The transformer architecture

Generative AI

References
The transformer architecture

• How do we get an embedding for a sequence of tokens?

• What functional form should we choose?

• Leading answer: Transformers.

• Transformers consist of multiple transformer blocks.

• Each of which includes self-attention layers.

6 / 17
Self-attention layers
• Take as given a sequence of input vectors x1 , . . . , xn ,

• We want to transform it,


to produce a sequence of output vectors y1 , . . . , yn
of the same dimension.

• yj is supposed to encode the meaning of xi in the context of the other xj .

• First step: Take a linear tranformation of the xi .

vi = W v · xi .

• Second step: Take a weighted average of the vi to get the output yi .

yi = ∑ αij vj .
j

7 / 17
Self-attention layers continued
• The weights αij capture the importance of xj as context for xi .

• But where do the weights come from? Self-attention!

exp(scoreij )
αij = .
∑j′ exp(scoreij′ )

Normalizing sum of weights to 1 (aka softmax / multinomial logit).


• scoreij : Relevance of xj as context for xi .

scoreij = ⟨qi , kj ⟩ inner product


q
qi = W · xi query
kj = W k · xj key

• Contrast to time series models: Weights depend only on |i − j|.


⇒ Would not recognize importance of far-away sentence parts for context.
8 / 17
Backward looking and bi-directional self-attention

9 / 17
Transformer blocks

• Self-attention layers are packaged with some additional transformations as


follows:

z = LayerNorm(x + SelfAttention(x))
y = LayerNorm(z + FFN(z))

• LayerNorm(x) normalizes x = (x1 , . . . , xn )


by subtracting the mean and dividing by the standard deviation.

• The addition of x to SelfAttention(x) is called “residual connection.”


This keeps raw information from previous input.

• FFN(z) is a standard feed-forward neural network.

10 / 17
11 / 17
Multi-head attention

• Tweak on the transformer block:


Replace the single self-attention layer
by several parallel versions, indexed by b.

• Thus:
h i exp(scorebij )
yib = ∑ αij · W v,b · xi , αijb = ,
j ∑j′ exp(scorebij′ )
Dh i h iE
scorebij = W q,b · xi , W k,b · xj .

• The rest of the transformer block stays the same.

• Motivation: Context matters in various ways.

12 / 17
Prediction tasks in language processing

The transformer architecture

Generative AI

References
Generative AI

• Suppose you have fit an autoregressive model, which gives

P̂(yi+1 |x, y1 , . . . , yi−1 ).

• Suppose you would like to generate a prediction of y, given an input x.

• That is you would like to find

ŷ = argmax P̂(y|x) = argmax ∏ P̂(yi |x, y1 , . . . , yi−1 ).


y y i

• Such forecasting of autoregressive models is at the heart of “generative AI.”

13 / 17
Greedy sampling

• Naive idea: Sequentially find the highest probability prediction, one step at a
time:
ŷi = argmax P̂(yi+1 |x, y1 , . . . , yi−1 ).

• This is known as greedy search.

• Problem:
This does not take into account the impact of the choice of ŷi
on the availability of high probability choices later.

• Dynamic programming problem!

14 / 17
Beam search

• Exhaustive search of the tree of possible sequences is too costly.

• Compromise: Beam search.


1. Start with the k highest-probability choices for ŷ1 .
2. For each of these choices separately,
find the k highest probability choices for ŷ2 .
3. Keep the k sequences of ŷ1 , ŷ2 with the highest probability,
discard the rest.
4. Iterate.

15 / 17
16 / 17
References

Speech and Language Processing,


Dan Jurafsky and James H. Martin, 2023,
chapters 10-11.

17 / 17

You might also like