Machine learning
1 Introduction
1.1 What is ML?
The successes of ML
1.1.1 Algorithmic advances
• New neural network architectures to handle many kinds of data (text, vision, sounds, physical data,
etc...)
• new training techniques, optimization algorithms
• new rules to be able to train these models at a large scale.
1.1.2 Computationnal advances and open source
• Use of efficient hardware to scale up models: gpus. Allows to perform many matrix multiplications
efficiently super quickly.
• development of open-source tools that are easy to use for researchers that are not experts in code.
Makes it easy to implement complicated neural networks, scale them, and compute gradients.
Pytorch, Jax.
• You can code a kind of chat-gpt and train it in about 1k lines of code (cf nano-gpt)
1.1.3 ML breakthroughs
ML and deep learning have allowed many breakthroughs in the last 10 years
• Computer vision (image processing, image recognition, segmentation, etc...): most computer vision
systems use neural networks. Also, automatic image generation.
• Natural language processing: automatic translation, and text generation
• Chemistry: protein folding prediction (alpha fold); one nobel prize for this this year
1.2 Two examples of modern machine learning success and their detailed
pipelines
1.2.1 Neural networks on imagenet
• We have a dataset of 1M images; each image belongs to one class: cat, dog, boat, etc...
• Goal: train an algorithm that takes as input an image and outputs a predicted class.
• Mindset: we have an expressive function that takes images and outputs class. We want to learn
the parameters. Everything is learned from data: the only thing the researcher specifies is the
architecture of the network.
1
• SGD training: feed a few images to the network, penalize errors.
• And that’s it.
More formally, we have a dataset of images (x1 , . . . , xn ), where here n = 106 .
Q: how do you represent an image in a computer?
Each image xi is represented as a 3 − D tensor of dimension 3 × l × h where l is the length and h the
width. This is a raw representation of the image. We call X this input space. Each image also comes
with a label y1 , . . . , yn , where each yi is the corresponding class of the image: yi ∈ {cat, dog, shoe, boat...}.
There are k different classes. This is a classification problem. The researcher then designs a neural
network, that is a function of two variables
f (x; θ)
where x are inputs belonging to X , θ ∈ Rp are parameters. A very important property of this function
is that it is by design differentiable with respect to θ. How to design such functions will be the topic of
future courses.
Q: what should be the output space of the neural network, having classification in mind?
Classically, the neural network outputs one value per class: f (x, θ) = z ∈ Rk . The idea is that the
coordinate zj should be large if the input x belongs to the class j, and small otherwise.
Q: how do you take a decision with such a network?
The neural network is trained by finding good parameters θ, so that the expected behavior above happens.
We therefore need a way to measure how well the output z corresponds to the class y of the sample.
Hence, we need to define a cost function ℓ(z, y) ∈ R.
Q: can you think of a way of doing so?
Usually, we use the cross-entropy loss:
X
ℓ(z, y) = log( exp(zj )) − zy
Q: can you show that this is a good loss? i.e., ≥ 0, when does it equal 0? show that it is shift-
invariant.
Hence, we have a way to tell how well the neural net with parameters θ performs on the image xi : it is
simply ℓ(f (xi ; θ); yi ).
Q: how do you measure the performance of the neural network on the full dataset?
We then find the optimal θ by empirical risk minimization:
n
1X
min L(θ) = ℓ(f (xi ; θ), yi ).
θ n i=1
This is done using classical optimization algorithms like SGD, more on that later. Importantly, we made
the architecture differentiable to be able to compute the gradient of the loss function; how to compute
gradients through a neural network is another important topic that we will discuss later.
Once you have trained the model, i.e. found θ that approximately minimizes the loss, you can use the
model for inference. We test the model on held-out data, computing the test accuracy and test loss.
History of imagenet: Before neural nets, 25% top-5 error. First neural net (2012): 15%. Now: 1%
2
1.2.2 Training large language models like gpt
Large language models are trained in a very similar fashion. The goal is to understand how natural
language works. To do so, we have access to billions of sentences on internet.
Q: what is a good supervised task to learn how natural language works?
Usually, these models are trained by doing next-token prediction: the input xi is now a sentence, and the
output target y is the following word – or token – in the sentence.
Q: how can we represent text in the computer? it should be in a vectorized form.
Each sentence is then transformed into a sequence of vectors: xi = [v1 , . . . , vq ], where each vector cor-
responds to a word, and the target y is a single word in the vocabulary of size 60k. The length of the
sequences varies, so we have an architecture that can take arbitrary-length sequences:
z = f ([v1 , . . . , vq ]; θ)
More on how to build such architectures later.
Then, the networks is trained once again by empirical risk minimization with the cross-entropy loss. At
inference, the network can take a sequence of inputs [v1 , . . . , vq ] and then construct the next token.
Q: how to construct the next token? what if you want randomness?
We usually compute z = f ([v1 , . . . , vq ]; θ) and then sample from the law softmax(z/t), where t is a
temperature: if t goes to 0, the generation becomes non-random, if t is high the network will generate
more non-random tokens.
Q: how to generate long sentences?
We simply input [v1 , . . . , vq+1 ] into the llm.
1.3 Machine learning and optimization
1.3.1 Empirical risk minimization
As we have seen, empirical risk minimization is a classical way to solve a ML problem. It means that we
need to be able to minimize a function. This is called optimization.
We consider a function L : Rp → R, and the problem minθ L(θ). Usually in ML the function f is
differentiable, i.e. it has a gradient such that
L(θ + dθ) = L(θ) + ⟨∇L(θ), dθ⟩ + o(dθ)
A simple idea to minimize L is to proceed iteratively, building a sequence θt that hopefully converges to
the solution.
Q: At θt , in which direction does the function decrease the most?
It is therefore natural to follow the gradient; this is gradient descent:
θt+1 = θt − ρ∇L(θt )
note: we will later see how to proceed when the function is not necessarily differentiable, e.g. if there is
a ℓ − 1 norm in it.
3
1.3.2 The example of least squares
We consider a regression problem, where inputs are x1 , . . . , xn ∈ Rp , and the targets are y1 , . . . , yn ∈ R.
We consider a simple linear model:
y = ⟨θ, x⟩ + ε, ε ∼ N (0, σ 2 )
Q: what is the probability of observing y given θ and x?
We have that p(y|x, θ) = p(y − ⟨θ, x⟩|x, θ) = N (y − ⟨θ, x⟩, σ 2 ).
In order to find the best parameters, weP
therefore maximize this probability, and find that the corre-
1 n 2
sponding optimization problem is min 2n i=1 ∥⟨θ, xi ⟩ − yi ∥
Q: what kind of function is it? What is the solution?
It is a simple quadratic function, minimized when the gradient cancels.
2 Unsupervised learning
Assume that we have access to a set of unlabeled samples x1 , . . . , xn . We do not have access to any label,
yet we want to extract information about this dataset. We therefore want to understand the structure
of this data.
This is super common in practice because annotating is usually costly. We will explore two classical
methods to do so.
2.1 Clustering: k-means
We want to group examples in the dataset: similar points should belong to the same cluster.
Draw pictures in 2d:
• Case where there is an obvious clustering
• Case where there is none
Formally, we want to split the data into k clusters, i.e. build a selection function ϕ : Rp → {1, k}.
Q: how would you do this?
We take a list of clusters C1 , . . . , Ck , such that ∪o Ci = {1, . . . n}.
Q: How would you define the quality of a clustering? We want points in the same cluster to be close:
k
X 1 X
L(C1 , . . . , Ck ) = ∥xi − xj ∥2
2#Cl
l=1 i,j∈Cl
Problem: this is a combinatorial problem, solving it requires exploring all the possible subsets... too hard
to do.
Idea: have a list of k centroids b1 , . . . , bk ∈ Rp .
Q: if the centroids are fixed, how do you assign a sample x to a cluster?
Simply take arg min ∥x − bi ∥.
4
Q: how can you rewrite the loss function with the centroids?
k X
X
L(C1 , . . . , Ck ) = ∥xi − bl ∥2
l=1 i∈Cl
Q: how can you minimize this loss?
alternately update clusters and centroids.
Q: define the function of both centroids et clusters
k X
X
G(C1 , . . . , Ck , b1 , . . . , bk ) = ∥xi − bl ∥2
l=1 i∈Cl
what can you say about L and G?
min G = L
b1 ,...,bk
Q: how do you minimize G wrt. b?
1
P
Simply take bi = #C i i∈Ci xi
Q: how do you minimize G wrt clusters?
Simply take Cl = {i| ∥xi − bl ∥ is minimal}.
Therefore, we have an alternate minimization scheme for the k-means problem.
Algorithm (Lloyd):
• Initialize centroids b1 , . . . , bk (Q: how??)
• iterate:
– Assign clusters Cl = {i| ∥xi − bl ∥ is minimal}
– Update centroids
Does this algorithm converge? Hard to prove... saddle points (e.g. if points are all in 1 cluster...)
Q: what is the complexity of the algorithm? Each iteration is n × k × p.
Q: how to estimate the number of clusters?
Draw an example on the board; look for a knee in the loss function.
Q: l2 distance
2.2 Principal component analysis
k means searches for groups in the data; PCA is a way to find a low rank representation of the data.
We have a dataset x1 , . . . , xn ∈ Rp . For simplicity, assume zero-mean.
We want to find one direction of space u ∈ Rp that explains as much as possible the dataset, i.e., such
that there are factors zi ∈ R such that
xi ≃ zi u
Q: is the problem well posed
there is a scale indeterminacy so we impose ∥u∥ = 1
5
Q: geometrically, what does it mean?
points are almost in a 1-d space
Q: how can you find the coefficients zi given u? how to compute u?
zi = arg minz ∥xi − zu∥2
and then simply
n
1X
min ∥xi − zi u∥2
u,z n
i=1
Q: what is the solution?
zi = ⟨xi , u⟩ so equivalent to
n
X
max ⟨xi , u⟩2
i=1
variance maximization
In other words,
max uT Cu
where C is the covariance matrix.
u: largest eigenvector of the covariance matrix.
Q: How to extend this to more ranks? Simply consider yi = xi − zi u, and then try to fit a PCA on this
(deflation).
Corresponds to second largest eigenvalue of covariance
Q: how would you use this for data vizualisation?
2.3 SVD
We can write any matrix X ∈ Rn×p , with n ≥ p, as X = U ΣV T with U ∈ Rn×p s.t. U T U = Ip , Σ
diagonal positive, and V ∈ Op .
The principal components are the rows of V .