0% found this document useful (0 votes)

8 views6 pages

ML Teaching

The document provides an overview of machine learning (ML), detailing its advancements in algorithms, computational techniques, and breakthroughs in various fields like computer vision and natural language processing. It discusses the training processes of neural networks and large language models, emphasizing empirical risk minimization and optimization methods such as gradient descent. Additionally, it covers unsupervised learning techniques like k-means clustering and principal component analysis (PCA) for data representation and structure extraction.

Uploaded by

dalobar456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

ML Teaching

Uploaded by

dalobar456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine learning

1 Introduction
1.1 What is ML?
The successes of ML

1.1.1 Algorithmic advances

• New neural network architectures to handle many kinds of data (text, vision, sounds, physical data,
etc...)
• new training techniques, optimization algorithms
• new rules to be able to train these models at a large scale.

1.1.2 Computationnal advances and open source

• Use of efficient hardware to scale up models: gpus. Allows to perform many matrix multiplications
efficiently super quickly.
• development of open-source tools that are easy to use for researchers that are not experts in code.
Makes it easy to implement complicated neural networks, scale them, and compute gradients.
Pytorch, Jax.
• You can code a kind of chat-gpt and train it in about 1k lines of code (cf nano-gpt)

1.1.3 ML breakthroughs
ML and deep learning have allowed many breakthroughs in the last 10 years
• Computer vision (image processing, image recognition, segmentation, etc...): most computer vision
systems use neural networks. Also, automatic image generation.
• Natural language processing: automatic translation, and text generation
• Chemistry: protein folding prediction (alpha fold); one nobel prize for this this year

1.2 Two examples of modern machine learning success and their detailed
pipelines
1.2.1 Neural networks on imagenet
• We have a dataset of 1M images; each image belongs to one class: cat, dog, boat, etc...
• Goal: train an algorithm that takes as input an image and outputs a predicted class.
• Mindset: we have an expressive function that takes images and outputs class. We want to learn
the parameters. Everything is learned from data: the only thing the researcher specifies is the
architecture of the network.

1
• SGD training: feed a few images to the network, penalize errors.
• And that’s it.
More formally, we have a dataset of images (x1 , . . . , xn ), where here n = 106 .
Q: how do you represent an image in a computer?
Each image xi is represented as a 3 − D tensor of dimension 3 × l × h where l is the length and h the
width. This is a raw representation of the image. We call X this input space. Each image also comes
with a label y1 , . . . , yn , where each yi is the corresponding class of the image: yi ∈ {cat, dog, shoe, boat...}.
There are k different classes. This is a classification problem. The researcher then designs a neural
network, that is a function of two variables
f (x; θ)
where x are inputs belonging to X , θ ∈ Rp are parameters. A very important property of this function
is that it is by design differentiable with respect to θ. How to design such functions will be the topic of
future courses.
Q: what should be the output space of the neural network, having classification in mind?
Classically, the neural network outputs one value per class: f (x, θ) = z ∈ Rk . The idea is that the
coordinate zj should be large if the input x belongs to the class j, and small otherwise.
Q: how do you take a decision with such a network?
The neural network is trained by finding good parameters θ, so that the expected behavior above happens.
We therefore need a way to measure how well the output z corresponds to the class y of the sample.
Hence, we need to define a cost function ℓ(z, y) ∈ R.
Q: can you think of a way of doing so?
Usually, we use the cross-entropy loss:
X
ℓ(z, y) = log( exp(zj )) − zy

Q: can you show that this is a good loss? i.e., ≥ 0, when does it equal 0? show that it is shift-
invariant.
Hence, we have a way to tell how well the neural net with parameters θ performs on the image xi : it is
simply ℓ(f (xi ; θ); yi ).
Q: how do you measure the performance of the neural network on the full dataset?
We then find the optimal θ by empirical risk minimization:

n
1X
min L(θ) = ℓ(f (xi ; θ), yi ).
θ n i=1

This is done using classical optimization algorithms like SGD, more on that later. Importantly, we made
the architecture differentiable to be able to compute the gradient of the loss function; how to compute
gradients through a neural network is another important topic that we will discuss later.
Once you have trained the model, i.e. found θ that approximately minimizes the loss, you can use the
model for inference. We test the model on held-out data, computing the test accuracy and test loss.
History of imagenet: Before neural nets, 25% top-5 error. First neural net (2012): 15%. Now: 1%

2
1.2.2 Training large language models like gpt
Large language models are trained in a very similar fashion. The goal is to understand how natural
language works. To do so, we have access to billions of sentences on internet.
Q: what is a good supervised task to learn how natural language works?
Usually, these models are trained by doing next-token prediction: the input xi is now a sentence, and the
output target y is the following word – or token – in the sentence.
Q: how can we represent text in the computer? it should be in a vectorized form.
Each sentence is then transformed into a sequence of vectors: xi = [v1 , . . . , vq ], where each vector cor-
responds to a word, and the target y is a single word in the vocabulary of size 60k. The length of the
sequences varies, so we have an architecture that can take arbitrary-length sequences:

z = f ([v1 , . . . , vq ]; θ)
More on how to build such architectures later.
Then, the networks is trained once again by empirical risk minimization with the cross-entropy loss. At
inference, the network can take a sequence of inputs [v1 , . . . , vq ] and then construct the next token.
Q: how to construct the next token? what if you want randomness?
We usually compute z = f ([v1 , . . . , vq ]; θ) and then sample from the law softmax(z/t), where t is a
temperature: if t goes to 0, the generation becomes non-random, if t is high the network will generate
more non-random tokens.
Q: how to generate long sentences?
We simply input [v1 , . . . , vq+1 ] into the llm.

1.3 Machine learning and optimization

1.3.1 Empirical risk minimization
As we have seen, empirical risk minimization is a classical way to solve a ML problem. It means that we
need to be able to minimize a function. This is called optimization.
We consider a function L : Rp → R, and the problem minθ L(θ). Usually in ML the function f is
differentiable, i.e. it has a gradient such that

L(θ + dθ) = L(θ) + ⟨∇L(θ), dθ⟩ + o(dθ)

A simple idea to minimize L is to proceed iteratively, building a sequence θt that hopefully converges to
the solution.
Q: At θt , in which direction does the function decrease the most?
It is therefore natural to follow the gradient; this is gradient descent:

θt+1 = θt − ρ∇L(θt )

note: we will later see how to proceed when the function is not necessarily differentiable, e.g. if there is
a ℓ − 1 norm in it.

3
1.3.2 The example of least squares
We consider a regression problem, where inputs are x1 , . . . , xn ∈ Rp , and the targets are y1 , . . . , yn ∈ R.
We consider a simple linear model:

y = ⟨θ, x⟩ + ε, ε ∼ N (0, σ 2 )

Q: what is the probability of observing y given θ and x?

We have that p(y|x, θ) = p(y − ⟨θ, x⟩|x, θ) = N (y − ⟨θ, x⟩, σ 2 ).
In order to find the best parameters, weP
therefore maximize this probability, and find that the corre-
1 n 2
sponding optimization problem is min 2n i=1 ∥⟨θ, xi ⟩ − yi ∥

Q: what kind of function is it? What is the solution?

It is a simple quadratic function, minimized when the gradient cancels.

2 Unsupervised learning
Assume that we have access to a set of unlabeled samples x1 , . . . , xn . We do not have access to any label,
yet we want to extract information about this dataset. We therefore want to understand the structure
of this data.
This is super common in practice because annotating is usually costly. We will explore two classical
methods to do so.

2.1 Clustering: k-means

We want to group examples in the dataset: similar points should belong to the same cluster.
Draw pictures in 2d:
• Case where there is an obvious clustering
• Case where there is none
Formally, we want to split the data into k clusters, i.e. build a selection function ϕ : Rp → {1, k}.
Q: how would you do this?
We take a list of clusters C1 , . . . , Ck , such that ∪o Ci = {1, . . . n}.
Q: How would you define the quality of a clustering? We want points in the same cluster to be close:

k
X 1 X
L(C1 , . . . , Ck ) = ∥xi − xj ∥2
2#Cl
l=1 i,j∈Cl

Problem: this is a combinatorial problem, solving it requires exploring all the possible subsets... too hard
to do.
Idea: have a list of k centroids b1 , . . . , bk ∈ Rp .
Q: if the centroids are fixed, how do you assign a sample x to a cluster?
Simply take arg min ∥x − bi ∥.

4
Q: how can you rewrite the loss function with the centroids?
k X
X
L(C1 , . . . , Ck ) = ∥xi − bl ∥2
l=1 i∈Cl

Q: how can you minimize this loss?

alternately update clusters and centroids.
Q: define the function of both centroids et clusters

k X
X
G(C1 , . . . , Ck , b1 , . . . , bk ) = ∥xi − bl ∥2
l=1 i∈Cl

what can you say about L and G?

min G = L
b1 ,...,bk

Q: how do you minimize G wrt. b?

1
P
Simply take bi = #C i i∈Ci xi

Q: how do you minimize G wrt clusters?

Simply take Cl = {i| ∥xi − bl ∥ is minimal}.
Therefore, we have an alternate minimization scheme for the k-means problem.
Algorithm (Lloyd):
• Initialize centroids b1 , . . . , bk (Q: how??)
• iterate:
– Assign clusters Cl = {i| ∥xi − bl ∥ is minimal}
– Update centroids
Does this algorithm converge? Hard to prove... saddle points (e.g. if points are all in 1 cluster...)
Q: what is the complexity of the algorithm? Each iteration is n × k × p.
Q: how to estimate the number of clusters?
Draw an example on the board; look for a knee in the loss function.
Q: l2 distance

2.2 Principal component analysis

k means searches for groups in the data; PCA is a way to find a low rank representation of the data.
We have a dataset x1 , . . . , xn ∈ Rp . For simplicity, assume zero-mean.
We want to find one direction of space u ∈ Rp that explains as much as possible the dataset, i.e., such
that there are factors zi ∈ R such that

xi ≃ zi u

Q: is the problem well posed

there is a scale indeterminacy so we impose ∥u∥ = 1

5
Q: geometrically, what does it mean?
points are almost in a 1-d space
Q: how can you find the coefficients zi given u? how to compute u?
zi = arg minz ∥xi − zu∥2
and then simply

n
1X
min ∥xi − zi u∥2
u,z n
i=1

Q: what is the solution?

zi = ⟨xi , u⟩ so equivalent to
n
X
max ⟨xi , u⟩2
i=1

variance maximization
In other words,

max uT Cu
where C is the covariance matrix.
u: largest eigenvector of the covariance matrix.
Q: How to extend this to more ranks? Simply consider yi = xi − zi u, and then try to fit a PCA on this
(deflation).
Corresponds to second largest eigenvalue of covariance
Q: how would you use this for data vizualisation?

2.3 SVD
We can write any matrix X ∈ Rn×p , with n ≥ p, as X = U ΣV T with U ∈ Rn×p s.t. U T U = Ip , Σ
diagonal positive, and V ∈ Op .
The principal components are the rows of V .

Deep Learning: Supervised vs. Unsupervised
No ratings yet
Deep Learning: Supervised vs. Unsupervised
48 pages
NLP Neural Networks Overview and Basics
No ratings yet
NLP Neural Networks Overview and Basics
13 pages
Probability and Optimization in Machine Learning
No ratings yet
Probability and Optimization in Machine Learning
31 pages
Machine Learning Insights for Physicists
No ratings yet
Machine Learning Insights for Physicists
91 pages
Optimization and Machine Learning Concepts
No ratings yet
Optimization and Machine Learning Concepts
4 pages
Feedforward Neural Network Overview
No ratings yet
Feedforward Neural Network Overview
101 pages
Linearized Neural Networks Overview
No ratings yet
Linearized Neural Networks Overview
77 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
29 pages
UC Berkeley Machine Learning Overview
No ratings yet
UC Berkeley Machine Learning Overview
43 pages
Understanding Deep Learning Basics
No ratings yet
Understanding Deep Learning Basics
73 pages
Machine Learning Basics with Python
No ratings yet
Machine Learning Basics with Python
54 pages
Machine Learning Concepts and Trends
No ratings yet
Machine Learning Concepts and Trends
29 pages
Deep Learning: Supervised Learning Insights
No ratings yet
Deep Learning: Supervised Learning Insights
79 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
156 pages
Mathematics for Machine Learning
No ratings yet
Mathematics for Machine Learning
156 pages
AI600_Solutions
No ratings yet
AI600_Solutions
19 pages
Foundations of Machine Learning Overview
No ratings yet
Foundations of Machine Learning Overview
121 pages
Introduction to Logistic Regression
No ratings yet
Introduction to Logistic Regression
42 pages
Understanding Deep Learning Simon J. D. Prince Ebook Testbank Solutions Direct Chapter Access
100% (2)
Understanding Deep Learning Simon J. D. Prince Ebook Testbank Solutions Direct Chapter Access
64 pages
Intro to Machine Learning Notes
No ratings yet
Intro to Machine Learning Notes
50 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
61 pages
Optimizing MLPs: Techniques & Strategies
No ratings yet
Optimizing MLPs: Techniques & Strategies
26 pages
Deep Learning Foundations Overview
No ratings yet
Deep Learning Foundations Overview
105 pages
Deep Learning with Neural Networks
No ratings yet
Deep Learning with Neural Networks
104 pages
Neural Networks: Feature Learning & Backpropagation
No ratings yet
Neural Networks: Feature Learning & Backpropagation
46 pages
Deep Neural Networks: Loss Functions Explained
No ratings yet
Deep Neural Networks: Loss Functions Explained
79 pages
Stanford CS 229 Machine Learning Notes
No ratings yet
Stanford CS 229 Machine Learning Notes
48 pages
Machine Learning Lecture II Overview
No ratings yet
Machine Learning Lecture II Overview
118 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
14 pages
Machine Learning Cheatsheet Overview
100% (1)
Machine Learning Cheatsheet Overview
15 pages
Loss Function Optimization in Neural Networks
100% (1)
Loss Function Optimization in Neural Networks
24 pages
Deep Learning and Neural Networks Overview
No ratings yet
Deep Learning and Neural Networks Overview
9 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
14 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
33 pages
CSE 465: Pattern Recognition Overview
No ratings yet
CSE 465: Pattern Recognition Overview
47 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
7 pages
Advanced Machine Learning Concepts
No ratings yet
Advanced Machine Learning Concepts
76 pages
Softmax vs Sigmoid in Deep Learning
No ratings yet
Softmax vs Sigmoid in Deep Learning
15 pages
Deep Learning Techniques Overview
No ratings yet
Deep Learning Techniques Overview
123 pages
Understanding Deep Neural Networks Basics
No ratings yet
Understanding Deep Neural Networks Basics
6 pages
Supervised Learning in Data Mining
No ratings yet
Supervised Learning in Data Mining
57 pages
Machine Learning Lecture Overview
100% (1)
Machine Learning Lecture Overview
283 pages
Deep Learning Concepts Overview
No ratings yet
Deep Learning Concepts Overview
20 pages
Deep Learning Fundamentals Explained
No ratings yet
Deep Learning Fundamentals Explained
48 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
42 pages
Deep Learning: Neural Networks Overview
No ratings yet
Deep Learning: Neural Networks Overview
47 pages
Foundations of Machine Learning Concepts
No ratings yet
Foundations of Machine Learning Concepts
41 pages
Deep Auto Encoders for E-Learning Adaptability
No ratings yet
Deep Auto Encoders for E-Learning Adaptability
13 pages
Multidimensional Scaling Techniques
No ratings yet
Multidimensional Scaling Techniques
30 pages
Image Processing and Analysis Overview
No ratings yet
Image Processing and Analysis Overview
16 pages
EE534 Pattern Recognition Course Details
No ratings yet
EE534 Pattern Recognition Course Details
6 pages
Accounting Quality in Debt Contracting
No ratings yet
Accounting Quality in Debt Contracting
28 pages
Turmeric Antioxidant Analysis via PCA
No ratings yet
Turmeric Antioxidant Analysis via PCA
7 pages
Avoiding RSC Submission Pitfalls
No ratings yet
Avoiding RSC Submission Pitfalls
10 pages
ANOVA Analysis of Salary Factors
95% (38)
ANOVA Analysis of Salary Factors
25 pages
Dimensionality Reduction Techniques
No ratings yet
Dimensionality Reduction Techniques
117 pages
Statistical Analysis of Dissolution Profiles
No ratings yet
Statistical Analysis of Dissolution Profiles
14 pages
Basic Win ISI Training Jun 2020
No ratings yet
Basic Win ISI Training Jun 2020
262 pages
Braun Kiran 2022 Stimulus and Person Level Variables Influence Word Production and Response To Anomia Treatment For
No ratings yet
Braun Kiran 2022 Stimulus and Person Level Variables Influence Word Production and Response To Anomia Treatment For
19 pages
Pediatric Pheniramine Dosage Guidelines
No ratings yet
Pediatric Pheniramine Dosage Guidelines
8 pages
Titanic Data Mining Assignment
No ratings yet
Titanic Data Mining Assignment
3 pages
Full-Pattern Cluster Analysis in XRD
No ratings yet
Full-Pattern Cluster Analysis in XRD
1 page
2024CFA二级模考1学员版题目
No ratings yet
2024CFA二级模考1学员版题目
56 pages
Principal Component Analysis Explained
No ratings yet
Principal Component Analysis Explained
12 pages
Gendered Impact of Unpaid Care in India
No ratings yet
Gendered Impact of Unpaid Care in India
23 pages
Contouring Human Development in India
100% (8)
Contouring Human Development in India
15 pages
Deep Learning for Stress Analysis
No ratings yet
Deep Learning for Stress Analysis
10 pages
USP <1039> Chemometrics Overview
No ratings yet
USP <1039> Chemometrics Overview
18 pages
SVD for Missing Data Imputation Report
No ratings yet
SVD for Missing Data Imputation Report
6 pages
Overview of Recommender Systems
No ratings yet
Overview of Recommender Systems
12 pages
CS229 Autumn 2017 Problem Set #4
No ratings yet
CS229 Autumn 2017 Problem Set #4
10 pages
Biometric Attendance System Design
No ratings yet
Biometric Attendance System Design
46 pages
Financial Inclusion and Growth in Africa
No ratings yet
Financial Inclusion and Growth in Africa
19 pages
Telco Churn Prediction Using Big Data
No ratings yet
Telco Churn Prediction Using Big Data
24 pages
SVG 520 Remote Sensing Q&A Guide
No ratings yet
SVG 520 Remote Sensing Q&A Guide
16 pages
Morphological Variation of Pinus Merkusii
No ratings yet
Morphological Variation of Pinus Merkusii
12 pages
M.Tech AIML Exam Paper: Math for ML
No ratings yet
M.Tech AIML Exam Paper: Math for ML
3 pages