0% found this document useful (0 votes)
44 views256 pages

Federated Learning: Theory and Practice

This book provides an in-depth exploration of federated learning (FL), focusing on its theoretical foundations and practical algorithms. It introduces key concepts such as FL networks and generalized total variation minimization (GTVMin), which are essential for developing collaborative training systems. The content includes various aspects of machine learning, optimization methods, and the importance of privacy and security in FL applications.

Uploaded by

ihasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views256 pages

Federated Learning: Theory and Practice

This book provides an in-depth exploration of federated learning (FL), focusing on its theoretical foundations and practical algorithms. It introduces key concepts such as FL networks and generalized total variation minimization (GTVMin), which are essential for developing collaborative training systems. The content includes various aspects of machine learning, optimization methods, and the importance of privacy and security in FL applications.

Uploaded by

ihasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Federated Learning: From Theory to Practice

Dipl.-Ing. [Link]. Alexander Jung∗

December 17, 2024

Abstract

This book discusses theory and algorithms for federated learning


(FL). FL exploits similarities between local datasets to collaboratively
train local (or personalized) models. Two core mathematical objects
of this course are federated learning (FL) networks and generalized
total variation minimization (GTVMin). We use an FL network as a
mathematical model for clients that generate local datasets and train
local models. GTVMin formulates FL as an instance of regularized
empirical risk minimization (ERM). As the regularizer, we use quan-
titative measures for the variation of local model parameters across
the edges of the FL network. We can obtain practical FL systems by
applying distributed optimization methods to solve GTVMin.


AJ is currently Associate Professor for Machine Learning at Aalto University (Finland).
This work has been partially funded by the Academy of Finland (decision numbers 331197,
349966, 363624) and the European Union (grant number 952410).

1
Contents
1 Introduction 1
1.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Related Courses . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Main Goal of the Course . . . . . . . . . . . . . . . . . . . . . 7
1.6 Outline of the Course . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Assignments and Grading . . . . . . . . . . . . . . . . . . . . 10
1.8 Student Project . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.9 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.10 Ground Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 ML Basics 1
2.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Three Components and a Design Principle . . . . . . . . . . . 1
2.3 Computational Aspects of ERM . . . . . . . . . . . . . . . . . 5
2.4 Statistical Aspects of ERM . . . . . . . . . . . . . . . . . . . . 6
2.5 Validation and Diagnosis of ML . . . . . . . . . . . . . . . . . 10
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 18

3 A Design Principle for FL 1


3.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2
3.2 Graphs and Their Laplacian . . . . . . . . . . . . . . . . . . . 2
3.3 Generalized Total Variation Minimization . . . . . . . . . . . 9
3.3.1 Computational Aspects of GTVMin . . . . . . . . . . . 11
3.3.2 Statistical Aspects of GTVMin . . . . . . . . . . . . . 13
3.4 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 16
3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.1 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . 18

4 Gradient Methods 1
4.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.3 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.4 When to Stop? . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.5 Perturbed Gradient Step . . . . . . . . . . . . . . . . . . . . . 10
4.6 Handling Constraints - Projected Gradient Descent . . . . . . 12
4.7 Generalizing the Gradient Step . . . . . . . . . . . . . . . . . 14
4.8 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 17

5 FL Algorithms 1
5.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
5.2 Gradient Step for GTVMin . . . . . . . . . . . . . . . . . . . 2
5.3 Message Passing Implementation . . . . . . . . . . . . . . . . 5
5.4 FedSGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.5 FedAvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.6 Asynchronous Computation . . . . . . . . . . . . . . . . . . . 14
5.7 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 16

3
5.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.8.1 Proof of Proposition 5.1 . . . . . . . . . . . . . . . . . 18
5.8.2 Proof of Proposition 5.2 . . . . . . . . . . . . . . . . . 19

6 Some Main Flavours of FL 1


6.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2 Single-Model FL . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.3 Clustered FL . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.4 Horizontal FL . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.5 Vertical FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.6 Personalized Federated Learning . . . . . . . . . . . . . . . . . 12
6.7 Few-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . 15
6.8 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 16
6.9 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.9.1 Proof of Proposition 6.1 . . . . . . . . . . . . . . . . . 18

7 Graph Learning 1
7.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
7.2 Edges are a Design Choice . . . . . . . . . . . . . . . . . . . . 2
7.3 Measuring (Dis-)Similarity Between Datasets . . . . . . . . . . 6
7.4 Graph Learning Methods . . . . . . . . . . . . . . . . . . . . . 9
7.5 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 13

8 Trustworthy FL 1
8.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
8.2 Seven Key Requirements by the EU . . . . . . . . . . . . . . . 2
8.2.1 KR1 - Human Agency and Oversight. . . . . . . . . . . 2

4
8.2.2 KR2 - Technical Robustness and Safety. . . . . . . . . 3
8.2.3 KR3 - Privacy and Data Governance. . . . . . . . . . . 4
8.2.4 KR4 - Transparency. . . . . . . . . . . . . . . . . . . . 5
8.2.5 KR5 - Diversity, Non-Discrimination and Fairness. . . . 6
8.2.6 KR6 - Societal and Environmental Well-Being. . . . . . 7
8.2.7 KR7 - Accountability. . . . . . . . . . . . . . . . . . . 8
8.3 Technical Robustness of FL Systems . . . . . . . . . . . . . . 8
8.3.1 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . 9
8.3.2 Estimation Error Analysis . . . . . . . . . . . . . . . . 10
8.3.3 Network Resilience . . . . . . . . . . . . . . . . . . . . 13
8.3.4 Stragglers . . . . . . . . . . . . . . . . . . . . . . . . . 14
8.4 Subjective Explainability of FL Systems . . . . . . . . . . . . 17
8.5 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 21

9 Privacy-Protection in FL 1
9.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
9.2 Measuring Privacy Leakage . . . . . . . . . . . . . . . . . . . 2
9.3 Ensuring Differential Privacy . . . . . . . . . . . . . . . . . . . 9
9.4 Private Feature Learning . . . . . . . . . . . . . . . . . . . . . 12
9.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
9.6 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 17
9.6.1 Where Are You? . . . . . . . . . . . . . . . . . . . . . 17
9.6.2 Ensuring Privacy with Pre-Processing . . . . . . . . . . 18
9.6.3 Ensuring Privacy with Post-Processing . . . . . . . . . 18
9.6.4 Private Feature Learning . . . . . . . . . . . . . . . . . 18

5
10 Data and Model Poisoning in FL 1
10.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
10.2 Attack Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
10.3 Data Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
10.4 Model Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . 7
10.5 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
10.5.1 Denial-of-Service Attack . . . . . . . . . . . . . . . . . 8
10.5.2 Backdoor Attack . . . . . . . . . . . . . . . . . . . . . 8

Glossary 1

6
Lists of Symbols

Sets and Functions

This statement indicates that the object a is an element


a∈A
of the set A.

a := b This statement defines a to be shorthand for b.

|A| The cardinality (number of elements) of a finite set A.

A⊆B A is a subset of B.

A⊂B A is a strict subset of B.

N The natural numbers 1, 2, . . ..

R The real numbers x [1].

R+ The non-negative real numbers x ≥ 0.

R++ The positive real numbers x > 0.

7
{0, 1} The set that consists of the two real numbers 0 and 1.

[0, 1] The closed interval of real numbers x with 0 ≤ x ≤ 1.

argmin f (w) The set of minimizers for a real-valued function f (w).


w

S(n) The set of unit-norm vectors in Rn+1 .

log a The logarithm of the positive number a ∈ R++ .

A function (map) that accepts any element a ∈ A from a


set A as input and delivers a well-defined element h(a) ∈ B
of a set B. The set A is the domain of the function h
h(·) : A → B : a 7→ h(a) and the set B is the codomain of h. ML aims at finding
(or learning) a function h (hypothesis) that reads in the
features x of a data point and delivers a prediction h(x)
for its label y.

The gradient of a differentiable real-valued function f :


∂f T
Rd → R is the vector ∇f (w) = ∂w ∂f
∈ Rd [2,

∇f (w) 1
, . . . , ∂w d

Ch. 9].

8
Matrices and Vectors
A generalized identity matrix with l rows and d columns.
Il×d The entries of Il×d ∈ Rl×d are equal to 1 along the main
diagonal and equal to 0 otherwise.

A square identity matrix of size d × d. If the size is clear


Id , I
from the context, we drop the subscript.
T
d The set of vectors x = x1 , . . . , xd consisting of d real-
R
valued entries x1 , . . . , xd ∈ R.

x = x1 , . . . , x d ) T A vector of length d with its j-th entry being xj .

The Euclidean (or ℓ2 ) norm of the vector x =


∥x∥2 T qP
x1 , . . . , xd ∈ Rd given as ∥x∥2 := j=1 xj .
d 2

Some norm of the vector x ∈ Rd [3]. Unless specified


∥x∥
otherwise, we mean the Euclidean norm ∥x∥2 .

The transpose of a vector x that is considered a single


xT column matrix. The transpose is a single-row matrix
x1 , . . . , x d .


T
The transpose of a matrix X ∈ Rm×d . A square real-valued
X
matrix X ∈ Rm×m is called symmetric if X = XT .
T
0 = 0, . . . , 0 The vector in Rd with each entry equal to zero.

T
1 = 1, . . . , 1 The vector in Rd with each entry equal to one.

9
T The vector of length d + d′ obtained by concatenating the
vT , wT ′
entries of vector v ∈ Rd with the entries of w ∈ Rd .

The span of a matrix B ∈ Ra×b , which is the subspace of


all linear combinations of columns of B, span{B} = Ba :

span{B}
a ∈ Rb ⊆ Ra .

The set of all positive semi-definite (psd) matrices of size


Sd+
d × d.

det (C) The determinant of the matrix C.

A⊗B The Kronecker product of A and B [4].

10
Probability Theory

The expectation of a function f (z) of a RV z whose prob-


Ep {f (z)} ability distribution is p(z). If the probability distribution
is clear from context we just write E{f (z)}.

A (joint) probability distribution of a RV whose realiza-


p(x, y)
tions are data points with features x and label y.

A conditional probability distribution of a RV x given the


p(x|y)
value of another RV y [5, Sec. 3.5].

A parametrized probability distribution of a RV x. The


probability distribution depends on a parameter vector
w. For example, p(x; w) could be a multivariate normal
p(x; w)
distribution with the parameter vector w given by the
entries
 of the mean vector E{x}
 and the covariance matrix
T
.

E x − E{x} x − E{x}

The probability distribution of a Gaussian RV x ∈ R


N (µ, σ 2 ) with mean (or expectation) µ = E{x} and variance σ 2 =
E (x − µ)2 .


The multivariate normal distribution of a vector-valued


N (µ, C) Gaussian RV x ∈ Rd with mean (or expectation) µ =
T
E{x} and covariance matrix C = E x − µ x − µ .
 

11
Machine Learning

r An index r = 1, 2, . . . , that enumerates data points.

m The number of data points in (the size of) a dataset.

A dataset D = {z(1) , . . . , z(m) } is a list of individual data


D
points z(r) , for r = 1, . . . , m.

d Number of features that characterize a data point.

The j-th feature of a data point. The first feature of a


xj given data point is denoted x1 , the second feature x2 and
so on.
T
The feature vector x = x1 , . . . , xd of a data point whose
x
entries are the individual features of a data point.

The feature space X is the set of all possible values that


X
the features x of a data point can take on.

Beside the symbol x, we sometimes use z as another symbol


to denote a vector whose entries are individual features of
z
a data point. We need two different symbols to distinguish
between raw and learnt features [6, Ch. 9].

x(r) The feature vector of the r-th data point within a dataset.

(r)
xj The j-th feature of the r-th data point within a dataset.

12
B A mini-batch (subset) of randomly chosen data points.

B The size of (the number of data points in) a mini-batch.

y The label (quantity of interest) of a data point.

y (r) The label of the r-th data point.

x(r) , y (r) The features and label of the r-th data point.


The label space Y of a ML method consists of all potential


label values that a data point can carry. The nominal
label space might be larger than the set of different label
values arising in a given dataset (e.g., a training set). We
Y refer to ML problems (methods) using a numeric label
space, such as Y = R or Y = R3 , as regression problems
(methods). ML problems (methods) that use a discrete
label space, such as Y = {0, 1} or Y = {cat, dog, mouse}
are referred to as classification problems (methods).

η Learning rate (step-size) used by gradient-based methods.

A hypothesis map that reads in features x of a data point


h(·)
and delivers a prediction ŷ = h(x) for its label y.

X
Given two sets X and Y, we denote by Y X the set of all
Y
possible hypothesis maps h : X → Y.

13
A hypothesis space or model used by a ML method. The
H hypothesis space consists of different hypothesis maps
h : X → Y between which the ML method has to choose .

deff (H) The effective dimension of a hypothesis space H.

The squared bias of a learnt hypothesis ĥ delivered by


a ML algorithm that is fed with data points which are
2
B modelled as realizations of RVs. If data is modelled as
realizations of RVs, also the delivered hypothesis ĥ is the
realization of a RV.

The variance of the (parameters of the) hypothesis de-


livered by a ML algorithm. If the input data for this
V
algorithm is interpreted as realizations of RVs, so is the
delivered hypothesis a realization of a RV.

The loss incurred by predicting the label y of a data


point using the prediction ŷ = h(x). The prediction ŷ is
L ((x, y), h)
obtained from evaluating the hypothesis h ∈ H for the
feature vector x of the data point.

The validation error of a hypothesis h, which is its average


Ev
loss incurred over a validation set.

 The empirical risk or average loss incurred by the hypoth-


L
b h|D
esis h on a dataset D.

14
The training error of a hypothesis h, which is its average
Et
loss incurred over a training set.

A discrete-time index t = 0, 1, . . . used to enumerate se-


t
quential events (or time instants).

An index that enumerates learning tasks within a multi-


t
task learning problem.

A regularization parameter that controls the amount of


α
regularization.

The j-th eigenvalue (sorted either ascending or descending)


of a psd matrix Q. We also use the shorthand λj if the

λj Q
corresponding matrix is clear from context.

The activation function used by an artificial neuron within


σ(·)
an artificial neural network (ANN).

Rŷ A decision region within a feature space.

T
A parameter vector w = w1 , . . . , wd of a model, e.g,
w
the weights of a linear model or in a ANN.

A hypothesis map that involves tunable model parameters


h(w) (·) T
w1 , . . . , wd , stacked into the vector w = w1 , . . . , wd .

A feature map ϕ : X → X ′ : x 7→ x′ := ϕ x ∈ X ′ .

ϕ(·)

15
Federated Learning

A graph whose nodes i ∈ V carry local datasets and local


G = (V, E)
models.

A node in the FL network that represents a local dataset


and a corresponding local model. It might also be useful
i∈V
to think of node i as a small computer that can collect
data and execute computations to train ML models.

G (C) The induced sub-graph of G using the nodes in C ⊆ V.

L(G) The Laplacian matrix of a graph G.

L(C) The Laplacian matrix of the induced graph G (C) .

D(i) The local dataset D(i) at node i ∈ V of an FL network.

The number of data points (sample size) contained in the


mi
local dataset D(i) at node i ∈ V.

N (i) The neighbourhood of the node i in a graph G.

The weighted degree d(i) := Ai,i′ of node i in a


P
(i) i′ ∈N (i)
d
graph G.

d(G)
max The maximum weighted node degree of a graph G.

16
The features of the r-th data point in the local dataset
x(i,r)
D(i) .

y (i,r) The label of the r-th data point in the local dataset D(i) .

The loss incurred by a hypothesis h′ on a data point with


L(d) x, h x , h′ x features x and label h x that is obtained from another
  

hypothesis.

 T
(1) T (n) T
The vector ∈ Rdn that is ob-
 
w ,..., w
n
stack w(i)

i=1 tained by vertically stacking the local model parameters
w(i) ∈ Rd .

17
1 Introduction
Welcome to the course CS-E4740 Federated Learning. While the course can
be completed fully remote, we strongly encourage you to attend the on-site
events for your convenience. Any on-site event will be recorded and made
available to students via this YouTube channel.
The basic variant (5 credits) of this course consists of lectures (sched-
ule here) and corresponding coding assignments (schedule here). We test
your completion of the coding assignments via quizzes (implemented on the
MyCourses page). You can upgrade the course to an extended variant (10
credits) by completing a student project (see Section 1.8).

1.1 Learning Goals

This lecture offers

• an introduction of course topic and positioning in wider curricula,

• a discussion of learning goals, assignments and student project,

• an overview of course schedule.

1.2 Introduction

We are surrounded by devices, such as smartphones or wearables, generating


decentralized collections of local datasets [7–11]. These local datasets typically
have an intrinsic network structure that arises from functional constraints
(connectivity between devices) or statistical similarities.

1
For example, the management of pandemics uses contact networks to
relate local datasets generated by patients. Network medicine relates data
about diseases via co-morbidity networks [12]. Social science uses notions
of acquaintance to relate data collected from be-friended individuals [13].
Another example for network-structured are observations collected at Finnish
Meteorological Institute (FMI) weather stations. Each FMI station generates
a local dataset which tend to have similar statistical properties for nearby
stations.
FL is an umbrella term for distributed optimization techniques to train
machine learning (ML) models from decentralized collections of local datasets
[14–18]. The idea is to carry out the computations, such as gradient steps
(see Lecture 4), arising during ML model training at the location of data
generation (such as your smartphone or a heart rate sensor). This design
philosophy is different from the basic ML workflow, which is to (i) collect
data at a single location (computer) and (ii) then train a single ML model on
this data.
It can be beneficial to train different ML models at the locations of actual
data generation [19] for several reasons:

• Privacy. FL methods are appealing for applications involving sensitive


data (such as healthcare) as they do not require the exchange of raw data
but only model parameters (or their updates) [16, 17]. By exchanging
only (updates of) model parameters, FL methods are considered privacy-
friendly in the sense of not leaking (too much) sensitive information
that is contained in the local datasets (see Lecture 9).

• Robustness. By relying on decentralized data and computation, FL

2
Figure 1.1: Left: A basic ML method uses a single dataset to train a single
model. Right: Decentralized collection of devices (or clients) with the ability
to generate data and train models.

methods offer robustness (to some extent) against hardware failures


(such as “stragglers”) and cyber attacks (such data poisoning discussed
Lecture 10).

• Parallel Computing. We can interpret a collection of interconnected


devices as a parallel computer. One example for such a parallel computer
is a mobile network constituted by smartphones that can communicate
via radio links. This parallel computer allows to speed up computational
tasks such as the computation of gradients required to train ML models
(see Lecture 4).

• Democratization of ML. FL allows to combine (or pool) the com-


putational resources of many low-cost devices in order to train high-
dimensional ML models (such as Large Language Model (LLM)s. Such
models are typically impossible to train on a single device (unless this
device is a supercomputer).

3
• Trading Computation against Communication. Consider a FL
application where local datasets are generated by low-complexity devices
at remote locations (think of a wildlife camera) that cannot be easily
accessed. The cost of communicating raw local datasets to some central
unit (which then trains a single global ML model) might be much higher
than the computational cost incurred by using the low-complexity
devices to (partially) train ML models [20].

• Personalization. FL can be used to train personalized ML models for


collections of local datasets, which might be generated by smartphones
(and their users) [21]. A key challenge for ensuring personalization is the
heterogeneity of local datasets [22, 23]. Indeed, the statistical properties
of different local datasets might vary significantly such that they cannot
be well modelled as independent and identically distributed (i.i.d.).
Each local dataset induces a separate learning task that consists of
learning useful parameter values for a local model. This course discusses
FL methods to train personalized models via combining the information
carried in decentralized and heterogeneous data (see Lecture 6).

1.3 Prerequisites

The main mathematical structure used to study and design FL algorithms


is the Euclidean space Rd . We, therefore, expect some familiarity with the
algebraic and geometric structure of Rd . By algebraic structure, we mean
the (real) vector space obtained from the elements (“vectors”) in Rd , along
with the usual definitions of vector addition and multiplication by scalars in

4
R [24, 25]. We will heavily use concepts from linear algebra to represent and
manipulate data and ML models.
The metric structure of Rd is an excellent analysis tool for the (convergence)
behaviour of FL algorithms. In particular, we will study FL algorithms that
are obtained as fixed-point iterations of some non-linear operator on Rd .
These operators are defined by the data (distribution) and ML models used
within a FL system. A prime example of such a non-linear operator is the
gradient step of gradient-based methods (see Lecture 4). The computational
properties (such as convergence speed) of these FL algorithms can then be
characterized via the contraction properties of the underlying operator [26].
A main tool for the design the FL algorithms are variants of gradient
descent (GD). The common idea of these gradient-based methods is to ap-
proximate a function f (x) locally by a linear function. This local linear
approximation is determined by the gradient ∇f (x). We, therefore, expect
some familiarity with multivariable calculus [2].

1.4 Related Courses

In what follows we briefly explain how CS-E4740 relates to selected courses


at Aalto University and University of Helsinki.

• CS-EJ3211 - Machine Learning with Python. Teaches the ap-


plication of basic ML methods using the Python package (library)
scikit-learn [27]. CS-E4740 couples a network of basic ML methods
using regularization techniques to obtain tailored (personalized) ML
models for local datasets. This coupling is required to adaptively pool
local datasets into sufficiently large training set for the personalized ML

5
model.

• Data Analysis with Python. Can be considered as a substitute for


CS-EJ3211.

• CS-E4510 - Distributed Algorithms. Teaches basic mathematical


tools for the study and design of distributed algorithms that are im-
plemented via distributed systems (computers) [28]. FL is enabled by
distributed algorithms to train ML models from decentralized data (see
Lecture 5).

• CS-C3240 - Machine Learning (spring 2022 edition). Teaches


basic theory of ML models and methods [6]. CS-E4740 combines the
components of basic ML methods, such as data representation and
models, with network models. In particular, instead of a single dataset
and a single model (such as a decision tree), we will study networks of
local datasets and local models.

• ABL-E2606 - Data Protection. Discusses important legal con-


straints (laws), including the European general data protection reg-
ulation (GDPR), for the use of data and, in turn, for the design of
trustworthy FL methods.

• MS-C2105 - Introduction to Optimization. Teaches basic opti-


misation theory and how to model applications as (linear, integer, and
non-linear) optimization problems. CS-E4740 uses optimization theory
and methods to formulate FL problems (see Lecture 3) and design FL
methods (see Lecture 5).

6
• ELEC-E5424 - Convex Optimization. Teaches advanced optimisa-
tion theory for the important class of convex optimization problems [29].
Convex optimization theory and methods can be used for the study and
design of FL algorithms.

• ELEC-E7120 - Wireless Systems. Teaches the fundamentals of radio


communications used in cellular and wireless systems. These systems
provide important computational infrastructures for the implementation
of FL algorithms (see Lecture 5).

1.5 Main Goal of the Course

The overarching goal of the course is to demonstrate how to apply concepts


from graph theory and mathematical optimization to analyze and design FL
algorithms. Students will learn to formulate a given FL application as an
optimization problem over an undirected graph G = (V, E) whose nodes i ∈ V
represent individual devices of a FL application. Each device i ∈ V generates
a local dataset and trains a local model. We refer to such a graph as the FL
network of a collection of devices (see Lecture 3).
This course uses only undirected FL networks with a finite number n of
nodes, which we identify with the first n positive integers:

V := {1, . . . , n}.

An edge {i, i′ } ∈ E in the FL network G connects two different devices if are


similar in a specific sense.1 We quantify the amount of similarity between
1
For example, two devices are similar if they generate data with similar statistical
properties.

7
two connected nodes i, i′ by a positive edge weight Ai,i′ > 0.
We can formalize a FL application as an optimization problem associated
with an FL network,
(i) (i′ )
X X
Li w(i) + α Ai,i′ d(w ,w ) . (1.1)

min
w(i)
i∈V {i,i′ }∈E

We refer to this problem as GTV minimization (GTVMin) and devote much


of the course to its computational and statistical properties. The optimization
variables w(i) in (1.1) are the local model parameters at the nodes i ∈ V of an
FL network. The objective function in (1.1) consists of two components: The
first component is a sum over all nodes of the loss values Li w(i) incurred


by local model parameters at each node i. The second component is the sum
(i) ,w(i′ ) )
of local model parameters discrepancies d(w across the edges {i, i′ } of
the FL network.

1.6 Outline of the Course

Our course is roughly divided into three parts:

• Part I: ML Refresher. Lecture 2 introduces data, models and loss


functions as three main components of ML. This lecture also explains
how these components are combined within ERM. We also discuss how
regularization of ERM can be achieved via manipulating its three main
components. We then explain when and how to solve regularized ERM
via simple GD methods in Lecture 4. Overall, this part serves two main
purposes: (i) to briefly recap basic concepts of ML in a simple centralized
setting and (ii) highlight ML techniques (such as regularization) that
are particularly relevant for the design and analysis of FL methods.

8
• Part II: FL Theory and Methods. Lecture 3 introduces the FL
network as our main mathematical structure for representing collections
of local datasets and corresponding tailored models. The undirected
and weighted edges of the FL network represent statistical similarities
between local datasets. Lecture 3 also formulates FL as an instance
of regularized empirical risk minimization (RERM) which we refer
to as GTVMin. GTVMin uses the variation of personalized model
parameters across edges in the FL network as regularizer. We will see
that GTVMin couples the training of tailored (or “personalized”) ML
models such that well-connected nodes (clusters) in the FL network will
obtain similar trained models. Lecture 4 discusses variations of gradient
descent as our main algorithmic toolbox for solving GTVMin. Lecture
5 shows how FL algorithms can be obtained in a principled fashion
by applying optimization methods, such as gradient-based methods,
to GTVMin. We will obtain FL algorithms that can be implemented
as iterative message passing methods for the distributed training of
tailored (“personalized”) models. Lecture 6 derives some main flavours
of FL as special cases of GTVMin. The usefulness of GTVMin crucially
depends on the choice for the weighted edges in the FL network. Lecture
7 discusses graph learning methods that determine a useful FL network
via different notions of statistical similarity between local datasets.

• Part III: Trustworthy AI. Lecture 8 enumerates seven key require-


ments for trustworthy artificial intelligence (AI) that have been put
forward by the European Union. These key requirements include the
protection of privacy as well as robustness against (intentional) pertur-

9
bations of data or computation. We then discuss how FL algorithms
can ensure privacy protection in Lecture 9. Lecture 10 discusses how
to evaluate and ensure robustness of FL methods against intentional
perturbations (poisoning) of local dataset.

1.7 Assignments and Grading

The course includes coding assignments that require you to implement concepts
from the lecture in a Python notebook. We will use MyCourses quizzes to test
your understanding of lecture contents and solutions to the coding assignments.
For each quiz, you can earn around 10 points. We will sum up the points
collected during the quizzes (no minimum requirements for any individual
quiz) and determine your grade according to: grade 1 for 50-59 points;
grade 2 for 60-69; grade 3 for 70-79; grade 4 for 80-89 and top grade
5 for at least 90 points. Students will be able to review their assignments’
grading at the end of the course.

1.8 Student Project

You can extend the basic variant (which is worth 5 credits) to 10 credits by
completing a student project and peer review. This project requires you to
formulate an application of your choice as a FL problem using the concepts
from this course. You then have to solve this FL problem using the FL
algorithms taught in this course. The main deliverable will be a project
report which must follow the structure indicated in the template. You will
then peer-review the reports of your fellow students by answering a detailed
questionnaire.

10
1.9 Schedule

The course lectures are held on Mo. and Wed. at 16.15, during 28-Feb-2024
until 30-Apr-2024. You can find the detailed schedule and lecture halls
following this link. As the course can be completed fully remote, we will
record each lecture and add the recording to the YouTube playlist here.
After each lecture, we release the corresponding assignment at this site.
You have a few days to work on the assignment before we open the corre-
sponding quiz on the MyCourses page of the course (click me).
The assignments and quizzes are somewhat fast paced to encourage stu-
dents to actively work on the course. We will also be strict about the deadlines
for the quizzes. However, students have the possibility to cover up
for lost quiz points during a review meeting at the end of course.
Moreover, active participation such as contribution to the discussion
forum or providing feedback on the lecture notes will be taken into
account.

1.10 Ground Rules

Note that as a student following this course, you must act according to the
Code of Conduct of Aalto University (see here). Two main ground rules for
this course are:

• BE HONEST. This course includes many tasks that require indepen-


dent work, including the coding assignments, the working on student
projects and the peer review of student projects. You must not use
other’s work inappropriately. For example, it is not allowed to copy

11
other’s solutions to coding assignments. We will randomly choose stu-
dents who have to explain their solutions (and corresponding answers
to quiz questions).

• BE RESPECTFUL. My personal wish is that this course provides a


safe space for an enjoyable learning experience. Any form of disrespectful
behaviour, including any course-related communication platforms, will
be sanctioned rigorously (including reporting to university authorities).

1.11 Acknowledgement

The development of this text has benefited from student feedback received
for the course CS-E4740 Federated Learning which has been offered at Aalto
University during 2023. The author is indebted to Olga Kuznetsova, Diana
Pfau and Shamsiiat Abdurakhmanova for providing feedback on early drafts.

1.12 Problems

1.1. Complexity of Matrix Inversion. Choose your favourite computer


architecture (represented by a mathematical model) and think about how
much computation is required - in the worst case - by the most efficient
algorithm that can invert any given invertible matrix Q ∈ Rd×d ? Try also to
reflect about how practical your chosen computer architecture is, i.e., is it
possible to buy such a computer in your nearest electronics shop?
1.2. Vector Spaces and Euclidean Norm. Consider data points, each
characterized by a feature vector x ∈ Rd with entries x1 , x2 , . . . , xd .

12
• Show that the set of all feature vectors forms a vector space under
standard addition and scalar multiplication.

• Calculate the Euclidean norm of the vector x = (1, −2, 3)T .

• If x(1) = (1, 2, 3)T and x(2) = (−1, 0, 1)T , compute 3x(1) − 2x(2) .

1.3. Matrix Operations in Linear Models. Linear regression methods


learn model parameters w
b ∈ Rd via solving the optimization problem:

b = arg min ∥y − Xw∥22 ,


w
w∈Rd

with some matrix X ∈ Rm×d , and some vector y ∈ Rm .

• Derive a closed-form expression for w


b that is valid for arbitrary X, y.

• Discuss the conditions under which XT X is invertible.

• Compute w
b for the following dataset:
   
1 2 7
   
X = 3 4 , y = 8 .
   
   
5 6 9

• Compute w
b for the following dataset: The r-th row of X, for r =
1, . . . , 28. is given the 10 minute temperature recordings during day
r/Mar/2023 at FMI weather station Kustavi Isokari. The r-th row of y
is the maximum daytime temperature during day r + 1/Mar/2023 at
the same weather station.

1.4. Eigenvalues and Positive Semi-Definiteness. The convergence


properties of widely-used ML methods rely on the properties of psd matrices.
Let Q = XT X, where X ∈ Rm×d .

13
1. Prove that Q is psd.
 
1 2
2. Compute the eigenvalues of Q for X =  .
3 4

3. Compute the eigenvalues of Q for the matrix X used in Problem 1.3


that is constituted by FMI temperature recordings .

14
2 ML Basics
This chapter covers basic ML techniques instrumental for federated learning
(FL). Content-wise, this chapter is more extensive compared to the following
chapters. However, this chapter should be considerably easier to follow than
the following chapters as it mainly refreshes pre-requisite knowledge.

2.1 Learning Goals

After completing this lecture, you will

• be familiar with the concept of data points (their features and labels),
model and loss function,

• be familiar with empirical risk minimization (ERM) as a design principle


for ML systems,

• know why and how validation is performed,

• be able to diagnose ML methods by comparing training error with


validation error,

• be able to regularize ERM via modifying data, model and loss.

2.2 Three Components and a Design Principle

Machine Learning (ML) revolves around learning a hypothesis map h out of


a hypothesis space H that allows to accurately predict the label of a data
point solely from its features. One of the most crucial steps in applying ML
methods to a given application domain is the definition or choice of what

1
precisely a data point is. Coming up with a good choice or definition of data
points is not trivial as it influences the overall performance of a ML method
in many different ways.
This course will focus mainly on one specific choice for the data points.
In particular, we will consider data points that represent the daily weather
conditions around Finnish Meteorological Institute (FMI) weather stations.
We denote a specific data point by z. It is characterized by the following
features:

• name of the FMI weather station, e.g., “TurkuRajakari”

• latitude lat and longitude lon of the weather station, e.g., lat := 60.37788,
lon := 22.0964,

• timestamp of the measurement in the format YYYY-MM-DD HH:MM:SS,


e.g., 2023-12-31 [Link]

It is convenient to stack the features into a feature vector x. The label y ∈ R


of such a data point is the maximum daytime temperature in degree Celsius,
e.g., −20. We indicate the features x and label y of a data point via the
notation z = (x, y).2
We predict the label of a data point with features x by the function value
h(x) of a hypothesis (map) h(·). The prediction will typically be not perfect,
i.e., h(x) ̸= y. ML methods use a loss function L ((x, y) , h) to measure the
error incurred by using the prediction h(x) as a guess for the true label y. The
2
Strictly speaking, a data point z might not be exactly the same as the pair of features
x and label y. Indeed, the data point might have additional properties that are neither
used as features nor as label. A more precise notation would then be x(z) and y(z).

2
choice of loss function crucially influences the statistical and computational
properties of the resulting ML method (see [6, Ch. 2]). In what follows,
2
unless stated otherwise, we use the squared error loss L (z, h) := y − h(x)
to measure the prediction error.
It seems natural to choose (or learn) a hypothesis that minimizes the aver-
age loss (or empirical risk) on a given set of data points D := x(1) , y (1) , . . . , xm , y (m) .
  

This is known as ERM,


m
X 2
ĥ ∈ argmin(1/m) y (r) − h x(r) . (2.1)
h∈H r=1

As the notation in (2.1) indicates (using the symbol “∈” instead of “:=”),
there might be several different solutions to the optimization problem (2.1).
Unless specified otherwise, ĥ can be used to denote any hypothesis in H that
has minimum average loss over D.
Several important machine learning (ML) methods use a parametrized
model H: Each hypothesis h ∈ H is defined by parameters w ∈ Rd , often
indicated by the notation h(w) . One important example for a parametrized
model is the linear model [6, Sec. 3.1],

H(d) := h : Rd 7→ R : h(x) := wT x . (2.2)




Linear regression methods, for example, learn the parameters of a linear


model by minimizing the average squared error loss. For linear regression,
ERM becomes an optimization over the parameter space Rd ,
m
X 2
w
b (LR)
∈ argmin (1/m) y (r) − wT x(r) . (2.3)
w∈Rd r=1
| {z }
:=f (w)

3
Note that (2.3) amounts to finding the minimum of a smooth and convex
function
 
T T T T
f (w) = (1/m) w X Xw − 2y Xw + y y (2.4)
T
with the feature matrix X := x(1) , . . . , x(m) (2.5)
T
and the label vector y := y (1) , . . . , y (m) of the training set D.


(2.6)

Inserting (2.4) into (2.3) allows to formulate linear regression as

b (LR) ∈ argmin wT Qw + wT q
w (2.7)
w∈Rd

with Q := (1/m)XT X, q := −(2/m)XT y.

The matrix Q ∈ Rd×d is positive semi-definite (psd) with corresponding


eigenvalue decomposition (EVD),
d
X T
λj Q u(j) u(j) . (2.8)

Q=
j=1

The EVD (2.8) involves non-negative eigenvalues

(2.9)
 
0 ≤ λ1 Q ≤ . . . ≤ λd Q .

To train a ML model H means to solve ERM (2.1) (or (2.3) for linear
regression); the dataset D is therefore referred to as a training set. The trained
model results in the learnt hypothesis ĥ. We obtain practical ML methods by
applying optimization algorithms to solve (2.1). Two key questions studied
in ML are:

• Computational aspects. How much compute do we need to solve


(2.1) ?

4
wT Qw + wT q

b (LR)
w

Figure 2.1: ERM (2.1) for linear regression minimizes a convex quadratic
function wT Qw + wT q.

• Statistical aspects. How useful is the solution ĥ to (2.1) in general,


i.e., how accurate is the prediction ĥ(x) for the label y of an arbitrary
data point with features x?

2.3 Computational Aspects of ERM

ML methods use optimization algorithms to solve (2.1), resulting in the


learnt hypothesis ĥ. Within this course, we use optimization algorithms that
are iterative methods: Starting from an initial choice h(0) , they construct a
sequence
h(0) , h(1) , h(2) , . . . ,

which are hopefully increasingly accurate approximations to a solution ĥ of


(2.1). The computational complexity of such a ML method can be measured
by the number of iterations required to guarantee some prescribed level of
approximation.
For a parametrized model and a smooth loss function, we can solve (2.3)

5
by gradient-based methods: Starting from an initial parameters w(0) , we
iterate the gradient step:

w(k) := w(k−1) − η∇f w(k−1)




m
X T
= w(k−1) + (2η/m) x(r) y (r) − w(k−1) x(r) . (2.10)

r=1

How much computation do we need for one iteration of (2.10)? How many
iterations do we need ? We will try to answer the latter question in Lecture 4.
The first question can be answered more easily for a typical computational in-
frastructure (e.g., “Python running on a commercial Laptop”). The evaluation
of (2.10) then typically requires around m arithmetic operations (addition,
multiplication).
It is instructive to consider the special case of a linear model that does
not use any feature, i.e., h(x) = w. For this extreme case, the ERM (2.3) has
a simple closed-form solution:
m
X
w
b = (1/m) y (r) . (2.11)
r=1

Thus, for this special case of the linear model, solving (2.11) amounts to
summing m numbers x(1) , . . . , x(m) . It seems reasonable to assume that the
amount of computation required to compute (2.11) is proportional to m.

2.4 Statistical Aspects of ERM

We can train a linear model on a given training set as ERM (2.3). But how
useful is the solution w
b of (2.3) for predicting the labels of data points outside
the training set? Consider applying the learnt hypothesis h(w)
b
to an arbitrary

6
data point not contained in the training set. What can we say about the
resulting prediction error y − h(w)
b
(x) in general? In other words, how well
does h(w)
b
generalize beyond the training set.
The most widely used approach to study the generalization of ML methods
is via probabilistic models. Here, we interpret each data point as a realization
of an independent and identically distributed (i.i.d.) RV with probability
distribution p(x, y). Under this i.i.d. assumption, we can evaluate the overall
performance of a hypothesis h ∈ H via the expected loss (or risk)

E{L ((x, y) , h)}. (2.12)

One example for a probability distribution p(x, y) relates the label y with
the features x of a data point as

y = wT x + ε with x ∼ N (0, I), ε ∼ N (0, σ 2 ), E{εx} = 0. (2.13)

A simple calculation reveals the expected squared error loss of a given linear
hypothesis h(x) = xT w
b as3

b 2 + σ2.
E{(y − h(x))2 } = ∥w − w∥ (2.14)

The component σ 2 can be interpreted as the intrinsic noise level of the label
y. We cannot hope to find a hypothesis with an expected loss below this level.
3
Strictly speaking, the relation (2.14) only applies for constant (deterministic) model
parameters w
b that do not depend on the RVs whose realizations are the observed data
points (see, e.g., (2.13)). However, the learnt model parameters w
b might be the output of a
ML method (such as (2.3)) that is applied to a dataset D consisting of i.i.d. realizations from
some underlying probability distribution. In this case, we need to replace the expectation

on the LHS of (2.14) with a conditional expectation E (y − h(x))2 D .

7
The first component of the RHS in (2.14) is the estimation error ∥w − w∥
b 2
of a ML method that reads in the training set and delivers an estimate w
b
(e.g., via (2.3)) for the parameters of a linear hypothesis.
We next study the estimation error w − w
b incurred by the specific estimate
w b (LR) (2.7) delivered by linear regression methods. To this end, we first
b =w
use the probabilistic model (2.13) to decompose the label vector y in (2.6) as
T
y = Xw + n , with n := ε(1) , . . . , ε(m) . (2.15)

Inserting (2.15) into (2.7) yields

b (LR) ∈ argmin wT Qw + wT q′ + wT e
w (2.16)
w∈Rd

with Q := (1/m)XT X, q′ := −(2/m)XT Xw, and e := −(2/m)XT n. (2.17)

Figure 2.2 depicts the objective function of (2.16). It is a perturbation of the


convex quadratic function wT Qw + wT q′ , which is minimized at w = w. In
general, the minimizer w
b (LR) delivered by linear regression is different from
w due the perturbation term wT e in (2.16).
The following result bounds the deviation between w
b (LR) and w under
the assumption that the matrix Q = (1/m)XT X is invertible.4

Proposition 2.1. Consider a solution w


b (LR) to the ERM instance (2.16) that
is applied to the dataset (2.15). If the matrix Q = (1/m)XT X is invertible,
with minimum eigenvalue λ1 (Q) > 0,
2
2 ∥e∥2 (2.17) 4 XT n
w (LR)
−w ≤ 22 = 2
. (2.18)
2 m2 λ21
b
λ1
4
Can you think of sufficient conditions on the feature matrix of the training set that
ensure Q = (1/m)XT X is invertible?

8
wT Qw+wT (q′ +e) wT Qw+wT q′

wT e

w
a

b (LR)
w w

Figure 2.2: The estimation error of linear regression is determined by the


effect of the perturbation term wT e on the minimizer of the convex quadratic
function wT Qw + wT q′ .

Proof. Let us rewrite (2.16) as


T
b (LR) ∈ argmin f (w) with f (w) := w−w Q w−w +eT w−w . (2.19)
 
w
w∈Rd

Clearly f w = 0 and, in turn, f (w)


b = minw∈Rd f (w) ≤ 0. On the other


hand,
(2.19) T
Q w − w + eT w − w
 
f (w) = w−w
(a) T 
≥ w−w Q w − w − ∥e∥2 ∥w − w∥2
(b)
≥ λ1 ∥w − w∥22 − ∥e∥2 ∥w − w∥2 . (2.20)

Step (a) used Cauchy–Schwarz inequality and (b) used the EVD (2.8) of Q.
Evaluating (2.20) for w = w
b and combining with f w
b ≤ 0 yields (2.18).


The bound (2.18) suggests that the estimation error w


b(LR) − w is small if
λ1 (Q) is large. This smallest eigenvalue of the matrix Q = (1/m)XT X could

9
be controlled by suitable choice (or transformation) of features x of a data
point. Trivially, we can increase λ1 (Q) by a factor 100 if we scale each feature
by a factor 10. However, this approach would also scale (by factor 100) the
2
error term XT n 2
in (2.18). For some applications, we can find feature
transformations (“whitening”) that increases λ1 (Q) but does not increase
2 2
XT n 2 . We finally note that the error term XT n 2
in (2.18) vanishes if
the noise vector n is orthogonal to the columns of the feature matrix X.
It is instructive to evaluate the bound (2.18) for the special case where
each data point has the same feature x = 1. Here, the probabilistic model
(2.15) reduces to a signal in noise model,

y (r) = x(r) w + ε(r) with x(r) = 1, (2.21)

with some true underlying parameter w. The noise terms ε(r) , for r = 1, . . . , m.
are realizations of i.i.d. RVs with probability distribution N (0, σ 2 ). The
feature matrix then becomes X = 1 and, in turn, Q = 1, λ1 (Q) = 1.
Inserting these values into (2.18), results in the bound
2
b(LR) − w
w ≤ 4 ∥n∥22 /m2 . (2.22)

For the labels and features in (2.21), the solution of (2.16) is given by
m m
(r) (2.21)
X X
w
b (LR)
= (1/m) y = w + (1/m) ε(r) . (2.23)
r=1 r=1

2.5 Validation and Diagnosis of ML

The above analysis of the generalization error started from postulating a prob-
abilistic model for the generation of data points. However, this probabilistic
model might be wrong and the bound (2.18) does not apply. Thus, we might

10
want to use a more data-driven approach for assessing the usefulness of a
learnt hypothesis ĥ obtained, e.g., from solving ERM (2.1).
Loosely speaking, validation tries to find out if a learnt hypothesis ĥ
performs similar inside and outside the training set. In its most basic form,
validation amounts to computing the average loss of a learnt hypothesis ĥ
on some data points not included in the training set. We refer to these data
points as the validation set.
Algorithm 2.1 summarizes a single iteration of a prototypical ML workflow
that consists of model training and validation. The workflow starts with
an initial choice for a dataset D, model H and loss function L (·, ·). We
then repeat Algorithm 2.1 several times. After each repetition, based on the
resulting training error and validation error, we modify (some of) the design
choices for dataset, model and loss function. These design choices, including
the decision of when to stop repeating Algorithm 2.1, are often guided by
simple heuristics [6, Ch.6.6].
We can diagnose a ERM-based ML method, such as Algorithm 2.1, by
comparing its training error with its validation error. This diagnosis is further
enabled if we know a baseline E (ref) . One important source for a baseline
E (ref) are probabilistic models for the data points (see Section 2.4).
Given a probabilistic model p(x, y), we can compute the minimum achiev-
able risk (2.12). Indeed, the minimum achievable risk is precisely the expected
loss of the Bayes estimator b
h(x) of the label y, given the features x of a
data point. The Bayes estimator b
h(x) is fully determined by the probability
distribution p(x, y) [30, Chapter 4].
A further potential source for a baseline E (ref) is an existing, but for

11
Algorithm 2.1 One Iteration of ML Training and Validation
Input: dataset D, model H, loss function L (·, ·)
1: split D into a training set D(train) and a validation set D(val)
2: learn a hypothesis via solving ERM
X
h ∈ argmin
b L ((x, y) , h) (2.24)
h∈H
(x,y)∈D(train)

3: compute resulting training error


X  
Et := (1/|D(train) |) L (x, y) , b
h
(x,y)∈D(train)

4: compute validation error


X  
(val)
Ev := (1/|D |) L (x, y) , h
b
(x,y)∈D(val)

Output: learnt hypothesis (or trained model) b


h, training error Et and vali-
dation error Ev

12
some reason unsuitable, ML method. This existing ML method might be
computationally too expensive to be used for the ML application at hand.
However, we might still use its statistical properties as a baseline.
We can also use the performance of human experts as a baseline. If
we want to develop a ML method that detects skin cancer from images,
a possible baseline is the classification accuracy achieved by experienced
dermatologists [31].
We can diagnose a ML method by comparing the training error Et with
the validation error Ev and (if available) the baseline E (ref) .

• Et ≈ Ev ≈ E (ref) : The training error is on the same level as the


validation error and the baseline. There seems to be little point in trying
to improve the method further since the validation error is already close
to the baseline. Moreover, the training error is not much smaller than
the validation error which indicates that there is no overfitting.

• Ev ≫ Et : The validation error is significantly larger than the training


error, which hints at overfitting. We can address overfitting either by
reducing the effective dimension of the hypothesis space or by increasing
the size of the training set. To reduce the effective dimension of the
hypothesis space, we can use fewer features (in a linear model), a
smaller maximum depth of decision trees or fewer layers in an artificial
neural network (ANN). Instead of this coarse-grained discrete model
pruning, we can also reduce the effective dimension of a hypothesis
space continuously via regularization (see [6, Ch. 7]).

• Et ≈ Ev ≫ E (ref) : The training error is on the same level as the

13
validation error and both are significantly larger than the baseline. Thus,
the learnt hypothesis seems to not overfit the training set. However, the
training error achieved by the learnt hypothesis is significantly larger
than the baseline. There can be several reasons for this to happen.
First, it might be that the hypothesis space is too small, i.e., it does not
include a hypothesis that provides a good approximation for the relation
between features and label of a data point. One remedy to this situation
is to use a larger hypothesis space, e.g., by including more features in a
linear model, using higher polynomial degrees in polynomial regression,
using deeper decision trees or ANNs with more hidden layers (deep net).
Second, besides the model being too small, another reason for a large
training error could be that the optimization algorithm used to solve
ERM (2.24) is not working properly (see Lecture 4).

• Et ≫ Ev : The training error is significantly larger than the validation


error. The idea of ERM (2.24) is to approximate the risk (2.12) of a
hypothesis by its average loss on a training set D = {(x(r) , y (r) )}m
r=1 .

The mathematical underpinning for this approximation is the law of


large numbers which characterizes the average of (realizations of) i.i.d.
RVs. The accuracy of this approximation depends on the validity of two
conditions: First, the data points used for computing the average loss
“should behave” like realizations of i.i.d. RVs with a common probability
distribution. Second, the number of data points used for computing the
average loss must be sufficiently large.

Whenever the training set or validation set differs significantly from


realizations of i.i.d. RVs, the interpretation (and comparison) of the

14
training error and the validation error of a learnt hypothesis becomes
more difficult. As an extreme case, the validation set might consist of
data points for which every hypothesis incurs small average loss (see
Figure 2.3). Here, we might try to increase the size of the validation set
by collecting more labeled data points or by using data augmentation
(see Section 2.6). If the size of both training set and validation set
is large but we still obtain Et ≫ Ev , one should verify if data points
in these sets conform to the i.i.d. assumption. There are principled
statistical tests for the validity of the i.i.d. assumption for a given
dataset (see [32] and references therein).

label y
h(1)
training set
validation set

h(2)

h(3)

feature x

Figure 2.3: Example for an unlucky split into training set and validation set
for the model H := {h(1) , h(2) , h(3) }.

2.6 Regularization

Consider a ERM-based ML method using a hypothesis space H and dataset


D (we assume all data points are used for training). A key parameter for such

15
a ML method is the ratio deff (H) /|D| between the model size deff (H) and
the number |D| of data points. The tendency of the ML method to overfit
increases with the ratio deff (H) /|D|.
Regularization techniques decrease the ratio deff (H) /|D| via three (essen-
tially equivalent) approaches:

• collect more data points, possibly via data augmentation (see Fig. 10.9),

• add penalty term αR h to average loss in ERM (2.1) (see Fig. 10.9),


• shrink the hypothesis space, e.g., by adding constraints on the model


parameters such as ∥w∥2 ≤ 10.

It can be shown that these three perspectives (corresponding to the three


components data, model and loss) on regularization are closely related [6, Ch.
7]. For example, adding a penalty term αR h in ERM (2.1) is equivalent


to ERM (2.1) with a pruned hypothesis space H(α) ⊆ H. Using a larger α


typically results in a smaller H(α) .
One important example of regularization is to add a penalty term to
the average loss is ridge regression. In particular, ridge regression uses the
regularizer R h := ∥w∥22 for a linear hypothesis h(x) := wT x. Thus, ridge


regression learns the parameters of a linear hypothesis via solving


 m 
T (r) 2
X
(α) (r) 2
(2.25)

w
b ∈ argmin (1/m) y −w x + α ∥w∥2 .
w∈Rd r=1

The objective function in (2.25) can be interpreted as the objective function


of linear regression applied to a modification of the training set D: We replace
each data point (x, y) ∈ D by a sufficiently large number of i.i.d. realizations

16
of
(x + n, y) with n ∼ N (0, αI). (2.26)

Thus, ridge regression (2.25) is equivalent to linear regression applied to


an augmented variant D′ of D. The augmentation D′ is obtained by replacing
each data point (x, y) ∈ D with a sufficiently large number of noisy copies.
Each copy of (x, y) is obtained by adding a i.i.d. realization n of a zero-mean
Gaussian noise with covariance matrix αI to the features x (see (2.26)). The
label of each copy of (x, y) is equal to y, i.e., the label is not perturbed.

label y
h(x)
original training set D
augmented


α
1 Pm 
x(r) , y (r) , h +αR h
 
m r=1 L
feature x

Figure 2.4: Equivalence between data augmentation and loss penalization.

To study the computational aspects of ridge regression, let us rewrite


(2.25) as

b (α) ∈ argmin wT Qw + wT q
w
w∈Rd

with Q := (1/m)XT X + αI, q := (−2/m)XT y. (2.27)

Thus, like linear regression (2.7), also ridge regression minimizes a convex
quadratic function. A main difference between linear regression (2.7) and

17
ridge regression (for α > 0) is that the matrix Q in (2.27) is guaranteed to
be invertible for any training set D. In contrast, the matrix Q in (2.7) for
linear regression might be singular for some training sets.5
The statistical aspects of the solutions to (2.27) (i.e., the parameters learnt
by ridge regression) crucially depends on the value of α. This choice can
be guided by an error analysis using a probabilistic model for the data (see
Proposition 2.1). Instead of using a probabilistic model, we can also compare
T
the training error and validation error of the hypothesis h(x) = w b (α) x
learnt by ridge regression with different values of α.

2.7 Overview of Coding Assignment

Python Notebook. MLBasics_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv
Description. The coding assignment revolves around weather data col-
lected by the FMI and stored in a csv file. This file contains temperature
measurements at different locations in Finland.
Each temperature measurement is a data point, characterized by d = 7
T
features x = x1 , . . . , x7 and a label y which is the temperature measurement
itself. The features are (normalized) values of the latitude and longitude of
the FMI station as well as the (normalized) year, month, day, hour, minute
during which this measurements has been taken. Your tasks include:

• Generate numpy arrays X, y, whose r-th row holds the features x(r) and
label y (r) , respectively, of the r-th data point in the csv file.
5
Consider the extreme case where all features of each data point in the training set D
are zero.

18
• Split the dataset into a training set and a validation set. The size of
the training set should be 100.

• Train a linear model, using the LinearRegression class of the scikit-learn


package, on the training set and determine the resulting training error
and validation error

• Train and validate a linear model using feature augmentation via polyno-
mial combinations (see PolynomialFeatures class). Try out different
different choices of the maximal degree of these polynomial combina-
tions.

• Using a fixed value for the polynomial degree for the feature augmenta-
tion step, train and validate a linear model using ridge regression (2.25)
via the Ridge class.

• CAUTION! The input parameter alpha of the Ridge class has a


different meaning than α in (2.25) In particular, the parameter α in
(2.25) corresponds to using the value mα for alpha.

• Determine the resulting training error and validation error or the model
parameters obtained for different values of α in (2.25).

19
3 A Design Principle for FL
Lecture 2 reviewed ML methods that use numeric arrays to store data points
(their features and labels) and model parameters. We have also discussed
ERM (and its regularization) as a main design principle for practical ML
systems. This lecture extends the basic ML concepts from a single-dataset
single-model setting to FL applications involving distributed collections of
data and models.
Section 3.2 introduces federated learning (FL) networks as a mathematical
model for collections of devices that generate local datasets and train local
models. Section 3.3 presents our main design principle for FL systems. This
principle, referred to as GTV minimization (GTVMin), is a special case of
ERM using a particular choice of loss function and model. The model used
by GTVMin is a (algebraic) product of local models, one for each node. The
loss function of GTVMin consists of two parts: the sum of loss functions at
each node and a penalty term that measures the variation of local models
across the edges of the FL network. The precise form of this penalty term
is a design choice and depends on the nature of the local models. If the
local models are parametrized by model parameterss, we can use the norm
of their differences across the edges in the FL network. For non-parametric
local models, we can measure the variation between two local models by
comparing their predictions on a specific dataset. We obtain the generalized
total variation (GTV) by summing their variations over all edges in the FL
network.

1
3.1 Learning Goals

After completing this lecture, you will

• be familiar with the concept of an FL network,

• know how connectivity is related to the spectrum of the Laplacian


matrix,

• know some measures for the variation of local models,

• be able to formulate FL as instances of GTVMin.

3.2 Graphs and Their Laplacian

Consider a collection of devices, indexed by i = 1, . . . , i, generating local


datasets D(1) , . . . , D(n) . Our goal is to train a personalized model H(i) for
each local dataset D(i) , with i = 1, . . . , n. We represent such a collection of
local datasets and (personal) local models, along with their relations, by an
FL network. Figure 3.1 depicts an example for an FL network.
′ ′
D(i ) , w(i )

Ai,i′

D(i) , w(i)

Figure 3.1: Example of an FL network whose nodes i ∈ V carry local datasets


D(i) and local models that are parametrized by local model parameters w(i) .

The FL network G is an undirected weighted graph G = (V, E) with nodes

2
V := {1, . . . , n}. Each node i ∈ V of the FL network G carries a local dataset

D(i) := x(i,1) , y (i,1) , . . . , x(i,mi ) , y (i,mi ) . (3.1)


  

Here, x(i,r) and y (i,r) denote, respectively, the features and the label of the
rth data point in the local dataset D(i) . Note that the size mi of the local
dataset might vary between different nodes i ∈ V.
It is convenient to collect the feature vectors x(i,r) and labels y (i,r) into a
feature matrix X(i) and label vector y(i) , respectively,
T T
X(i) := x(i,1) , . . . , x(i,mi ) , and y := y (1) , . . . , y (mi ) . (3.2)

The local dataset D(i) can then be represented compactly by the feature
matrix X(i) ∈ Rmi ×d and the vector y(i) ∈ Rmi .
Besides the local dataset D(i) , each node i ∈ G also carries a local model
H(i) . Our focus is on local models that can be parametrized by local model
parameters w(i) ∈ Rd , for i = 1, . . . , n. The usefulness of a specific choice
for the local model parameter w(i) is measured by a local loss function
Li w(i) , for i = 1, . . . , n. Note that we might use different local loss functions


Li (·) ̸= Li′ (·) at different nodes i, i′ ∈ V.


The FL network also contains undirected edges {i, i′ } ∈ E between pairs
of (different) nodes i, i′ ∈ V. We use an undirected edge {i, i′ } ∈ E to couple

the training of the corresponding local models H(i) , H(i ) . The strength of this
coupling is determined by a positive edge weight Ai,i′ > 0. The coupling is
implemented by penalizing the discrepancy between local model parameters

w(i) and w(i ) (see Section 3.3).
There are different ways to measure the discrepancy between the local

model parameters w(i) , w(i ) at connected nodes {i, i′ } ∈ E. For example,

3
′ 
we can use some cost or penalty function ϕ w(i) − w(i ) that satisfies basic
requirements such as being a (semi-)norm [33].
The choice of penalty ϕ(·) has crucial impact on the computational and
statistical properties of the FL methods arising from the design principle intro-
duced in Section 3.3. Unless stated otherwise, we use the choice ϕ(·) := ∥·∥22 ,
i.e., we measure the discrepancy between local model parameters across an
′ 2
edge {i, i′ } ∈ E by the squared Euclidean distance w(i) − w(i ) 2 . Summing
the discrepancies (weighted by the edge weights) over all edges in the FL
network yields the total variation of local model parameters
X ′ 2
Ai,i′ w(i) − w(i ) . (3.3)
2
{i,i′ }∈E

The connectivity of an FL network G can be characterized locally around


a node i ∈ V by its weighted node degree
X
d(i) := Ai,i′ . (3.4)
i′ ∈N (i)

Here, we used the neighbourhood N (i) := {i′ ∈ V : {i, i′ } ∈ E} of node i ∈ V.


A global characterization for the connectivity of G is the maximum weighted
node degree
(3.4) X
d(G)
max := max d
(i)
= max Ai,i′ . (3.5)
i∈V i∈V
i′ ∈N (i)

Besides inspecting its (maximum) node degrees, we can study the connec-
tivity of G also via the eigenvalues and eigenvectors of its Laplacian matrix
L(G) ∈ Rn×n .6 The Laplacian matrix of an undirected weighted graph G is
6
The study of graphs via the eigenvalues and eigenvectors of associated matrices is the
main subject of spectral graph theory [34, 35].

4
defined element-wise as (see Fig. 10.7)

−Ai,i′ for i ̸= i′ , {i, i′ } ∈ E






(G)
Li,i′ := for i = i′ (3.6)
P
 i′′ ̸=i Ai,i′′


else.

0

1  
−1 −1
2
 
L(G) = −1 1
 
0
2 3
 
−1 0 1

Figure 3.2: Left: Example for an FL network G with three nodes i = 1, 2, 3.


Right: Laplacian matrix L(G) ∈ R3×3 of G.

The Laplacian matrix is symmetric and psd which follows from the identity
X ′ 2
wT (L(G) ⊗ I)w = Ai,i′ w(i) − w(i )
2
{i,i′ }∈E
 T
(1) T (n) T
for any w := (3.7)
 
w ,..., w .
| {z }
n
=:stack w(i)
i=1

As a psd matrix, L(G) possesses an EVD


n
X T
L (G)
= λj u(j) u(j) , (3.8)
j=1

with the ordered (in increasing order) eigenvalues

0 = λ1 L(G) ≤ λ2 L(G) ≤ . . . ≤ λn L(G) . (3.9)


  

5
The ordered eigenvalues λi L(G) can be computed via the Courant–Fischer–Weyl


min-max characterization (CFW) [3, Thm. 8.1.2.]. Two important special


cases of this characterization are

 CFW
λn L(G) = maxn vT L(G) v
v∈R
∥v∥=1
(3.7) X 2
= maxn Ai,i′ vi − vi′ (3.10)
v∈R
∥v∥=1 {i,i′ }∈E

and

 CFW
λ2 L(G) = maxn vT L(G) v
v∈R
vT 1=0
∥v∥=1
(3.7) X 2
= maxn Ai,i′ vi − vi′ . (3.11)
v∈R
vT 1=0 {i,i′ }∈E
∥v∥=1

By (3.7), we can measure the total variation of local model parameters


by stacking them into a single vector w ∈ Rnd and computing the quadratic
form wT L(G) ⊗ Id×d w. Another consequence of (3.7) is that any collection


of identical local model parameters, stacked into the vector

T
w = stack{c} = cT , . . . , cT , with some c ∈ Rd , (3.12)

is an eigenvector of L(G) ⊗ I with corresponding eigenvalue λ1 = 0 (see (3.9)).


Thus, the Laplacian matrix of any FL network is singular (non-invertible).
The second eigenvalue λ2 of the Laplacian matrix provides a great deal of
information about the connectivity structure of G.7
7
Much of spectral graph theory is devoted to the analysis of λ2 for different graph
constructions [34, 35].

6
• Consider the case λ2 = 0: Here, beside the eigenvector (3.12), we can
find at least one additional eigenvector
n ′
e = stack w(i) i=1 with w(i) ̸= w(i ) for some i, i′ ∈ V, (3.13)

w

of L(G) ⊗ I with eigenvalue equal to 0. In this case, the graph G is not


connected, i.e., we can find two subsets (components) of nodes that do
not have any edge between them. For each connected component C, we
can construct the eigenvector by assigning the same (non-zero) vector
c ∈ Rd \ {0} to all nodes i ∈ C and the zero vector 0 to the remaining
nodes i ∈ V \ C.

• On the other hand, if λ2 > 0 then G is connected. Moreover, the larger


the value of λ2 , the stronger the connectivity between the nodes in G.
Indeed, adding edges to G can only increase the objective in (3.11) and,
in turn, λ2 .

In what follows, we will make use of the lower bound [36]


n
2 2
(i′ )
X X
Ai,i′ w (i)
−w ≥ λ2 w(i) − avg{w(i) } 2
. (3.14)
2
{i,i′ }∈E i=1

Here, avg{w(i) } := (1/n) w(i) is the average of all local model parameters.
Pn
i=1

The bound (3.14) follows from (3.7) and the CFW for the eigenvalues of the
matrix L(G) ⊗ I.
2
The quantity on the RHS of (3.14) has an in-
Pn
i=1 w(i) − avg{w(i) }ni=1 2

teresting geometric interpretation: It is the squared Euclidean norm of the pro-


 T
 
(1) T (n) T
jection of the stacked local model parameters w := w

,..., w
on the orthogonal complement of the subspace
   
T T
d T
, for some a ∈ R ⊆ Rdn . (3.15)
d

S := 1 ⊗ a : a ∈ R = a , . . . , a

7
Indeed, the projection PS w of w ∈ Rnd on S given explicitly as
T
PS w = aT , . . . , aT , with a = avg{w(i) }ni=1 . (3.16)

The projection on the orthogonal complement S ⊥ , in turn, is given explicitly


as
n
PS ⊥ w = w − PS w = stack w(i) − avg{w(i) }ni=1 (3.17)

i=1
.

It might be convenient to replace a given FL network G with an equivalent


fully connected FL network G ′ (see Figure 3.3). The graph G ′ has an edge
between each pair of different nodes i, i′ ,

E ′ = {i, i′ } , with some i, i′ ∈ V, i ̸= i′ .




The edge weights are chosen A′i,i′ = Ai,i′ for any edge {i, i′ } ∈ E and A′i,i′ = 0
if the original FL network G does not contain an edge between nodes i, i′ .

1 2 1 2

3 4 3 4

Figure 3.3: Left: Some FL network G consisting of n = 4 nodes. Right:


Equivalent fully connected FL network G ′ with the same nodes and non-zero
edge weights A′i,i′ = Ai,i′ for {i, i′ } ∈ E and A′i,i′ = 0 for {i, i′ } ∈
/ E.

Note that the undirected edges E of an FL network encode a symmetric


notion of similarity between local datasets: If the local dataset D(i) at node i

is similar to the local dataset D(i ) at node i′ , i.e., {i, i′ } ∈ E, then also the

local dataset D(i ) is similar to the local dataset D(i) .

8
3.3 Generalized Total Variation Minimization

Consider some FL network G whose nodes i ∈ V carry local datasets D(i) and
local model parametrized by the vector w(i) ∈ Rd . We learn these local model
parameters by minimizing their local loss and at the same time enforce a small
total variation. Requiring a small total variation of the learnt local model
parameters enforces them to be (approximately) constant over well-connected
nodes (“clusters”).
To optimally balance local loss and total variation, we solve generalized
total variation (GTV) minimization,

n
X X ′ 2
b (i) Li w(i) + α Ai,i′ w(i) − w(i ) (GTVMin).
 
w i=1
∈ argmin
 n 2
stack w(i) i∈V {i,i′ }∈E
i=1
(3.18)
Note that GTVMin is an instance of regularized empirical risk minimization
(RERM): The regularizer is the GTV of local model parameters over the
weighted edges Ai,i′ of the FL network. Clearly, the FL network is an important
design choice for GTVMin-based methods. This choice can be guided by
computational aspects and statistical aspects of GTVMin-based FL systems.
Some application domains allow to leverage domain expertise to guess a
useful choice for the FL network. If local datasets are generated at different
geographic locations, we might use nearest-neighbour graphs based on geodesic
distances between data generators (e.g., FMI weather stations). Lecture 7
will also discuss graph learning methods that determine the edge weights Ai,i′

in a data-driven fashion, i.e., directly from the local datasets D(i) , D(i ) .
Let us now consider the special case of GTVMin with local models being
a linear model. For each node i ∈ V of the FL network, we want to learn the

9
T
parameters w(i) of a linear hypothesis h(i) (x) := w(i) x. We measure the
quality of the parameters via the average squared error loss
mi  2
(i) T (i,r)
X
(i) (i,r)
 
Li w := (1/mi ) y − w x
r=1
(3.2) 2
= (1/mi ) y(i) − X(i) w(i) 2
. (3.19)

Inserting (3.19) into (3.18), yields the following instance of GTVMin to


train local linear models,
X 2 X ′ 2
b (i) ∈argmin (1/mi ) y(i) −X(i) w(i) Ai,i′ w(i) −w(i ) . (3.20)

w 2

{w(i) }n 2
i=1 i∈V {i,i′ }∈E

The identity (3.7) allows to rewrite (3.20) using the Laplacian matrix L(G) as
X 2
b (i) ∈ (1/mi ) y(i) −X(i) w(i) +αwT L(G) ⊗ Id w. (3.21)

w argmin
 2
w=stack w(i) i∈V

Let us rewrite the objective function in (3.21) as


  
Q(1) 0 ··· 0
  
(2)
···
  
 0 Q
T 
0  (G)
 T T 
w  . . . .
 +αL ⊗ I w+ q(1) , . . . , q(n) w
 .. .. .. ..   

  
(n)
0 0 ··· Q
(3.22)

T T
with Q(i) = (1/mi ) X(i) X(i) , and q(i) := (−2/mi ) X(i) y(i) .

Thus, like linear regression (2.7) and ridge regression (2.27), also GTVMin
(3.21) (for local linear models H(i) ) minimizes a convex quadratic function,

n
b (i) wT Qw + qT w. (3.23)

w i=1
∈ argmin
 n
w=stack w(i)
i=1

10
Here, we used the psd matrix
 
Q(1) 0 ··· 0
 
 0 Q(2) ···
 
0  (i) T (i)
..  +αL ⊗I with Q := (1/mi ) X
(G) (i)

Q := 
 .. .. ..
 X
 . . . . 
 
0 0 · · · Q(n)
(3.24)
and the vector
T T T T
q := q(1) , . . . , q(n) , with q(i) := (−2/mi ) X(i) y(i) . (3.25)

3.3.1 Computational Aspects of GTVMin

Lecture 5 will apply optimization methods to solve GTVMin, resulting in


practical FL algorithms. Different instances of GTVMin favour different
classes of optimization methods. For example, using a differentiable loss
function allows to apply gradient-based methods (see Lecture 4) to solve
GTVMin.
Another important class of loss functions are those for which we can
efficiently compute the proximal operator

2
proxL,ρ (w) := argmin L(w′ ) + (ρ/2) ∥w − w′ ∥2 for some ρ > 0.
w′ ∈Rd

Some authors refer to functions L for which proxL,ρ (w) can be computed
easily as simple or proximable [37]. GTVMin with proximable loss functions
can be solved quite efficiently via proximal algorithms [38].
Besides influencing the choice of optimization method, the design choices
underlying GTVMin also determine the amount of computation that is re-
quired by a given optimization method. For example, using an FL network

11
with relatively few edges (“sparse graphs”) typically results in a smaller com-
putational complexity. Indeed, Lecture 5 discusses GTVMin-based algorithms
requiring an amount of computation that is proportional to the number of
edges in the FL network.
Let us now consider the computational aspects of GTVMin (3.20) to train
local linear models. As discussed above, this instance is equivalent to solving
(3.23). Any solution w
b of (3.23) (and, in turn, (3.20)) is characterized by the
zero-gradient condition
b = −(1/2)q,
Qw (3.26)

with Q, q as defined in (3.24) and (3.25). If the matrix Q in (3.26) is invertible,


the solution to (3.26) and, in turn, to the GTVMin instance (3.20) is unique
and given by w
b = (−1/2)Q−1 q.
The size of the matrix Q (see (3.24)) is proportional to the number of
nodes in the FL network G, which might be on the order of millions (or
even billions) for internet-scale applications. For such large systems, we
typically cannot use direct matrix inversion methods (e.g., based on Gaussian
elimination) to compute Q−1 .8 Instead, we typically need to resort to iterative
methods [39, 40].
One important family of such iterative methods are the gradient-based
methods which we will discuss in Lecture 4. Starting from an initial choice
for the local model parameters w b 0(1) , . . . , w
b 0(n) , these methods repeat

b0 = w
(variants of) the gradient step,

b k + q for k = 0, 1, . . . .

w b k − η 2Qw
b k+1 := w
8
How many arithmetic operations (addition, multiplication) do you think are required
to invert an arbitrary matrix Q ∈ Rd×d ?

12
The gradient step results in the updated local model parameters w
b (i) which
we stacked into  T
(1) T (n) T
 
w
b k+1 := w
b ,..., w
b .

We repeat the gradient step for a sufficient number of times, according to


some stopping criterion (see Lecture 4).

3.3.2 Statistical Aspects of GTVMin

How useful are the solutions of GTVMin (3.18) as a choice for the local model
parameters? To answer this question, we use - as for the statistical analysis of
ERM in Lecture 2 - a probabilistic model for the local datasets. In particular,
we use a variant of an i.i.d. assumption: Each local dataset D(i) , consists of
data points whose features and labels are realizations of i.i.d. RVs

T
y(i) = x(i,1) , . . . , x(i,mi ) w(i) + ε(i) with x(i,r) ∼ N (0, I), ε(i) ∼ N (0, σ 2 I).
i.i.d.

| {z }
local feature matrix X(i)
(3.27)
In contrast to the probabilistic model (2.13) (which we used for the analysis
of ERM), the probabilistic model (3.27) allows for different node-specific
parameters w(i) , for i ∈ V. In particular, the entire dataset obtained from
pooling all local datasets does not conform to an i.i.d. assumption.
In what follows, we focus on the GTVMin instance (3.20) to learn the
parameters w(i) of a local linear model for each node i ∈ V. For a reasonable

choice for FL network, the parameters w(i) , w(i ) at connected nodes {i, i′ } ∈ E
should be similar. However, we cannot choose the edge weights based on
parameters w(i) as they are unknown. We can only use (noisy) estimates
for w(i) the features and labels of the data points in the local datasets (see

13
Lecture 7).
Consider an FL network with nodes carrying local datasets generated
from the probabilistic model (3.27) with true model parameters w(i) . For
ease of exposition, we assume that

w(i) = c, for some c ∈ Rd and all i ∈ V. (3.28)

To study the deviation between the solutions w


b (i) of (3.20) and the true
underlying parameters w(i) , we decompose it as
n

X
w
b (i)
=: w
e (i)
c, with b
+b c := (1/n) b (i ) .
w (3.29)
i′ =1

The component b c is identical at all nodes i ∈ V and obtained as the orthogonal


projection of w b }i=1 on the subspace (3.15). The component
 (i) n
b = stack w

b (i) − (1/n) ni′ =1 w
e (i) := w b (i ) consists of the deviations, for each node i,
P
w
between the GTVMin solution w
b (i) and their average over all nodes. Triv-
ially, the average of the deviations w
e (i) across all nodes is the zero vector,
(1/n) ni=1 w e (i) = 0.
P

The decomposition (3.29) entails an analogous (orthogonal) decomposition


of the error w
b (i) −w(i) . Indeed, for identical true underlying model parameters
(3.28) (which makes w an element of the subspace (3.15)), we have
n n n
X 2 (3.28),(3.29) X X 2
w
b (i)
− w(i) 2 = ∥c c∥22
−b + e (i)
w 2
. (3.30)
i=1 i=1 i=1
| {z }
c∥22
n∥c−b

The following proposition provides an upper bound on the second error


component in (3.30).

14
Proposition 3.1. Consider a connected FL network, i.e., λ2 > 0 (see (3.9)),
and the solution (3.29) to GTVMin (3.20) for the local datasets (3.27). If the
true local model parameters in (3.27) are identical (see (3.28)), we can upper
bound the deviation w b (i) of learnt model parameters
b (i) − (1/n) ni=1 w
e (i) := w
P

b (i) from their average, as


w
n n
X 2 1 X 2
e (i)
w 2
≤ (1/mi ) ε(i) 2
. (3.31)
i=1
λ2 α i=1

Proof. See Section 3.5.1.

The upper bound (3.31) involves three components:

• the properties of local datasets, via the noise terms ε(i) in (3.27),

• the FL network via the eigenvalue λ2 L(G) (see (3.9)),




• the GTVMin parameter α.

According to (3.31), we can ensure a small error component w


e (i) of the
GTVMin solution by choosing a large value α. Thus, by (3.30), for sufficiently
large α (and a connected FL network G where λ2 L(G) > 0), the solution


b (i) of GTVMin are approximately identical for all nodes i ∈ V. This is


w
desirable for FL applications that use a common model for all nodes [14].
However, some FL applications involve heterogeneous devices that generate
local datasets with significantly different statistics. For such applications it is
detrimental to use a common model for all nodes (see Lecture 6).

15
3.4 Overview of Coding Assignment

Python Notebook. FLDesignPrinciple_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv
This assignment revolves around a collection of temperature measurements
that we store in the FL network G (FMI) . Each node i ∈ V represents a FMI
weather station at latitude lat(i) and longitude lon(i) , which we stack into
T
the vector z(i) := lat(i) , lon(i) ∈ R2 . The local dataset D(i) contains mi
temperature measurements y (i,1) , . . . , y (i,mi ) at the station i.
The edges of G (FMI) are obtained using the Python function add_edges().
Each FMI station i is connected to its nearest neighbours i′ , using the

Euclidean distance between the corresponding vectors z(i) , z(i ) . The number
of neighbours is controlled by the input parameter numneighbors. All edges
{i, i′ } ∈ E have the same edge weight Ai,i′ = 1.
For each station i ∈ V, you need to learn the single parameter w(i) ∈ R
of a hypothesis h(x) = w(i) that predicts the temperature. We measure
the quality of a hypothesis by the average squared error loss Li w(i) =

2
(1/mi ) m (i,r)
−w(i) . You should learn the parameters w(i) via balancing
P i
r=1 y

the local loss with the total variation of w(i) ,


n
T X X ′ 2
b(1) , . . . , w
b(n) Li w(i) + α w(i) − w(i ) . (3.32)

w ∈ argmin
T
w(1) ,...,w(n) i=1 {i,i′ }∈E

Your tasks involve

• Reformulate (3.32) as (3.20) using a suitable choice for features x(i,r) .


Carefully note that this choice needs to be different from the raw
features (longitude, latitude, year, month, day, hour, minute) used in
the assignment “ML Basics”.

16
• The computation of solutions w
b(i) to (3.32) via (3.26) and using the
Python function [Link]. To this end, you should determine
the matrix Q and vector q in terms of local datasets D(i) and the
Laplacian matrix L(FMI) of the FL network G (FMI) .

• Studying the effect of varying values of α in (3.32) on the local loss and
total variation of the corresponding solutions w
b(i) .

17
3.5 Proofs

3.5.1 Proof of Proposition 3.1

Let us introduce the shorthand f w(i) for the objective function of the


GTVMin instance (3.20). We verify the bound (3.31) by showing that if it


does not hold, the choice for the local model parameters w(i) := w(i) (see
(3.27)) results in a smaller objective function value, f w(i) < f w
b (i) . This
 

would contradict the fact that w


b (i) is a solution to (3.20).
First, note that

 X 2 X ′ 2
f w(i) = (1/mi ) y(i) −X(i) w(i) 2
+α Ai,i′ w(i) −w(i )
2
i∈V {i,i′ }∈E
(3.28) X 2
= (1/mi ) y(i) −X(i) w(i) 2
i∈V
(3.27) X 2
= (1/mi ) X(i) w(i) +ε(i) −X(i) w(i) 2
i∈V
X 2
= (1/mi ) ε(i) 2
. (3.33)
i∈V

Inserting (3.29) into (3.20),

 X 2 X ′ 2
b (i) =
f w (1/mi ) y(i) −X(i) w
b (i) 2 +α Ai,i′ w b (i) − w b (i )
i∈V {i,i′ }∈E | {z }2
| {z } (3.29) ′ 2
≥0 = ∥w w(i ) ∥
e (i)−e
2
X ′ 2
≥α e (i) − w
Ai,i′ w e (i )
2
{i,i′ }∈E
n
(3.14) X 2
≥ αλ2 e (i)
w 2
. (3.34)
i=1

If the bound (3.31) would not hold, then by (3.34) and (3.33) we would obtain
b (i) > f w(i) . This is a contradiction to the fact that w
b (i) solves (3.20).
 
f w

18
.

4 Gradient Methods
Chapter 3 introduced GTVMin as a central design principle for FL methods.
Many significant instances of GTVMin minimize a smooth objective function
f (w) over the parameter space (typically a subset of Rd ). This chapter
explores gradient-based methods, a widely used family of iterative algorithms
to find the minimum of a smooth function. These methods approximate the
objective function locally using its gradient at the current choice for the model
parameters. Chapter 5 focuses on FL algorithms obtained from applying
gradient-based methods to solve GTVMin.

4.1 Learning Goals

After completing this chapter, you will

• have some intuition about the effect of a gradient step,

• understand the role of the step size (or learning rate),

• know some examples of a stopping criterion,

• be able to analyze the effect of perturbations in the gradient step,

• know about projected gradient descent (GD) to cope with constraints


on model parameters.

1
4.2 Gradient Descent

Gradient-based methods are iterative algorithms for finding the minimum of


a differentiable objective function f (w) of a vector-valued argument w (e.g.,
the model parameters in a ML method). Unless stated otherwise, we consider
an objective function of form

f (w) = wT Qw + qT w. (4.1)

Note that (4.1) defines an entire family of convex quadratic functions f (w).
Each member of this family is specified by a psd matrix Q ∈ Rd×d and a
vector q ∈ Rd .
We have already encountered some ML and FL methods that minimize
an objective function of the form (4.1): Linear regression (2.3) and ridge
regression (2.27) in Chapter 2 as well as GTVMin (3.20) for local linear
models in Chapter 3. Moreover, (4.1) might serve as a useful approximation
for the objective functions arising from larger classes of ML methods [41–43].
Given model parameters w(k) , we want to update them towards a minimum
of (4.1). To this end, we use the gradient ∇f w(k) to locally approximate


f (w) (see Figure 4.1). The gradient ∇f w(k) indicates the direction in


which the function f (w) maximally increases. Therefore, it seems reasonable


to update w(k) in the opposite direction of ∇f w(k) ,


w(k+1) := w(k) − η∇f w(k)




(4.1)
= w(k) − η 2Qw(k) + q . (4.2)


The gradient step (4.2) involves the factor η which we refer to as step-size or
learning rate. Algorithm 4.1 summarizes the most basic instance of gradient-

2
based methods which simply repeats (iterates) (4.2) until some stopping
criterion is met.

 T 
f w(k) + w−w(k) ∇f w(k)
f (w)
n
f w(k)

Figure 4.1: We can approximate a differentiable function f (w) locally around


T
a point w(k) ∈ Rd using the linear function f w(k) + w−w(k) ∇f w(k) .
 

Geometrically, we approximate the graph of f (w) by a hyperplane with


normal vector n = (∇f w(k) , −1)T ∈ Rd+1 of this approximating hyperplane


is determined by the gradient ∇f w(k) [2].




The usefulness of gradient-based methods crucially depends on the com-


putational complexity of evaluating the gradient ∇f (w). Modern software
libraries for automatic differentiation enable the efficient evaluation of the
gradients arising in widely-used ERM-based methods [44]. Besides the actual
computation of the gradient, it might be already challenging to gather the
required data points which define the objective function f (w) (e.g., being the
average loss over a large training set). Indeed, the matrix Q and vector q
in (4.1) are constructed from the features and labels of data points in the
training set. For example, the gradient of the objective function in ridge

3
regression (2.27) is
m
X
x(r) y (r) − wT x(r) + 2αw.

∇f (w) = −(2/m)
r=1

Evaluating this gradient requires roughly d × m arithmetic operations (sum-


mations and multiplications).

Algorithm 4.1 A blueprint for gradient-based methods


Input: some objective function f (w) (e.g., the average loss of a hypothesis
h(w) on a training set); learning rate η > 0; some stopping criterion;
Initialize: set w(0) := 0; set iteration counter k := 0
1: repeat
2: k := k + 1 (increase iteration counter)
w(k) := w(k−1) − η∇f w(k−1) (do a gradient step (4.2))

3:

4: until stopping criterion is met


Output: learnt model parameters w
b := w(k) (hopefully f w

b ≈ minw f (w))

Note that Algorithm 4.1, like most other gradient-based methods, involves
two hyper-parameters: (i) the learning rate η used for the gradient step and
(ii) a stopping criterion to decide when to stop repeating the gradient step.
We next discuss how to choose these hyper-parameters.

4.3 Learning Rate

The learning rate must not be too large to avoid moving away from the
optimum by overshooting (see Figure 4.2-(a)). On the other hand, if the
learning rate is chosen too small, the gradient step makes too little progress
towards the solutions of (4.1) (see Figure 4.2-(b)). Note that in practice we
can only afford to repeat the gradient step for a finite number of iterations.

4
f (w(k) )
f (w(k+1) ) f (w(k+2) )
(4.2) f (w(k+2) )
f (w(k+1) ) (4.2)
f (w(k) )
(a) (b)

Figure 4.2: Effect of inadequate learning rates η in the gradient step (4.2). (a)
If η is too large, the gradient steps might “overshoot” such that the iterates
w(k) might diverge from the optimum, i.e., f (w(k+1) ) > f (w(k) )! (b) If η is
too small, the gradient steps make very little progress towards the optimum
or even fail to reach the optimum at all.

One approach to choose the learning rate is to start with some initial
value (first guess) and monitor the decrease of the objective function. If
this decrease does not agree with the decrease predicted by the (local linear
approximation using the) gradient, we decrease the learning rate by a constant
factor. After we decrease the learning rate, we re-consider the decrease of the
objective function. We repeat this procedure until a sufficient decrease of the
objective function is achieved [45, Sec 6.1].
Alternatively, we can use a prescribed sequence (schedule) ηk , for k =
1, 2, . . . , of learning rates that vary across successive gradient steps [46]. We
can guarantee convergence of (4.2) (under mild assumptions on the objective
function f (w)) by using a learning rate schedule that satisfies [45, Sec. 6.1]

X ∞
X
lim ηk = 0, ηk = ∞ , and ηk2 < ∞. (4.3)
k→∞
k=1 k=1

The first condition (4.3) requires that the learning rate eventually become

5
sufficiently small to avoid overshooting. The third condition (4.3) ensures
that this required decay of the learning rate does not take “forever”. Note
that the first and third condition in (4.3) could be satisfied by the trivial
learning rate schedule ηk = 0 which is clearly not useful as the gradient step
has no effect.
The trivial schedule ηk = 0 is ruled out by the middle condition of (4.3).
This middle condition ensures that the learning rate ηk is large enough such
that the gradient steps make sufficient progress towards a minimizer of the
objective function.
We emphasize that the conditions (4.3) do not involve any properties of
the matrix Q in (4.1). Note that the matrix Q is determined by data points
(see, e.g., (2.3)) whose statistical properties typically can be controlled only
to a limited extend (e.g., via data normalization).

4.4 When to Stop?

For the stopping criterion we might use a fixed number kmax of iterations.
This hyper-parameter kmax might be dictated by limited resources (such as
computational time) for implementing GD or tuned via validation techniques.
We can obtain another stopping criterion from monitoring the decrease
in the objective function f w(k) : we stop repeating the gradient step (4.2)


whenever f w(k) − f w(k+1) ≤ ε(tol) for a given tolerance ε(tol) . The


 

tolerance level ε(tol) is a hyper-parameter of the resulting ML method which


could be chosen via validation techniques (see Chapter 2).
The above technique for deciding when to stop the gradient steps are
convenient as they do not require any information about the psd matrix Q

6
in (4.1). However, they might result in sub-optimal use of computational
resources by implementing “useless” additional gradient steps. We use infor-
mation about the psd matrix Q in (4.1) to avoid unnecessary computations.9
Indeed, the choice for the learning rate η and the stopping criterion can be
guided by the eigenvalues

0 ≤ λ1 (Q) ≤ . . . ≤ λd (Q). (4.4)

Even if we do not know these eigenvalues precisely, we might know (or be


able to ensure via feature learning) some upper and lower bounds,

0 ≤ L ≤ λ1 (Q) ≤ . . . ≤ λd (Q) ≤ U. (4.5)

In what follows, we assume that Q is invertible and we know some positive


lower bound L > 0 (see (4.5)).10 The objective function (4.1) has then a
unique solution w.
b It turns out that a gradient step (4.2) reduces the distance
w(k) − w
b 2
to w
b by a constant factor [45, Ch. 6],

w(k+1) − w
b 2
≤ κ(ηk ) (Q) w(k) − w
b 2
. (4.6)

Here, we used the contraction factor

κ(η) (Q) := max |1 − η2λ1 |, |1 − η2λd | . (4.7)




9
For linear regression (2.7), the matrix Q is determined by the features of the data
points in the training set. We can influence the properties of Q to some extent by
feature transformation methods. One important example for such a transformation is the
normalization of features.
10
What are sufficient conditions for the local datasets and the edge weights used in
GTVMin such that Q in (3.24) is invertible?

7
The contraction factor depends on the learning rate η which is a hyper-
parameter of gradient-based methods that we can control. However, the
contraction factor also depends on the eigenvalues of the matrix Q in (4.1).
In ML and FL applications, this matrix typically depends on data and can
be controlled only to some extend, e.g., using feature transformation [6, Ch.
5]. We can ensure κ(η) (Q) < 1 if we use a positive learning rate ηk < 1/U .
For a contraction factor κ(η) (Q) < 1, a sufficient condition on the
number k (ε of gradient steps required to ensure an optimization error
(tol) )

w(k+1) − w
b 2
≤ ε(tol) is (see (4.6))

(ε(tol) )
log w(0) − w /ε(tol)
2
(4.8)
b
k ≥  .
log 1/κ(η) (Q)
According to (4.6), smaller values of the contraction factor κ(η) (Q) guar-
antee a faster convergence of (4.2) towards the solution of (4.1). Figure 4.3
illustrates the dependence of κ(η) (Q) on the learning rate η. Thus, choosing
a small η (close to 0) will typically result in a larger κ(η) (Q) and, in turn,
require more iterations to ensure optimization error level ε(tol) via (4.6).
We can minimize this contraction factor by choosing the learning rate (see
Figure 4.3)
1
. η (∗) := (4.9)
λ1 + λd
[Note that evaluating (4.9) requires to know the extremal eigenvalues λ1 , λd
of Q.] Inserting the optimal learning rate (4.9) into (4.6),
(λd /λ1 ) − 1
w(k+1) − w
b 2
≤ w(k) − w
b 2
. (4.10)
(λd /λ1 ) + 1
| {z }
:=κ∗ (Q)

Carefully note that the formula (4.10) is valid only if the matrix Q in
(4.1) is invertible, i.e., if λ1 > 0. If the matrix Q is singular (λ1 = 0), the

8
1

κ(η) (Q)
κ∗ (Q) = (λd /λ1 )−1
(λd /λ1 )+1
|1−η2λ1 |

|1−η2λd |

η
1/(2λd ) η ∗ = λ1 +λ
1
d
1
λd

Figure 4.3: The contraction factor κ(η) (Q) (4.7), used in the upper bound
(4.6), as a function of the learning rate η. Note that κ(η) (Q) also depends on
the eigenvalues of the matrix Q in (4.1).

convergence of (4.2) towards a solution of (4.1) is much slower than the


decrease of the bound (4.10). However, we can still ensure the convergence of
gradient steps w(k) by using a fixed learning rate ηk = η that satisfies [47, Thm.
2.1.14]
0 < η < 1/λd (Q). (4.11)

It is interesting to note that for linear regression, the matrix Q depends only
on the features x(r) of the data points in the training set (see (2.17)) but not
on their labels y (r) . Thus, the convergence of gradient steps is only affected
by the features, whereas the labels are irrelevant. The same is true for ridge

9
regression and GTVMin (using local linear models).
Note that both, the optimal learning rate (4.9) and the optimal contraction
factor
(λd /λ1 ) − 1
κ∗ (Q) := (4.12)
(λd /λ1 ) + 1
depend on the eigenvalues of the matrix Q in (4.1). According to (4.10),
the ideal case is when all eigenvalues are identical which leads, in turn, to
a contraction factor κ∗ (Q) = 0. Here, a single gradient step arrives at the
unique solution of (4.1).
In general, we do not have full control over the matrix Q and its eigenvalues.
For example, the matrix Q arising in linear regression (2.7) is determined
by the features of data points in the training set. These features might be
obtained from sensing devices and therefore beyond our control. However,
other applications might allow for some design freedom in the choice of feature
vectors. We might also use feature transformations that nudge the resulting
Q in (2.7) more towards a scaled identity matrix.

4.5 Perturbed Gradient Step

Consider the gradient step (4.2) used to find a minimum of (4.1). We again
assume that the matrix Q in (4.1) is invertible (λ1 (Q) > 0) and, in turn, (4.1)
has a unique solution w.
b
In some applications, it might be difficult to evaluate the gradient ∇f (w) =
2Qw + q of (4.1) exactly. For example, this evaluation might require to
gather data points from distributed storage locations. These storage locations
might become unavailable during the computation of ∇f (w) due to software
or hardware failures (e.g., limited connectivity).

10
We can model imperfections during the computation of (4.2) as the
perturbed gradient step

w(k+1) := w(k) − η∇f w(k) + ε(k)




(4.1)
= w(k) − η 2Qw(k) + q + ε(k) . (4.13)


We can use the contraction factor (4.7) to upper bound the deviation between
w(k) and the optimum w
b as (see (4.6))
k
(k) (η)
k (0)
X k ′ ′
w −w
b 2
≤ κ (Q) w −w
b 2
+ κ(η) (Q) ε(k−k ) .
2
k′ =1
(4.14)
This bound applies for any number of iterations k = 1, 2, . . . of the perturbed
gradient step (4.13).
Finally, note that the perturbed gradient step (4.13) could also be used
as a tool to analyze the (exact) gradient step for an objective function f˜(w)
which does not belong to the family (4.1) of convex quadratic functions.
Indeed, we can write the gradient step for minimizing f˜(w) as

w(k+1) := w(k) − η∇f˜(w)

= w(k) − η∇f (w) + η ∇f (w) − ∇f˜(w) .



| {z }
:=ε(k)

The last identity is valid for any choice of surrogate function f (w). In
particular, we can choose f (w) as a convex quadratic function (4.1) that
approximates f˜(w). Note that the perturbation term ε(k) is scaled by the
learning rate η.

11
4.6 Handling Constraints - Projected Gradient Descent

Many important ML and FL methods amount to the minimization of an


objective function of the form (4.1). The optimization variable w in (4.1)
represents some model parameters.
Sometimes we might require the parameters w to belong to a subset
C ⊂ Rd . One example is regularization via model pruning (see Chapter 2).
Another example are FL methods that learn identical local model parameters
w(i) at all nodes i ∈ V of an FL network. This can be implemented by
T
requiring the stacked local model parameters w = w(1) , . . . , w(n) to belong
to the subset
 
(1) (n) T (1) (n)

C= w ,...,w :w = ... = w .

Let us now show how to adapt the gradient step (4.2) to solve the con-
strained problem
f ∗ = min wT Qw + qT w. (4.15)
w∈C

We assume that the constraint set C ⊆ Rd is such that we can efficiently


compute the projection

PC w = argmin ∥w − w′ ∥2 for any w ∈ Rd . (4.16)



w′ ∈C

A suitable modification of the gradient step (4.2) to solve the constrained


variant (4.15) is [45]

w(k+1) := PC w(k) − η∇f w(k)




(4.1)
= PC w(k) − η 2Qw(k) + q . (4.17)


The projected GD step (4.17) amounts to

12
1. compute ordinary gradient step w(k) 7→ w(k) − η∇f w(k) and


2. project back to the constraint set C.

Note that we re-obtain the basic gradient step (4.2) from the projected
gradient step (4.17) for the specific constraint set C = Rd .

f (w)
(4.2)
w(k)


 PC ·
w(k) −η∇f w (k)
C
w(k+1)

Figure 4.4: Projected GD augments a basic gradient step with a projection


back onto the constraint set C.

The approaches for choosing the learning rate η and stopping criterion for
basic gradient step (4.2) explained in Sections 4.3 and 4.4 work also for the
projected gradient step (4.17). In particular, the convergence speed of the
projected gradient step is also characterized by (4.6) [45, Ch. 6]. This follows
from the fact that the concatenation of a contraction (such as the gradient
step (4.2) for sufficiently small η) and a projection (such as PC · ) results


again in a contraction with the same contraction factor.


Thus, the convergence speed of projected GD, in terms of number of
iterations required to ensure a given level of optimization error, is essentially
the same as that of basic GD. However, the bound (4.6) is only telling about
the number of projected gradient steps (4.17) required to achieve a guaranteed
level of sub-optimality f w(k) − f ∗ . The iteration (4.17) of projected GD


13
might require significantly more computation than the basic gradient step, as
it requires to compute the projection (4.16).

4.7 Generalizing the Gradient Step

The gradient-based methods discussed so far can be used to learn a hypothesis


from a parametrized model. Let us now sketch one possible generalization of
the gradient step (4.2) for a model H without a parametrization.
We start with rewriting the gradient step (4.2) as the optimization
 
2 T
w(k+1) = argmin (1/(2η)) w−w(k) 2 +f w(k) + w−w(k) ∇f w(k)  .
  
w∈Rd | {z }
≈f (w)
(4.18)
The objective function in (4.18) includes the first-order approximation
T
f (w) ≈ f w(k) + w − w(k) ∇f w(k)
 

of the function f (w) around the location w = w(k) (see Figure 4.1).
Let us modify (4.18) by using f (w) itself (instead of an approximation),
2
w(k+1) = argmin(1/(2η)) w − w(k) 2
+ f (w). (4.19)
w∈Rd

Like the gradient step, also (4.19) maps a given vector w(k) to an updated
vector w(k+1) . Note that (4.19) is nothing but the proximal operator of the
function f (w) [38]. Similar to the role of the gradient step as the main
building block of gradient-based methods, the proximal operator (4.19) is the
main building block of proximal algorithms [38].
To obtain a version of (4.19) for a non-parametric model, we need to be
able to evaluate its objective function directly in terms of a hypothesis h

14
instead of its parameters w. The objective function (4.19) consists of two
components. The second component f (·), which is the function we want to
minimize, is obtained from a training error incurred by a hypothesis, which
might be parametrized h(w) . Thus, we can evaluate the function f (h) by
computing the training error for a given hypothesis.
2
The first component of the objective function in (4.19) uses w − w(k) 2
to
measure the difference between the hypothesis maps h(w) and h(w . Another
(k) )

measure for the difference between two hypothesis maps can be obtained by

using some test dataset D′ = x(1) , . . . , x(m ) : The average squared difference


between their predictions,


m′ 
 2
X 
′ (r) (k) (r)
(4.20)

(1/m ) h x −h x ,
r=1

is a measure for the difference between h and h(k) . Note that the measure
(4.20) does require any model parameters but only the predictions delivered
by the hypothesis maps on D′
2
It is interesting to note that (4.20) coincides with w − w(k) 2
for the
linear model h(w) (x) := wT x and a specific construction of the dataset D′ .
This construction uses the realizations x(1) , x(2) , . . . of i.i.d. RVs with a
common probability distribution x ∼ N (0, I). Indeed, by the law of large

15
numbers
m  ′
 2

(w(k) )
X
′ (w) (r) (r)

lim (1/m ) h x −h x
m′ →∞
r=1
m  ′ 2
(k) T
X
′ (r)

= lim

(1/m ) w−w x
m →∞
r=1
m ′


X T T
w − w(k) x(r) x(r) w − w(k)

= lim

(1/m )
m →∞
r=1
 m ′ 
(k) T (r) T
X
′ (r)
w − w(k)
  
= w−w lim

(1/m ) x x
m →∞
r=1
| {z }
=I
2
= w − w(k) 2
. (4.21)

2
To arrive at our generalization of the gradient step, we replace w − w(k) 2

in (4.19) with (4.20),


 ′
m  2 
X
(k+1) ′ (r) (k) (r)
(4.22)
 
h = argmin (1/(2ηm )) h x −h x + f (h) .
h∈H r=1

We can modify gradient-based methods (such as Algorithm 4.1), by replacing


the gradient step with the update (4.22), to obtain training algorithms for a
non-parametric model (or hypothesis space) H.

16
4.8 Overview of Coding Assignment

Python Notebook. GradientMethods_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv
Description. This assignment builds on the assignment for Chapter 2
which used ready-made Python implementations of linear regression and ridge
regression. Now you have to implement ridge regression “from scratch” by
applying Algorithm 4.1 to solve the ridge regression problem (2.27).
The features are (normalized) values of the latitude and longitude of the
FMI station as well as the (normalized) year, month, day, hour and minute
during which the measurement has been taken. Your tasks include

• Generate numpy arrays X, y, whose r-th row hold the features x(r) and
label y (r) , respectively, of the r-th data point in the csv file.

• Split the dataset into a training set and validation set. The size of the
training set should be 100.

• Train a linear model, using the RidgeRegression class of the scikit-learn


package, on the training set and determine the resulting training error
Et and validation error Ev .

• Implement the GD Algorithm 4.1 for the objective function (2.27).


Unless stated otherwise, use the initialization w(0) = 0. Determine the
training error Et and validation error Ev using the model parameters
delivered by Algorithm 4.1.

• To develop some intuition for the behaviour of Algorithm 4.1, you could
try out difference choices for the learning rate η and maximum number

17
kmax of gradient steps. For each choice, you could monitor and plot the
objective function value f w(k) for the model parameters as a function


of the iteration counter k in Algorithm 4.1.

18
5 FL Algorithms
Chapter 3 introduced GTVMin as a flexible design principle for FL methods
that arise from different design choices for the local models and edge weights
of the FL network. The solutions of GTVMin are local model parameters
that strike a balance between the loss incurred on local datasets and the total
variation. This chapter applies the gradient-based methods from Chapter 4 to
solve GTVMin. The resulting FL algorithms can be implemented by message
passing over the edges of the FL network. The details of how this message
passing is implemented physically (e.g., via short range wireless technology)
is beyond the scope if this course.
Section 5.2 studies the gradient step for the GTVMin instance obtained for
training local linear models. In particular, we show how the convergence rate
of the gradient step can be characterized by properties of the local datasets
and their FL network. Section (5.3) spells out the gradient step in the form
of message passing over the FL network. Section 5.4 presents a FL algorithm
that is obtained by replacing the exact GD with a stochastic approximation.
Section 5.5 discusses FL algorithms for the single-model setting where we
want to train a single global model in a distributed fashion.

5.1 Learning Goals

After this lecture, you

• can apply gradient-based methods to GTVMin for local linear models,

• can implement a gradient step via message passing over FL networks,

1
• know some motivation for stochastic gradient descent (SGD),

• know how federated averaging (FedAvg) is obtained from projected GD.

5.2 Gradient Step for GTVMin

Consider a collection of n local datasets represented by the nodes V =


{1, . . . , n} of an FL network G = (V, E). Each undirected edge {i, i′ } ∈ E in
FL network G has a known edge weight Ai,i′ . We want to learn local model
parameters w(i) of a personalized linear model for each node i = 1, . . . , n. To
this end, we solve the GTVMin instance
local loss Li (w(i) )
z }| { 2
n (i) (i) 2 (i′ )
X X
b (i) (i) (i)

w i=1
∈ argmin (1/mi ) y −X w 2
+α Ai,i ′ w −w .
{w(i) } 2
i∈V {i,i′ }∈E
| {z }
=:f (w)

(5.1)

As discussed in Chapter 3, the objective function in (5.1) - viewed as a


function of the stacked local model parameters w := stack{w(i) }ni=1 - is a
quadratic function of the form (4.1),
  
(1)
Q 0 ··· 0
  
(2)
· · ·
  
0 Q 0
 w+ q(1) T , . . . , q(n) T w

T 
  (G)
   
w  . .. ... .. 
 +αL ⊗ I
 .. . . 
 | {z }

   :=qT
0 0 · · · Q(n)
| {z }
:=Q

(5.2)

T T
with Q(i) = (1/mi ) X(i) X(i) , and q(i) := (−2/mi ) X(i) y(i) .

2
Therefore, the discussion and analysis of gradient-based methods from Chapter
4 also apply to GTVMin (5.1). In particular, we can use the gradient step

w(k+1) := w(k) − η∇f w(k)




(4.1)
= w(k) − η 2Qw(k) + q (5.3)


to iteratively compute an approximate solution w


b to (5.1). This solution
consists of learnt local model parameters w
b (i) , i.e., w b (i) }. Section
b = stack{w
5.3 will formulate the gradient step (5.3) directly in terms of local model
parameters, resulting in a message passing over the FL network G.
According to the convergence analysis in Chapter 4, the convergence rate
of the iterations (5.3) is determined by the eigenvalues of the matrix Q in
(5.2). Clearly, the eigenvalues λj (Q) of Q are related to the eigenvalues
λj Q(i) and to the eigenvalues λj L(G) of the Laplacian matrix of the FL
 

network G. In particular, we will use the following two summary parameters


 Xn 
λmax := max λd Q , and λ̄min := λ1 (1/n)
(i) (i)
(5.4)

Q .
i=1,...,n
i=1

We first present an upper bound U (see (4.5)) on the eigenvalues of the


matrix Q in (5.2).

Proposition 5.1. The eigenvalues of Q in (5.2) are upper-bounded as

λj (Q) ≤ λmax + αλn L(G)




≤ λmax + α2d(G) , for j = 1, . . . , dn. (5.5)


| {z max}
:=U

Proof. See Section 5.8.1.

The next result offers a lower bound on the eigenvalues λj (Q).

3
Proposition 5.2. Consider the matrix Q in (5.2). If λ2 L(G) > 0 (i.e.,


the FL network in (5.1) is connected) and λ̄min > 0 (i.e., the average of the
matrices Q(i) is non-singular), then the matrix Q is invertible and its smallest
eigenvalue is lower bounded as
1
min{λ2 L(G) αρ2 , λ̄min /2}. (5.6)

λ1 (Q) ≥ 2
1+ρ
Here, we used the shorthand ρ := λ̄min /(4λmax ) (see (5.4)).

Proof. See Section 5.8.2.

Prop. 5.2 and Prop. 5.1 can provide some guidance for the design choices
of GTVMin. According to the convergence analysis of gradient-based methods
in Chapter 4, the eigenvalue λ1 Q should be close to λdn Q to ensure fast
 

convergence. This favours FL networks G, with corresponding eigenvalue


(G)
λ2 L(G) and maximum node degree dmax , having a small ratio between the


upper bound (5.5) and the lower bound (5.6).


The bounds in (5.5) and (5.6) also depend on the GTVMin parameter
α. While these bounds might provide some guidance for the choice of α, its
effect on the convergence speed of the gradient step (5.3) is complicated. For
a fixed value of learning rate in (5.3), using larger values for α might slow
down the convergence of (5.3) for some collection of local datasets but speed
up the convergence of (5.3) for another collection of local datasets. Exercise.
Study the convergence speed of (5.3) for two different collections of local
datasets assigned to the nodes of the FL network G with nodes V = {1, 2} and
(unit weight) edges E = {{1, 2}}. The first collection of local datasets results
in the local loss functions L1 (w) := (w + 5)2 and L2 (w) := 1000(w + 5)2 .
The second collection of local datasets results in the local loss functions

4
L1 (w) := 1000(w + 5)2 and L2 (w) := 1000(w − 5)2 . Use a fixed learning rate
η := 0.5 · 10−3 for the iteration (5.3).

5.3 Message Passing Implementation

We now discuss in more detail the implementation of gradient-based methods


to solve the GTVMin instances with a differentiable objective function f (w).
One example for such an instance is (5.1). The core computational step of
gradient-based methods is the gradient step

w(k+1) := w(k) − η∇f w(k) . (5.7)




The iterate w(k) contains local model parameters w(i,k) ,

n
w(k) =: stack w(i,k) (5.8)

i=1
.

Inserting (5.1) into (5.7), we obtain the gradient step



(i,k+1) (i,k)
T (i)
+ η (2/mi ) X(i) y −X(i) w(i,k)

w := w
| {z }
(I)

(i′ ,k)
X
(i,k)
(5.9)

+ 2α Ai,i′ w −w .
i′ ∈V\{i}
| {z }
(II)

The update (5.9) consists of two components, denoted (I) and (II). The
component (I) is nothing but the negative gradient −∇Li w(i,k) of the local


2
loss Li w(i) := (1/mi ) y(i) − X(i) w(i) 2 . Component (I) drives the local


model parameters w(i,k+1) towards the minimum of Li (·), i.e., having a small
T
deviation between labels y (i,r) and the predictions w(i,k+1) x(i,r) . Note that

5
we can rewrite the component (I) in (5.9), as
mi
X T
x(i,r) y (i,r) − x(i,r) w(i,k) . (5.10)

(2/mi )
r=1

The purpose of component (II) in (5.9) is to force the local model pa-
rameters to be similar across an edge {i, i′ } with large weight Ai,i′ . We
control the relative importance of (II) and (I) using the GTVMin parameter
α: Choosing a large value for α puts more emphasis on enforcing similar local
model parameters across the edges. Using a smaller α puts more emphasis
on learning local model parameters delivering accurate predictions (incurring
a small loss) on the local dataset.

A1,2
w(2,k)
A1,3
w(1,k)

w(3,k)
Figure 5.1: At the beginning of iteration k, node i = 1 collects the cur-
rent local model parameters w(2,k) and w(3,k) from its neighbours. Then, it
computes the gradient step (5.9) to obtain the new local model parameters
w(1,k+1) . These updated parameters are then used in the next iteration for
the local updates at the neighbours i = 2, 3.

The execution of the gradient step (5.9) requires only local information
at node i. Indeed, the update (5.9) node i depends only on its current
model parameters w(i,k) , the local loss function Li (·), the neighbours’ model

parameters w(i ,k) , for i′ ∈ N (i) , and the corresponding edge weights Ai,i′ (see
Figure 5.1). In particular, the update (5.9) does not depend on any properties

6
(model parameters or edge weights) of the FL network beyond the neighbours
N (i) .
We obtain Algorithm 5.1 by repeating the gradient step (5.9), simulta-
neously for each node i ∈ V, until a stopping criterion is met. Algorithm
5.1 allows for potentially different learning rates ηk,i at different nodes i and
iterations k (see Section 4.3). It is important to note that Algorithm 5.1

Algorithm 5.1 FedGD for Local Linear Models


Input: FL network G; GTV parameter α; learning rate ηk,i
local dataset D(i) = for each i; some
 (i,1) (i,1)  
x ,y , . . . , x(i,mi ) , y (i,mi )
stopping criterion.
Output: linear model parameters w
b (i) for each node i ∈ V
Initialize: k := 0; w(i,0) := 0

1: while stopping criterion is not satisfied do


2: for all nodes i ∈ V (simultaneously) do
3: share local model parameters w(i,k) with neighbours i′ ∈ N (i)
4: update local model parameters via (5.9)
5: end for
6: increment iteration counter: k := k+1
7: end while
8: b (i) := w(i,k) for all nodes i ∈ V
w

requires a synchronous (simultaneous) execution of the updates (5.9) at all


nodes i ∈ V [48]. Loosely speaking, all nodes i must the use the same “global
clock” that maintains the current iteration counter k [49].
At the beginning of iteration k, each node i ∈ V sends their current model
parameters w(i,k) to their neighbours i′ ∈ N (i) . Then, each node i ∈ V updates

7
compute w(i,k+1)
i i i

′ ′
w(i ,k) w(i,k) Ai,i′ w(i ,k+1) w(i,k+1)

i′ i′ i′

compute w(i ,k+1)

Figure 5.2: Algorithm 5.1 alternates between message passing across the
edges of the FL network and updates of local model parameters.

their model parameters according to (5.9) resulting in the updated model


parameters w(i,k+1) . As soon as these local updates are completed, the global
clock increments the counter k := k + 1 and triggers the next iteration to be
executed by all nodes.
The implementation of Algorithm 5.1 in real-world computational infras-
tructures might incur deviations from the exact synchronous execution of
(5.9). This deviation can be modelled as a perturbation of the gradient step
(5.7) and therefore analyzed using the concepts of Section 4.5 on perturbed
GD. Section 8.3 will also discuss the effect of imperfect computation in the
context of key requirements for trustworthy FL.

8
5.4 FedSGD

For some applications, it might be infeasible to compute the sum (5.10)


exactly for each gradient step. For example, local datasets might consist of a
large number of data points which cannot be accessed quickly enough (stored
in the cloud). It might then be useful to approximate the sum by
X T
x(i,r) y (i,r) − x(i,r) w(i,k) . (5.11)

(2/B)
r∈B
| {z }
≈(5.10)

The approximation (5.11) uses a subset (so called “batch”)

x(r1 ) , y (r1 ) , . . . , x(rB ) , y (rB )


  
B=

of B randomly chosen data points from D(i) . While (5.10) requires summing
over m data points, the approximation requires to to sum over B (typically
B ≪ m) data points.
Inserting the approximation (5.11) into the gradient step (5.9) yields the
approximate gradient step
  
(i,r) T
X
(i,k+1) (i,k) (i,r) (i,r) (i,k)

w := w + η (2/B) x y − x w
r∈B
| {z }
≈(5.10)

(i′ ,k)
X
(i,k)
(5.12)

+ 2α Ai,i′ w −w .
i′ ∈V\{i}

We obtain Algorithm 5.2 from Algorithm 5.1 by replacing the gradient


step (5.9) with the approximation (5.12).

9
Algorithm 5.2 FedSGD for Local Linear Models
Input: FL network G; GTV parameter α; learning rate ηk,i ;
local datasets D(i) = for each node i;
 (i,1) (i,1)  
x ,y , . . . , x(i,mi ) , y (i,mi )
batch size B; some stopping criterion.
Output: linear model parameters w
b (i) at each node i ∈ V
Initialize: k := 0; w(i,0) := 0

1: while stopping criterion is not satisfied do


2: for all nodes i ∈ V (simultaneously) do
3: share local model parameters w(i,k) with all neighbours i′ ∈ N (i)
4: draw fresh batch B (i) := {r1 , . . . , rB }
5: update local model parameters via (5.12)
6: end for
7: increment iteration counter k := k+1
8: end while
9: b (i) := w(i,k) for all nodes i ∈ V
w

10
5.5 FedAvg

Consider a FL use-case that requires to learn model parameters w


b ∈ Rd for a
single (global) linear model from local datasets D(i) , i = 1, . . . , n. How can we
learn w
b without exchanging local datasets but only some model parameters
(updates)?
One approach could be to use GTVMin (5.1) and choose α such that its
solutions w
b (i) are identical for all nodes i ∈ V. We can interpret the local
model parameters delivered by GTVMin as a local copy of the global model
parameters. According to our analysis of GTVMin in Chapter 3 (in particular,
the upper bound in Prop. 3.1), we can enforce this agreement by a sufficiently
large parameter α.
Note that the bound in Prop. 3.1 only applies if the FL network (used
in GTVMin) is connected. One example for a connected FL network is the
star as depicted in Figure 5.3. Here, we choose one node i = 1 as a centre
node that is connected by an edge with weight A1,i to the remaining nodes
i = 2, . . . , n. The star graph is distinct in the sense of using the minimum
number of edges required to connect n nodes [50].

D(i) Ai,1

Figure 5.3: Star graph G (star) with centre node representing a server and
peripheral nodes representing clients that generate local datasets.

11
Instead of using GTVMin with a connected FL network and a large value
of α, we can also enforce identical local copies w
b (i) via a constraint:
X 2
wb ∈ arg min (1/mi ) y(i) − X(i) w(i) 2
w∈C
i∈V

with C = w = stack{w(i) }ni=1 : w(i) = w(i ) for any i, i′ ∈ V . (5.13)


Note that the constraint set C is nothing but the subspace defined in (3.15).
The projection of a given collection of local model parameters w = stack{w(i) }
on C is given by
T X
PC w = vT , . . . , vT with v := (1/n) w(i) . (5.14)

i∈V

We can solve (5.13) using projected GD from Chapter 4 (see Section 4.6).
The resulting projected gradient step for solving (5.13) is
T (i)
b k(i) := w(i,k) +ηi,k (2/mi ) X(i) y −X(i) w(i,k) (5.15)

w
| {z }
(local gradient step)
X (i′ )
w(i,k+1) := (1/n) wbk ( projection ) . (5.16)
i′ ∈V

We can implement (5.16) conveniently in a server-client system with each


node i being a client:

• First, each node computes the update (5.15), i.e., a gradient step towards
2
a minimum of the local loss y(i) − X(i) w 2 .

b k(i) of its local gradient step to a


• Second, each node i sends the result w
server.

b k(i) from all nodes i ∈ V, the server


• Finally, after receiving the updates w
computes the projection step (5.16). This projection results in the new
local model parameters w(i,k+1) that are sent back to each client i.

12
The averaging step (5.16) might take much longer to execute than the
local update step (5.15). Indeed, (5.16) typically requires transmission of local
model parameters from every client i ∈ V to a server or central computing
unit. Thus, after the client i ∈ V has computed the local gradient step (5.15)
b k(i) from all clients
it must wait until the server (i) has collected the updates w
and (ii) sent back their average w(i,k+1) to i ∈ V.
Instead of using a computational cheap gradient step (5.15),11 and then
being forced to wait for receiving w(i,k+1) back from the server, a client
might “make better use” of its time. For example, the client i could execute
several local gradient steps (5.15) in order to make more progress towards
the optimum. Another option is to use a local minimization of Li (v) :=
2
(1/mi ) y(i) − X(i) v 2
around w(i,k) ,
 
2 2
b k(i)
w (i) (i)
:= argmin (1/mi ) y − X v 2 +(1/η) v − w(i,k) 2 . (5.17)
v∈Rd | {z }
=Li (v)

Note that (5.17) is nothing but the proximal operator of Li (v) (see (4.19)).
We obtain Algorithm 5.3 from (5.15)-(5.16) by replacing the gradient step
(5.15) with the local minimization (5.17).
The local minimization step (5.17) is closely related to ridge regression.
Indeed, (5.17) is obtained from ridge regression (2.25) by replacing the regu-
2
larizer R v := ∥v∥22 with the regularizer R v := v − w b k(i) .
 
2
As the notation in (5.17) indicates, the parameter η plays a role similar
to the learning rate of a gradient step (4.2). It controls the size of the
11
For a large local dataset, the local gradient step (5.15) might actually be computation-
ally expensive and should be replaced by an approximation, e.g., based on the stochastic
gradient approximation (5.11).

13
neighbourhood of w(i,k) over which (5.17) optimizes the local loss function
Li (·). Choosing a small η forces the update (5.17) to not move too far from
the given model parameters w(i,k) .
We can interpret the update (5.17) as a form of ERM (2.3) for linear
regression. Indeed,
 
2 2
min (1/mi ) y(i) −X(i) v 2
+(1/η) w (i,k)
−v 2
=
v∈Rd
 mi 
X
(i,r) (i,r) T
 2 (i,k) 2
min (1/mi ) y − x v +(1/η) w −v 2
=
v∈Rd
r=1
mi
X d 
(i,r) T
2 X (r) T
 2
(i,r) (i,k)
(5.18)

min (1/mi ) y − x v + (mi /η) wr − e v .
v∈Rd
r=1 r=1

Here, e(r) denotes the vector obtained from the rth column of the identity
matrix I. The only difference between (5.18) and (2.3) is the presence of the
sample weights (mi /η).12

5.6 Asynchronous Computation

The FL methods presented in Section ?? - Section ?? assume a perfectly


synchronous mode of computation. As a case in point, consider the syn-
chronous mode of operation required by Algorithm ??. In particular,the step
?? of Algorithm ?? has to be carried simultaneously at all nodes. Only when
step ?? has been completed at all nodes, Algorithm ?? can continue with
executing the update ??. However, what happens if some of the nodes i ∈ V
fail to complete step ?? of Algorithm ???

12
The “.fit()” method of the Python class sklearn.linear_model.LinearRegression
allows to specify sample weights via the parameter sample_weight [27].

14
Algorithm 5.3 FedAvg to train a linear model
Input: client list V
Server. Initialize. k := 0

1: while stopping criterion is not satisfied do


2: receive local model parameters w(i) from all clients i ∈ V
3: update global model parameters

X
w [k] := 1/ V w(i) .
i∈V

4: send updated global model parameters w [k] to all clients i ∈ V


5: k := k+1
6: end while

Client. (at some node i ∈ V)

1: while stopping criterion is not satisfied do


2: receive global model parameters w [k] from server
3: update local model parameters by RERM (see (5.17))
" mi
#
2
X 2
w(i) := argmin (1/mi ) vT x(i,r) − y (i,r) + (1/η) ∥v − w [k]∥2 .

v∈Rd r=1

4: send w(i) back to server


5: end while

15
5.7 Overview of Coding Assignment

Python Notebook. FLAlgorithms_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv

This coding assignment builds on the coding assignment “ML Basics” (see
Section 2.7) and the coding assignment “FL Design Principle” (see Section
3.4). In particular, we consider an FL network G (FMI) with each node i ∈ V
being a FMI weather station. The node i ∈ V holds a local dataset D(i) that
consists of mi data points. Each data point is a temperature measurement,
T
taken at station i, and characterized by d = 7 features x = x1 , . . . , x7
and a label y which is the temperature measurement itself. The features are
(normalized) values of the latitude and longitude of the FMI station as well
as the (normalized) year, month, day, hour, minute when the temperature
has been measured.
The edges of G (FMI) are obtained using the Python function add_edges().
Each FMI station i is connected to its nearest neighbours i′ , using the
T
Euclidean distance between the corresponding vectors lat(i) , lon(i) ∈ R2 .
The number of neighbours is controlled by the input parameter numneighbors.
All edges {i, i′ } ∈ E have the same edge weight Ai,i′ = 1.
Your tasks involve

• Construct the FL network G (FMI) as a [Link]() object.

• For each node i ∈ V (FMI station), add node attributes that store the
feature matrix (as numpy arrays) X(i) and label vector y(i) . The r-th
row of these holds the features x(i,r) and label y (i,r) , respectively, of the

16
r-th data point recorded for FMI station i in the csv file.

• For each node i ∈ V (FMI station), add a node attribute that stores
local model parameters w(i) (as a numpy array).

• Split each local dataset into a training set and a validation set.

• Learn local model parameters for a linear model by applying FedGD


and FedSGD to the local training sets. Diagnose the resulting model
parameters by computing the average (over all nodes) training error
and validation error.

• Try to develop some intuition for the effect of choosing different values
for the hyper-parameters such as GTVMin parameter α, learning rate
η or batch size in Algorithms 5.1 and 5.2.

17
5.8 Proofs

5.8.1 Proof of Proposition 5.1

The first inequality in (5.5) follows from well-known results on the eigenvalues
of a sum of symmetric matrices (see, e.g., [3, Thm 8.1.5]). In particular,

λmax Q ≤ max max λd Q(i) , λmax αL(G) ⊗ I . (5.19)


   
i=1,...,n
| {z }
(5.4)
= λmax

The second inequality in (5.5) uses the following upper bound on the maximum
eigenvalue λn L(G) of the Laplacian matrix:


 (a)
λn L(G) = max vT L(G) v
v∈S(n−1)
(3.7) X 2
= max Ai,i′ vi − vi′
v∈S(n−1)
{i,i′ }∈E
(b) X
Ai,i′ 2 vi2 + vi2′

≤ max
v∈S(n−1)
{i,i′ }∈E
(c) X X
= max 2vi2 Ai,i′
v∈S(n−1)
i∈V i′ ∈N (i)
(3.5) X
≤ max 2vi2 d(G)
max
v∈S(n−1)
i∈V

= 2d(G)
max . (5.20)

Here, step (a) uses the CFW of eigenvalues [3, Thm. 8.1.2.] and step (b) uses
the inequality (u+v)2 ≤ 2(u2 +v 2 ) for any u, v ∈ R. For step (c) we use the
identity i∈V i′ ∈N (i) f (i, i′ ) = {i,i′ } f (i, i′ ) + f (i′ , i) (see Figure 5.20).
P P P 

The bound (5.20) is essentially tight.13


13
Consider an FL network being a chain (or path).

18

2
w21 w
+ 2
A1,22 2
0
1 A1,3 2 w 2
1 + w2

3
3

Figure 5.4: Illustration of step (c) in (5.20).

5.8.2 Proof of Proposition 5.2

Similar to the upper bound (5.20) we also start with the CFW for the
eigenvalues of Q in (5.2). In particular,

λ1 = min
2
wT Qw. (5.21)
∥w∥2 =1

We next analyze the RHS of (5.21) by partitioning the constraint set {w :


∥w∥22 = 1} of (5.21) into two complementary regimes for the optimization
variable w = stack{w(i) }. To define these two regimes, we use the orthogonal
decomposition

w = PS w + PS ⊥ w for subspace S in (3.15). (5.22)


| {z } | {z }
=:w =:w
e

Explicit expressions for the orthogonal components w, w


e are given by (3.16)
and (3.17). In particular, the component w satisfies

T T T n
with c := avg w(i) (5.23)

w= c ,..., c i=1
.

Note that
∥w∥22 = ∥w∥22 + ∥w∥
e 22 . (5.24)

19
Regime I. This regime is obtained for ∥w∥
e 2 ≥ ρ ∥w∥2 . Since ∥w∥22 = 1,
and due to (5.24), we have

e 22 ≥ ρ2 /(1 + ρ2 ).
∥w∥ (5.25)

This implies, in turn, via (3.14) that


(5.2)
wT Qw ≥ αwT L(G) ⊗ I w


(3.7),(3.14)
e 22
αλ2 L(G) ∥w∥


(5.25)
≥ αλ2 L(G) ρ2 /(1 + ρ2 ). (5.26)


Regime II. This regime is obtained for ∥w∥


e 2 < ρ ∥w∥2 . Here we have
∥w∥22 > (1/ρ2 ) 1 − ∥w∥22 and, in turn,


n ∥c∥22 = ∥w∥22 > 1/(1 + ρ2 ). (5.27)

We next develop the RHS of (5.21) according to


n
(5.2) X T
T
w Qw ≥ w(i) Q(i) w(i)
i=1
n
(5.22) X T
e (i) Q(i) c + w
e (i)

≥ c+w
i=1
 n  n
(5.27) X X T T
∥w∥22 (i)
e (i) Q(i) c + we (i) Q(i) w
e (i)
 
≥ λ1 (1/n) Q + 2 w
| {z }
| {zi=1 } i=1
≥0
λ̄min
n
X T
≥ ∥w∥22 λ̄min + e (i)
2 w Q(i) c. (5.28)
i=1

To develop (5.28) further, we note that


n
X T (a)
e (i)
2 w Q(i) c ≤ 2λmax ∥w∥
e 2 ∥w∥2
i=1
∥w∥
e 2 <ρ∥w∥2
≤ 2λmax ρ ∥w∥22 . (5.29)

20
Here, step (a) follows from max∥y∥2 =1,∥x∥2 =1 yT Qx = λmax . Inserting (5.29)
into (5.28) for ρ = λ̄min /(4λmax ),
(5.27)
wT Qw ≥ ∥w∥22 λ̄min /2 ≥ (1/(1 + ρ2 ))λ̄min /2 (5.30)

For each w with ∥w∥22 = 1, either (5.26) or (5.30) must hold.

21
6 Some Main Flavours of FL
Chapter 3 discussed GTVMin as a main design principle for FL algorithms.
GTVMin learns local model parameters that optimally balance the individual
local loss with their variation across the edges of an FL network. Chapter
5 discussed how to obtain practical FL algorithms. These algorithms solve
GTVMin using distributed optimization methods, such as those from Chapter
4.
This chapter discusses important special cases of GTVMin, known as
“main flavours”. These flavours arise for specific construction of local datasets,
choices of local models, measures for their variation and, last but not least, the
weighted edges in the FL network. We next briefly summarize the resulting
main flavours of FL discussed in the following sections.
Section 6.2 discusses single-model FL that learns model parameters of
a single (global) model from local datasets. This single-model flavour can
be obtained from GTVMin using a connected FL network with large edge
weights or, equivalently, a sufficient large value for the GTVMin parameter.
Section 6.3 discusses how clustered FL is obtained from GTVMin over
FL networks with a clustering structure. CFL exploits the presence of
clusters (subsets of local datasets) which can be approximated using an i.i.d.
assumption. GTVMin captures these clusters if they are well-connected by
many (large weight) edges in the FL network.
Section 6.4 discusses horizontal FL which is obtained from GTVMin over
an FL network whose nodes carry different subsets of a single underlying global
dataset. Loosely speaking, horizontal FL involves local datasets characterized
by the same set of features but obtained from different data points from an

1
underlying dataset.
Section 6.5 discusses vertical FL which is obtained from GTVMin over an
FL network whose nodes the same data points but using different features.
As an example, consider the local datasets at different public institutions (tax
authority, social insurance institute, supermarkets) which contain different
informations about the same underlying population (anybody who has a
Finnish social security number).
Section 6.6 shows how personalized FL can be obtained from GTVMin by
using specific measures for the total variation of local model parameters. For
example, using deep neural networks as local models, we might only use the
model parameters corresponding to the first few input layers to define the
total variation.

6.1 Learning Goals

After this chapter, you will know particular design choices for GTVMin
corresponding to some main flavours of FL:

• single-model FL

• CFL (generalization of clustering methods)

• horizontal FL (relation to semi-supervised learning)

• personalized FL/multi-task learning

• vertical FL.

2
6.2 Single-Model FL

Some FL use cases require to train a single (global) model H from a decen-
tralized collection of local datasets D(i) , i = 1, . . . , n [15, 51]. In what follows
we assume that the model H is parametrized by a vector w ∈ Rd . Figure
6.1 depicts a server-client architecture for an iterative FL algorithm that
generates a sequence of (global) model parameters w(k) , k = 1, . . . .. After
computing the new model parameters w(k+1) , the server broadcasts it to the
clients i. During the next iteration, each client i uses the current global model
parameters w(k) to compute a local update w(i,k) based on its local dataset
D(i) . The precise implementation of this local update step depends on the
choice for the global model H (trained by the server). One example for such
a local update has been discussed in Chapter 5 (see (5.17)).
Chapter 5 already hinted at an alternative to the server-based system in
Figure 6.1. Indeed, we might learn local model parameters w(i) for each client
i using a distributed optimization of GTVMin. We can force the resulting
model parameters w(i) to be (approximately) identical by using a connected
FL network and a sufficiently large GTVMin parameter α.
To minimize the computational complexity of the resulting single-model
FL system, we prefer FL networks with small number of edges such as the
star graph in Figure 5.3 [50]. However, to increase the robustness against
node/link failures we might prefer using an FL network that has more edges.
This “redundancy” helps to ensure that the FL network is connected even
after removing some of its edges.
Much like the server-based system from Figure 6.1, GTVMin-based meth-
ods using a star graph offers a single point of failure (the server in Figure 6.1

3
global model parameters w(k) at time k
server
w (k +
1)

k)
w(k+1) w (3,
w(2,k)
(1, k)
w
1)
(k+
w
1 2 3

D(1) D(2) compute w(3,k) based on w(k)


and local dataset D(3)

Figure 6.1: The operation of a server-based (or centralized) FL system during


iteration k. First, the server broadcasts the current global model parameters
w(k) to each client i ∈ V. Each client i then computes the update w(i,k) by
combining the previous model parameters w(k) (received from the server) and
its local dataset D(i) . The updates w(i,k) are then sent back to the server who
aggregates them to obtain the updated global model parameters w(k+1) .

4
or the centre node in Figure 5.3). Chapter 8 will discuss the robustness of
GTVMin-based FL systems in slightly more detail (see Section 8.3).

6.3 Clustered FL

Single-model FL systems require the local datasets to be well approximated


as i.i.d. realizations from a common underlying probability distribution.
However, requiring homogeneous local datasets, generated from the same
probability distribution, might be overly restrictive. Indeed, the local datasets
might be heterogeneous and need to be modelled using different probability
distribution [18, 33].
CFL relaxes the requirement of a common probability distribution under-
lying all local datasets. Instead, we approximate subsets of local datasets as
i.i.d. realizations from a common probability distribution. In other words,
CFL assumes that local datasets form clusters. Each cluster C ⊆ V has a
cluster-specific probability distribution p(C) .
The idea of CFL is to pool the local datasets D(i) in the same cluster C
to obtain a training set to learn cluster-specific w
b (C) . Each node i ∈ C then
uses these learnt model parameters w
b (C) . A main challenge in CFL is that
the cluster assignments of the local datasets are unknown in general.
To determine a cluster C, we could apply basic clustering methods, such
as k-means or Gaussian mixture model (GMM) to vector representations for
local datasets [6, Ch. 5]. We can obtain a vector representation for local
dataset D(i) via the learnt model parameters w
b of some parametric ML model
that is trained on D(i) .
We can also implement CFL via GTVMin with a suitably chosen FL

5
network. In particular, the FL network should contain many edges (with
large weight) between nodes in the same cluster and few edges (with small
weight) between nodes in different clusters. To fix ideas, consider the FL
network in Figure 6.3, which contains a cluster C = {1, 2, 3}.

w(5)

w(2)
C ∂C
w(1) w(4)

w(3)

Figure 6.2: The solutions of GTVMin (3.18) are local model parameters that
are approximately identical for all nodes in a tight-knit cluster C.

Chapter 3 discussed how the eigenvalues of the Laplacian matrix can


be used to measure the connectivity of G. Similarly, we can measure the
connectivity of a cluster C via the eigenvalue λ2 L(C) of the Laplacian matrix


L(C) of the induced sub-graph G (C) :14 The larger λ2 L(C) , the better the


connectivity among the nodes in C. While λ2 L(C) describes the intrinsic




connectivity of a cluster C, we also need to characterize its connectivity with


the other nodes in the FL network. To this end, we use the cluster boundary
X
Ai,i′ with ∂C := {i, i′ } ∈ E : i ∈ C, i′ ∈ (6.1)

|∂[| C] := /C .
{i,i′ }∈∂C

Note that for a single-node cluster C = {i}, the cluster boundary coincides
with the node degree, |∂[| C] = d(i) (see (3.4)).
14
The graph G (C) consists of the nodes in C and the edges {i, i′ } ∈ E for i, i′ ∈ C.

6
Intuitively, GTVMin tends to deliver (approximately) identical model
parameters w(i) for nodes i ∈ C if λ2 L(C) is large and the cluster boundary


|∂[| C] is small. The following result makes this intuition more precise for the
special case of GTVMin (5.1) for local linear models.

Proposition 6.1. Consider an FL network G which contains a cluster C of


local datasets with labels y(i) and feature matrix X(i) related via

y(i) = X(i) w(C) +ε(i) , for all i ∈ C. (6.2)

We learn local model parameters w


b (i) via solving GTVMin (5.1). If the cluster
is connected, the error component
X
e (i) := w
w b (i) − (1/|C|) b (i)
w (6.3)
i∈C

is upper bounded as
X  
X 2 1 1 2 2
e (i) 2
w ≤ (C)
 ε(i) 2
+ α |∂[| C]2 w(C) 2
+R2
. (6.4)
i∈C
αλ2 L i∈C
m i


b (i ) 2 .
Here, we used R := maxi′ ∈V\C w

Proof. See Sec. 6.9.1.

The bound (6.4) depends on the cluster C (via the eigenvalue λ2 L(C)


and the boundary |∂[| C]) and the GTVMin parameter α. Using a larger C
might result in a decreased eigenvalue λ2 L(C) .15 According to (6.4), we


15
Consider an FL network (with uniform edge weights) that contains a fully connected
cluster C which is connected via a single edge with another node i′ ∈ V \ C (see Figure
 ′ 
6.3). Compare the corresponding eigenvalues λ2 L(C) and λ2 L(C ) of C and the enlarged
cluster C ′ := C ∪ {i′ }.

7
should then increase α to maintain a small deviation w
e (i) of the learnt local
model parameters from their cluster-wise average. Thus, increasing α in (3.18)
enforces its solutions to be approximately constant over increasingly larger
subsets (clusters) of nodes (see Figure 6.3).
For a connected FL network G, using a sufficiently large α for GTVMin
results in learnt model parameters that are approximately identical for all
nodes V in G. The resulting approximation error is quantified by Prop. 6.1
for the extreme case where the entire FL network forms a single cluster, i.e.,
C = V. Trivially, the cluster boundary is then equal to 0 and the bound (6.4)
specializes to (3.31).
We hasten to add that the bound (6.4) only applies for local datasets that
conform with the probabilistic model (6.2). In particular, it assumes that all
cluster nodes i ∈ C have identical model parameters w(C) . Trivially, this is
no restriction if we allow for arbitrary error terms ε(i) in the probabilistic
model (6.4). However, as soon as we place additional assumptions on these
error terms (such as being realizations of i.i.d. Gaussian RVs) we should
verify their validity using principled statistical tests [32,52]. Finally, we might
2
replace w(C) 2
in (6.4) with an upper bound for this quantity.

8
C
C C

small α
moderate α large α
Figure 6.3: The solutions of GTVMin (3.18) become increasingly clustered
for increasing α.

6.4 Horizontal FL

Horizontal FL uses local datasets D(i) , for i ∈ V, that contain data points
characterized by the same features [53]. As illustrated in Figure 6.4, we can
think of each local dataset D(i) as being a subset (or batch) of an underlying
global dataset

D(global) := x(1) , y (1) , . . . , x(m) , y (m) .


  

In particular, local dataset D(i) is constituted by the data points of D(global)


with indices in {r1 , . . . , rmi },

D(i) := x(r1 ) , y (r1 ) , . . . , x(rmi ) , y (rmi ) .


  

We can interpret horizontal FL as a generalization of semi-supervised


learning (SSL) [54]: For some local datasets i ∈ U we might not have access
to the label values of data points. Still, we can use the features of the data
points to construct (the weighted edges of) the FL network. To implement
SSL, we can solve GTVMin using a trivial loss function Li w(i) = 0 for


each unlabelled node i ∈ U. Solving GTVMin delivers model parameters w(i)

9
 
(1) (1) (1)
 x1 x2 ··· xd y (1) 

 (2) (2) (2)
 D(1)
(2) 
 x1 x2 ··· xd y 
 
 .. .. ... .. .. 
 . . . . 
D(i)
 
 (m) (m) (m) (m)

x1 x2 · · · xd y

D(global)

Figure 6.4: Horizontal FL uses the same features to characterize data points
in different local datasets. Different local datasets are constituted by different
subsets of an underlying global dataset.

for all nodes i (including the unlabelled ones U). GTVMin-based methods
combine the information in the labelled local datasets D(i) , for i ∈ V \ U and
their connections (via the edges in G) with nodes in U (see Figure 6.4).

10
i∈U

Figure 6.5: Horizontal FL includes SSL as a special case: There is a subset of


nodes U, for which the local datasets do not contain labels. We can take this
into account by using the trivial loss function Li (·) = 0 for each node i ∈ U.
However, we can still use the features in D(i) to construct an FL network G.

6.5 Vertical FL

Vertical FL uses local datasets that are constituted by the same (identical!)
data points. However, each local dataset uses a different choice of features to
characterize these data points [55]. Formally, vertical FL applications revolve
around an underlying global dataset

D(global) := x(1) , y (1) , . . . , x(m) , y (m) .


  

Each data point in the global dataset is characterized by d′ features x(r) =


(r) (r) T
x1 , . . . , xd′ . The global dataset can only be accessed indirectly via local
datasets that use different subsets of the feature vectors x(r) (see Figure 6.6).
Formally, the local dataset D(i) contains the pairs x(i,r) , y (i,r) , for r =


1, . . . , m. The labels y (i,r) are identical to the labels in the global dataset,
y (i,r) = y (r) . The feature vectors x(i,r) are obtained by a subset F (i) :=

11
{j1 , . . . , jd } of the original d′ features in x(r) ,

(r) (r) T
x(i,r) = xj1 , . . . , xjd .

(1)
D(i)
D 
(1) (1) (1)
 x1 x2 ··· xd y (1) 
 
 (2) (2) (2) (2) 
 x1 x2 · · · xd y 
 
 .. .. .. .. .. 

 . . . . . 

 (m) (m) (m) (m)

x1 x2 · · · xd y

D(global)

Figure 6.6: Vertical FL uses local datasets that are derived from the same data
points. The local datasets differ in the choice of features used to characterize
the common data points.

6.6 Personalized Federated Learning

Consider GTVMin (3.18) for learning local model parameters w


b (i) for each
local dataset D(i) . If the value of α in (3.18) is not too large, the local model
parameters w
b (i) can be different for each i ∈ V. However, the local model
parameters are still coupled via the GTV term in (3.18).
For some FL use-cases we should use different coupling strengths for
different components of the local model parameters. For example, if local
models are deep ANNs we might enforce the parameters of input layers to be

12
identical while the parameters of the deeper layers might be different for each
local dataset.
The partial parameter sharing for local models can be implemented in
many different ways [56, Sec. 4.3.]. One way is to use a choice for the GTV
′ 2
penalty function that is different from ϕ = w(i) − w(i ) 2 , which is the main
choice in our course. In particular, we could construct the penalty function
as a combination of two terms,

′  ′  ′ 
ϕ w(i) − w(i ) := α(1) ϕ(1) w(i) − w(i ) + α(2) ϕ(2) w(i) − w(i ) . (6.5)

Each component ϕ(1) , ϕ(2) measures different components of the variation



w(i) − w(i ) of local model parameters at connected nodes {i, i′ } ∈ E.
Moreover, we might use different regularization strengths α(1) and α(2)
for different penalty components in (6.5) to enforce different subsets of the
model parameters to be clustered with different granularity (cluster size). For
local models being deep ANNs, we might want to enforce the low-level layers
(closer to the input) to have same model parameters (weights and bias terms),
while deeper (closer to the output) layers can have different model parameters.
Figure 6.7 illustrates this setting for local models constituted by ANNs with
a single hidden layer. Yet another technique for partial sharing of model
parameters is to train a hyper-model which, in turn, is used to initialize the
training of local models [57].

13
u(1) h(1)
2
v(1) (2)
h2

x x
h(1) (x) h(2) (x)
(1) (2)
h1 h1

(3)
h2

x
h(3) (x)
(3)
h1

Figure 6.7: Personalized FL with local models being ANNs with one hidden
 T
 
(i) T (i) T
layer. The ANN h is parametrized by the vector w = u ,
(i) (i)

, v
with parameters u(i) of hidden layer and the parameters v(i) of the output
layer. We couple the training of u(i) via GTVMin using the discrepancy
′ 2
measure ϕ = u(i) − u(i ) 2 .

14
6.7 Few-Shot Learning

Some maximum likelihood applications involve data points belonging to a


large number of different categories. A prime example is the detection of
a specific object in a given image [58, 59]. Here, the object category is the
label y ∈ Y of a data point (image). The label space Y is constituted by the
possible object categories and, in turn, can be quite large. Moreover, for some
categories we might have only few example images in the training set.
Few-shot learning leverages similarities between object categories in order
to accurately detect objects for which only very few (or even no) training
examples are available. One principled approach to few shot learning is via
GTVMin. To this end, we define a FL network G whose nodes i ∈ i represent
the elements of the label space Y. The edge weights of G encode known
similarities between different object categories.
Each node i of the G represents a specific object category. Solving GTVMin
delivers, for each node i, model parameters w
b (i) for an object detector (tailored
for the i-th object category).

15
6.8 Overview of Coding Assignment

Python Notebook. FLFlavors_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv

This coding assignment builds on the coding assignment “ML Basics” (see
Section 2.7) and the coding assignment “FL Design Principle” (see Section
3.4). In particular, we consider an FL network G (FMI) with each node i ∈ V
being a FMI weather station. The node i ∈ V holds a local dataset
 
(i) (i,1) (i,1) (i,mi ) (i,mi )
 
D = x ,y ,..., x ,y

that consists of mi data points. Each data point is a temperature measurement,


T
taken at station i, and characterized by d = 7 features x = x1 , . . . , x7
and a label y which is the temperature measurement itself. The features
are (normalized) values of the latitude and longitude of the FMI station as
well as the (normalized) year, month, day, hour, minute during which this
measurements has been taken.
The edges of G (FMI) are obtained using the Python function add_edges().
Each FMI station i is connected to its nearest neighbours i′ , using the
T
Euclidean distance between the corresponding vectors lat(i) , lon(i) ∈ R2 .
The number of neighbours is controlled by the input parameter numneighbors.
All edges {i, i′ } ∈ E have the same edge weight Ai,i′ = 1.
Your tasks involve

• Construct the FL network G (FMI) as a [Link]() object.

• For each node i ∈ V (FMI station), add node attributes that store the
feature matrix (as numpy arrays) X(i) and label vector y(i) . The r-th

16
row of these holds the features x(i,r) and label y (i,r) , respectively, of the
r-th data point recorded for FMI station i in the csv file. Add another
node attribute that stores the index of the cluster to which this node
belongs to.

• Cluster the nodes of G (FMI) using the Python class [Link]


which implements the k-means clustering method (see [6, Ch. 8]). This
clustering method requires a vector representation z(i) for each local
dataset D(i) . You should try out different choices for the vector repre-
sentation z(i) of a local dataset D(i) :

– z(i) ∈ R2 with entries being the latitude and longitude of the FMI
station i ∈ V.

– z(i) is obtained from stacking the parameters of a GMM fitted to


the feature vectors x(i,1) , . . . , x(i,mi ) .
(1) (2) (k) T
– z(i) = ui , ui , . . . , ui given by the i-th entries of the eigen-
vectors u(1) , u(2) , . . . , u(k) of the Laplacian matrix L(FMI) . Note
that u(j) is an eigenvector corresponding to some j-th smallest
eigenvalue λj of L(FMI) (see (3.8)).

• For each clustering obtained in the previous task, compute the cluster-
Pmi′ (i′ ,r)
wise average temperature ŷ (C) = P ′ 1 m ′ i′ ∈C r=1 for each
P
y
i ∈C i

cluster C. Let C (i)


denote the cluster to which node i belongs, then
ŷ (i,r) := ŷis a prediction for the actual temperature y (i,r) . Compute
(C (i) )

(i,r) 2
the average squared error loss Pn 1 mi i∈V m .
P P i (i,r)

r=1 y − ŷ
i=1

17
6.9 Proofs

6.9.1 Proof of Proposition 6.1

To verify (6.4), we follow a similar argument as used in the proof (see Section
3.5.1) of Prop. 3.1.
First we decompose the objective function f w in (5.1) as follows:


f (w) =
X 
2 2 2
(i′ ) (i′ )
X X
(1/mi ) y(i) −X(i) w(i) 2
+α Ai,i′ w −w (i)
+ (i)
Ai,i′ w −w
2 2
i∈C i,i′ ∈C {i,i′ }∈∂C
| {z  }
=:f ′ w

+ f ′′ w . (6.6)


Note that only the first component f ′ depends on the local model parameters
w(i) of cluster nodes i ∈ C. Let us introduce the shorthand f ′ w(i) for the


′ ′
function obtained from f ′ (w) for varying w(i) , i ∈ C, but fixing w(i ) := w
b (i )
for i′ ∈
/ C.
We obtain the bound (6.4) via a proof by contradiction: If (6.4) does
not hold, the local model parameters w(i) := w(C) , for i ∈ C, result in a
smaller value f ′ w(i) < f ′ w
b (i) than the choice w
b (i) , for i ∈ C. This would
 

contradict the fact that w


b (i) is a solution to (5.1).

18
First, note that

 X 2
f ′ w(i) = (1/mi ) y(i) −X(i) w(C) 2
i∈C
 X 
2 2
(i′ )
X
(C)
+α Ai,i′ w −w(C) 2 + Ai,i′ w (C)
−w
b
2
{i,i′ }∈E {i,i′ }∈E
i,i′ ∈C i∈C,i′ ∈C
/
(6.2) X 2 X ′ 2
= (1/mi ) ε(i) 2
+α Ai,i′ w(C) − w
b (i )
2
i∈C {i,i′ }∈E
i∈C,i′ ∈C
/
(a)
 
2 2 2
(i′ )
X X
≤ (1/mi ) ε(i) 2 +α Ai,i′ 2 w(C) 2
+ w
b
2
i∈C {i,i′ }∈E
i∈C,i′ ∈C
/
 
X 2 2
≤ (1/mi ) ε(i) 2 + α |∂[| C]2 w(C) 2
+R . 2
(6.7)
i∈C

Step (a) uses the inequality ∥u+v∥22 ≤ 2 ∥u∥22 +∥v∥22 which is valid for any


two vectors u, v ∈ Rd .
On the other hand,
X ′ 2
f′ w
b (i) ≥ α b (i) − wb (i )

Ai,i′ w
i,i′ ∈C | {z }2
(6.3) ′ 2
= ∥w w(i ) ∥
e (i)−e
2

(3.14) X 2
≥ αλ2 L(C) e (i) 2 . (6.8)

w
i∈C

If the bound (6.4) would not hold, then by (6.8) and (6.7) we would obtain
f′ wb (i) > f ′ w(i) , which contradicts the fact that w
b (i) solves (5.1).
 

19
7 Graph Learning
Chapter 3 discussed GTVMin as a main design principle for FL algorithms. In
particular, Chapter 5 discusses FL algorithms that arise from the application
of optimization methods, such as the gradient-based methods from Chapter
4, to solve GTVMin.
The computational and statistical properties of these algorithms depend
crucially on the properties of the underlying FL network. For example, the
amount of computation and communication required by FL systems typically
grows with the number of edges in the FL network. Moreover, the connectivity
of the FL network steers the pooling of local datasets into clusters that share
common model parameters.
In some applications, domain expertise can guide the choice for the FL
network. However, it might also be useful to learn the FL network in a more
data-driven fashion. This chapter discusses methods that learn an FL network
solely from a given collection of local datasets and corresponding local loss
functions.
The outline of this chapter is as follows: Section 7.2 discusses how the
computational and statistical properties of Algorithm 5.1 from Chapter 5 can
guide the construction of the FL network. Section 7.3 presents some ideas
for how to measure the discrepancy (lack of similarity) between two local
datasets. The measure for the discrepancy is an important design choice for
the graph learning methods discussed in Section 7.4. We formulate these
methods as the optimization of edge weights given the discrepancy measure
for any pair of local datasets. The formulation as an optimization problem
allows to include connectivity constraints such as a minimum value for each

1
node degree.

7.1 Learning Goals

After completing this chapter, you will

• know how the computational and statistical properties of GTVMin-


based methods depend on the structure of the FL network,

• know some measures for (dis-)similarity (or discrepancy) between local


datasets,

• be able to learn a graph from given pairwise similarities and structural


constraints, such a prescribed (maximum) node degree.

7.2 Edges are a Design Choice

Consider the GTVMin instance (3.20) for learning the model parameters
of a local linear model for each local dataset D(i) . To solve (3.20), we use
Algorithm 5.1 as a message passing implementation of the basic gradient
step (5.3). Note that GTVMin (3.20) is defined for a given FL network G.
Therefore, the choice for G is critical for the statistical and computational
properties of Algorithm 5.1.
Statistical Properties. The statistical properties of Algorithm 5.1 can
be assessed via a probabilistic model for the local datasets. One important
example for such a probabilistic model is the clustering assumption (6.2)
of CFL (see Section 3.3.1). For CFL, we would like to learn similar model
parameters for nodes in the same cluster.

2
According to Prop. 6.1, the GTVMin solutions will be approximately
constant over C, if λ2 L(C) is large and the cluster boundary |∂[| C] is small.


Here, λ2 L(C) denotes the smallest non-zero eigenvalue of the Laplacian




matrix associated with the induced sub-graph G (C) .


Roughly speaking, λ2 L(C) will be larger if there are more edges connecting


the nodes in C. This informal statement can be made precise using a celebrated
result from spectral graph theory, known as Cheeger’s inequality [35, Ch. 21].
Alternatively, we can analyse λ2 L(C) by interpreting (or approximating)


the induced sub-graph G (C) as a typical realization of an Erdős-Rényi (ER)


random graph.16
The ER model for G (C) postulates that two nodes i, i′ ∈ C are connected
by an edge with probability pe . The presence of an edge between a given pair
of nodes does not depend on the presence of an edge between any other pair
of nodes (“edges are i.i.d.”).
Since we model the presence of edges as realizations of RVs also the node
degrees d(i) as well as the eigenvalue λ2 L(C) become realizations of RVs. The


expected node degree is given as E d(i) = pe (|C| − 1). With high probability,


a realization of an ER graph has maximum node degree

d(G) (7.1)
 (i)
max ≈ E d = pe (|C| − 1).

Thus, increasing the ER parameter pe results in a larger node degree (i.e., a


higher connectivity) of G (C) .
We can approximate λ2 L(C) with the second smallest eigenvalue λ2 L
 

of the expected Laplacian matrix L := E L(C) = |C|pe I − pe 11T . A simple




16
This approximation is particularly useful if the FL network G itself is (close to) a
typical realization of an ER graph.

3
 
calculation reveals that λ2 L = |C|pe . Thus, we have the approximation

(7.1)
λ2 L(C) ≈ λ2 L = |C|pe ≈ d(G) (7.2)
 
max .

The precise quantification of the approximation error in (7.2) is beyond the


scope of this course. We refer the reader to relevant literature on the theory
of random graphs [60, 61].
Computational Properties. The computational complexity of Algo-
rithm 5.1 depends on the amount of computation required by a single iteration
of its steps (3) - (4). Clearly, this “per-iteration” complexity of Algorithm
5.1 increases with increasing node degrees d(i) . Indeed, the step (3) requires
to communicate local model parameters across each edge of the FL network.
This communication can be implemented by different communication channels
such as a short-range wireless link or an optical fibre cable [62, 63].
To summarize, using an FL network with smaller d(i) translates into a
smaller amount of computation and communication needed during a single
iteration of Algorithm 5.1. Trivially, the per-iteration complexity of Algorithm
5.1 is minimized by d(i) = 0, i.e., an empty FL network without any edges
(E = ∅). However, the overall computational complexity of Algorithm 5.1
also depends on the number of iterations required to achieve an approximate
solution to GTVMin (3.20).
According to (4.6), the convergence speed of the gradient steps (5.9) used
in Algorithm 5.1 depends on the condition number λnd (Q) /λ1 (Q) of the
| {z }
=λmax (Q)
matrix Q in (5.2). Algorithm 5.1 tends to require fewer iterations when the
ratio between the largest and smallest eigenvalues of Q is small (closer to 1).
The condition number of Q tends to be smaller for a smaller ratio between

4
(G)
the maximum node degree dmax and eigenvalue λ2 L(G) (see (5.5) and (5.6)).


To summarize, the per-iteration complexity of Algorithm 5.1 increases with


the node degrees d(i) of the FL network G. On the other hand, the number of
iterations required by Algorithm 5.1 decrease with increasing λ2 L(G) . Some


recent work studies graph constructions that maximize λ2 L(G) for a given


(G)
(prescribed) maximum node degree dmax = maxi∈V d(i) [64, 65].
Spectral graph theory also provides upper bounds on λ2 L(G) in terms of


the node degrees [35, 36, 66]. These upper bounds can be used as a baseline
for practical constructions of the FL network: If some construction results in
a value λ2 L(G) close to the upper bound, there is little benefit in trying to


further improve the construction (in the sense of achieving higher λ2 L(G) .


The next result provides one example for such an upper bound.

Proposition 7.1. Consider an FL network G with n nodes and associated


Laplacian matrix L(G) . Then, λ2 L(G) cannot exceed the node degree d(i) of


any node by more than a factor n/(n − 1). In other words,

n (i)
λ2 L(G) ≤ d , for every i = 1, . . . , n. (7.3)

n−1

Proof. CFW [3, Thm. 8.1.2.] with the  evaluation of the quadratic form
wT L(G) w for the specific vector w e = w̃(1) = −(1/n), . . . , w̃(i) = 1 −
T
(1/n), . . . , w̃ (n)
= −(1/n) . Note that w
e T 1 = 0, i.e., this vector is or-
thogonal to the subspace (3.15) (for the special case of local models with
dimension d = 1).

Alternative (and potentially tighter) upper bounds on λ2 L(G) can be




found in the graph theory literature [34, 35, 61, 67].

5
nr. of iterations

per-iteration complexity

nr. of edges
Figure 7.1: The per-iteration complexity and number of iterations required by
Algorithm 5.1 depends on the number of edges in the underlying FL network
in different manners.

7.3 Measuring (Dis-)Similarity Between Datasets

The whole idea of GTVMin is to enforce similar model parameters at two


nodes i, i′ that are connected by an edge {i, i′ } with (relatively) large edge
weight Ai,i′ . In general, the edges (and their weights) of the FL network are a
design choice. However, placing an edge between two nodes i, i′ will typically
be a useful design choice only if (the underlying processes that generate) the

local datasets D(i) , D(i ) have similar properties. We next discuss different
approaches to measuring the similarity or, equivalently, the discrepancy (the
lack of similarity) between two local datasets.
The first approach is based on a probabilistic model, i.e., we interpret the
local dataset D(i) as realizations of RVs with some parametrized probability
distribution p(i) D(i) ; w(i) . We can then measure the discrepancy between D(i)


′ ′
and D(i ) via the Euclidean distance w(i) − w(i ) 2
between the parameters

6

w(i) , w(i ) of the probability distributions.
In general, we do not know the parameters of the probability distribution
p(i) D(i) ; w(i) underlying a local dataset.17 We might still be able to estimate


these parameters, e.g., using variants of maximum likelihood [6, Ch. 3]. Given

the estimates w b (i ) for the model parameters, we can then compute the
b (i) , w
′ ′
discrepancy measure d(i,i ) := w b (i ) 2 .
b (i) − w
Example. Consider local datasets being a single number y (i) which is
modelled as a noisy observation y (i) = w(i) + n(i) with n(i) ∼ N (0, 1). The
maximum likelihood estimator for w(i) is then obtained as ŵ(i) = y (i) [30, 68]
′ ′
and, in turn, the resulting discrepancy measure d(i,i ) := y (i) − y (i ) [69].
Example. Consider nodes i ∈ V that generate local datasets D(i) . Each
data point in D(i) is characterized by a label value from a finite set Y (i) . We
can the measure the similarity between i, i′ by the fraction of data points in
S ′ ′
D(i) D(i ) with label values in Y (i) ∩ Y (i ) [70].
Example. Consider local datasets D(i) constituted by images of hand-
written digits 0, 1, . . . , 9. We model a local dataset using a hierarchical
probabilistic model: Each node i ∈ V is assigned a deterministic but unknown
(i) (i)  (i)
distribution α(i) = α1 , . . . , α9 . The entry αj is the fraction of images at
node i that show digit j. We interpret the labels y (i,1) , . . . , y (i,mi ) as realiza-
tions of i.i.d. RVs, with values in {0, 1, . . . , 9} and distributed according to
α(i) . The features are interpreted as realizations of RVs with conditional dis-
tribution p(x|y) which is the same for all nodes i ∈ V. We can then estimate
the dis-similarity between nodes i, i′ via the distance between (estimations
17
One exception is when we generate the local dataset by drawing i.i.d. realizations from
(i)

p D(i) ; w(i) .

7

of) the parameters α(i) and α(i ) .
The above discrepancy measure construction (using estimates for the
parameters of a probabilistic model) is a special case of a more generic
two-step approach:


• First, we determine a vector representation z(i) ∈ Rm for each local
dataset D(i) [6, 71].


• Second, we measure the discrepancy between local datasets D(i) , D(i ) via

the distance between the corresponding representation vectors z(i) , z(i ) ∈

Rm .


One example for constructing a vector representation z(i) ∈ Rm for a local
dataset D(i) is, under an i.i.d. assumption, to use the maximum likelihood
estimate w
b (i) for the parameters of a probabilistic model p(x, y; w(i) ).

Let us next discuss a construction for the vector representation z(i) ∈ Rm
that is motivated by SGD (see Section 5.4). In particular, we could define
the discrepancy between two local datasets by interpreting them as two
possible batches used by SGD to train a model. If these two batches have
similar statistical properties (in the sense of being useful for the model
training), then their corresponding gradient approximations (5.11) should
be aligned. This suggests to use the gradient ∇f (w′ ) of the average loss
f (w) := (1/|D(i) |) (x,y)∈D(i) L (x, y) , h(w) as a vector representation z(i)
P 

for D(i) . The properties of this representation depend on the choice of


parametrized model, i.e., how h(w) depends precisely on the parameters w
and on the model parameters w′ at which the gradient is evaluated.
We can also use an autoencoder [71, Ch. 14] to learn a vector representation

8
encoder decoder

D(i) z(i) ∈ Rm b (i)
D
h(·) h∗ (·)

Figure 7.2: An autoencoder consists of an encoder, mapping the input to a


latent vector, and a decoder which tries to reconstruct the input as accurately
as possible. The encoder and the decoder are trained jointly by minimizing
some quantitative measure (a loss) of the reconstruction error (see [6, Ch.
9]). When using a local dataset as input, we can use the latent vector as its
vector representation.

for a local dataset. In particular, we fed it into an encoder network which has
been trained jointly with a decoder network on some learning task. Figure
7.2 illustrates a generic autoencoder setup.

7.4 Graph Learning Methods



Assume we have constructed a useful measure d(i,i ) ∈ R+ for the discrepancy

between any two local datasets D(i) , D(i ) . We could then construct an
FL network by connecting each node i with its nearest neighbours. The
nearest neighbours of i are those other nodes i′ ∈ V \ {i} with smallest

discrepancy d(i,i ) . We next discuss an alternative to the nearest-neighbour
graph construction. This alternative approach formulates graph learning as a
constrained linear optimization problem.
Let us measure the usefulness of a given choice for the edge weights

9
Ai,i′ ∈ R+ via

X
Ai,i′ d(i,i ) . (7.4)
i,i′ ∈V

The objective function (7.4) penalizes having a large edge weight Ai,i′ between

two nodes i, i′ with a large discrepancy d(i,i ) . Note that the objective function
(7.4) is minimized by the trivial choice Ai,i′ = 0, i.e., the empty graph without
any edges E = ∅.
As discussed in Section 7.2, the FL network should have a sufficient
amount of edges in order to ensure that the GTVMin solutions are useful
model parameters. Indeed, the desired pooling effect of GTVMin requires
the eigenvalue λ2 L(G) to be sufficiently large. Ensuring a large λ2 L(G)
 

requires, in turn, that the FL network G contains a sufficiently large number


of edges and corresponding node degrees (see (7.2)).
We can enforce the presence of edges (with positive weight) by adding
constraints to (7.4). For example, we might require
X
Ai,i = 0 , (G)
Ai,i′ = dmax for all i ∈ V, Ai,i′ ∈ [0, 1] for all i, i′ ∈ V. (7.5)
i′ ̸=i

The constraints (7.5) require that each node i is connected with other nodes
(G)
using total edge weight (weighted node degree) i′ ̸=i Ai,i′ = dmax .
P

Combining the constraints (7.5) with the objective function (7.4) results

10
in the following graph learning principle,


X
bi,i′ ∈ argmin
A Ai,i′ d(i,i ) (7.6)
Ai,i′ =Ai′ ,i
i,i′ ∈V

Ai,i′ ∈ [0, 1] for all i, i′ ∈ V,

Ai,i = 0 for all i ∈ V,


X
max for all i ∈ V.
Ai,i′ = d(G)
i′ ̸=i

Note that (7.6) is an instance of the constrained quadratic minimization


problem (4.15). We can therefore use (variants of) projected GD from Section
4.6 to compute (approximate) solutions to (7.6).
The first constraint in (7.6) requires each edge weight to belong to the
interval [0, 1]. The second constraint prohibits any self-loops in the resulting
FL network. Indeed, adding self-loops to an FL network has no effect on
the resulting GTVMin-based method (see (3.18)). The last constraint of the
learning principle (7.6) enforces a regular FL network: Each node i having
(G)
the same weighted node degree d(i) = i′ ̸=i Ai,i′ = dmax .
P

For some FL applications it might be detrimental to insist on identical


node degrees of the FL network. Instead, we might prefer other structural
properties such as a small total number of edges or presence of few hub nodes
(with exceptionally large node degree) [13, 69].
We can enforce an upper bound on the total number Emax of edges by

11
modifying the last constraint in (7.6),


X
bi,i′ ∈ argmin
A Ai,i′ d(i,i ) (7.7)
Ai,i′ =Ai′ ,i
i,i′ ∈V

Ai,i′ ∈ [0, 1] for all i, i′ ∈ V,

Ai,i = 0 for all i ∈ V,


X
Ai,i′ = Emax .
i′ ,i∈V

The problem has a closed-form solution as explained in [69]: It is obtained


by placing the edges between those pairs i, i′ ∈ V that result in the smallest

discrepancy d(i,i ) . However, it might still be useful to solve (7.7) via itera-
tive optimization methods such as the gradient-based methods discussed in
Chapter 4. These methods can be implemented in a fully distributed fashion
as message passing over an underlying communication network [72]. This
communication network might be significantly different from the learnt FL
network.18

18
Can you think of FL application domains where the connectivity (e.g., via short-range
wireless links) of two clients i, i′ ∈ V might also reflect the pair-wise similarities between

probability distributions of local datasets D(i) , D(i ) ?

12
7.5 Overview of Coding Assignment

Python Notebook. GraphLearning_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv

This coding assignment builds on the previous coding assignments “FL


Algorithms” (see Section 5.7) and “FL Main Flavors” (see Section 6.8). In
particular, we again construct an FL network G with nodes i ∈ V representing
FMI stations. The node i ∈ V holds a local dataset D(i) that consists of
mi data points. Each of these data points is a temperature measurement,
T
taken at station i, and characterized by d = 7 features x = x1 , . . . , x7
and a label y which is the temperature measurement itself. The features are
(normalized) values of the latitude and longitude of the FMI station as well
as the (normalized) year, month, day, hour, minute when the temperature
has been measured. We then apply some of the graph learning methods from
this chapter to construct the edge set of G.
Similar to the coding assignment “FL Main Flavours”, you should connect
each node i to its nearest neighbours using a Python function add_edges().
This function has two input parameters: (i) a networkx graph graph_FMI
that stores an FL network whose node attributes will be used to determine
its edges and (ii) a parameter node_degree that determines the number of
nearest neighbours connected to each node.
The nearest neighbours are determined based on the discrepancy measure
(i,i′ ) ′
d := z(i) − z(i ) 2
between two nodes i, i′ ∈ V. This measure requires, for
each node i ∈ V, some representation vector z(i) which is stored as a node
attribute in the input parameter graph_FMI. Your tasks include to try out

13
the following constructions for the representation vectors:

• I: z (i) = (1/mi ) y (i,r) , (average temperature at FMI station i)


P(mi )
r=1

• II: the vector z(i) consisting of the parameters of a GMM that is fit to
the feature vectors x(i,1) , . . . , x(i,mi ) in the local dataset D(i) .

• III:the vector z(i) := ∇Li (w)


b being the gradient of the average squared
 2
(i,r) T
error loss Li (w) = mi r=1 y w . Carefully note that
1
Pmi (i,r)

− x
the function Li (w) is determined by the local dataset D(i) . The gradient
b is evaluated for the model parameters w
∇Li (w) b (of a linear model)
obtained from minimizing the average squared error loss, over all local
datasets,
mi 
n X 2
1 X
(i,r) (i,r) T

w
b = argmin Pn y − x w .
w∈R7 i=1 mi i=1 r=1

Note that for each of the above discrepancy measures, we obtain (via con-
necting nearest neighbours) a potentially different FL network G (I) , G (II) , G (III) .
Carefully note that these FL networks also depend on the choice for the input
parameter node_degree.
Using each of the above constructions for the FL network, learn local model
parameters for a linear model by applying FedGD Algorithm 5.1. To this
end, you have to split each local dataset into a training set and a validation
set (Algorithm 5.1 is applied to the training sets). Diagnose the resulting
model parameters by computing the average (over all nodes) training error
and validation error.

14
8 Trustworthy FL
The Story So Far. We have introduced GTVMin as a main design principle
for FL in Chapter 3. Chapter 5 applied the gradient-based methods from
Chapter 4 to solve GTVMin, resulting in practical FL systems. Our focus
has been on computational and statistical properties of these FL systems. In
this and the following chapters, we shift the focus from technical properties
to the trustworthiness of FL systems.
Section 8.2 reviews key requirements for trustworthy AI, which includes
FL systems, that have been put forward by the European Union [73, 74].
We will also discuss how these requirements guide the design choices for
GTVMin-based methods. Our focus will be on the three design criteria for
trustworthy FL: privacy, robustness and explainability. This chapter covers
robustness and explainability, the leakage and protection of privacy in FL
systems is the subject of Chapter 9.
Section 8.3 discusses the robustness of FL systems against perturbations
of local datasets and computations. A special type of perturbation is the
intentional modification or poisoning of local datasets (see Chapter 10).
Section 8.4 introduces a measure for the (subjective) explainability of the
personalized models trained by FL systems.

8.1 Learning Goals

After completing this chapter, you will

• know some key requirements for trustworthiness of FL

• be familiar with quantitative measures of robustness and explainability

1
• have some intuition about how robustness, privacy and transparency
guides design choices for local models, loss functions and FL network in
GTVMin

8.2 Seven Key Requirements by the EU

As part of their artificial intelligence (AI) strategy, the European Commission


set up the High-Level Expert Group on Artificial Intelligence (AI HLEG)
in 2018. This group put forward seven key requirements for trustworthy
AI [73, 74]. We next discuss in step-by-step fashion how these requirements
guide the design choices for GTVMin-based FL systems.

8.2.1 KR1 - Human Agency and Oversight.

“..AI systems should support human autonomy and decision-making, as pre-


scribed by the principle of respect for human autonomy. This requires that AI
systems should both act as enablers to a democratic, flourishing and equitable
society by supporting the user’s agency and foster fundamental rights, and
allow for human oversight...” [74, p.15]
Human Dignity. Learning personalized model parameters for recom-
mender systems allows to boost addiction or in widespread emotional manipu-
lation resulting in genocide [75–77]. KR1 might rule out certain design choices
for the labels of data points. In particular, we might not use the mental and
psychological characteristics of a user as the label. We might also avoid loss
functions that can be used to train predictors of psychological characteristics.
Using personalized ML models to predict user preferences for products or
susceptibility towards propaganda is also referred to as “micro-targeting” [78].

2
Simple is Good. Human oversight can be facilitated by relying on
simple local models. Examples include linear models with few features or
decision trees with small depth. However, we are unaware of a widely accepted
definition of when a model is simple. Loosely speaking, a simple model results
in a learn hypothesis that allows humans to understand how features of a
data point relate to the prediction h(x). This notion of simplicity is closely
related to the concept of explainability which we discuss in more detail in
Section 8.4.
Continuous Monitoring. In its simplest form, GTVMin-based methods
involve a single training phase, i.e., learning local model parameters by
solving GTVMin. However, this approach is only useful if the data can
be well approximated by an i.i.d. assumption. In particular, this approach
works only if the statistical properties of local datasets do not change over
time. For many FL applications, this assumption is unrealistic (consider a
social network which is exposed to constant change of memberships and user
behaviour). It is then important to continuously compute a validation error
on a timely validation set which is then used, in turn, to diagnose the overall
FL system (see [6, Sec. 6.6]).

8.2.2 KR2 - Technical Robustness and Safety.

“...Technical robustness requires that AI systems be developed with a preven-


tative approach to risks and in a manner such that they reliably behave as
intended while minimising unintentional and unexpected harm, and preventing
unacceptable harm. ...’ [74, p.16].
Practical FL systems are obtained by implementing FL algorithms in

3
physical computers (e.g., a wireless network of devices with limited computa-
tional capability). These computers typically incur imperfections, such as a
temporary lack of connectivity or a device shutting down due to running out
of battery. Moreover, also the data generation processes might be subject to
perturbations such as statistical anomalies (outliers) or intentional modifica-
tions (see Chapter 10). Section 8.3 studies in some detail the robustness of
GTVMin-based systems against different perturbations of data sources and
imperfections of computational infrastructure.

8.2.3 KR3 - Privacy and Data Governance.

“..privacy, a fundamental right particularly affected by AI systems. Prevention


of harm to privacy also necessitates adequate data governance that covers the
quality and integrity of the data used...” [74, p.17].
So far, our discussion of FL system used a high level of abstraction using
mathematical concepts such as GTVMin and its building blocks. However, to
obtain actual FL systems we need implement these mathematical concepts in a
given physical hardware. These implementations might incur deviations from
the (idealized) GTVMin formulation (3.18) and the gradient-based methods
(such as Algorithm 5.1) used to solve it. For example, we might have access
only to quantized label values resulting in a quantization error. Moreover,
the local datasets might deviate significantly from a typical realizations of
i.i.d. RVs (such a situation is referred to as statistical bias [79, Sec. 3.3.])
Also, we might not be able to access all features of a data point due to
data processing regulations [80–82]. For example, the general data protection
regulation (GDPR) includes a data minimization principle which requires to

4
only access those features of data points (being users) that are relevant for
predicting the label.
Data Governance. FL systems might use local datasets that are gener-
ated by human users, i.e., personal data. Whenever personal data is used,
special care must be dedicated towards data protection regulations [82]. It
might then be useful (or even compulsory) to designate a data protection
officer and conduct a data protection impact assessment [74].
Privacy. The operation of a FL system must not violate the fundamental
human right for privacy [83]. Chapter 9 discusses quantitive measures and
methods for privacy protection in GTVMin-based FL systems.

8.2.4 KR4 - Transparency.

Traceability. This key requirement includes the documentation of design


choices (and underlying business models) for a GTVMin-based FL system.
This includes the source for the local datasets, the local models, the local
loss function as well as the construction of the FL network. Moreover, the
documentation should also cover the details of the implemented optimization
method used to solve GTVMin. This documentation might also require to
periodically store the current model parameters along with a time-stamp
(“logging”).
Communication. Depending on the use case, FL systems need to
communicate the capabilities and limitations to their end users (e.g., of a
digital health app running on a smartphone). For example, we can indicate
a measure of uncertainty about the predictions delivered by the trained
local models. Such an uncertainty measure can be obtained naturally from

5
probabilistic model for the data, e.g., the conditional variance of the label
y, given the features x of a random data point. Another example for an
uncertainty measure is the validation error of a trained local model.
Explainability. The transparency of a GTVMin-based FL system also
includes the explainability of the trained local models. Section 8.4 discusses
quantitative measures for the subjective explainability of a learnt hypothesis.
We will also use this measure as a regularizer to obtain GTVMin-based
systems that guarantee subjective explainability “by design”.

8.2.5 KR5 - Diversity, Non-Discrimination and Fairness.

“...we must enable inclusion and diversity throughout the entire AI system’s
life cycle...this also entails ensuring equal access through inclusive design
processes as well as equal treatment.” [74, p.18].
The local datasets used for the training of local models should be carefully
selected to not enforce existing discrimination. In a health-care application,
there might be significantly more training data for patients of a specific
gender, resulting in models that perform best for that specific gender at the
cost of worse performance for the minority [79, Sec. 3.3.]. Fairness is also
important for ML methods used to determine credit score and, in turn, if a
loan should be granted or not [84]. Here, we must ensure that ML methods
do not discriminate customers based on ethnicity or race. To this end, we
could augment data points via modifying any features that mainly reflect the
ethnicity or race of a customer (see Figure 8.1).

6
compensation y
h(x)
original training set D
augmented

gender x

Figure 8.1: We can improve fairness of a ML method by augmenting the


training set using perturbations of an irrelevant feature such as the gender
of a person for which we want to predict the adequate compensation as the
label.

8.2.6 KR6 - Societal and Environmental Well-Being.

“...Sustainability and ecological responsibility of AI systems should be encour-


aged, and research should be fostered into AI solutions addressing areas of
global concern, such as for instance the Sustainable Development Goals.” [74,
p.19].
Society. FL systems might be used to deliver personalized recommen-
dations to users within a social media application (social network). These
recommendations might be (fake) news used to boost polarization and, in the
extreme case, social unrest [85].
Environment. Chapter 5 discussed FL algorithms that were obtained by
applying gradient-based methods to solve GTVMin. These methods require
computational resources to compute local updates for model parameters and

7
to share them across the edges of the FL network. Computation and communi-
cation always requires energy should be generated in an environmental-friendly
fashion [86].

8.2.7 KR7 - Accountability.

“...mechanisms be put in place to ensure responsibility and accountability


for AI systems and their outcomes, both before and after their development,
deployment and use.” [74, p. 19].

8.3 Technical Robustness of FL Systems

Consider a GTVMin-based FL system that aims at training a single (global)


linear model in a distributed fashion from a collection of local datasets D(i) , for
i = 1, . . . , n. As discussed Section 6.2, this single-model FL setting amounts
to using GTVMin (3.20) over a connected FL network with a sufficiently large
choice for α.
To ensure KR2 we need to understand the effect of perturbations on
a GTVMin-based FL system. These perturbations might be intentional or
non-intentional and affect the local datasets used to evaluate the loss of local
model parameters or the computational infrastructure used to implement a
GTVMin-based method (see Chapter 5). We next explain how to use some
of the theoretic tools from previous chapters to quantify the robustness of
GTVMin-based FL systems.

8
8.3.1 Sensitivity Analysis

As pointed out in Chapter 3, GTVMin (3.20) can be rewritten as the mini-


mization of a quadratic function,

min wT Qw + qT w. (8.1)
w=stack{w(i) }

The matrix Q and vector q are determined by the feature matrices X(i) and
label vector y(i) at nodes i ∈ V (see (2.27)). We next study the sensitivity of
(the solutions of) (8.1) towards external perturbations of the label vector.19
Consider an additive perturbation y
e(i) := y(i) + ε(i) of the label vector
y(i) . Using the perturbed label vector y
e(i) results also in a “perturbation” of
GTVMin (8.1),

min wT Qw + qT w + nT w + c. (8.2)
w=stack{w(i) }
 T
(1) T (n) T
An inspection of (2.27) yields that n = . The
 (1)
 (n)
ε X ,..., ε X
next result provides an upper bound on the deviation between the solutions
of (8.1) and (8.2).

Proposition 8.1. Consider the GTVMin instance (8.1) for learning local
model parameters of a linear model for each node i ∈ V of an FL network G.
We assume that the FL network is connected, i.e., λ2 L(G) > 0 and the local


datasets are such that λ̄min > 0 (see (5.4)). Then, the deviation between w
b (i)
and the solution w
e (i) to the perturbed problem (8.2) is upper bounded as
n n
X 2 λmax (1 + ρ2 )2 X 2
w
b (i)
− e (i) 2
w ≤  2 ε(n) 2
. (8.3)
i=1 min{λ2 L(G) αρ2 , λ̄min /2} i=1
19
Our study can be generalized to also take into account perturbations of the feature
matrices X(i) , for i = 1, . . . , n.

9
Here, we used the shorthand ρ := λ̄min /(4λmax ) (see (5.4)).

Proof. Left as an exercise to the reader.

8.3.2 Estimation Error Analysis

Prop. 8.1 characterizes the sensitivity of GTVMin solutions against “external”


perturbations of the local datasets. While this form of robustness is important,
it might not suffice for a comprehensive assessment of a FL system. For
example, we can trivially achieve perfect robustness (in the sense of minimum
sensitivity) by delivering constant model parameters, e.g., w
b (i) = 0.
Another form of robustness is to ensure a small estimation error of (3.20).
To study this form of robustness, we use a variant of the probabilistic model
(3.27): We assume that the labels and features of data points of each local
dataset D(i) , for i = 1, . . . , n, are related via

y(i) = X(i) w + ε(i) . (8.4)

In contrast to Chapter 3, we assume that all components of (8.4) are


deterministic. In particular, the noise term ε(i) is a deterministic but unknown
quantity. This term accommodates any perturbation that might arise from
technical imperfections or intrinsic label noise due to random fluctuations in
the labelling process.
In the (unlikely) ideal case of no perturbation, we would have ε(i) = 0.
However, in general might only know some upper bound measure for the
2
size of the perturbation, e.g., ε(i) 2 . We next present upper bounds on the
estimation error w
b (i) − w incurred by the GTVMin solutions w
b (i) .

10
This estimation error consists of two components, the first component
 (i′ )
being avg w
b − w for each node i ∈ V. Note that this error component is
identical for all nodes i ∈ V. The second component of the estimation error is
 (i′ )
the deviation w e (i) := w b (i) − avg w
b of the learnt local model parameters
′  (i′ ) ′
b (i ) , for i′ = 1, . . . , n, from their average avg w b (i ) . As
= (1/n) ni′ =1 w
P
w b
discussed in Section 3.3.2, these two components correspond to two orthogonal
subspaces of Rd·n .
According to Prop. 3.1, the second error component is upper bounded as
n n
X 2 1 X 2
e (i) 2
w ≤ (1/mi ) ε(i) 2
. (8.5)
i=1
λ2 α i=1

To bound the first error component c̄−w, using the shorthand c̄ := avg w ,
 (i)
b
we first note that (see (3.20))
X 2 X ′ 2
(1/mi ) y(i) −X(i) w− w
e (i) e (i) − w
e (i ) . (8.6)

c̄ = argmin 2
+α Ai,i′ w
w∈Rd 2
i∈V {i,i′ }∈E

Using a similar argument as in the proof for Prop. 2.1, we obtain


n 2
(i) T
X
w∥22 (i) (i) (i)
/(nλ̄min )2 . (8.7)
 
∥c̄ − ≤ (1/mi ) X ε +X w
e
i=1 2

Here, λ̄min is the smallest eigenvalue of (1/n) ni=1 Q(i) , i.e., the average of
P
T
the matrices Q(i) = (1/mi ) X(i) X(i) over all nodes i ∈ V.20 Note that the
bound (8.7) is only valid if λ̄min > 0 which, in turn, implies that the solution
to (8.6) is unique.
20
We encountered the quantity λ̄min already during our discussion of gradient-based
methods for solving the GTVMin instance (3.20) (see (5.4)).

11
We can develop (8.7) further using
n
X T
(1/mi ) X(i) ε(i) + X(i) w
e (i)

i=1 2
n
(a) X T
(1/mi ) X(i) ε(i) + X(i) w
e (i)


2
i=1
v
u n
(b) √ uX T  2
≤ nt (1/mi ) X(i) ε(i) + X(i) w
e (i)
2
i=1
v
u n
√ u
(c) X T 2 T 2
≤ nt 2 (1/mi ) X(i) ε(i) + 2 (1/mi ) X(i) X(i) w
e (i)
2 2
i=1
v
u n
√ u
(d) X 2 2
≤ nt (2/mi )λmax ∥ε(i) ∥2 +2λ2max ∥w
e (i) ∥2 . (8.8)
i=1

Here, step (a) uses the triangle inequality form the norm ∥·∥2 , step (b)uses the
Cauchy-Schwarz inequality, step (c) uses the inequality ∥a + b∥22 ≤ 2 ∥a∥22 +

∥b∥2 and step (d) uses the maximum eigenvalue λmax := maxi∈V λd Q(i)
2 

T
of the matrices Q(i) = (1/mi ) X(i) X(i) (see (5.4)).
Inserting (8.8) into (8.7) results in the upper bound
n  
X 2 2
∥c̄ − w∥22 ≤2 (1/mi )λmax ε(i) 2 + λ2max e (i) 2
w /(nλ̄2min )
i=1
n
(8.5) X 2
≤ 2 λmax + (λ2max /(λ2 α)) (1/mi ) ε(i) 2
/(nλ̄2min ). (8.9)
i=1

The upper bound (8.9) on the estimation error of GTVMin-based methods


depends on both, the FL network G via the eigenvalue λ2 of L(G) , and the
feature matrices X(i) of the local datasets (via the quantities λmax and λ̄min
as defined in (5.4)). Let us next discuss how the upper bound (8.9) might

12
guide the choice for the FL network G and the features of data points in the
local datasets.
According to (8.9), we should use an FL network G with large λ2 L(G)


to ensure a small estimation error for GTVMin-based methods. Note that we


came across the same design criterion already when discussing graph learning
methods in Chapter 7. In particular, using an FL network with large λ2 L(G)


tends also to speed up the convergence of gradient-based methods for solving


GTVMin (such as Algorithm 5.1).
The upper bound (8.9) suggests to use features that result in a small ratio
λmax /λ̄min between the quantities λmax and λ̄min (see (5.4)). We might use
feature learning methods to optimize this ratio.

8.3.3 Network Resilience

The previous sections studied the robustness of GTVMin-based methods


against perturbations of local datasets (see Prop. 8.1) and in terms of ensuring
a small estimation error (see (8.9)). We also need to ensure that FL systems
are robust against imperfections of the computational infrastructure used to
solve GTVMin. These imperfections include hardware failures, running out
of battery or lack of wireless connectivity.
Chapter 5 showed how to design FL algorithms by applying gradient-
based methods to solve GTVMin (3.20). We obtain practical FL systems by
implementing these algorithms, such as Algorithm 5.1, in a particular compu-
tational infrastructure. Two important examples for such an infrastructure
are mobile networks and wireless sensor networks.
The effect of imperfections in the implementation of the GD based Algo-

13
rithm 5.1 can be modelled as perturbed GD (4.13) from Chapter 4. We can
then analyze the robustness of the resulting FL system via the convergence
analysis of perturbed GD (see Section 4.5).
According to (4.14), the performance of the decentralized Algorithm 5.1
degrades gracefully in the presence of imperfections such as missing or faulty
communication links. In contrast, the client-server based implementation of
FedAvg Algorithm 5.3 offers a single point of failure (the server).

8.3.4 Stragglers

Using perturbed GD to model imperfections in the actual implementation of


Algorithm 5.1 is quite flexible. The flexibility in allowing for a wide range of
imperfections might come at the cost of a coarse-grained analysis. In particular,
the upper bound (4.14) might be too loose (pessimistic) to be of any use. We
might then need to take into account the specific nature of the imperfection
and the resulting perturbation ε(i) . Let us next focus on a particular type of
imperfection that arises from the asynchronous implementation of Algorithm
5.1 at different nodes i ∈ V.
Note that Algorithm 5.1 requires the nodes to operate synchronous: at
each iteration the nodes need to exchange their current model parameters
w(i,k) simultaneously with their neighbours in the FL network. This requires,
in turn, that the update in step 4 of Algorithm 5.1 is completed at every node
i before the global clock ticks (triggering the next iteration).
Some of the nodes i might have limited computational resources and
therefore require much longer for the update step 4 of Algorithm 5.1. The
literature refers to such slower nodes sometimes as stragglers [87]. Instead of

14
forcing the other (faster) nodes to wait until also the slower ones are ready,
we could instead let them continue with their local updates. This results in
an asynchronous variant of Algorithm 5.1 which we summarize in Algorithm
8.1.
Note that Algorithm 8.1 still uses a global iteration counter k. However,
the role of this counter is fundamentally different to its role in the synchronous
Algorithm 5.1. In particular, we merely need the counter k for notational
convenience to denote the (arbitrary) time-instants during which the asyn-
chronous local updates (8.10). These updates take place only for a subset
of (“active”) nodes i ∈ A(k) . All other (“inactive”) nodes i ∈
/ A(k) leave their
current model parameters unchanged, w(i,k+1) := w(i,k) .
The asynchronous local update (8.10) at some node i ∈ A(i) uses the

“outdated” model parameters w(i ,ki,i′ ) at its neighbours i′ ∈ N (i) . In par-
ticular, some of these neighbours might have not been in the active sets
A(k−1) , A(k−2) , . . . during the most recent iterations. In this case, we cannot

use w(i ,k) as it is simply not available to node i at “time” k. Rather, we might

need to use w(i ,ki,i′ ) obtained during some previous “time” ki,i′ .
We can interpret the iteration index ki,i′ as the most recent time instant
during which node i′ has updated its local model parameters and shared it
with node i. The difference k − ki,i′ , in turn, can be viewed as a measure
for the communication delay from node i′ to node i. The robustness of (the
convergence of) Algorithm 8.1 against these communication delays is studied
in-depth in [48, Ch. 6 and 7].

15
Algorithm 8.1 Asynchronous FedGD for Local Linear Models
Input: FL network G; GTV parameter α; learning rate η
local dataset D(i) = for each i; some
 (i,1) (i,1)  
x ,y , . . . , x(i,mi ) , y (i,mi )
stopping criterion.
Output: linear model parameters w
b (i) for each node i ∈ V
Initialize: k := 0; w(i,0) := 0 for all nodes i ∈ V.

1: while stopping criterion is not satisfied do


2: for all active nodes i ∈ A(k) : do
3: update local model parameters via

(i,k+1) (i,k)
T (i)
+ η (2/mi ) X(i) y −X(i) w(i,k)

w := w

(i′ ,ki,i′ )
X
(i,k)
(8.10)

+ 2α Ai,i′ w −w .
i′ ∈N (i)

4: share local model parameters w(i,k+1) with neighbours i′ ∈ N (i)


5: end for
6: for all inactive nodes i ∈
/ A(k) : do
7: keep model parameters unchanged w(i,k+1) := w(i,k)
8: end for
9: increment iteration counter: k := k+1
10: end while
11: b (i) := w(i,k) for all nodes i ∈ V
w

16
8.4 Subjective Explainability of FL Systems

Let us now discuss how to ensure key requirement KR4 - Transparency


in GTVMin-based FL systems. This key requirement includes also the ex-
plainability of a trained personalized model ĥ(i) ∈ H(i) (and their predictions).
It is important to note that the explainability of ĥ(i) is subjective: A given
learnt hypothesis ĥ(i) might offer high degree of explainability to one user (a
graduate student at a university)but a low degree of explainability to another
user (a high-school student). We must ensure explainability for the specific
user, which we will also denote by i, that “consumes” the predictions at node
i ∈ V of the FL network.
The explainability of trained ML models is closely related to its simu-
latability [88–90]: How well can a user anticipate (or guess) the prediction
ŷ = ĥ(i) (x) delivered by ĥ(i) for a data point with features x. We can then
measure the explainability of ĥ(i) (x) to the user at node i by comparing the
prediction ĥ(i) (x) with the corresponding “guess” (or “simulation”) u(i) (x).
We can enforce (subjective) explainability of FL systems by modifying
the local loss functions in GTVMin. For ease of exposition we will focus on
the GTVMin instance (5.1) for training local (personalized) linear models.
(i)
For each node i ∈ V, we construct a test-set Dt and ask user i to deliver a
(i)
guess u(i) (x) for each data point in Dt .21
We measure the (subjective) explainability of a linear hypothesis with
21 (i)
We only use the features of the data points in Dt , i.e., this dataset can be constructed
from unlabeled data.

17
model parameters w(i) by
X  2
(i) (i) T (i)
(8.11)

(1/ Dt ) u x −x w .
(i)
x∈Dt

It seems natural to add this measure as a penalty term to the local loss
function in (5.1), resulting in the new loss function

2 (i)
X 2
Li w(i) := (1/mi ) y(i) −X(i) w(i) 2 +ρ (1/ Dt ) u(i) x −xT w(i) .
 
| {z } (i)
training error |
x∈Dt
{z }
subjective explainability
(8.12)
The regularization parameter ρ controls the preference for a high subjective
T
explainability of the hypothesis h(i) (x) = w(i) x over a small training
error [90]. It can be shown that (8.12) is the average weighted squared error
loss of h(i) (x) on an augmented version of D(i) . This augmented version
(i)
includes the data point x, u(i) (x) for each data point x in the test-set Dt .


So far, we have focused on the problem of explaining a trained personalized


model to some user. The general idea is to provide partial information, in the
form of some explanation, about the learnt hypothesis map ĥ. Explanations
should help the user to anticipate the prediction ĥ(x) for any given data point.
Instead of explaining a trained model, i.e., a learnt hypothesis ĥ, we might
also want to explain an entire FL algorithm.
Mathematically, we can interpret an FL algorithm as a map A that reads
in local datasets and delivers learnt hypothesis maps ĥ(i) . Explaining an FL
algorithm amounts to providing partial information about this map A. Thus,
mathematically speaking, the problem of explaining a learnt hypothesis is
essentially the same as explaining an entire FL algorithm: Provide partial

18
information (explanation) about a map such that user can anticipate the
results of applying the map to arbitrary arguments. However, the map A
might be much more complicated compared to a learnt hypothesis (which
could be a linear map for linear models). The different level of complexity
of these two families of maps requires to use different forms of explanation.
For example, we might explain a FL algorithm using a pseudo-code such as
Algorithm 5.1. Another form of explanation could be a Python code snippet
that illustrates a potential implementation of the algorithm.

19
1 from sklearn . datasets import load_iris
2 from sklearn . model_selection import train_test_split
3 from sklearn . tree import D e c i s i o n T r e e C l a s s i f i e r
4 from sklearn . metrics import accuracy_score
5

6 # Load the Iris dataset


7 data = load_iris ()
8 X = data . data
9 y = data . target
10

11 # Split the dataset into training and test sets


12 X_train , X_test , y_train , y_test = train_test_split (X , y ,
test_size =0.3 , random_state =42)
13

14 # Create a Decision Tree classifier


15 clf = D e c i s i o n T r e e C l a s s i f i e r ( random_state =42)
16

17 # Train the classifier


18 clf . fit ( X_train , y_train )
19

20 # Make predictions on the test data


21 y_pred = clf . predict ( X_test )
22

23 # Calculate accuracy
24 accuracy = accuracy_score ( y_test , y_pred )
25 accuracy

Figure 8.2: Python code for fitting a decision tree model on the Iris dataset.

20
8.5 Overview of Coding Assignment

Python Notebook. TrustworthyFL_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv

This coding assignment continues from the coding assignment “FL Algo-
rithms” (see Section 5.7). This previous assignment required you to implement
Algorithm 5.1 and apply it to an FL network whose nodes are FMI weather
stations. We now study the effect of adding perturbations ε(i) to the label
vector y(i) , consisting of temperature measurements taken at the FMI station
i. In particular, we replace the label vector y(i) with the perturbed label
vector y(i) + ε(i) . How much do the local model parameters delivered by
Algorithm 5.1 change when adding perturbations of different strength ε(i) 2 .

21
9 Privacy-Protection in FL
The core idea of FL is to share information contained in local datasets in order
to improve the training of ML models. Chapter 5 discussed FL algorithms that
share information in the form of model parameters that are computed from
the local loss function. Each node i ∈ V receives current model parameters
from other nodes and, after executing a local update, shares its new model
parameters with other nodes.
Depending on the design choices for GTVMin-based methods, sharing
model parameters allows to reconstruct local loss functions and, in turn, to
estimate private information about individual data points such as health-care
customers (“patients”) [91]. Thus, the bad news is that FL systems will
almost inevitably incur some leakage of private [Link] good news
is, however, that the extent of privacy leakage can be controlled by (i) careful
design choices for GTVMin and (ii) applying slight modifications of basic FL
algorithms (such as those from Chapter 5).
This chapter revolves around two main questions:

• Q1. How can we measure privacy leakage of a FL system?

• Q2. How can we control (minimize) privacy leakage of a FL system?

Section 9.2 addresses Q1 while Sections 9.3 and 9.4 address Q2.

9.1 Learning Goals

After completing this chapter, you will

• be aware of threats to privacy and the need to protect it,

1
• know some quantitative measures for privacy leakage,

• understand how GTVMin can facilitate privacy protection,

• be able to implement FL algorithms with guaranteed levels of privacy


protection.

9.2 Measuring Privacy Leakage

Consider a FL system that trains a personalized model for the users, indexed
by i = 1, . . . , n, of heart rate sensors. Each user i generates a local dataset
D(i) that consists of time-stamped heart rate measurements. We define a
single data point as a single continuous activity, e.g. as a 50-minute long run.
The features of such a data point (activity) might include the trajectory in
the form of a time-series of GPS coordinates (e.g., measured every 30 seconds).
The label of a data point (activity) could be the average heart rate during
the activity. Let us assume that this average heart rate is private information
that should not be shared with anybody.22
Our FL system also exploits the information provided by a fitness expert
that determines pair-wise similarities Ai,i′ between users i, i′ (e.g., due to
body weight and height). We then use Algorithm 5.1 to learn, for each user i,
the model parameters w(i) for some AI health-care assistant [92]. In what
follows, we interpret Algorithm 5.1 as a map A(·) (see Figure 9.1). The map A
n
reads in the dataset D := D(i) i=1 (constituted by the local datasets D(i) for

 (i) n
i = 1, . . . , n) and delivers the local model parameters A D := stack w .

{z i=1}
b
|
w
b
22
In particular, we might not want to share our heart rate profiles with a potential future
employer who prefers candidates with a long life expectation.

2
A  (i) n
D(i) stack w
b i=1

Figure 9.1: Algorithm 5.1 maps the collection D of local datasets D(i) to the
learnt model parameters w
b (i) , for each node i = 1, . . . , n. These learnt model
parameters are (approximate) solutions to the GTVMin instance (3.20).

A privacy-preserving FL system should not allow to infer, solely from the


learnt model parameters, the average heart rate y (i,r) during a specific single
activity r of a specific user i. Mathematically, we must ensure that the map
A is not invertible: The learnt model parameters should not change if we
would apply the FL algorithm to a perturbed dataset that includes a different
value for the average heart rate y (i,r) . Figure 9.2 illustrates an algorithm that
is partially invertible in the sense of allowing to infer the label of some data
points used in the training set.
The sole requirement for a FL algorithm A to be not invertible is not
useful in general. Indeed, we can easily make any algorithm A by simple
pre- or post-processing techniques whose effect is limited to irrelevant regions
of the input space (which is the space of all possible datasets). The level
of privacy-protection offered by A can be characterized by a measure of its
“non-invertibility”.
A simple measure of non-invertibility is the sensitivity of the output A D


3
5

3
x2
2

r=2
1

0
0 1 2 3 4 5
x1
Figure 9.2: Scatterplot of data points r = 1, 2, . . ., each characterized by
(r) (r) 2
features x(r) = x1 , x2 and a binary label y (r) ∈ {◦, ×}. The plot also
indicates the decision regions of a hypothesis b
h that has been learnt via ERM.
Would you be able to infer the label of data point r = 2 if you knew the
decision regions?

4
b D′ )
p(w;
b D)
p(w;

T w
b

Figure 9.3: Two probability distributions of the learnt model parameters


 (i) n
b = stack w
w b i=1
delivered by some FL algorithm (such as Algorithm 5.1).
These two probability distributions correspond to two different choices for
the input dataset, denoted by D′ and D. For example D′ might be obtained
from D by changing the value of a private feature of some data point in D.
We also indicate an “acceptance region” T that is used to detect if D (or a
neighbouring dataset D′ ) has been fed into the algorithm.

5
against varying the heart rate value y (i,r) ,
A D − A D′
 
2
. (9.1)
ε
Here, D denotes some given collection of local datasets and D′ is a modified
dataset. In particular, D′ is obtained by replacing the actual average heart
rate y (i,r) with the modified value y (i,r) + ε. The privacy-protection of A
is higher for smaller values (9.1), i.e., the output changes only little when
varying the value of the average heart rate.
Another measure for the non-invertibility of A is referred to as differential
privacy (DP). This measure is particularly useful for stochastic algorithms
that use some random mechanism. One example for such a random mechanism
is the selection of a random subset of data points (a batch) within Algorithm
5.1. Section 9.3 discusses another example of a random mechanism: add the
realization of a RV to the (intermediate) results of an algorithm.
A stochastic algorithm A can be described by a probability distribution
b D) over a measurable space that is constituted by the possible values of
p(w;
the learnt model parameters w
b (see Figure 9.3).23 This probability distribution
is parametrized by the dataset D that is fed as input to the algorithm A.
DP measures the non-invertibility of a stochastic algorithm A via the
similarity of the probability distributions obtained for two datasets D, D′
that are considered neighbouring or adjacent [79, 96]. Typically, we consider
D′ as adjacent to D if it is obtained by modifying the features or label of a
single data point in D. As a case in point, consider data points representing
physical activities which are characterized by a binary feature xj ∈ {0, 1}
23
For more details about the concept of a measurable space, we refer to the literature on
probability and measure theory [93–95].

6
that indicates an excessively high average heart rate during the activity. We
could then define neighbouring datasets via flipping the feature xj of a single
data point. In general, the notion of neighbouring datasets is a design choice
used in the formal definition of DP.

Definition 1. (from [96]) A stochastic algorithm A is (ε, δ)-DP if for any


measurable set S and any two neighbouring datasets D, D′ ,

Prob A(D) ∈ S} ≤ exp(ε)Prob A(D′ ) ∈ S} + δ. (9.2)


 

It appears that Definition 1 is the current de-facto standard for measuring


the (lack of) privacy-protection in FL systems [79, 96]. Nevertheless, there
are also other measures for the similarity between probability distributions
b D) and p(w;
p(w; b D′ ) that might be more useful for practical or theoretical
reasons [97]. One such alternative measure is the Rényi divergence of order
α > 1,
   α 
1 b D)
dp(w;
b D) p(w;
Dα p(w; ′
b D ) := Ep(w;D
b ′) . (9.3)
α−1 b D′ )
dp(w;
The Rényi divergence allows to define the following variant of DP [97, 98].

Definition 2. (from [96]) A stochastic algorithm A is (α, γ)-RDP if, for any
two neighbouring datasets D, D′ ,
 
Dα p(w;b D) p(w; ′
b D ) ≤ γ. (9.4)

A recent use-case of (α, γ)-RDP is the analysis of DP guarantees offered


by variants of SGD [97]. This analysis uses the fact that (α, γ)-RDP implies
(ε, δ)-DP for suitable choices of ε, δ [97].
One important property of the DP notions in Definition 1 and Definition
2 is that they are preserved by post-processing:

7
Proposition 9.1. Consider a FL system A that is applied to some dataset
D and some (possibly stochastic) map B that does not depend on D. If A is
(ε, δ)-DP (or (α, γ)-RDP), then so is also the composition B ◦ A.

Proof. See, e.g., [96, Sec. 2.3].

By Prop. (9.1), we cannot reduce the (level of) DP of A by any post-


processing method B that has no access to the raw data itself. It seems almost
natural to make this “post-processing immunity” a defining property of any
useful notion of DP [98]. However, due to Prop. (9.1), this property is already
“built-in” into the Definition 1 (and the Definition 2).
Operational Meaning of DP. The above (mathematically precise)
definitions of DP might some somewhat abstract. It is instructive to interpret
them from the perspective of statistical testing: We could use the output of
an algorithm A to test (or detect) if the underlying dataset fed into A was D
or if it actually was a neighbouring dataset D′ [99]. Such a statistical test
amounts to specifying a region T and to declare either

• “dataset D seems to be used”, if A ∈ T , or

• “dataset D′ seems to be used”, if A ∈


/ T.

The performance of a test T is characterized by two error probabilities:

• The probability of declaring D′ but actually D was fed into A, which is


R
PD→D′ := 1 − T p(w; b D).

• The probability of declaring D but actually D′ was fed into A, which is


b D′ ).
R
PD′ →D := T p(w;

8
For a privacy-preserving algorithm A, there should be no test T for which
both PD→D′ and PD′ →D are small (close to 0). This intuition can be made
precise as follows (see, e.g., [100, Thm. 2.1.] or [101]): If an algorithm A is
(ε, δ)-DP, then
exp(ε)PD→D′ + PD′ →D ≥ 1 − δ. (9.5)

Thus, if A is (ε, δ)-DP with small ε, δ (close to 0), then (9.5) implies PD→D′ +
PD′ →D ≈ 1.

9.3 Ensuring Differential Privacy

Depending on the underlying design choices (for data, model and optimization
method), a GTVMin-based method A might already ensure DP by design.
However, for some design choices the resulting GTVMin-based method A
might not ensure DP. However, according to Prop. 9.1, we might then still be
able to ensure DP by applying pre- and/or post-processing techniques to the
input (local datasets) and output (learnt model parameters) of A. Formally,
this means to compose the map A with two (possibly stochastic) maps I and
O, resulting in a new algorithm with map A′ := O ◦ A ◦ I. The output of A′
for a given dataset D is obtained by

• first applying the pre-processing I(D),

• then the original algorithm A I(D) ,




• and the final post-processing O A I(D) =: A′ (D).




Post-Processing. Maybe the most widely used post-processing to ensure

9
DP is simply to add some noise [96],
T i.i.d.
O(A) := A + n, with noise n = n1 , . . . , nnd , n1 , . . . , nnd ∼ p(n). (9.6)

Note that the post-processing (9.6) is parametrized by the choice for the
probability distribution p(n) of the noise entries. Two important choices are
the Laplacian distribution p(n) := 2b 1
exp − |n| and the norm distribution

b
n2
1
(i.e., using Gaussian noise n ∼ N (0, σ 2 )).

p(n) := √2πσ 2
exp − 2σ 2

When using Gaussian noise n ∼ N (0, σ 2 ) in (9.6), the variance σ 2 can be


chosen based on the sensitivity

∆2 A := max′ ∥A(D) − A(D′ )∥2 . (9.7)



D,D

Here, the maximum is over all pairs of neighbouring datasets D, D′ . Adding


Gaussian noise with variance σ 2 > 2 ln(1.25/δ) · ∆2 (A)/ε ensures that A
p

is (ε, δ)-DP [96, Thm. 3.22]. It might be difficult to evaluate the sensitivity
(9.7) for a given FL algorithm A [102]. For a GTVMin-based method, i.e.,
A(D) is a solution to (3.18), we might obtain upper bounds on ∆2 A by a


perturbation analysis similar in spirit to the proof of Prop. 8.1.


Pre-Processing. Instead of ensuring DP via post-processing the output
of a FL algorithm A, we can ensure DP by applying a pre-processing map
I(D) to the dataset D. The result of the pre-processing is a new dataset
b = I(D) which can be made available (publicly!) to any algorithm A that
D
has no direct access to D. According to Prop. 9.1, as long as the pre-processing
map I is (ε, δ)-DP (see Definition 1), so will be the composition A ◦ I.
As for post-processing, one important approach to pre-processing is to “add”
or “inject” noise. This results in a stochastic pre-processing map D
b = I(D)

that is characterized by a probability distribution. The noise mechanisms

10
used for pre-processing might be different from just adding the realization of
a RV (see (9.6)): 24

• For a classification method with discrete label space Y = {1, . . . , K},


we can inject noise by replacing the true label of a data point with a
randomly selected element of Y [103, Mechanism 1]. The noise injection
might also include the replacement of the features of a data point by a
realization of a RV whose probability distribution is somehow matched
to the dataset D [103, Mechanism 2].

• Another form of noise injection is to construct I(D) by randomly


selecting data points from the original (private) dataset D [104]. Note
that such a form of noise injection is naturally provided by SGD methods
(see, e.g., step 4 of Algorithm 5.2).

How To Be Sure? Consider some algorithm A, possibly obtained by pre-


and post-processing techniques, that is claimed to be (ε, δ)-DP. In practice, we
might not know the detailed implementation of the algorithm. For example,
we might not have access to the noise generation mechanism used in the
pre- or post-processing steps. How can we verify a claim about DP of an
algorithm A without having access to the detailed implementation of A? One
(1) (L)
approach could be to apply the algorithm to synthetic datasets Dsyn , . . . , Dsyn
that differ only in some private attribute of a single data point. We can then
(r)
try to predict the private attribute s(r) of the dataset Dsyn by applying a
(r) 
learnt hypothesis b
h to the output A Dsyn delivered by the “algorithm under
24
Can you think of a simple pre-processing map that is deterministic and guarantees
maximum DP?

11
test” A. The hypothesis bh might be learnt by a ERM-based method (see
 
(r) 
Algorithm 2.1) using a training set consisting of pairs A Dsyn , s(r) for
some r ∈ {1, . . . , L}.

9.4 Private Feature Learning

Section 9.3 discussed pre-processing techniques that ensure DP of a FL


algorithm. We next discuss pre-processing techniques that are not directly
motivated from a DP perspective. Instead, we cast privacy-friendly pre-
processing of a dataset as a feature learning problem [6, Ch. 9].
Consider a data point characterized by a feature vector x ∈ Rd and a
label y ∈ R. Moreover, each data point is characterized by a private attribute

s. We want to learn a (potentially stochastic) feature map Φ : Rd → Rd

such that the new features z = Φ(x) ∈ Rd do not allow to accurately predict
the private attribute s. Trivially, we can make prediction of s from Φ(x)
impossible by using a constant map, e.g., Φ(x) = 0. However, we still want
the new features z = Φ(x) to allow an accurate prediction (using a suitable
hypothesis) for the label y of a data point.
Privacy Funnel. To quantify the predictability of the private attribute s
solely from the transformed features z = ϕ(x) we can use the i.i.d. assumption
as a simple but useful probabilistic model. Indeed, we can then use the mutual
information (MI) I (s; Φ(x)) as a measure for the predicability of s from Φ(x).
A small value of I (s; Φ(x)) indicates that it is difficult to predict the private
attribute s solely from Φ(x), i.e., a high level of privacy protection.25 Similarly,
25
The relation between MI-based privacy measures and DP has been studied in some
detail recently [105].

12
I (s; Φ(x))

I (y; Φ(x))
Figure 9.4: The solutions of the privacy funnel (9.8) trace out (for varying
constraint R) a curve in the plane spanned by the values of I (s; Φ(x))
(measuring the privacy leakage) and I (y; Φ(x)) (measuring the usefulness of
the transformed features for predicting the label).

we can use the MI I (y; Φ(x)) to measure the predicability of the label y
from Φ(x). A large value I (y; Φ(x)) indicates that Φ(x) allows to accurately
predict y (which is of course preferable).
It seems natural to use a feature map Φ(x) that optimally balances a small
I (s; Φ(x)) (privacy protection) with a sufficiently large I (y; Φ(x)) (allowing
to accurately predict y). The mathematically precise formulation of this plan
is known as the privacy funnel [106, Eq. (2)],

min I (s; Φ(x)) such that I (y; Φ(x)) ≥ R. (9.8)


Φ(·)

Figure 9.4 illustrates the solution of (9.8) for varying R, i.e., minimum value
of I (y; Φ(x)).
Optimal Private Linear Transformation. The privacy funnel (9.8)

13
uses the MI I (s; Φ(x)) to quantify the privacy leakage of a feature map Φ(x).
An alternative measure for the privacy leakage is the minimum reconstruction
error s− ŝ. The reconstruction ŝ is obtained by applying a reconstruction map
r(·) to the transformed features Φ(x). If the joint probability distribution
p(s, x) is a multivariate normal and the Φ(·) is a linear map (of the form
Φ(x) := Fx with some matrix F), then the optimal reconstruction map is
again linear [30].
We would like to find the linear feature map Φ(x) := Fx such that for any
linear reconstruction map r (resulting in ŝ := rT Fx) the expected squared
error E{(s − ŝ)2 } is large. The smallest possible expected squared error loss

ε(F) := min′ E{(s − rT Fx)2 } (9.9)


r∈Rd

measures the level of privacy protection offered by the new features z = Fx.
The larger the value ε(F), the more privacy protection is offered. It can
be shown that ε(F) is maximized by any F that is orthogonal to the cross-
covariance vector cx,s := E{xs}, i.e., whenever Fcx,s = 0. One specific choice
for F that satisfies this orthogonality condition is

F = I − (1/ ∥cx,s ∥22 )cx,s cTx,s . (9.10)

Figure 9.5 illustrates a dataset for which we want to find a linear feature map
F such that the new features z = Fx do not allow to accurately predict a
private attribute.

9.5 Problems

9.1. Where is Alice? Consider a device, named Alice, that implements the
asynchronous Algorithm 8.1. The local dataset of the device are temperature

14
x2
food preference y

gender s
x1

Figure 9.5: A toy dataset D whose data points represent customers, each
T
characterized by features x = x1 , x2 . These raw features carry information
about a private attribute s (gender) and the label y (food preference) of a
person. The scatter-plot suggests that we can find a linear feature transfor-
mation F := f T ∈ R1×2 resulting in a new feature z := Fx that does not
allow to predict s, while still allowing to predict y.

15
measurements from some FMI weather station. Assuming that no other device
interacts with Alice except your device, named Bob. Develop a software for
Bob that interacts with Alice according to Algorithm 8.1 in order to determine
at which FMI station we can find Alice.

16
9.6 Overview of Coding Assignment

Python Notebook. PrivacyProtectionFL_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv

This coding assignment builds on the coding assignment “ML Basics”


(see Section 2.7). Similar to this previous coding assignment, we consider
data points, indexed by r = 1, . . . , m, that represent weather measurements
recorded by the FMI. The r-th data point is characterized by (normalized)
coordinates, a time stamp and a temperature measurement.
The overall idea of the assignment is to illustrate a model inversion attack:
An attacker tries to determine the private attribute of an individual by
evaluating a trained model (learnt hypothesis). Here, the private attribute is
the latitude and longitude of a person that shares a snapshot of a thermometer
during a Finland journey. The attacker might then use a trained ML model
to predict the latitude and longitude from the disclosed (public) attributes x
consisting of the time-stamp and temperature value.

9.6.1 Where Are You?

Consider a social media post of a friend that is travelling across Finland. This
post includes a snapshot of a temperature measurement and a clock. Can
you guess the latitude and longitude of the location where your friend took
this snapshot? We can use ERM to do this: Use Algorithm 2.1 to learn a
vector-valued hypothesis b
h for predicting latitude and longitude from the time
and value of a temperature measurement. For the training set and validation
set, we use the FMI recordings stored in the data file.

17
9.6.2 Ensuring Privacy with Pre-Processing

Repeat the privacy attack described in Section 9.6.1 but this time using a
pre-processed version of the raw data. In particular, try out combinations of
randomly selecting a subset of the data points in the data file and also adding
noise to their features and label. How well can you predict the latitude and
longitude from the time and value of a temperature measurement using a
hypothesis h
b learnt from the perturbed data.

9.6.3 Ensuring Privacy with Post-Processing

Repeat the privacy attack described in Section (9.6.1) but this time using
a post-processing of the learnt hypothesis h
b (obtained from Algorithm 2.1

applied to the data file). In particular, study how well you can predict the
latitude and longitude from the time and value of a temperature measurement
using a noisy prediction hypothesis h(x)
b + n. Here, n is a realization drawn
from a multivariate normal distribution N (0, σ 2 I).

9.6.4 Private Feature Learning

We now use the original features x ∈ R7 of a FMI weather recording as we


used in the “ML Basics” assignment (see Section 2.7) The goal of this task is
to learn a linear feature transformation z = Fx such that the new features do
not allow to recover the latitude x1 (which is considered a private attribute
s of the data point). In particular, we construct the matrix F according to
(9.10) by replacing the exact cross-covariance vector cx,s with an estimate (or
approximation) ĉx,s . This estimate is computed as follows:

18
1. read in all data points stored in the data file and construct a feature
matrix X ∈ Rm×7 with m being the total number of data points

2. remove the sample means from each feature, resulting in the centered
feature matrix

b := X − (1/m)11T X , 1 := 1, . . . , 1 T ∈ Rm . (9.11)

X

3. extract the private attribute (centred normalized latitude) for each data
point and store in the vector

(1) (2) (m) T


s := x̂1 , x̂1 , . . . , x̂1 . (9.12)

4. compute the approximate cross-covariance vector

T
ĉx,s := (1/m) X
b s (9.13)

The matrix F obtained from (9.10) by replacing cx,s with ĉx,s , is then
used to compute the privacy preserving features z(r) = Fx(r) for r = 1, . . . , m.
To verify if these new features are indeed privacy preserving, we use linear
regression (as implemented by the LinearRegression class of the Python
package scikit-learn) to learn the model parameters of a linear model to
(r)
predict the private attribute s(r) = x1 (the latitude of the FMI station at
which the r-th measurement has been taken) from the features z(r) . (Note
that you have already trained and validated a linear model within the “ML
Basics” assignment.) We can use the validation error of the trained linear
model as a measure for the privacy protection (larger validation error means
higher privacy protection). Beside the privacy protection, the new features

19
z(r) should still allow to accurately predict the label y (r) of a data point. To
this end, we learn the model parameters of a linear model to predict the
private property y (r) solely from the features z(r) .

20
10 Data and Model Poisoning in FL
Every ML method, including ERM or GTVMin, is to some extent at the mercy
of the data generator. Indeed, the model parameters learnt via ERM (for basic
ML) or via GTVMin (for FL) are determined by the statistical properties
of the training set. We must hope (e.g., via an i.i.d. assumption) that the
data points in the training set truthfully reflect the statistical properties of
the underlying data generation process. However, these data points might
have been intentionally perturbed (poisoned).
In general, it is impossible to perfectly detect if data points have been
poisoned. The perfect detection of those perturbed data points requires
complete knowledge of the underlying probability distribution. However, we
typically do not know this probability distribution but can only estimate it
from (possibly perturbed) data points. We can then use this estimate to
detect perturbed data points via outlier detection techniques.
Instead of trying to identify and remove poisoned data points we can also
try to make GTVMin-based FL systems more robust against data poisoning.
We have already discussed the robustness of GTVMin-based methods in
Section 8.3. The level of robustness crucially depends on the design choices
for the local models, local loss functions and FL network in GTVMin.
Besides data poisoning, FL systems can also be subject to model poisoning
attacks. FL systems are implemented with distributed computers such as a
collection of smartphones (clients) that are inter-connected by wireless links
(see Section 5). For some applications it is not possible (or desirable) to
enforce sophisticated authentication techniques that determine which clients
can participate in the FL system. The distributed computations might then

1
be compromised (poisoned) in order to disturb the training of the local model
at a dedicated (“target”) client.

10.1 Learning Goals

This chapter discusses the robustness of FL systems against data poisoning


and model poisoning attacks. After completing this chapter, you will

1. be aware of data poisoning and model poisoning attacks at FL systems

2. know how design choices for GTVMin-based methods affect the vulner-
ability of FL systems against these attacks.

10.2 Attack Types

Consider a FL system that implements Algorithm 5.1 over a computer network.


Each computer corresponds to one node of the FL network, and implements
the local update in step 4 of Algorithm 5.1. The model parameters sharing
in step 3 of Algorithm 5.1 is implemented over some communication channel
(e.g., short-range wireless links).
An attack on such a FL system could be carried out in different forms,
depending on the level of control that the attacker has over the implementation
of Algorithm 5.1. If the attacker has some control over the communication
links between the nodes, it can directly manipulate the model parameters
shared between nodes (model poisoning). The attacker might instead have
only access to the local datasets of some (vulnerable) nodes W ⊂ V. It could
then manipulate (poison) the local datasets at these vulnerable nodes to
perturb the corresponding local updates (see step 4 of Algorithm 5.1). The

2
perturbations at the nodes i′ ∈ W then propagate over the edges of the FL
network (via the model parameters sharing step 3 in Algorithm 5.1) and
result in perturbed model parameters at other nodes (whose local datasets
have not been poisoned).
Based on the objective of an attack, we distinguish between the following
attack types:

• Denial-of-Service Attacks. The goal of a denial-of-service attack is


to make the learnt hypothesis h̄(i) , at some target node i, useless in the
sense of having unacceptable large prediction errors. We can detect a
denial-of-service attack by continuously monitoring the performance of
the learnt model parameters w(i) . Denial-of-service attacks on GTVMin-
based FL systems can be launched by manipulating some of the local
datasets at few nodes in the FL network. These manipulated (poisoned)
local datasets influence the model parameters at the target node i
indirectly via the sharing of model parameters across the edges of the
FL network. Instead of poisoning local datasets, denial-of-service attacks
might manipulate the sharing (communication) of model parameters
across the edges in the FL network. Such model poisoning is possible
when the sharing of model parameters is carried out over in-secure
communication links. Figure 10.1 illustrates, for some (target) node i,
the result of a denial-of-service attack on a FL system.

• Backdoor Attacks. A backdoor attack tries to make a target node i


learn a hypothesis h̃(i) that behaves well on the local dataset D(i) but
highly irregular for a certain range of feature values. The goal of the
attacker is to exploit this irregular behaviour by preparing a data point

3
with feature values falling in this range such that the hypothesis h̃(i)
delivers a specific prediction (e.g., a prediction that results in granting
access to a restricted area within a building). Figure 10.1 illustrates, for
some (target) node i, the result of a backdoor attack on a FL system.

• Privacy Attacks (see Chapter 9). The goal of a privacy attack is to


determine private attributes of the data points in the local dataset at
some target node i. To this end, an attacker might try to enforce some
other (vulnerable) node i′ to learn a copy of the model parameters w(i)
at node i. For GTVMin-based methods, this could be achieved by using
a trivial local loss function Li′ () (e.g., being identically equal to zero)
and having edges between i′ and nodes that are in the same cluster
as node i (see Section 6.3). After the attacker has obtained a copy of
the learn model parameters w
b (i) , they can try to probe the resulting
hypothesis in order to infer private attributes of the data points in D(i)
(see Figure 9.2).

4
h̃(i) (x)
label y
ĥ(i) (x)

h̄(i) (x)

local dataset D(i)

“backdoor"
features x

Figure 10.1: A local dataset D(i) along with three hypothesis maps learnt via
some GTVMin-based method such as Algorithm 5.1. These three maps are
obtained for different attacks to the FL system: The map ĥ(i) is obtained
when no manipulation is applied (no attacks). A backdoor attack aims at
nudging the learnt hypothesis h̃(i) to behave similar to ĥ(i) when applied to
data points in D(i) . However, it behaves very different for a certain range
of feature values outside of D(i) . This value range can be interpreted as a
backdoor that is used to trigger a malicious prediction. In contrast to backdoor
attacks, a denial-of-service attack tries to enforce a learnt hypothesis h̄(i) that
delivers poor predictions on the local dataset D(i) .

5
10.3 Data Poisoning

Consider a GTVMin-based FL system that learns local model parameters


for a local (personal) model for each local dataset D(i) , i = 1, . . . , n. A data
poisoning attack on such a FL system is to manipulate (to poison) data points
in some of the local datasets.
The manipulation (poisoning) of data points can consist of adding the
realization of RVs to the features and label of a data point: We poison a
data point by replacing its features x and label y with x
e := x + ∆x and
ỹ = y + ∆y.
For classification problems, with discrete label spaces, we distinguish
between the following data poisoning strategies [107]:

• Label Poisoning: The attacker manipulates the labels of data points in


the training set.

• Clean-Label Attack: The attacker leaves the labels untouched and only
manipulates the features of data points in the training set.

From a GTVMin-perspective, the effect of a data poisoning attack is


that the original local loss functions Li (·) in GTVMin (3.20) are replaced by
perturbed local loss functions L̃i (·). The extend of perturbation depends on
the fraction of data points that are poisoned as well as on the loss function
used to measure the prediction errors.
Different choices for the underlying loss function offer different levels of
robustness against poisoning. For example, using the absolute error loss yields
higher robustness against perturbations of the label values of few data points

6
compared to the squared error loss. Another class of robust loss functions is
obtained by including a penalty term (as in regularization).

10.4 Model Poisoning

A model poisoning attack on Algorithm 5.1 manipulates the model parameters


sharing step such that a target node i receives perturbed model parameters
of its neighbours. Note that Algorithm 5.1 is nothing but a message passing
implementation of plain GD (see Section 4.2). Thus, the effect of a model
poisoning attack on Algorithm 5.1 is that it becomes an instance of perturbed
GD. We can use the analysis of perturbed GD (see Section 4.5) to study the
impact of model poisoning on the learnt model parameters w(i) .

10.5 Assignment

Python Notebook. PoisoningFL_CodingAssignment.ipynb


Data File. Assignment_MLBasicsData.csv

This coding assignment builds on the coding assignments “ML Basics”


(see Section 2.7) and “FL Algorithm” (see Section 5.7). As in these previous
assignments, we consider data points, indexed by r = 1, . . . , m, that repre-
sent weather measurements recorded by the FMI. The r-th data point is
characterized by (normalized) coordinates, a time stamp and a temperature
measurement.
The overall idea of the assignment is to illustrate two different types of
data poisoning attacks: a denial-of-service attack and a backdoor attack.

7
10.5.1 Denial-of-Service Attack

Construct an FL network of FMI stations and store it as a [Link]()


object. Implement Algorithm 5.1 to learn, for each node i = 1, . . . , n, the
model parameters of local linear model. Then, implement a denial-of-service
attack by poisoning the local datasets as increasingly many nodes i′ =
̸ 1.
The goal of the attack is to increase the validation error of the learnt model
parameters w(1) (at target node i = 1) by 20 %.

10.5.2 Backdoor Attack

We now use a different collection of features for a data point (= temperature


recording). In particular, we replace the numeric feature representing the hour
of the measurement with 24 new features. These new features are the one-hot
encoding of the hour. For example if the hour was 0 then x′1 = 1, x′2 = 0, . . ..

8
Glossary
k-means The k-means algorithm is a hard clustering method which assigns
each data point of a dataset to precisely one of k different clusters.
The method alternates between updating the cluster assignments (to
the cluster with nearest cluster mean) and, given the updated cluster
assignments, re-calculating the cluster means [6, Ch. 8].. 5, 6, 17, 23

accuracy Consider data points characterized by features x ∈ X and a


categorical label y which takes on values from a finite label space Y.
The accuracy of a hypothesis h : X → Y, when applied to the data
points in a dataset D = x(1) , y (1) , . . . , x(m) , y (m) is then defined
  

as 1 − (1/m) m , h using the 0/1 loss.. 13


P (r) (r)
 
r=1 L x , y

activation function Each artificial neuron within an ANN is assigned an


activation function g(·) that maps a weighted combination of the neuron
inputs x1 , . . . , xd to a single output value a = g w1 x1 + . . . + wd xd .


Note that each neuron is parameterized by the weights w1 , . . . , wd .. 15

artificial intelligence Artificial intelligence refers to systems that behave


rational in the sense of maximizing a long-term reward. The ML-based
approach to AI is to train a model that allows to predict optimal actions
for a given observed state of the environment. What sets AI applications
apart from more basic ML applications is the choice of loss function.
AI systems rarely have access to a labeled training set that allows to
measure the average loss for a given choice of model parameters. Rather,
AI systems typically use a loss function that can only be estimated from

1
observed reward signals.. 2, 9

artificial neural network An artificial neural network is a graphical (signal-


flow) representation of a map from features of a data point at its input
to a prediction for the label as its output.. 1, 11–15, 34, 44, 47

autoencoder An autoencoder is a ML method that jointly learns an encoder


map h(·) ∈ H and a decoder map h∗ (·) ∈ H∗ . It is an instance of ERM
using a loss computed from the reconstruction error x − h∗ h x .. 8, 9


baseline Consider some ML method that delivers a learnt hypothesis (or


trained model) ĥ ∈ H. We evaluate the quality of a trained model by
computing the average loss on a test set. But how do we know that the
resulting test set performance is good (enough)? How can we determine
if the trained model performs close to optimal and there is little point in
investing more resources (for data collection or computation) to improve
it? To this end, it is useful to have a reference (or baseline) level against
which we can compare the performance of the trained model. Such a
reference value might be obtained from human performance, e.g., the
misclassification rate of dermatologists who diagnose cancer from visual
inspection of skin. Another source for a baseline is an existing, but
for some reason unsuitable, ML method. For example, the existing
ML method might be computationally too expensive for the intended
ML application. However, we might still use its test set error as a
baseline. Another, somewhat more principled, approach to constructing
a baseline is via a probabilistic model. For a wide range of probabilistic
models p(x, y) we can precisely determine the minimum achievable risk

2
among any hypothesis (not even required to belong to the hypothesis
space H) [30]. This minimum achievable risk (referred to as the Bayes
risk) is the risk of the Bayes estimator for the label y of a data point,
given its features x. Note that, for a given choice of loss function, the
Bayes estimator (if it exists) is completely determined by the probability
distribution p(x, y) [30, Chapter 4]. However, there are two challenges
to computing the Bayes estimator and Bayes risk: i) the probability
distribution p(x, y) is unknown and needs to be estimated but (ii)
even if we know p(x, y) it might be computationally too expensive to
compute the Bayes risk exactly. A widely used probabilistic model is
the multivariate normal distribution (x, y) ∼ N (µ, Σ) for data points
characterised by numeric features and labels. Here, for the squared
error loss, the Bayes estimator is given by the posterior mean µy|x of
the label y given the features x [30, 108]. The corresponding Bayes risk
is given by the posterior variance σy|x
2
(see Figure 10.2). . 5, 11, 13, 14

batch In the context of SGD, a batch refers to a randomly chosen subset


of the overall training set. We use the data points in this subset to
estimte the gradient of training error and, in turn, to update the model
parameters.. 6, 9, 10, 17

Bayes estimator Consider a probabilistic model with joint probability dis-


tribution p(x, y) for the features x and label y of a data point. For a
given loss function L (·, ·), we refer to a hypothesis h as a Bayes estima-
tor if its risk E{L ((x, y) , h)} is minimum [30]. Note that the property
of a hypothesis being a Bayes estimator depends on the underlying

3
µy|x

σy|x

× y
ĥ(x)

Figure 10.2: If features and label of a data point are drawn from a multivariate
normal distribution, we can achievive minimum risk (under squared error
loss) by using the Bayes estimator µy|x to predict the label y of a data point
with features x. The corresponding minimum risk is given by the posterior
variance σy|x
2
. We can use this quantity as a baseline for the average loss of a
trained model ĥ.

probability distribution and the choice for the loss function L (·, ·).. 3,
4, 11

Bayes risk Consider a probabilistic model with joint probability distribution


p(x, y) for the features x and label y of a data point. The Bayes risk
is the minimum possible risk that can be achieved by any hypothesis
h : X → Y. Any hypothesis that achieves the Bayes risk is referred to
as a Bayes estimator [30].. 3, 14

bias Consider a ML method using a parameterized hypothesis space H. It


 m
learns the model parameters w ∈ Rd using the dataset D = x(r) , y (r) r=1 .


To analyze the properties of the ML method, we typically interpret the

4
data points as realizations of i.i.d. RVs,

y (r) = h(w) x(r) + ε(r) , r = 1, . . . , m.




We can then interpret the ML method as an estimator w,


b computed
from D (e.g., by solving ERM). The (squared) bias incurred by the
2
estimate w b − w 2 .. 14
b is then defined as B 2 := E{w}

classification Classification is the task of determining a discrete-valued label


y of a data point based solely on its features x. The label y belongs
to a finite set, such as y ∈ {−1, 1}, or y ∈ {1, . . . , 19} and represents
a category to which the corresponding data point belongs to. Some
classification problems involve a countably infinite label space.. 5, 11,
13, 25, 26

classifier A classifier is a hypothesis (map) h(x) used to predict a label


taking values from a finite label space. We might use the function value
h(x) itself as a prediction ŷ for the label. However, it is customary
to use a map h(·) that delivers a numeric quantity. The prediction is
then obtained by a simple thresholding step. For example, in a binary
classification problem with Y ∈ {−1, 1}, we might use real-valued
hypothesis map h(x) ∈ R as classifier. A prediction ŷ can then be
obtained via thresholding,

ŷ = 1 for h(x) ≥ 0, and ŷ = −1 otherwise. (10.1)

We can characterize a classifier by its decision regions Ra , for every


possible label value a ∈ Y.. 29, 48

5
cluster A cluster is a subset of data points that are more similar to each
other than to the data points outside the cluster. The quantitative
measure of similarity between data points is a design choice. If data
points are characterized by Euclidean feature vectors x ∈ Rd , we can
define the similarity between two data points via the Euclidean distance
between their feature vectors.. 1, 5–8, 11, 13, 17, 23, 43

clustered federated learning (CFL) Clustered FL (CFL) assumes that


local datasets form clusters. The local datasets belonging to the same
cluster have similar statistical properties. CFL pools local datasets in
the same cluster to obtain a training set for training a cluster-specific
model. GTVMin implements this pooling implicitly by forcing the local
model parameters to be approximately identical over well-connected
subsets of the FL network.. 1, 2, 5

clustering Clustering methods decompose a given set of data points into


few subsets, which are referred to as clusters. Each cluster consists of
data points that are more similar to each other than to data points
outside the cluster. Different clustering methods use different measures
for the similarity between data points and different forms of cluster
representations. The clustering method k-means uses the average feature
vector (cluster mean) of a cluster as its representative. A popular soft
clustering method based on GMM represents a cluster by a multivariate
normal distribution.. 1, 2, 5, 17, 43

clustering assumption The clustering assumption postulates that data


points in a dataset form a (small) number of groups or clusters. Data

6
points in the same cluster are more similar with each other than with
those outside the cluster [54]. We obtain different clustering methods
by using different notions of similarity between data points.. 2

computational aspects By computational aspects of a ML method, we


mainly refer to the computational resources required for its imple-
mentation. For example, if a ML method uses iterative optimization
techniques to solve ERM, then its computational aspects include (i) how
many arithmetic operations are needed to implement a single iteration
(gradient step) and (ii) how many iterations are needed to obtain useful
model parameters. One important example of an iterative optimization
technique is GD.. 4, 9, 12, 17, 29, 45

condition number The condition number κ(Q) ≥ 1 of a positive definite


matrix Q ∈ Rd×d is the ratio λmax /λmin between the largest λmax and
the smallest λmin eigenvalue of Q. The condition number is useful for
the analysis of ML methods. The computational complexity of gradient-
based methods for linear regression crucially depends on the condition
number of the matrix Q = XXT , with the feature matrix X of the
training set. Thus, from a computational perspective, we prefer features
of data points such that Q has a condition number close to 1.. 4

connected graph A undirected graph G = (V, E) is connected if it does not


contain a (non-empty) subset V ′ ⊂ V with no edges leaving V ′ .. 7

convex A subset C ⊆ Rd of the Euclidean space Rd is referred to as convex


if it contains the line segment between any two points of that set. We

7
define a function as convex if its epigraph is a convex set [29].. 2, 4, 5,
8–10, 17, 36, 37, 47

Courant–Fischer–Weyl min-max characterization Consider a psd ma-


trix Q ∈ Rd×d with EVD (or spectral decomposition),
d
X T
Q= λj u(j) u(j) .
j=1

Here, we used the ordered (in increasing fashion) eigenvalues

λ1 ≤ . . . ≤ λn .

. The Courant–Fischer–Weyl min-max characterization [3, Thm. 8.1.2.]


amounts to representing the eigenvalues as solutions of optimization
problems.. 5–7, 18, 19

covariance
 matrix The covariance
 matrix of a RV x ∈ R is defined as
d

 T
.. 11, 17, 18, 32
 
E x−E x x−E x

data See dataset.. 1–3, 5, 8, 9, 15–17

data augmentation Data augmentation methods add synthetic data points


to an existing set of data points. These synthetic data points are
obtained by perturbations (e.g., adding noise to physical measurements)
or transformations (e.g., rotations of images) of the original data points.
These perturbations and transformations are such that the resulting
synthetic data points should still have the same label. As a case in
point, a rotated cat image is still a cat image even if their feature vectors
(obtained by stacking pixel color intensities) are very different. Data
augmentation can be an efficient form of regularization.. 15–17, 39, 40

8
data minimization principle European data protection regulation includes
a data minimization principle. This principle requires a data controller
to limit the collection of personal information to what is directly relevant
and necessary to accomplish a specified purpose. The data should be
retained only for as long as necessary to fulfill that purpose [82, Article
5(1)(c)] [?].. 19

data point A data point is any object that conveys information [109]. Data
points might be students, radio signals, trees, forests, images, RVs, real
numbers or proteins. We characterize data points using two types of
properties. One type of property is referred to as a feature. Features
are properties of a data point that can be measured or computed in
an automated fashion. A different kind of property is referred to as
labels. The label of a data point represents some higher-level fact (or
quantity of interest). In contrast to features, determining the label of a
data point typically requires human experts (domain experts). Roughly
speaking, ML aims to predict the label of a data point based solely on
its features.. 1–20, 23–31, 34–36, 38–48

data poisoning Data poisoning refers to the intentional manipulation (or


fabrication) of data points to steer the training of a ML model [110,111].
The protection against data poisoning is particularly important in
distributed ML applications where datasets are de-centralized.. 2, 3

dataset With a slight abuse of language we use the term dataset or set of
data points to refer to an indexed list of data points z(1) , z(2) , . . .. Thus,
there is a first data point z(1) , a second data point z(2) and so on. Strictly

9
speaking, a dataset is a list and not a set [112]. We need to keep track
of the order of data points in order to cope with several data points
having the same features and labels. Database theory studies formal
languages for defining, structuring, and reasoning about datasets [?]..
1–17, 19, 24, 28–31, 35, 45

decision boundary Consider a hypothesis map h that reads in a feature


vector x ∈ Rd and delivers a value from a finite set Y. The decision
boundary of h is the set of vectors x ∈ Rd that lie between different
decision regions. More precisely, a vector x belongs to the decision
boundary if and only if each neighbourhood {x′ : ∥x − x′ ∥ ≤ ε}, for any
ε > 0, contains at least two vectors with different function values.. 25

decision region Consider a hypothesis map h that delivers values from a


finite set Y. For each label value (category) a ∈ Y, the hypothesis h
determines a subset of feature values x ∈ X that result in the same
output h(x) = a. We refer to this subset as a decision region of the
hypothesis h.. 4, 5, 10, 15, 29

decision tree A decision tree is a flow-chart-like representation of a hypoth-


esis map h. More formally, a decision tree is a directed graph containing
a root node that reads in the feature vector x of a data point. The root
node then forwards the data point to one of its children nodes based
on some elementary test on the features x. If the receiving children
node is not a leaf node, i.e., it has itself children nodes, it represents
another test. Based on the test result, the data point is further pushed
to one of its descendants. This testing and forwarding of the data point

10
is continued until the data point ends up in a leaf node (having no
children nodes). . 3, 6, 13, 14

deep net A deep net is a ANN with a (relatively) large number of hidden
layers. Deep learning is an umbrella term for ML methods that use a
deep net as their model [71].. 14, 15, 26

degree of belonging A number that indicates the extent by which a data


point belongs to a cluster [6, Ch. 8]. The degree of belonging can be
interpreted as a soft cluster assignment. Soft clustering methods can
encode the degree of belonging by a real number in the interval [0, 1].
Hard clustering is obtained as the extreme case when the degree of
belonging only takes on values 0 or 1.. 43

device Any physical system that is can be used to store and process data.
In the context of ML, we typically mean a computer that is able to read
in data points from different sources and, in turn, to train a ML model
using these data points.. 3, 14, 16

differentiable A function real-valued function f : Rd → R is differentiable if


it can, at any point, be approximated locally by a linear function. The
local linear approximation at the point x is determined by the gradient
∇f (x) [2].. 2, 3, 8, 11, 19–21, 42

differential privacy Consider some ML method A that reads in a dataset


(e.g., the training set used for ERM) and delivers some output A(D).
The output could be either the learnt model parameters or the predic-
tions for specific data points. Differential privacy is a precise measure of

11
privacy leakage incurred by revealing the output. Roughly speaking, a
ML method is differentially private if the probability distribution of the
output A(D) does not change too much if the sensitive attribute of one
data point in the training set is changed. Note that differential privacy
builds on a probabilistic model for a ML method, i.e., we interpret its
output A(D) as the realization of a RV. The randomness in the output
can be ensured by intentionally adding the realization of an auxiliary
RV (noise) to the output of the ML method.. 6–12, 35

discrepancy Consider a FL application with networked data represented


by a FL network. FL methods use a discrepancy measure to compare
hypothesis maps from local models at nodes i, i′ connected by an edge
in the FL network.. 1–4, 6–10, 12–14

edge weight Each edge {i, i′ } of a FL network is assigned a non-negative


edge weight Ai,i′ ≥ 0. A zero edge weight Ai,i′ = 0 indicates the absence
of an edge between nodes i, i′ ∈ V.. 1–4, 6–13, 15, 16, 26

effective dimension The effective dimension deff (H) of an infinite hypoth-


esis space H is a measure of its size. Loosely speaking, the effective
dimension is equal to the effective number of independent tunable pa-
rameters of the model. These parameters might be the coefficients used
in a linear map or the weights and bias terms of an ANN.. 13, 14, 38

eigenvalue We refer to a number λ ∈ R as eigenvalue of a square matrix


A ∈ Rd×d if there is a non-zero vector x ∈ Rd \ {0} such that Ax = λx..
3–15, 17–19

12
eigenvalue decomposition The eigenvalue decomposition for a square ma-
trix A ∈ Rd×d is a factorization of the form

A = VΛV−1 .

The columns of the matrix V = v(1) , . . . , v(d) are the eigenvectors of




the matrix V. The diagonal matrix Λ = diag λ1 , . . . , λd contains the




eigenvalues λj corresponding to the eigenvectors v(j) . Note that the


above decomposition exists only if the matrix A is diagonalizable.. 4, 5,
8, 9

eigenvector An eigenvector of a matrix A ∈ Rd×d is a non-zero vector


x ∈ Rd \ {0} such that Ax = λx with some eigenvalue λ.. 4, 6, 7, 13,
17

empirical risk The empirical risk L


b h|D of a hypothesis on a dataset D is


the average loss incurred by h when applied to the data points in D.. 3,
13, 14, 33, 44, 45

empirical risk minimization Empirical risk minimization is the optimiza-


tion problem of finding a hypothesis (out of a model) with minimum
average loss (or empirical risk) on a given dataset D (the training set).
Many ML methods are obtained from empirical risk via specific design
choices for the dataset, model and loss [6, Ch. 3].. 1–8, 11–17, 28, 32,
33, 38, 39, 43–46

estimation error Consider data points with feature vectors x and label y. In
some applications we can model the relation between the features and the
label of a data point as y = h̄(x)+ε. Here, we used some true underlying

13
hypothesis h̄ and a noise term ε which summarized any modelling or
labelling errors. The estimation error incurred by a ML method that
learns a hypothesis b
h, e.g., using ERM, is defined as b
h(x) − h̄(x), for
some feature vector. For a parameterized hypothesis space, consisting
of hypothesis maps that are determined by a model parameters w, we
can define the estimation error as ∆w = w
b − w [68, 113].. 8–10, 12, 13

Euclidean space The Euclidean space Rd of dimension d ∈ N consists of


vectors x = x1 , . . . , xd , with d real-valued entries x1 , . . . , xd ∈ R. Such


an Euclidean space is equipped with a geometric structure defined by the


inner product xT x′ = dj=1 xj x′j between any two vectors x, x′ ∈ Rd [2]..
P

4, 7, 16, 29, 38

expert ML aims to learn a hypothesis h that accurately predicts the label


of a data point based on its features. We measure the prediction error
using some loss function. Ideally we want to find a hypothesis that
incurs minimum loss on any data point. We can make this informal
goal precise via the i.i.d. assumption and using the Bayes risk as the
baseline for the (average) loss of a hypothesis. An alternative approach
to obtain a baseline is to use the hypothesis h′ learnt by an existing
ML method. We refer to this hypothesis h′ as an expert [114]. Regret
minimization methods learn a hypothesis that incurs a loss comparable
to the best expert [114, 115].. 13

explainability We define the (subjective) explainability of a ML method


as the level of simulatability [88] of the predictions delivered by a ML
system to a human user. Quantitative measures for the (subjective)

14
explainability of a trained model can be constructed by comparing its
predictions with the predictions provided by a user on a test-set [88, 90].
Alternatively, we can use probabilistic models for data and measure
explainability of a trained ML model via the conditional (differential)
entropy of its predictions, given the user predictions [89, 116]. . 1, 3, 6,
17–19

feature A feature of a data point is one of its properties that can be mea-
sured or computed easily without the need for human supervision. For
example, if a data point is a digital image (e.g„ stored in as a jpeg
file), then we could use the red-green-blue intensities of its pixels as
features. Domain-specific synonyms for the term feature are covariate,
explanatory variable, independent variable, input (variable), predictor
(variable) or regressor [117–119].. 1–20, 23, 24, 27–31, 34, 35, 38, 41–43,
47, 48

feature map A map that transforms the original features of a data point
into new features. The so-obtained new features might be preferable over
the original features for several reasons. For example, the arrangement
of data points might become simpler (of more linear ) in the new feature
space, allowing to use linear models in the new features. This idea is
a main driver for the development of kernel methods [120]. Moreover,
the hidden layers of a deep net can be interpreted as a trainable feature
map followed by a linear model in the form of the output layer. Another
reason for learning a feature map could be that learning a small number
of new features helps to avoid overfitting and ensure interpretability [121].

15
The special case of a feature map delivering two numeric features is
particularly useful for data visualization. Indeed, we can depict data
points in a scatterplot by using two features as the coordinates of a
data point.. 14, 15

feature matrix Consider a dataset D with m data points with feature


vectors x(1) , . . . , x(m) ∈ Rd . It is convenient to collect the individual
T
feature vectors into a feature matrix X := x(1) , . . . , x(m) of size
m × d.. 3, 7, 8, 10, 13, 16, 19

feature space The feature space of a given ML application or method is


constituted by all potential values that the feature vector of a data
point can take on. A widely used choice for the feature space is the
Euclidean space Rd with dimension d being the number of individual
features of a data point.. 12, 15, 23–25
T
feature vector A vector x = x1 , . . . , xd whose entries are individual
features x1 , . . . , xd . Many ML methods use feature vectors that belong
to some finite-dimensional Euclidean space Rd . However, for some ML
methods it is more convenient to work with feature vectors that belong
to an infinite-dimensional vector space (see, e.g., kernel method).. 8,
12, 13, 24, 25, 39

federated averaging (FedAvg) Federated averaging is an iterative FL al-


gorithm that alternates between local model trainings and averaging the
resulting local model parameters. Different variants of this algorithm
are obtained by different techniques for the model training. The authors

16
of [14] consider federated averaging methods where the local model
training is implemented by running several GD steps.. 2, 14

federated learning (FL) Federated learning is an umbrella term for ML


methods that train models in a collaborative fashion using decentralized
data and computation.. 1–19, 23, 33, 35, 47

federated learning (FL) network A federated network is an undirected


weighted graph whose nodes represent data generators that aim to train
a local (or personalized) model. Each node in a federated network
represents some device, capable to collect a local dataset and, in turn,
train a local model. FL methods learn a local hypothesis h(i) , for each
node i ∈ V, such that it incurs small loss on the local datasets.. 1–18,
21, 30, 33

Finnish Meteorological Institute The Finnish Meteorological Institute


is a government agency responsible for gathering and reporting weather
data in Finland.. 2, 7–9, 13, 14, 16–19, 21, 42

Gaussian mixture model A Gaussian mixture model (GMM) is particular


type of probabilistic models for a numeric vector x (e.g., the features of
a data point). Within a GMM, the vector x is drawn from a randomly
selected multivariate normal distribution p(c) = N µ(c) , C(c) with


c = I. The index I ∈ {1, . . . , k} is a RV with probabilities p(I = c) = pc .


Note that a GMM is parameterized by the probability pc , the mean
vector µ(c) and covariance matrix Σ(c) for each c = 1, . . . , k. GMMs
are widely used for clustering, density estimation and as a generative
model.. 5, 6, 14, 17, 43

17
Gaussian random variable A standard Gaussian RV is a real-valued ran-
dom variable x with probability density function (pdf) [5, 108, 122]

1 2
p(x) = √ exp−x /2 .

Given a standard Gaussian RV x, we can construct a general Gaussian


RV x′ with mean µ and variance σ 2 via x′ := σ(x + µ). The probability
distribution of a Gaussian RV is referred to as normal distribution,
denoted N (µ, σ).
A Gaussian random vector x ∈ Rd with covariance matrix C and mean
µ can be constructed via x := A z + µ . Here, A is any matrix that

T
satisfies AAT = C and z := z1 , . . . , zd is a vector whose entries are
i.i.d. standard Gaussian RVs z1 , . . . , zd . Gaussian random processes
generalize Gaussian random vectors by applying linear transformations
to infinite sequences of standard Gaussian RVs [?].
Gaussian RVs are widely used probabilistic models for the statistical
analysis of machine learning methods. Their significance arises partly
from the central limit theorem which states that the sum of many
independent RVs (not necessarily Gaussian themselves) tends to a
Gaussian RV [?].. 39

General Data Protection Regulation The General Data Protection Reg-


ulation (GDPR) was enacted by the European Union (EU), effective
from May 25, 2018 [82]. It safeguards the privacy and data rights of
individuals in the EU. The GDPR has significant implications for how
data is collected, stored, and used in ML applications. Key provisions
include:

18
• Data minimization principle: ML systems should only use necessary
amount of personal data for their purpose.

• Transparency and Explainability: ML systems should enable their


users to understand how they make decisions that impact them.

• Data Subject Rights: Including the rights to access, rectify, and


delete personal data, as well as to object to automated decision-
making and profiling.

• Accountability: Organizations must ensure robust data security


and demonstrate compliance through documentation and regular
audits.

. 4, 6

generalized total variation Generalized total variation measures the changes


of vector-valued node attributes over a weighted undirected graph.. 1,
7, 9, 10, 12, 13, 16, 23, 45

gradient For a real-valued function f : Rd → R : w 7→ f (w), a vector g such


f (w)− f (w′ )+gT (w−w′ )
that limw→w′ ∥w−w′ ∥
= 0 is referred to as the gradient of
f at w′ . If such a vector exists it is denoted ∇f (w′ ) or ∇f (w) w′
[2]..
1–5, 8, 10, 11, 14, 20, 22, 42–44, 47

gradient descent (GD) Gradient descent is an iterative method for find-


ing the minimum of a differentiable function f (w) of a vector-valued
argument w ∈ Rd . Consider a current guess or approximation w(k)
for minimum. We would like to find a new (better) vector w(k+1) that
has smaller objective value f (w(k+1) ) < f w(k) than the current guess


19
w(k) . We can achieve this typically by using a gradient step

w(k+1) = w(k) − η∇f (w(k) ) (10.2)

with a sufficiently small step size η > 0. Figure 10.3 illustrates the effect
of a single GD step (10.2). . 1, 2, 5–8, 11–14, 17, 20, 22, 28, 36, 43, 44

f (w) −η∇f (w(k) )


4
3 ∇f (w(k) )
2
1
1
w
w w(k+1) w(k)

Figure 10.3: A single gradient step (10.2) towards the minimizer w of f (w).

gradient step Given a differentiable real-valued function f (·) : Rd → R and


a vector w ∈ Rd , the gradient step updates w by adding the scaled
negative gradient ∇f (w) to obtain the new vector

b := w − η∇f (w).
w (10.3)

Mathematically, the gradient step is a (typically non-linear) operator


T (f,η) that is parametrized by the function f and the step size η. Note
that the gradient step (10.3) optimizes locally (confined to a neigh-
bourhood defined by the step size η) a linear approximation to the
function f (·). A natural generalization of (10.3) is to locally optimize

20
−η∇f (w(k) )
f (·)

∇f (w(k) )

w T (f,η) (w) w

Figure 10.4: The basic gradient step (10.3) maps a given vector w to the
updated vector w′ . It defines an operator T (f,η) (·) : Rd → Rd : w 7→ w.
b

the function itself (instead of its linear approximation),

2
b = argmin f (w′ )+(1/η) ∥w − w′ ∥2 .
w (10.4)
w′ ∈Rd

We intentionally use the same symbol η for the parameter in (10.4) as


we used for the step-size in (10.3). The larger we choose η in (10.4), the
more progress the update will make towards reducing the function value
b Note that, much like the gradient step (10.3), also the update
f (w).
(10.4) defines a (typcially non-linear) operator that is paramterized by
the function f (·) and the parameter η. For convex f (·), this operator is
known as the proximal operator of f (·) [38]. . 1–14, 16, 18, 36

gradient-based method Gradient-based methods are iterative techniques


for finding the minimum (or maximum) of a differentiable objective
function of the model parameters. These methods construct a sequence
of approximations to an optimal choice for model parameters that

21
f (w′ )

(1/η) ∥w − w′ ∥22

Figure 10.5: A generalized gradient step updates a vector w by minimizing


a penalized version of the function f (·). The penalty term is the squared
Euclidean distance from the vector w.

results in a minimum (or maximum) value of the objective function. As


their name indicates, gradient-based methods use the gradients of the
objective function evaluated during previous iterations to construct new
(hopefully) improved model parameters. One important example for a
gradient-based method is GD.. 1–9, 11–14, 16

graph A graph G = (V, E) is a pair that consists of a node set V and an


edge set E. In its most general form, a graph is specified by a map that
assigns to each edge e ∈ E a pair of nodes [123]. One important family
of graphs are simple undirected graphs. A simple undirected graph is
obtained by identifying each edge e ∈ E with two different nodes {i, i′ }.
Weighted graphs also specify numeric weights Ae for each edge e ∈ E..
2, 4, 7, 9, 10, 17, 19, 26, 33

GTV minimization GTV minimization is an instance of RERM using the

22
GTV of local model parameters as a regularizer [33].. 1–15, 17, 18

hard clustering Hard clustering refers to the task of partitioning a given


set of data points into (few) non-overlapping clusters. The most widely
used hard clustering method is k-means.. 11

horizontal FL Horizontal FL uses local datasets that are constituted by dif-


ferent data points but using the same features to characterize them [53].
For example, weather forecasting uses a network of spatially distributed
weather (observation) stations. Each weather station measures the
same quantities such as daily temperature, air pressure and precipita-
tion. However, different weather stations measure the characteristics or
features of different spatio-temporal regions (each such region being a
separate data point).. 1, 2, 9–11

hypothesis A map (or function) h : X → Y from the feature space X to the


label space Y. Given a data point with features x, we use a hypothesis
map h to estimate (or approximate) the label y using the prediction
ŷ = h(x). ML is all about learning (or finding) a hypothesis map h
such that y ≈ h(x) for any data point (having features x and label y)..
1–8, 10–19, 27–34, 38, 40–43, 45, 46

hypothesis space Every practical ML method uses a hypothesis space (or


model) H. The hypothesis space of a ML method is a subset of all
possible maps from the feature space to label space. The design choice
of the hypothesis space should take into account available computational
resources and statistical aspects. If the computational infrastructure

23
allows for efficient matrix operations, and there is a (approximately)
linear relation between features and label, a useful choice for the hy-
pothesis space might be the linear model.. 1, 3, 4, 12–16, 29–31, 34, 40,
45–47

i.i.d. It can be useful to interpret data points z(1) , . . . , z(m) as realizations of


independent and identically distributed RVs with a common probability
distribution. If these RVs are continuous-valued, their joint pdf is
p z(1) , . . . , z(m) = m with p(z) being the common marginal
 Q (r)

r=1 p z

pdf of the underlying RVs.. 3–5, 7, 8, 10, 13–18, 24, 27, 31, 36, 41, 43

i.i.d. assumption The i.i.d. assumption interprets data points of a dataset


as the realizations of i.i.d. RVs.. 1, 3, 7, 8, 12–15, 41

interpretability A ML method is interpretable for a specific user if they


can well anticipate the predictions delivered by the method. The notion
of interpretability can be made precise using quantitative measures of
the uncertainty about the predictions [89].. 29

kernel Consider data points characterized by a feature vector x ∈ X with a


generic feature space X . A (real-valued) kernel K : X × X → R assigns
each pair of feature vectors x, x′ ∈ X a real number K x, x′ . The value


K x, x′ is often interpreted as a measure for the similarity between x




and x′ . Kernel methods us a kernel to transform the feature vector x to


a new feature vector z = K x, · . This new feature vector belongs to a


linear feature space X ′ which is (in general) different from the original
feature space X . The feature space X ′ has a specific mathematical

24
structure, i.e., it is a reproducing kernel Hilbert space [120, 124].. 15,
25, 34

kernel method A kernel method is a ML method that uses a kernel K


to map the original (raw) feature vector x of a data point to a new
(transformed) feature vector z = K x, · [120, 124]. The motivation


for transforming the feature vectors is that, using a suitable kernel,


the data points have a more pleasant geometry in the transformed
feature space. For example, in a binary classification problem, using
transformed feature vectors z might allow to use linear models, even if
the data points are not linearly separable in the original feature space
(see Figure 10.6). . 16, 24

x(5)
x(4) z = K x, ·

x(1) z(1)
z(5)z(4)z(3)z(2)
x(3) x(2)

Figure 10.6: Five data points characterized by feature vectors x(r) and
labels y (r) ∈ {◦, □}, for r = 1, . . . , 5. With these feature vectors, there is no
way to separate the two classes by a straight line (representing the decision
boundary of a linear classifier). In contrast, the transformed feature vectors
z(r) = K x(r) , · allow to separate the data points using a linear classifier.


label A higher-level fact or quantity of interest associated with a data point.


For example, if the data point is an image, the label could indicate

25
whether the image contains a cat or not. Synonyms for label, commonly
used in specific domains, include response variable, output variable, and
target [117–119].. 1–18, 20, 23–31, 34, 35, 38, 41–43, 45, 48

label space Consider a ML application that involves data points charac-


terized by features and labels. The label space is constituted by all
potential values that the label of a data point can take on. Regres-
sion methods, aiming at predicting numeric labels, often use the label
space Y = R. Binary classification methods use a label space that
consists of two different elements, e.g., Y = {−1, 1}, Y = {0, 1} or
Y = {“cat image”, “no cat image”}. 1, 5, 11, 13, 15, 23, 28, 29

Laplacian matrix The structure of a graph G, with nodes i = 1, . . . , n, can


be analyzed using the properties of special matrices that are associated
with G. One such matrix is the graph Laplacian matrix L(G) ∈ Rn×n
which is defined for an undirected and weighted graph [125, 126]. It is
defined element-wise as (see Fig. 10.7)

−Ai,i′ for i ̸= i′ , {i, i′ } ∈ E






(G)
Li,i′ := for i = i′ (10.5)
P
 i′′ ̸=i Ai,i′′


else.

0

Here, Ai,i′ denotes the edge weight of an edge {i, i′ } ∈ E. . 2–6, 10,
16–18

Large Language Model Large Language Models (LLMs) is an umbrella


term for ML methods that process and generate human-like text. These
methods typically use deep nets with billions (or even trillions) of

26
1
 
2 −1 −1
 
L(G) = −1 1
 
0
 
2 3 −1 0 1

Figure 10.7: Left: Some undirected graph G with three nodes i = 1, 2, 3.


Right: Laplacian matrix L(G) ∈ R3×3 of G.

parameters. A widely used choice for the network architecture is referred


to as Transformers [?]. The training of LLMs is often based on the
task of predicting a few words that are intentionally removed from a
large text corpus. Thus, we can construct labelled data points simply
by selecting some words of a text as labels and the remaining words as
features of data points. This construction requires very little human
supervision and allows for generating sufficiently large training sets for
LLMs.. 3

law of large numbers The law of large numbers refers to the convergence
of the average of an increasing (large) number of i.i.d. RVs to the
mean of their common probability distribution. Different instances
of the law of large numbers are obtained using different notions of
convergence [122].. 14, 15

learning rate Consider an iterative method for finding or learning a useful


hypothesis h ∈ H. Such an iterative method repeats similar computa-
tional (update) steps that adjust or modify the current hypothesis to
obtain an improved hypothesis. A prime example of such an iterative

27
learning method is GD and its variants such as SGD or projected GD.
We refer by learning rate to a parameter of an iterative learning method
that controls the extent by which the current hypothesis can be modified
during a single iteration. A prime example of such a parameter is the
step size used in GD [6, Ch. 5].. 1, 2, 4–11, 13, 16, 17, 43

learning task Consider a dataset D constituted by several data points, each


of them characterized by features x. For example, the dataset D might
be constituted by the images of a particular database. Sometimes it
might be useful to represent a dataset D, along with the choice of
features, by a probability distribution p(x). A learning task associated
with D consists of a specific choice for the label of a data point and
the corresponding label space. Given a choice for the loss function and
model, a learning task gives rise to an instance of ERM. Thus, we could
define a learning task also via an instance of ERM, i.e., via an objective
function. Note that, for the same dataset, we obtain different learning
tasks by using different choices for the features and label of a data point.
These learning tasks are related, as they are based on the same dataset,
and solving them jointly (via multitask learning methods) is typically
preferable over solving them separately [127–129].. 4, 15

least absolute shrinkage and selection operator (Lasso) The least ab-
solute shrinkage and selection operator (Lasso) is an instance of struc-
tural risk minimization (SRM) to learn the weights w of a linear map
h(x) = wT x based on a training set. The Lasso is obtained from linear
regression by adding the scaled ℓ1 -norm α ∥w∥1 to the average squared

28
error loss incurred on the training set.. 40

linear classifier Consider data points characterized by numeric features x ∈


Rd and a label y ∈ Y from some finite label space Y. A linear classifier
characterized by having decision regions separated by hyperplanes in
the Euclidean space Rd [6, Ch. 2].. 25

linear model Consider data points, each characterized by a numeric feature


vector x ∈ Rd . A linear model is a hypothesis space which consists of
all linear maps,

H(d) := h(x) = wT x : w ∈ Rd . (10.6)




Note that (10.6) defines an entire family of hypothesis spaces, which is


parameterized by the number d of features that are linearly combined
to form the prediction h(x). The design choice of d is guided by
computational aspects (smaller d means less computation), statistical
aspects (increasing d might reduce prediction error) and interpretability.
A linear model using few carefully chosen features tends to be considered
more interpretable [121, 130].. 1–3, 6–17, 19, 20, 24, 25, 47

linear regression Linear regression aims to learn a linear hypothesis map


to predict a numeric label based on numeric features of a data point.
The quality of a linear hypothesis map is measured using the average
squared error loss incurred on a set of labeled data points, which we
refer to as the training set.. 2–5, 7–10, 13, 14, 16–19, 28, 38, 39

local dataset The concept of a local dataset is in-between the concept of a


data point and a dataset. A local dataset consists of several individual

29
data points which are characterized by features and labels. In contrast
to a single dataset used in basic ML methods, a local dataset is also
related to other local datasets via different notions of similarities. These
similarities might arise from probabilistic models or communication
infrastructure and are encoded in the edges of a FL network.. 1–18, 23,
30, 33, 35, 47

local model Consider a collections of local datasets that are assigned to the
nodes of a FL network. A local model H(i) is a hypothesis space assigned
to a node i ∈ V. Different nodes might be assigned different hypothesis

spaces, i.e., in general H(i) ̸= H(i ) for different nodes i, i′ ∈ V.. 1–7, 9,
11–14, 16–18

loss ML methods use a loss function L (z, h) to measure the error incurred
by applying a specific hypothesis to a specific data point. With slight
abuse of notation, we use the term loss for both, the loss function L
itself and for its value L (z, h) for a specific data point z and hypothesis
h.. 1–9, 11–17, 31, 33, 39–41, 43, 45, 46, 48

loss function A loss function is a map

 
L : X × Y × H → R+ : x, y , h 7→ L ((x, y), h)

which assigns a pair of a data point, with features x and label y, and
a hypothesis h ∈ H the non-negative real number L ((x, y), h). The
loss value L ((x, y), h) quantifies the discrepancy between the true label
y and the prediction h(x). Lower (closer to zero) values L ((x, y), h)
indicate a smaller discrepancy between prediction h(x) and label y.

30
Figure 10.8 depicts a loss function for a given data point, with features
x and label y, as a function of the hypothesis h ∈ H. . 1–9, 11, 12, 14,

loss
L ((x, y), h)

hypothesis h

Figure 10.8: Some loss function L ((x, y), h) for a fixed data point, with
feature vector x and label y, and varying hypothesis h. ML methods try to
find (learn) a hypothesis that incurs minimum loss.

17, 18, 28, 30, 31, 41

maximum likelihood Consider data points D = z(1) , . . . , z(m) } that are




interpreted as realizations of i.i.d. RVs with a common probability


distribution p(z; w) which depends on the model parameters w ∈
W ⊆ Rn . Maximum likelihood methods learn model parameters w
by maximizing the probability (density) p(D; w) = mr=1 p(z ; w) of
(r)
Q

observing the dataset is maximized. Thus, the maximum likelihood


estimator is a solution to the optimization problem maxw∈W p(D; w)..
7, 8, 15, 35

mean The expectation E{x} of a numeric RV x.. 27

model In the context of ML methods, the term model typically refers to the
hypothesis space used by a ML method [6, 131].. 1–6, 8, 9, 11–18, 23,

31
28, 29, 32, 33, 38, 40, 43, 45

model parameters Model parameters are quantities that are used to select a
specific hypothesis map from a model. We can think of model parameters
as a unique identifier for a hypothesis map, similar to how a social
security number identifies a person in Finland.. 1–11, 13–23, 31, 33, 35,
36, 38, 40, 44, 47

multivariate normal distribution The multivariate normal distribution


N (m, C) is an important family of probability distributions for a
continuous RV x ∈ Rd [5, 108, 132]. This family is parameterized by
the mean m and covariance matrix C of x. If the covariance matrix is
invertible, the probability distribution of x is
 
T −1 
p(x) ∝ exp − (1/2) x − m C x − m .

. 3, 4, 6, 11, 17, 18

mutual information The mutual information I (x; y) between two RVs x,


y defined on the same probability space is given by [109]
 
p(x, y)
I (x; y) := E log .
p(x)p(y)
It is a measure for how well we can estimate y based solely from x.
A large value of I (x; y) indicates that y can be well predicted solely
from x. This prediction could be obtained by a hypothesis learnt by a
ERM-based ML method.. 12–14, 35

neighbourhood The neighbourhood of a node i ∈ V is the subset of nodes


constituted by the neighbours of i.. 4

32
neighbours The neighbours of a node i ∈ V within a FL network are those
nodes i′ ∈ V \ {i} that are connected (via an edge) to node i.. 6, 7, 10,
16, 32, 33

networked data Networked data consists of local datasets that are related
by some notion of pair-wise similarity. We can represent networked
data using a graph whose nodes carry local datasets and edges encode
pairwise similarities. One example for networked data arises in FL
applications where local datasets are generated by spatially distributed
devices.. 12, 33

node degree The degree d(i) of a node i ∈ V in an undirected graph is the


number of its neighbours, d(i) := N (i) .. 2–6, 10, 11

objective function An objective function is a map that assigns each value


of an optimization variable, such as the model parameters w of a
hypothesis h(w) , to an objective value f (w). The objective value f (w)
could be the risk or the empirical risk of a hypothesis h(w) .. 1–8, 10–12,
14–18, 22, 39, 43, 47

overfitting Consider a ML method that uses ERM to learn a hypothesis


with minimum empirical risk on a given training set. Such a method is
overfitting the training set if it learns hypothesis with small empirical
risk on the training set but significantly larger loss outside the training
set.. 13, 15, 38

parameters The parameters of a ML model are tunable (learnable or ad-


justable) quantities that allow to choose between different hypothesis

33
maps. For example, the linear model H := {h(w) : h(w) (x) = w1 x + w2 }
consists of all hypothesis maps h(w) (x) = w1 x + w2 with a particular
T
choice for the parameters w = w1 , w2 ∈ R2 . Another example of
parameters is the weights assigned to the connections between neurons
of an ANN.. 1–18

polynomial regression Polynomial regression aims at learning a polyno-


mial hypothesis map to predict a numeric label based on numeric
features of a data point. For data points characterized by a sin-
gle numeric feature, polynomial regression uses the hypothesis space
(poly)
j=0 x wj }. The quality of a polynomial hypothesis
:= {h(x) = d−1 j
P
Hd
map is measured using the average squared error loss incurred on a set
of labeled data points (which we refer to as training set).. 14

positive semi-definite A (real-valued) symmetric matrix Q = QT ∈ Rd×d


is referred to as positive semi-definite if xT Qx ≥ 0 for every vector
x ∈ Rd . The property of being psd can be extended from matrices to
(real-valued) symmetric kernel maps K : X × X → R (with K(x, x′ ) =
K(x′ , x) as follows: For any finite set of feature vectors x(1) , . . . , x(m) ,
′ 
the resulting matrix Q ∈ Rm×m with entries Qr,r′ = K x(r) , x(r ) is
psd [120].. 2, 4–8, 10, 11, 13–15

prediction A prediction is an estimate or approximation for some quantity


of interest. ML revolves around learning or finding a hypothesis map h
that reads in the features x of a data point and delivers a prediction
yb := h(x) for its label y.. 1–8, 11, 14, 15, 17, 18, 23, 24, 29, 30, 32, 34,
35, 41, 48

34
privacy leakage Consider a (ML or FL) system that processes a local dataset
D(i) and shares data, such as the predictions obtained for new data
points, with other parties. Privacy leakage arises if the shared data
carries information about a private (sensitive) feature of a data point
(which might be a human) of D(i) . The amount of privacy leakage can
be measured via MI using a probabilistic model for the local dataset.
Another quantitative measure for privacy leakage is DP.. 2

privacy protection Consider some ML method A that reads in a D and


delivers some output A(D). The output could be the learnt model
parameters w
b or the prediction ĥ(x) obtained for a specific data point
with features x. Many important ML applications involve data points
representing humans. Each data point is characterized by features x,
potentially a label y and a sensitive attribute s (e.g., a recent medical
diagnosis). Roughly speaking, privacy protection means that it should
be impossible to infer, from the output A(D), any of the sensitive
attributes of data points in D. Mathematically, privacy protection
requires non-invertibility of the map A(D). In general, just making
A(D) non-invertible is typically insufficient for privacy protection. We
need to make A(D) sufficiently non-invertible.. 2, 35

probabilistic model A probabilistic model interprets data points as realiza-


tions of RVs with a joint probability distribution. This joint probability
distribution typically involves parameters which have to be manually
chosen or learnt via statistical inference methods such as maximum
likelihood [30].. 2–4, 6–8, 10–15, 17, 18, 30, 35, 43

35
probability density function (pdf ) The probability density function (pdf)
p(x) of a real-valued RV x ∈ R is a particular representation of its prob-
ability distribution. If the pdf exists, it can be used to compute the
probability that x takes on a value from a (measurable) set B ⊆ R
via p(x ∈ B) = B p(x′ )dx′ [5, Ch. 3]. The pdf of a vector-valued RV
R

x ∈ Rd (if it exists) allows to compute the probability that x falls into


a (measurable) region R via p(x ∈ R) = R p(x′ )dx′1 . . . dx′d [5, Ch. 3]..
R

18, 24

probability distribution To analyze ML methods it can be useful to inter-


pret data points as i.i.d. realizations of a RV. The typical properties of
such data points are then governed by the probability distribution of
this RV. The probability distribution of a binary RV y ∈ {0, 1} is fully
specified by the probabilities p(y = 0) and p(y = 1) = 1−p(y = 0). The
probability distribution of a real-valued RV x ∈ R might be specified by
a probability density function p(x) such that p(x ∈ [a, b]) ≈ p(a)|b − a|.
In the most general case, a probability distribution is defined by a
probability measure [95, 108].. 1, 3–7, 10–12, 14, 15, 18, 24, 27, 28, 31,
32, 35–37, 41, 43

projected gradient descent (projected GD) Projected GD extends ba-


sic GD for unconstrained optimization to handle constraints on the
optimization variable (model parameters). A single iteration of pro-
jected GD consists of first taking a gradient step and then projecting
the result back into a constrain set.. 1, 2, 11–13, 28

proximable A convex function for which the proximal operator can be

36
computed efficiently are sometimes referred to as proximable or simple
[37].. 11

proximal operator Given a convex function and a vector w′ , we define its


proximal operator as [38, 133]
 
′ 2
proxLi (·),2α (w ) := argmin f (w)+(ρ/2) ∥w − w ∥2 with ρ > 0.

w∈Rd

Convex functions for which the proximal operator can be computed


efficiently are sometimes referred to as proximable or simple [37].. 11,
13, 14, 21, 36

quadratic function A quadratic function f (w), reading in a vector w ∈ Rd


as its argument, is such that

f (w) = wT Qw + qT w + a,

with some matrix Q ∈ Rd×d , vector q ∈ Rd and scalar a ∈ R.. 2, 5,


8–10, 17

Rényi divergence The Rényi divergence measures the (dis-)similarity be-


tween two probability distributions [?].. 7

random variable (RV) A random variable is a mapping from a probability


space P to a value space [95]. The probability space, whose elements
are elementary events, is equipped with a probability measure that
assigns a probability to subsets of P. A binary random variable maps
elementary events to a set containing two different values, e.g., {−1, 1}
or {cat, no cat}. A real-valued random variable maps elementary events

37
to real numbers R. A vector-valued random variable maps elementary
events to the Euclidean space Rd . Probability theory uses the concept
of measurable spaces to rigorously define and study the properties of
(large) collections of random variables [95, 108].. 3–15, 17, 18, 24, 27,
31, 32, 35, 36, 38, 39, 41, 47

realization Consider a RV x which maps each element (outcome, or ele-


mentary event) ω ∈ P of a probability space P to an element a of a
measurable space N [2, 94, 95]. A realization of x is any element a′ ∈ N
such that there is an element ω ′ ∈ P with x(ω ′ ) = a′ .. 3, 5–8, 10–18,
24, 35, 36, 39, 41, 43

regression Regression problems revolve around the problem of predicting a


numeric label solely from the features of a data point [6, Ch. 2].. 13

regularization A key challenge of modern ML applications is that they


often use large models, having an effective dimension in the order of
billions. Using basic ERM-based methods to train a high-dimensional
model is prone to overfitting: the learn hypothesis performs well on the
training set but poorly outside the training set. Regularization refers to
modifications of a given instance of ERM in order to avoid overfitting,
i.e., to ensure the learnt hypothesis performs not much worse outside
the training set. There are three routes for implementing regularization:

• Model pruning. We prune the original model H to obtain a


smaller model H′ . For a parametrized model, the pruning can
be implemented by including constraints on the model parame-
ters (such as w1 ∈ [0.4, 0.6] for the weight of feature x1 in linear

38
regression).

• Loss penalization. We modify the objective function of ERM


by adding a penalty term to the training error. The penalty term
estimates how much larger the expected loss (risk) is compared to
the average loss on the training set.

• Data augmentation. We can enlarge the training set D by


adding perturbed copies of the original data points in D. One
example for such a perturbation is to add the realization of a RV
to the feature vector of a data point.

Figure 10.9 illustrates the above routs to regularization. These routes


are often equivalent: data augmentation using Gaussian RVs to perturb
the feature vectors in the training set of linear regression has the same
effect as adding the penalty λ ∥w∥22 to the training error (which is
nothing but ridge regression).

39
{h : h(x) = w1 x+w0 ; w1 ∈ [0.4, 0.6]}

label y h(x)
original training set D
augmented


α
1
Pm   
m r=1 L x(r) , y (r) , h +αR h
feature x

Figure 10.9: Three approaches to regularization: data augmentation, loss


penalization and model pruning (via constraints on model parameters).

. 1, 5, 7, 8, 12, 13, 15, 16, 18, 41, 45

regularized empirical risk minimization (RERM) Synonym for SRM..


9, 15, 22

regularizer A regularizer assigns each hypothesis h from a hypothesis space


H a quantitative measure R h for how much its prediction error on a


training set might differ from its prediction errors on data points outside
the training set. Ridge regression uses the regularizer R h := ∥w∥22


for linear hypothesis maps h(w) (x) := wT x [6, Ch. 3]. The least
absolute shrinkage and selection operator (Lasso) uses the regularizer
R h := ∥w∥1 for linear hypothesis maps h(w) (x) := wT x [6, Ch. 3]..


1, 6, 9, 13, 16, 23

ridge regression Ridge regression learns the weights w of a linear hypoth-

40
esis map h(w) (x) = wT x. The quality of a particular choice for the
parameter vector w is measured by the sum of two components. The
first component is the average squared error loss incurred by h(w) on a
set of labeled data points (the training set). The second component is
the scaled squared Euclidean norm α∥w∥22 with a regularization param-
eter α > 0. It can be shown that the effect of adding to α∥w∥22 to the
average squared error loss is equivalent to replacing the original data
points by an ensemble of realizations of a RV centered around these
data points.. 2, 3, 9, 10, 13, 16–19, 39, 40

risk Consider a hypothesis h used to predict the label y of a data point based
on its features x. We measure the quality of a particular prediction
using a loss function L ((x, y), h). If we interpret data points as the
realizations of i.i.d. RVs, also the L ((x, y), h) becomes the realization
of a RV. The i.i.d. assumption allows to define the risk of a hypothesis
as the expected loss E L ((x, y), h) . Note that the risk of h depends


on both, the specific choice for the loss function and the probability
distribution of the data points.. 2–4, 7, 11, 33, 39, 46

scatterplot A visualization technique that depicts data points by markers


in a two-dimensional plane. . 4, 16

semi-supervised learning Semi-supervised learning methods use unlabeled


data points to support the learning of a hypothesis from labeled data
points [54]. This approach is particularly useful for ML applications
that offer a large amount of unlabeled data points, but only a limited
number of labeled data points.. 9, 11

41
y

x
Figure 10.10: A scatterplot of data points that represent daily weather
conditions in Finland. Each data point is characterized by its minimum
daytime temperature x as feature and its maximum daytime temperature
y as the label. The temperatures have been measured at the FMI weather
station Helsinki Kaisaniemi during 1.9.2024 - 28.10.2024.

sensitive attribute ML revolves around learning a hypothesis map that


allows to predict the label of a data point from its features. In some
applications we must ensure that the output delivered by an ML system
does not allow to infer sensitive attributes of a data point. Which parts
of a data point is considered as a sensitive attribute is a design choice
that varies across different application domains.. 12, 35

smooth We refer to a real-valued function as smooth if it is differentiable


and its gradient is continuous [47, 134]. A differentiable function f (w)
is referred to as β-smooth if the gradient ∇f (w) is Lipschitz continuous
with Lipschitz constant β, i.e.,

∥∇f (w) − ∇f (w′ )∥ ≤ β∥w − w′ ∥.

. 1, 4, 5, 47

soft clustering Soft clustering refers to the task of partitioning a given set of
data points into (few) overlapping clusters. Each data point is assigned

42
to several different clusters with varying degree of belonging. Soft
clustering methods determine the degree of belonging (or soft cluster
assignment) for each data point and each cluster. A principled approach
to soft clustering is by interpreting data points as i.i.d. realizations of
a GMM. We then obtain a natural choice for the degree of belonging
as the conditional probability of a data point belonging to a specific
mixture component.. 6, 11

squared error loss The squared error loss measures the prediction error of
a hypothesis h when predicting a numeric label y ∈ R from the features
x of a data point. It is defined as

2
L ((x, y), h) := y − h(x) .
|{z}
=ŷ

. 3, 4, 7, 10, 14, 16–18, 28, 29, 34, 41

statistical aspects By statistical aspects of a ML method, we refer to (prop-


erties of) the probability distribution of its output under a probabilistic
model for the data fed into the method.. 5, 9, 18, 29, 45

step size See learning rate.. 1, 20

stochastic gradient descent Stochastic GD is obtained from GD by re-


placing the gradient of the objective function with a stochastic ap-
proximation. A main application of stochastic gradient descent is to
implement ERM for a parametrized model and a training set D is either
very large or not readily available (e.g., when data points are stored in
a database distributed all over the planet). To evaluate the gradient

43
Pm
r=1

P
r∈B

Figure 10.11: Stochastic GD for ERM approximates the gradient

r=1 ∇w L z , w by replacing the sum over all data points in the training
Pm (r)


set (indexed by r = 1, . . . , m) with a sum over a randomly chosen subset


B ⊆ {1, . . . , m}.

of the empirical risk (as a function of the model parameters w), we


need to compute a sum m r=1 ∇w L z , w over all data points in the
P (r)


training set. We obtain a stochastic approximation to the gradient by


replacing the sum m r=1 ∇w L z , w with a sum
P (r)
 P (r)

r∈B ∇w L z , w

over a randomly chosen subset B ⊆ {1, . . . , m} (see Figure 10.11). We


often refer to these randomly chosen data points as a batch. The batch
size |B| is an important parameter of stochastic GD. Stochastic GD
with |B| > 1 is referred to as mini-batch stochastic GD [?]. . 2, 3, 7, 8,
11, 28

stopping criterion Many ML methods use iterative algorithms that con-


struct a sequence of model parameters (such as the weights of a linear
map or the weights of an ANN) that (hopefully) converge to an optimal
choice for the model parameters. In practice, given finite computational
resources, we need to stop iterating after a finite number of times. A
stopping criterion is any well-defined condition required for stopping

44
iterating.. 1, 3, 4, 6, 7, 10, 13, 15, 16

structural risk minimization Structural risk minimization is the problem


of finding the hypothesis that optimally balances the average loss (or
empirical risk) on a training set with a regularization term. The regu-
larization term penalizes a hypothesis that is not robust against (small)
perturbations of the data points in the training set.. 28, 40

test set A set of data points that have neither been used to train a model,
e.g., via ERM, nor in a validation set to choose between different models..
2

total variation See GTV.. 1, 2, 4, 9, 16, 17

training error The average loss of a hypothesis when predicting the labels of
data points in a training set. We sometimes refer by training error also
the minimum average loss incurred on the training set by the optimal
hypothesis from a hypothesis space.. 1, 3, 11–15, 17–19, 39, 45, 46

training set A dataset D, constituted by some data points used in ERM


to learn a hypothesis ĥ. The average loss of ĥ on the training set is
referred to as the training error. The comparison of the training error
with the validation error of ĥ allows to diagnose the ML method and
informs how to improve them (e.g., using a different hypothesis space
or collecting more data points) [6, Sec. 6.6.].. 1, 3–19, 27–29, 33, 34,
38–41, 43–46

trustworthiness Beside the computational aspects and statistical aspects,


a third main design aspect for ML methods is their trustworthiness [?].

45
The European Union has put forward seven key requirements (KRs)
for trustworthy AI (that typically build on ML methods) [73]: KR1 -
Human Agency and Oversight, KR2 - Technical Robustness
and Safety, KR3 - Privacy and Data Governance, KR4 - Trans-
parency, KR5 - Diversity Non-Discrimination and Fairness,
KR6 Societal and Environmental Well-Being, KR7 - Account-
ability. . 1

validation Consider a hypothesis ĥ that has been learnt via some ML method,
e.g., by solving ERM on a training set D. Validation refers to the practice
of evaluating the loss incurred by hypothesis ĥ on a validation set that
consists of data points that are not contained in the training set D.. 1,
6, 11

validation error Consider a hypothesis ĥ which is obtained by some ML


method, e.g., using ERM on a training set. The average loss of ĥ on a
validation set, which is different from the training set, is referred to as
the validation error.. 1, 3, 6, 8, 11–15, 17–19, 45, 46

validation set A set of data points used to estimate the risk of a hypothesis
ĥ that has been learnt by some ML method (e.g., solving ERM). The
average loss of ĥ on the validation set is referred to as the validation
error and can be used to diagnose a ML method (see [6, Sec. 6.6.]).
The comparison between training error and validation error can inform
directions for improvements of the ML method (such as using a different
hypothesis space).. 3, 11, 12, 14, 15, 17, 19, 45, 46

46
variance The variance of a real-valued RV x is defined as the expectation
2
of the squared difference between x and its expectation

E x − E{x}
E{x}. We extend this definition to vector-valued RVs x as E x −


2
E{x} 2
.. 3, 4, 14

vertical FL Vertical FL uses local datasets that are constituted by the same
data points but characterizing them with different features [55]. For
example, different healthcare providers might all contain information
about the same population of patients. However, different healthcare
providers collect different measurements (blood values, electrocardiog-
raphy, lung X-ray) for the same patients.. 2, 11, 12

weights Consider a parameterized hypothesis space H. We use the term


weights for numeric model parameters that are used to scale features or
their transformations in order to compute h(w) ∈ H. A linear model uses
T
weights w = w1 , . . . , wd to compute the linear combination h(w) (x) =
wT x. Weights are also used in ANNs to form linear combinations of
features or the outputs of neurons in hidden layers.. 40

zero-gradient condition Consider the unconstrained optimization problem


minw∈Rd f (w) with a smooth and convex objective function f (w). A
necessary and sufficient condition for a vector w
b ∈ Rd to solve this
problem is that the gradient ∇f wb is the zero-vector,


 
∇f w
b =0⇔f w
b = min f (w).
w∈Rd

. 12

47
0/1 loss The 0/1 loss L ((x, y) , h) measures the quality of a classifier h(x)
that delivers a prediction ŷ (e.g., via thresholding (10.1)) for the label
y of a data point with features x. It is equal to 0 if the prediction
is correct, i.e., L ((x, y) , h) = 0 when ŷ = y. It is equal to 1 if the
prediction is wrong, L ((x, y) , h) = 1 when ŷ ̸= y.. 1

48
References
[1] W. Rudin, Real and Complex Analysis, 3rd ed. New York: McGraw-
Hill, 1987.

[2] ——, Principles of Mathematical Analysis, 3rd ed. New York: McGraw-
Hill, 1976.

[3] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed. Bal-
timore, MD: Johns Hopkins University Press, 2013.

[4] G. Golub and C. van Loan, “An analysis of the total least squares
problem,” SIAM J. Numerical Analysis, vol. 17, no. 6, pp. 883–893, Dec.
1980.

[5] D. Bertsekas and J. Tsitsiklis, Introduction to Probability, 2nd ed.


Athena Scientific, 2008.

[6] A. Jung, Machine Learning: The Basics, 1st ed. Springer Singapore,
Feb. 2022.

[7] M. Wollschlaeger, T. Sauter, and J. Jasperneite, “The future of industrial


communication: Automation networks in the era of the internet of things
and industry 4.0,” IEEE Industrial Electronics Magazine, vol. 11, no. 1,
pp. 17–27, 2017.

[8] M. Satyanarayanan, “The emergence of edge computing,” Computer,


vol. 50, no. 1, pp. 30–39, Jan. 2017. [Online]. Available:
[Link]

49
[9] H. Ates, A. Yetisen, F. Güder, and C. Dincer, “Wearable devices for
the detection of covid-19,” Nature Electronics, vol. 4, no. 1, pp. 13–14,
2021. [Online]. Available: [Link]

[10] H. Boyes, B. Hallaq, J. Cunningham, and T. Watson, “The industrial


internet of things (iiot): An analysis framework,” Computers in
Industry, vol. 101, pp. 1–12, 2018. [Online]. Available: https:
//[Link]/science/article/pii/S0166361517307285

[11] S. Cui, A. Hero, Z.-Q. Luo, and J. Moura, Eds., Big Data over Networks.
Cambridge Univ. Press, 2016.

[12] A. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: a


network-based approach to human disease,” Nature Reviews Genetics,
vol. 12, no. 56, 2011.

[13] M. E. J. Newman, Networks: An Introduction. Oxford Univ. Press,


2010.

[14] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y.


Arcas, “Communication-Efficient Learning of Deep Networks from
Decentralized Data,” in Proceedings of the 20th International
Conference on Artificial Intelligence and Statistics, ser. Proceedings
of Machine Learning Research, A. Singh and J. Zhu, Eds.,
vol. 54. PMLR, 20–22 Apr 2017, pp. 1273–1282. [Online]. Available:
[Link]

50
[15] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
Challenges, methods, and future directions,” IEEE Signal Processing
Magazine, vol. 37, no. 3, pp. 50–60, May 2020.

[16] Y. Cheng, Y. Liu, T. Chen, and Q. Yang, “Federated learning for


privacy-preserving ai,” Communications of the ACM, vol. 63, no. 12,
pp. 33–36, Dec. 2020.

[17] N. Agarwal, A. Suresh, F. Yu, S. Kumar, and H. McMahan, “cpSGD:


Communication-efficient and differentially-private distributed sgd,” in
Proc. Neural Inf. Proc. Syst. (NIPS), 2018.

[18] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated


Multi-Task Learning,” in Advances in Neural Information Processing
Systems, vol. 30, 2017. [Online]. Available: [Link]
cc/paper/2017/file/[Link]

[19] J. You, J. Wu, X. Jin, and M. Chowdhury, “Ship compute


or ship data? why not both?” in 18th USENIX Symposium
on Networked Systems Design and Implementation (NSDI 21).
USENIX Association, April 2021, pp. 633–651. [Online]. Available:
[Link]

[20] D. Tse and P. Viswanath, Fundamentals of Wireless Communication.


Cambridge University Press, 2005.

[21] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong,


D. Ramage, and F. Beaufays, “Applied federated learning: Improving

51
google keyboard query suggestions,” 2018. [Online]. Available:
[Link]

[22] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient frame-


work for clustered federated learning,” in 34th Conference on Neural
Information Processing Systems (NeurIPS 2020), Vancouver, Canada,
2020.

[23] F. Sattler, K. Müller, and W. Samek, “Clustered federated learning:


Model-agnostic distributed multitask optimization under privacy con-
straints,” IEEE Transactions on Neural Networks and Learning Systems,
2020.

[24] G. Strang, Computational Science and Engineering. Wellesley-


Cambridge Press, MA, 2007.

[25] ——, Introduction to Linear Algebra, 5th ed. Wellesley-Cambridge


Press, MA, 2016.

[26] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone


Operator Theory in Hilbert Spaces. New York: Springer, 2011.

[27] F. Pedregosa, “Scikit-learn: Machine learning in python,” Journal


of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011.
[Online]. Available: [Link]

[28] J. Hirvonen and J. Suomela. (2023) Distributed algorithms 2020.

[29] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK:


Cambridge Univ. Press, 2004.

52
[30] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed.
New York: Springer, 1998.

[31] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,


and S. Thrun, “Dermatologist-level classification of skin cancer with
deep neural networks,” Nature, vol. 542, 2017.

[32] H. Lütkepohl, New Introduction to Multiple Time Series Analysis. New


York: Springer, 2005.

[33] Y. SarcheshmehPour, Y. Tian, L. Zhang, and A. Jung, “Clustered


federated learning via generalized total variation minimization,” IEEE
Transactions on Signal Processing, vol. 71, pp. 4240–4256, 2023.

[34] F. Chung, “Spectral graph theory,” in Regional Conference Series in


Mathematics, 1997, no. 92.

[35] D. Spielman, “Spectral and algebraic graph theory,” 2019.

[36] ——, “Spectral graph theory,” in Combinatorial Scientific Computing,


U. Naumann and O. Schenk, Eds. Chapman and Hall/CRC, 2012.

[37] L. Condat, “A primal–dual splitting method for convex optimization


involving lipschitzian, proximable and linear composite terms,” Journal
of Opt. Th. and App., vol. 158, no. 2, pp. 460–479, Aug. 2013.

[38] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends


in Optimization, vol. 1, no. 3, pp. 123–231, 2013.

53
[39] R. Peng and D. A. Spielman, “An efficient parallel solver for SDD linear
systems,” in Proc. ACM Symposium on Theory of Computing, New
York, NY, 2014, pp. 333–342.

[40] N. K. Vishnoi, “Lx = b — Laplacian solvers and their algorithmic


applications,” Foundations and Trends in Theoretical Computer
Science, vol. 8, no. 1–2, pp. 1–141, 2012. [Online]. Available:
[Link]

[41] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent ker-


nel: Convergence and generalization in neural networks,” in
Advances in Neural Information Processing Systems, S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online].
Available: [Link]
[Link]

[42] S. S. Du, X. Zhai, B. Póczos, and A. Singh, “Gradient descent provably


optimizes over-parameterized neural networks,” in 7th International
Conference on Learning Representations, ICLR 2019, New Orleans,
LA, USA, May 6-9, 2019. [Link], 2019. [Online]. Available:
[Link]

[43] W. E, C. Ma, and L. Wu, “A comparative analysis of optimization


and generalization properties of two-layer neural network and random
feature models under gradient descent dynamics,” Science China
Mathematics, vol. 63, no. 7, pp. 1235–1258, 2020. [Online]. Available:
[Link]

54
[44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, PyTorch: An Imperative Style, High-
Performance Deep Learning Library. Red Hook, NY, USA: Curran
Associates Inc., 2019.

[45] D. P. Bertsekas, Convex Optimization Algorithms. Athena Scientific,


2015.

[46] Y. L. T. Schaul, X. Zhang, “No more pesky learning rates,” in Proc. of


the 30th International Conference on Machine Learning, PMLR 28(3),
vol. 28, Atlanta, Georgia, June 2013, pp. 343–351.

[47] Y. Nesterov, Introductory lectures on convex optimization, ser.


Applied Optimization. Kluwer Academic Publishers, Boston,
MA, 2004, vol. 87, a basic course. [Online]. Available: http:
//[Link]/10.1007/978-1-4419-8853-9

[48] D. P. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation:


Numerical Methods. Athena Scientific, 2015.

[49] D. Mills, “Internet time synchronization: the network time protocol,”


IEEE Transactions on Communications, vol. 39, no. 10, pp. 1482–1493,
1991.

[50] R. Diestel, Graph Theory. Springer Berlin Heidelberg, 2005.

[51] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, Federated


Learning, 1st ed. Springer, 2022.

55
[52] D. J. Spiegelhalter, “An omnibus test for normality for small samples,”
Biometrika, vol. 67, no. 2, pp. 493–496, 2024/03/25/ 1980. [Online].
Available: [Link]

[53] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, Horizontal


Federated Learning. Cham: Springer International Publishing, 2020, pp.
49–67. [Online]. Available: [Link]
4

[54] O. Chapelle, B. Schölkopf, and A. Zien, Eds., Semi-Supervised Learning.


Cambridge, Massachusetts: The MIT Press, 2006.

[55] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, Vertical


Federated Learning. Cham: Springer International Publishing, 2020, pp.
69–81. [Online]. Available: [Link]
5

[56] H. Ludwig and N. Baracaldo, Eds., Federated Learning: A Comprehen-


sive Overview of Methods and Applications. Springer, 2022.

[57] A. Shamsian, A. Navon, E. Fetaya, and G. Chechik, “Personalized


federated learning using hypernetworks,” in Proceedings of the 38th
International Conference on Machine Learning, ser. Proceedings of
Machine Learning Research, M. Meila and T. Zhang, Eds., vol.
139. PMLR, 18–24 Jul 2021, pp. 9489–9502. [Online]. Available:
[Link]

[58] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales,


“Learning to compare: Relation network for few-shot learning,” in 2018

56
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018, pp. 1199–1208.

[59] V. Satorras and J. Bruna, “Few-shot learning with graph


neural networks.” in ICLR (Poster). [Link], 2018.
[Online]. Available: [Link]
html#SatorrasE18

[60] J. Tropp, “An introduction to matrix concentration inequalities,” Found.


Trends Mach. Learn., May 2015.

[61] B. Bollobas, W. Fulton, A. Katok, F. Kirwan, and P. Sarnak, Random


graphs. Cambridge studies in advanced mathematics., 2001, vol. 73.

[62] G. Keiser, Optical Fiber Communication, 4th ed. New Delhi: Mc-Graw
Hill, 2011.

[63] D. Tse and P. Viswanath, Fundamentals of wireless communication.


USA: Cambridge University Press, 2005.

[64] B. Ying, K. Yuan, Y. Chen, H. Hu, P. PAN, and W. Yin,


“Exponential graph is provably efficient for decentralized deep training,”
in Advances in Neural Information Processing Systems, M. Ranzato,
A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds.,
vol. 34. Curran Associates, Inc., 2021, pp. 13 975–13 987. [Online].
Available: [Link]
[Link]

[65] Y.-T. Chow, W. Shi, T. Wu, and W. Yin, “Expander graph and
communication-efficient decentralized optimization,” in 2016 50th Asilo-

57
mar Conference on Signals, Systems and Computers, 2016, pp. 1715–
1720.

[66] M. Fiedler, “Algebraic connectivity of graphs,” Czechoslovak Mathemat-


ical Journal,, vol. 23, no. 2, pp. 298–305, 1973.

[67] S. Hoory, N. Linial, and A. Wigderson, “Expander graphs and their


applications,” Bull. Amer. Math. Soc., vol. 43, no. 04, pp. 439–562, Aug.
2006.

[68] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation


Theory. Englewood Cliffs, NJ: Prentice Hall, 1993.

[69] S. Chepuri, S. Liu, G. Leus, and A. Hero, “Learning sparse graphs under
smoothness prior,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, 2017, pp. 6508–6512.

[70] J. Tan, Y. Zhou, G. Liu, J. H. Wang, and S. Yu, “pFedSim: Similarity-


Aware Model Aggregation Towards Personalized Federated Learning,”
arXiv e-prints, p. arXiv:2305.15706, May 2023.

[71] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,


2016.

[72] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed


Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers. Hanover, MA: Now Publishers, 2010, vol. 3,
no. 1.

58
[73] E. Commission, C. Directorate-General for Communications Networks,
and Technology, The Assessment List for Trustworthy Artificial Intelli-
gence (ALTAI) for self assessment. Publications Office, 2020.

[74] H.-L. E. G. on Artificial Intelligence, “Ethics guidelines for trustworthy


AI,” European Commission, Tech. Rep., April 2019.

[75] D. Kuss and O. Lopez-Fernandez, “Internet addiction and problem-


atic internet use: A systematic review of clinical research.” World J
Psychiatry, vol. 6, no. 1, pp. 143–176, Mar 2016.

[76] L. Munn, “Angry by design: toxic communication and technical


architectures,” Humanities and Social Sciences Communications, vol. 7,
no. 1, p. 53, 2020. [Online]. Available: [Link]
s41599-020-00550-7

[77] P. Mozur, “A genocide incited on facebook, with posts from myanmar’s


military,” The New York Times, 2018.

[78] A. Simchon, M. Edwards, and S. Lewandowsky, “The persuasive effects


of political microtargeting in the age of generative artificial intelligence.”
PNAS Nexus, vol. 3, no. 2, p. pgae035, Feb 2024.

[79] J. Near and D. Darais, “Guidelines for evaluating differential privacy


guarantees,” National Institute of Standards and Technology, Gaithers-
burg, MD, Tech. Rep., 2023.

[80] S. Wachter, “Data protection in the age of big data,” Nature


Electronics, vol. 2, no. 1, pp. 6–7, 2019. [Online]. Available:
[Link]

59
[81] P. Samarati, “Protecting respondents identities in microdata release,”
IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 6,
pp. 1010–1027, 2001.

[82] E. Comission, “Regulation (eu) 2016/679 of the european parliament


and of the council of 27 april 2016 on the protection of natural persons
with regard to the processing of personal data and on the free movement
of such data, and repealing directive 95/46/ec (general data protection
regulation) (text with eea relevance),” no. 119, pp. 1–88, May 2016.

[83] U. N. G. Assembly, The Universal Declaration of Human Rights


(UDHR), New York, 1948.

[84] N. Kozodoi, J. Jacob, and S. Lessmann, “Fairness in credit scoring:


Assessment, implementation and profit implications,” European Journal
of Operational Research, vol. 297, no. 3, pp. 1083–1094, 2022.
[Online]. Available: [Link]
S0377221721005385

[85] J. Gonçalves-Sá and F. Pinheiro, Societal Implications of Recom-


mendation Systems: A Technical Perspective. Cham: Springer
International Publishing, 2024, pp. 47–63. [Online]. Available:
[Link]

[86] A. Abrol and R. Jha, “Power optimization in 5g networks: A step


towards green communication,” IEEE Access, vol. 4, pp. 1355–1374,
2016.

60
[87] C. Wang, Y. Yang, and P. Zhou, “Towards efficient scheduling of feder-
ated mobile devices under computational and statistical heterogeneity,”
IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 2,
pp. 394–410, 2021.

[88] J. Colin, T. Fel, R. Cadène, and T. Serre, “What I Cannot Predict, I


Do Not Understand: A Human-Centered Evaluation Framework for
Explainability Methods.” Advances in Neural Information Processing
Systems, vol. 35, pp. 2832–2845, 2022.

[89] A. Jung and P. Nardelli, “An information-theoretic approach to person-


alized explainable machine learning,” IEEE Sig. Proc. Lett., vol. 27, pp.
825–829, 2020.

[90] L. Zhang, G. Karakasidis, A. Odnoblyudova, L. Dogruel, Y. Tian, and


A. Jung, “Explainable empirical risk minimization,” Neural Computing
and Applications, vol. 36, no. 8, pp. 3983–3996, 2024. [Online]. Available:
[Link]

[91] M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou,


M. Milchenko, W. Xu, D. Marcus, R. R. Colen, and S. Bakas, “Federated
learning in medicine: facilitating multi-institutional collaborations
without sharing patient data,” Scientific Reports, vol. 10, no. 1, p. 12598,
2020. [Online]. Available: [Link]

[92] P. Amin, N. R. Anikireddypally, S. Khurana, S. Vadakkemadathil, and


W. Wu, “Personalized health monitoring using predictive analytics,”

61
in 2019 IEEE Fifth International Conference on Big Data Computing
Service and Applications (BigDataService), 2019, pp. 271–278.

[93] R. B. Ash, Probability and Measure Theory, 2nd ed. New York:
Academic Press, 2000.

[94] P. R. Halmos, Measure Theory. New York: Springer, 1974.

[95] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995.

[96] C. Dwork and A. Roth, “The algorithmic foundations of differential


privacy,” Foundations and Trends® in Theoretical Computer
Science, vol. 9, no. 3–4, pp. 211–407, 2014. [Online]. Available:
[Link]

[97] S. Asoodeh, J. Liao, F. P. Calmon, O. Kosut, and L. Sankar, “A Better


Bound Gives a Hundred Rounds: Enhanced Privacy Guarantees via
f -Divergences,” arXiv e-prints, p. arXiv:2001.05990, Jan. 2020.

[98] I. Mironov, “Rényi differential privacy,” in 2017 IEEE 30th Computer


Security Foundations Symposium (CSF), 2017, pp. 263–275.

[99] S. M. Kay, Fundamentals of statistical signal processing. Vol. 2., Detec-


tion theory, ser. Prentice-Hall signal processing series. Upper Saddle
River, NJ: Prentice-Hall PTR, 1998.

[100] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem


for differential privacy,” in Proceedings of the 32nd International
Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille,

62
France: PMLR, 07–09 Jul 2015, pp. 1376–1385. [Online]. Available:
[Link]

[101] Q. Geng and P. Viswanath, “The optimal noise-adding mechanism in


differential privacy,” IEEE Transactions on Information Theory, vol. 62,
no. 2, pp. 925–951, 2016.

[102] H. Shu and H. Zhu, “Sensitivity analysis of deep neural networks,”


in Proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence, ser. AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019.
[Online]. Available: [Link]

[103] R. Busa-Fekete, A. Munoz-Medina, U. Syed, and S. Vassilvitskii,


“Label differential privacy and private training data release,” in
Proceedings of the 40th International Conference on Machine
Learning, ser. Proceedings of Machine Learning Research, vol.
202. PMLR, 23–29 Jul 2023, pp. 3233–3251. [Online]. Available:
[Link]

[104] B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification by sub-


sampling: tight analyses via couplings and divergences,” in Proceedings
of the 32nd International Conference on Neural Information Processing
Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.,
2018, pp. 6280–6290.

[105] P. Cuff and L. Yu, “Differential privacy as a mutual information


constraint,” in Proceedings of the 2016 ACM SIGSAC Conference on
Computer and Communications Security, ser. CCS ’16. New York, NY,

63
USA: Association for Computing Machinery, 2016, pp. 43–54. [Online].
Available: [Link]

[106] A. Makhdoumi, S. Salamatian, N. Fawaz, and M. Médard, “From the


information bottleneck to the privacy funnel,” in 2014 IEEE Information
Theory Workshop (ITW 2014), 2014, pp. 501–505.

[107] A. Turner, D. Tsipras, and A. Madry, “Clean-label backdoor


attacks,” 2019. [Online]. Available: [Link]
HJg6e2CcK7

[108] R. Gray, Probability, Random Processes, and Ergodic Properties, 2nd ed.
New York: Springer, 2009.

[109] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.


New Jersey: Wiley, 2006.

[110] X. Liu, H. Li, G. Xu, Z. Chen, X. Huang, and R. Lu, “Privacy-enhanced


federated learning against poisoning adversaries,” IEEE Transactions
on Information Forensics and Security, vol. 16, pp. 4574–4588, 2021.

[111] J. Zhang, B. Chen, X. Cheng, H. T. T. Binh, and S. Yu, “Poisongan:


Generative poisoning attacks against federated learning in edge com-
puting systems,” IEEE Internet of Things Journal, vol. 8, no. 5, pp.
3310–3322, 2021.

[112] P. Halmos, Naive set theory. Springer-Verlag, 1974.

64
[113] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning, ser. Springer Series in Statistics. New York, NY, USA:
Springer, 2001.

[114] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. New


York, NY, USA: Cambridge University Press, 2006.

[115] E. Hazan, Introduction to Online Convex Optimization. Now Publishers


Inc., 2016.

[116] J. Chen, L. Song, M. Wainwright, and M. Jordan, “Learning to explain:


An information-theoretic perspective on model interpretation,” in Proc.
35th Int. Conf. on Mach. Learning, Stockholm, Sweden, 2018.

[117] D. Gujarati and D. Porter, Basic Econometrics. Mc-Graw Hill, 2009.

[118] Y. Dodge, The Oxford Dictionary of Statistical Terms. Oxford Univer-


sity Press, 2003.

[119] B. Everitt, Cambridge Dictionary of Statistics. Cambridge University


Press, 2002.

[120] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector


Machines, Regularization, Optimization, and Beyond. Cambridge, MA,
USA: MIT Press, Dec. 2002.

[121] M. Ribeiro, S. Singh, and C. Guestrin, “ “Why should i trust you?”: Ex-
plaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD,
Aug. 2016, pp. 1135–1144.

65
[122] A. Papoulis and S. U. Pillai, Probability, Random Variables, and Stochas-
tic Processes, 4th ed. New York: Mc-Graw Hill, 2002.

[123] R. T. Rockafellar, Network Flows and Monotropic Optimization. Athena


Scientific, Jul. 1998.

[124] C. Lampert, “Kernel methods in computer vision,” Foundations and


Trends in Computer Graphics and Vision, 2009.

[125] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and


Computing, vol. 17, no. 4, pp. 395–416, Dec. 2007.

[126] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis


and an algorithm,” in Adv. Neur. Inf. Proc. Syst., 2001.

[127] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1,


pp. 41–75, 1997. [Online]. Available: [Link]
1007379606734

[128] A. Jung, G. Hannak, and N. Görtz, “Graphical LASSO Based Model


Selection for Time Series,” IEEE Sig. Proc. Letters, vol. 22, no. 10, Oct.
2015.

[129] A. Jung, “Learning the conditional independence structure of station-


ary time series: A multitask learning approach,” IEEE Trans. Signal
Processing, vol. 63, no. 21, Nov. 2015.

[130] C. Rudin, “Stop explaining black box machine learning models for high-
stakes decisions and use interpretable models instead,” Nature Machine
Intelligence, vol. 1, no. 5, pp. 206–215, 2019.

66
[131] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning
– from Theory to Algorithms. Cambridge University Press, 2014.

[132] A. Lapidoth, A Foundation in Digital Communication. New York:


Cambridge University Press, 2009.

[133] H. Bauschke and P. Combettes, Convex Analysis and Monotone Opera-


tor Theory in Hilbert Spaces, 2nd ed. New York: Springer, 2017.

[134] S. Bubeck, “Convex optimization. algorithms and complexity.” in Foun-


dations and Trends in Machine Learning. Now Publishers, 2015, vol. 8.

67

You might also like