Federated Learning: Theory and Practice
Federated Learning: Theory and Practice
Abstract
∗
AJ is currently Associate Professor for Machine Learning at Aalto University (Finland).
This work has been partially funded by the Academy of Finland (decision numbers 331197,
349966, 363624) and the European Union (grant number 952410).
1
Contents
1 Introduction 1
1.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Related Courses . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Main Goal of the Course . . . . . . . . . . . . . . . . . . . . . 7
1.6 Outline of the Course . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Assignments and Grading . . . . . . . . . . . . . . . . . . . . 10
1.8 Student Project . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.9 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.10 Ground Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 ML Basics 1
2.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Three Components and a Design Principle . . . . . . . . . . . 1
2.3 Computational Aspects of ERM . . . . . . . . . . . . . . . . . 5
2.4 Statistical Aspects of ERM . . . . . . . . . . . . . . . . . . . . 6
2.5 Validation and Diagnosis of ML . . . . . . . . . . . . . . . . . 10
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 18
2
3.2 Graphs and Their Laplacian . . . . . . . . . . . . . . . . . . . 2
3.3 Generalized Total Variation Minimization . . . . . . . . . . . 9
3.3.1 Computational Aspects of GTVMin . . . . . . . . . . . 11
3.3.2 Statistical Aspects of GTVMin . . . . . . . . . . . . . 13
3.4 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 16
3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.1 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . 18
4 Gradient Methods 1
4.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.3 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.4 When to Stop? . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.5 Perturbed Gradient Step . . . . . . . . . . . . . . . . . . . . . 10
4.6 Handling Constraints - Projected Gradient Descent . . . . . . 12
4.7 Generalizing the Gradient Step . . . . . . . . . . . . . . . . . 14
4.8 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 17
5 FL Algorithms 1
5.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
5.2 Gradient Step for GTVMin . . . . . . . . . . . . . . . . . . . 2
5.3 Message Passing Implementation . . . . . . . . . . . . . . . . 5
5.4 FedSGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.5 FedAvg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.6 Asynchronous Computation . . . . . . . . . . . . . . . . . . . 14
5.7 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 16
3
5.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.8.1 Proof of Proposition 5.1 . . . . . . . . . . . . . . . . . 18
5.8.2 Proof of Proposition 5.2 . . . . . . . . . . . . . . . . . 19
7 Graph Learning 1
7.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
7.2 Edges are a Design Choice . . . . . . . . . . . . . . . . . . . . 2
7.3 Measuring (Dis-)Similarity Between Datasets . . . . . . . . . . 6
7.4 Graph Learning Methods . . . . . . . . . . . . . . . . . . . . . 9
7.5 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 13
8 Trustworthy FL 1
8.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
8.2 Seven Key Requirements by the EU . . . . . . . . . . . . . . . 2
8.2.1 KR1 - Human Agency and Oversight. . . . . . . . . . . 2
4
8.2.2 KR2 - Technical Robustness and Safety. . . . . . . . . 3
8.2.3 KR3 - Privacy and Data Governance. . . . . . . . . . . 4
8.2.4 KR4 - Transparency. . . . . . . . . . . . . . . . . . . . 5
8.2.5 KR5 - Diversity, Non-Discrimination and Fairness. . . . 6
8.2.6 KR6 - Societal and Environmental Well-Being. . . . . . 7
8.2.7 KR7 - Accountability. . . . . . . . . . . . . . . . . . . 8
8.3 Technical Robustness of FL Systems . . . . . . . . . . . . . . 8
8.3.1 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . 9
8.3.2 Estimation Error Analysis . . . . . . . . . . . . . . . . 10
8.3.3 Network Resilience . . . . . . . . . . . . . . . . . . . . 13
8.3.4 Stragglers . . . . . . . . . . . . . . . . . . . . . . . . . 14
8.4 Subjective Explainability of FL Systems . . . . . . . . . . . . 17
8.5 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 21
9 Privacy-Protection in FL 1
9.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 1
9.2 Measuring Privacy Leakage . . . . . . . . . . . . . . . . . . . 2
9.3 Ensuring Differential Privacy . . . . . . . . . . . . . . . . . . . 9
9.4 Private Feature Learning . . . . . . . . . . . . . . . . . . . . . 12
9.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
9.6 Overview of Coding Assignment . . . . . . . . . . . . . . . . . 17
9.6.1 Where Are You? . . . . . . . . . . . . . . . . . . . . . 17
9.6.2 Ensuring Privacy with Pre-Processing . . . . . . . . . . 18
9.6.3 Ensuring Privacy with Post-Processing . . . . . . . . . 18
9.6.4 Private Feature Learning . . . . . . . . . . . . . . . . . 18
5
10 Data and Model Poisoning in FL 1
10.1 Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2
10.2 Attack Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
10.3 Data Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
10.4 Model Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . 7
10.5 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
10.5.1 Denial-of-Service Attack . . . . . . . . . . . . . . . . . 8
10.5.2 Backdoor Attack . . . . . . . . . . . . . . . . . . . . . 8
Glossary 1
6
Lists of Symbols
A⊆B A is a subset of B.
7
{0, 1} The set that consists of the two real numbers 0 and 1.
Ch. 9].
8
Matrices and Vectors
A generalized identity matrix with l rows and d columns.
Il×d The entries of Il×d ∈ Rl×d are equal to 1 along the main
diagonal and equal to 0 otherwise.
T
The transpose of a matrix X ∈ Rm×d . A square real-valued
X
matrix X ∈ Rm×m is called symmetric if X = XT .
T
0 = 0, . . . , 0 The vector in Rd with each entry equal to zero.
T
1 = 1, . . . , 1 The vector in Rd with each entry equal to one.
9
T The vector of length d + d′ obtained by concatenating the
vT , wT ′
entries of vector v ∈ Rd with the entries of w ∈ Rd .
10
Probability Theory
11
Machine Learning
x(r) The feature vector of the r-th data point within a dataset.
(r)
xj The j-th feature of the r-th data point within a dataset.
12
B A mini-batch (subset) of randomly chosen data points.
x(r) , y (r) The features and label of the r-th data point.
X
Given two sets X and Y, we denote by Y X the set of all
Y
possible hypothesis maps h : X → Y.
13
A hypothesis space or model used by a ML method. The
H hypothesis space consists of different hypothesis maps
h : X → Y between which the ML method has to choose .
14
The training error of a hypothesis h, which is its average
Et
loss incurred over a training set.
T
A parameter vector w = w1 , . . . , wd of a model, e.g,
w
the weights of a linear model or in a ANN.
A feature map ϕ : X → X ′ : x 7→ x′ := ϕ x ∈ X ′ .
ϕ(·)
15
Federated Learning
d(G)
max The maximum weighted node degree of a graph G.
16
The features of the r-th data point in the local dataset
x(i,r)
D(i) .
y (i,r) The label of the r-th data point in the local dataset D(i) .
hypothesis.
T
(1) T (n) T
The vector ∈ Rdn that is ob-
w ,..., w
n
stack w(i)
i=1 tained by vertically stacking the local model parameters
w(i) ∈ Rd .
17
1 Introduction
Welcome to the course CS-E4740 Federated Learning. While the course can
be completed fully remote, we strongly encourage you to attend the on-site
events for your convenience. Any on-site event will be recorded and made
available to students via this YouTube channel.
The basic variant (5 credits) of this course consists of lectures (sched-
ule here) and corresponding coding assignments (schedule here). We test
your completion of the coding assignments via quizzes (implemented on the
MyCourses page). You can upgrade the course to an extended variant (10
credits) by completing a student project (see Section 1.8).
1.2 Introduction
1
For example, the management of pandemics uses contact networks to
relate local datasets generated by patients. Network medicine relates data
about diseases via co-morbidity networks [12]. Social science uses notions
of acquaintance to relate data collected from be-friended individuals [13].
Another example for network-structured are observations collected at Finnish
Meteorological Institute (FMI) weather stations. Each FMI station generates
a local dataset which tend to have similar statistical properties for nearby
stations.
FL is an umbrella term for distributed optimization techniques to train
machine learning (ML) models from decentralized collections of local datasets
[14–18]. The idea is to carry out the computations, such as gradient steps
(see Lecture 4), arising during ML model training at the location of data
generation (such as your smartphone or a heart rate sensor). This design
philosophy is different from the basic ML workflow, which is to (i) collect
data at a single location (computer) and (ii) then train a single ML model on
this data.
It can be beneficial to train different ML models at the locations of actual
data generation [19] for several reasons:
2
Figure 1.1: Left: A basic ML method uses a single dataset to train a single
model. Right: Decentralized collection of devices (or clients) with the ability
to generate data and train models.
3
• Trading Computation against Communication. Consider a FL
application where local datasets are generated by low-complexity devices
at remote locations (think of a wildlife camera) that cannot be easily
accessed. The cost of communicating raw local datasets to some central
unit (which then trains a single global ML model) might be much higher
than the computational cost incurred by using the low-complexity
devices to (partially) train ML models [20].
1.3 Prerequisites
4
R [24, 25]. We will heavily use concepts from linear algebra to represent and
manipulate data and ML models.
The metric structure of Rd is an excellent analysis tool for the (convergence)
behaviour of FL algorithms. In particular, we will study FL algorithms that
are obtained as fixed-point iterations of some non-linear operator on Rd .
These operators are defined by the data (distribution) and ML models used
within a FL system. A prime example of such a non-linear operator is the
gradient step of gradient-based methods (see Lecture 4). The computational
properties (such as convergence speed) of these FL algorithms can then be
characterized via the contraction properties of the underlying operator [26].
A main tool for the design the FL algorithms are variants of gradient
descent (GD). The common idea of these gradient-based methods is to ap-
proximate a function f (x) locally by a linear function. This local linear
approximation is determined by the gradient ∇f (x). We, therefore, expect
some familiarity with multivariable calculus [2].
5
model.
6
• ELEC-E5424 - Convex Optimization. Teaches advanced optimisa-
tion theory for the important class of convex optimization problems [29].
Convex optimization theory and methods can be used for the study and
design of FL algorithms.
V := {1, . . . , n}.
7
two connected nodes i, i′ by a positive edge weight Ai,i′ > 0.
We can formalize a FL application as an optimization problem associated
with an FL network,
(i) (i′ )
X X
Li w(i) + α Ai,i′ d(w ,w ) . (1.1)
min
w(i)
i∈V {i,i′ }∈E
by local model parameters at each node i. The second component is the sum
(i) ,w(i′ ) )
of local model parameters discrepancies d(w across the edges {i, i′ } of
the FL network.
8
• Part II: FL Theory and Methods. Lecture 3 introduces the FL
network as our main mathematical structure for representing collections
of local datasets and corresponding tailored models. The undirected
and weighted edges of the FL network represent statistical similarities
between local datasets. Lecture 3 also formulates FL as an instance
of regularized empirical risk minimization (RERM) which we refer
to as GTVMin. GTVMin uses the variation of personalized model
parameters across edges in the FL network as regularizer. We will see
that GTVMin couples the training of tailored (or “personalized”) ML
models such that well-connected nodes (clusters) in the FL network will
obtain similar trained models. Lecture 4 discusses variations of gradient
descent as our main algorithmic toolbox for solving GTVMin. Lecture
5 shows how FL algorithms can be obtained in a principled fashion
by applying optimization methods, such as gradient-based methods,
to GTVMin. We will obtain FL algorithms that can be implemented
as iterative message passing methods for the distributed training of
tailored (“personalized”) models. Lecture 6 derives some main flavours
of FL as special cases of GTVMin. The usefulness of GTVMin crucially
depends on the choice for the weighted edges in the FL network. Lecture
7 discusses graph learning methods that determine a useful FL network
via different notions of statistical similarity between local datasets.
9
bations of data or computation. We then discuss how FL algorithms
can ensure privacy protection in Lecture 9. Lecture 10 discusses how
to evaluate and ensure robustness of FL methods against intentional
perturbations (poisoning) of local dataset.
The course includes coding assignments that require you to implement concepts
from the lecture in a Python notebook. We will use MyCourses quizzes to test
your understanding of lecture contents and solutions to the coding assignments.
For each quiz, you can earn around 10 points. We will sum up the points
collected during the quizzes (no minimum requirements for any individual
quiz) and determine your grade according to: grade 1 for 50-59 points;
grade 2 for 60-69; grade 3 for 70-79; grade 4 for 80-89 and top grade
5 for at least 90 points. Students will be able to review their assignments’
grading at the end of the course.
You can extend the basic variant (which is worth 5 credits) to 10 credits by
completing a student project and peer review. This project requires you to
formulate an application of your choice as a FL problem using the concepts
from this course. You then have to solve this FL problem using the FL
algorithms taught in this course. The main deliverable will be a project
report which must follow the structure indicated in the template. You will
then peer-review the reports of your fellow students by answering a detailed
questionnaire.
10
1.9 Schedule
The course lectures are held on Mo. and Wed. at 16.15, during 28-Feb-2024
until 30-Apr-2024. You can find the detailed schedule and lecture halls
following this link. As the course can be completed fully remote, we will
record each lecture and add the recording to the YouTube playlist here.
After each lecture, we release the corresponding assignment at this site.
You have a few days to work on the assignment before we open the corre-
sponding quiz on the MyCourses page of the course (click me).
The assignments and quizzes are somewhat fast paced to encourage stu-
dents to actively work on the course. We will also be strict about the deadlines
for the quizzes. However, students have the possibility to cover up
for lost quiz points during a review meeting at the end of course.
Moreover, active participation such as contribution to the discussion
forum or providing feedback on the lecture notes will be taken into
account.
Note that as a student following this course, you must act according to the
Code of Conduct of Aalto University (see here). Two main ground rules for
this course are:
11
other’s solutions to coding assignments. We will randomly choose stu-
dents who have to explain their solutions (and corresponding answers
to quiz questions).
1.11 Acknowledgement
The development of this text has benefited from student feedback received
for the course CS-E4740 Federated Learning which has been offered at Aalto
University during 2023. The author is indebted to Olga Kuznetsova, Diana
Pfau and Shamsiiat Abdurakhmanova for providing feedback on early drafts.
1.12 Problems
12
• Show that the set of all feature vectors forms a vector space under
standard addition and scalar multiplication.
• If x(1) = (1, 2, 3)T and x(2) = (−1, 0, 1)T , compute 3x(1) − 2x(2) .
• Compute w
b for the following dataset:
1 2 7
X = 3 4 , y = 8 .
5 6 9
• Compute w
b for the following dataset: The r-th row of X, for r =
1, . . . , 28. is given the 10 minute temperature recordings during day
r/Mar/2023 at FMI weather station Kustavi Isokari. The r-th row of y
is the maximum daytime temperature during day r + 1/Mar/2023 at
the same weather station.
13
1. Prove that Q is psd.
1 2
2. Compute the eigenvalues of Q for X = .
3 4
14
2 ML Basics
This chapter covers basic ML techniques instrumental for federated learning
(FL). Content-wise, this chapter is more extensive compared to the following
chapters. However, this chapter should be considerably easier to follow than
the following chapters as it mainly refreshes pre-requisite knowledge.
• be familiar with the concept of data points (their features and labels),
model and loss function,
1
precisely a data point is. Coming up with a good choice or definition of data
points is not trivial as it influences the overall performance of a ML method
in many different ways.
This course will focus mainly on one specific choice for the data points.
In particular, we will consider data points that represent the daily weather
conditions around Finnish Meteorological Institute (FMI) weather stations.
We denote a specific data point by z. It is characterized by the following
features:
• latitude lat and longitude lon of the weather station, e.g., lat := 60.37788,
lon := 22.0964,
2
choice of loss function crucially influences the statistical and computational
properties of the resulting ML method (see [6, Ch. 2]). In what follows,
2
unless stated otherwise, we use the squared error loss L (z, h) := y − h(x)
to measure the prediction error.
It seems natural to choose (or learn) a hypothesis that minimizes the aver-
age loss (or empirical risk) on a given set of data points D := x(1) , y (1) , . . . , xm , y (m) .
As the notation in (2.1) indicates (using the symbol “∈” instead of “:=”),
there might be several different solutions to the optimization problem (2.1).
Unless specified otherwise, ĥ can be used to denote any hypothesis in H that
has minimum average loss over D.
Several important machine learning (ML) methods use a parametrized
model H: Each hypothesis h ∈ H is defined by parameters w ∈ Rd , often
indicated by the notation h(w) . One important example for a parametrized
model is the linear model [6, Sec. 3.1],
3
Note that (2.3) amounts to finding the minimum of a smooth and convex
function
T T T T
f (w) = (1/m) w X Xw − 2y Xw + y y (2.4)
T
with the feature matrix X := x(1) , . . . , x(m) (2.5)
T
and the label vector y := y (1) , . . . , y (m) of the training set D.
(2.6)
b (LR) ∈ argmin wT Qw + wT q
w (2.7)
w∈Rd
(2.9)
0 ≤ λ1 Q ≤ . . . ≤ λd Q .
To train a ML model H means to solve ERM (2.1) (or (2.3) for linear
regression); the dataset D is therefore referred to as a training set. The trained
model results in the learnt hypothesis ĥ. We obtain practical ML methods by
applying optimization algorithms to solve (2.1). Two key questions studied
in ML are:
4
wT Qw + wT q
b (LR)
w
Figure 2.1: ERM (2.1) for linear regression minimizes a convex quadratic
function wT Qw + wT q.
5
by gradient-based methods: Starting from an initial parameters w(0) , we
iterate the gradient step:
m
X T
= w(k−1) + (2η/m) x(r) y (r) − w(k−1) x(r) . (2.10)
r=1
How much computation do we need for one iteration of (2.10)? How many
iterations do we need ? We will try to answer the latter question in Lecture 4.
The first question can be answered more easily for a typical computational in-
frastructure (e.g., “Python running on a commercial Laptop”). The evaluation
of (2.10) then typically requires around m arithmetic operations (addition,
multiplication).
It is instructive to consider the special case of a linear model that does
not use any feature, i.e., h(x) = w. For this extreme case, the ERM (2.3) has
a simple closed-form solution:
m
X
w
b = (1/m) y (r) . (2.11)
r=1
Thus, for this special case of the linear model, solving (2.11) amounts to
summing m numbers x(1) , . . . , x(m) . It seems reasonable to assume that the
amount of computation required to compute (2.11) is proportional to m.
We can train a linear model on a given training set as ERM (2.3). But how
useful is the solution w
b of (2.3) for predicting the labels of data points outside
the training set? Consider applying the learnt hypothesis h(w)
b
to an arbitrary
6
data point not contained in the training set. What can we say about the
resulting prediction error y − h(w)
b
(x) in general? In other words, how well
does h(w)
b
generalize beyond the training set.
The most widely used approach to study the generalization of ML methods
is via probabilistic models. Here, we interpret each data point as a realization
of an independent and identically distributed (i.i.d.) RV with probability
distribution p(x, y). Under this i.i.d. assumption, we can evaluate the overall
performance of a hypothesis h ∈ H via the expected loss (or risk)
One example for a probability distribution p(x, y) relates the label y with
the features x of a data point as
A simple calculation reveals the expected squared error loss of a given linear
hypothesis h(x) = xT w
b as3
b 2 + σ2.
E{(y − h(x))2 } = ∥w − w∥ (2.14)
The component σ 2 can be interpreted as the intrinsic noise level of the label
y. We cannot hope to find a hypothesis with an expected loss below this level.
3
Strictly speaking, the relation (2.14) only applies for constant (deterministic) model
parameters w
b that do not depend on the RVs whose realizations are the observed data
points (see, e.g., (2.13)). However, the learnt model parameters w
b might be the output of a
ML method (such as (2.3)) that is applied to a dataset D consisting of i.i.d. realizations from
some underlying probability distribution. In this case, we need to replace the expectation
on the LHS of (2.14) with a conditional expectation E (y − h(x))2 D .
7
The first component of the RHS in (2.14) is the estimation error ∥w − w∥
b 2
of a ML method that reads in the training set and delivers an estimate w
b
(e.g., via (2.3)) for the parameters of a linear hypothesis.
We next study the estimation error w − w
b incurred by the specific estimate
w b (LR) (2.7) delivered by linear regression methods. To this end, we first
b =w
use the probabilistic model (2.13) to decompose the label vector y in (2.6) as
T
y = Xw + n , with n := ε(1) , . . . , ε(m) . (2.15)
b (LR) ∈ argmin wT Qw + wT q′ + wT e
w (2.16)
w∈Rd
8
wT Qw+wT (q′ +e) wT Qw+wT q′
wT e
w
a
b (LR)
w w
hand,
(2.19) T
Q w − w + eT w − w
f (w) = w−w
(a) T
≥ w−w Q w − w − ∥e∥2 ∥w − w∥2
(b)
≥ λ1 ∥w − w∥22 − ∥e∥2 ∥w − w∥2 . (2.20)
Step (a) used Cauchy–Schwarz inequality and (b) used the EVD (2.8) of Q.
Evaluating (2.20) for w = w
b and combining with f w
b ≤ 0 yields (2.18).
9
be controlled by suitable choice (or transformation) of features x of a data
point. Trivially, we can increase λ1 (Q) by a factor 100 if we scale each feature
by a factor 10. However, this approach would also scale (by factor 100) the
2
error term XT n 2
in (2.18). For some applications, we can find feature
transformations (“whitening”) that increases λ1 (Q) but does not increase
2 2
XT n 2 . We finally note that the error term XT n 2
in (2.18) vanishes if
the noise vector n is orthogonal to the columns of the feature matrix X.
It is instructive to evaluate the bound (2.18) for the special case where
each data point has the same feature x = 1. Here, the probabilistic model
(2.15) reduces to a signal in noise model,
with some true underlying parameter w. The noise terms ε(r) , for r = 1, . . . , m.
are realizations of i.i.d. RVs with probability distribution N (0, σ 2 ). The
feature matrix then becomes X = 1 and, in turn, Q = 1, λ1 (Q) = 1.
Inserting these values into (2.18), results in the bound
2
b(LR) − w
w ≤ 4 ∥n∥22 /m2 . (2.22)
For the labels and features in (2.21), the solution of (2.16) is given by
m m
(r) (2.21)
X X
w
b (LR)
= (1/m) y = w + (1/m) ε(r) . (2.23)
r=1 r=1
The above analysis of the generalization error started from postulating a prob-
abilistic model for the generation of data points. However, this probabilistic
model might be wrong and the bound (2.18) does not apply. Thus, we might
10
want to use a more data-driven approach for assessing the usefulness of a
learnt hypothesis ĥ obtained, e.g., from solving ERM (2.1).
Loosely speaking, validation tries to find out if a learnt hypothesis ĥ
performs similar inside and outside the training set. In its most basic form,
validation amounts to computing the average loss of a learnt hypothesis ĥ
on some data points not included in the training set. We refer to these data
points as the validation set.
Algorithm 2.1 summarizes a single iteration of a prototypical ML workflow
that consists of model training and validation. The workflow starts with
an initial choice for a dataset D, model H and loss function L (·, ·). We
then repeat Algorithm 2.1 several times. After each repetition, based on the
resulting training error and validation error, we modify (some of) the design
choices for dataset, model and loss function. These design choices, including
the decision of when to stop repeating Algorithm 2.1, are often guided by
simple heuristics [6, Ch.6.6].
We can diagnose a ERM-based ML method, such as Algorithm 2.1, by
comparing its training error with its validation error. This diagnosis is further
enabled if we know a baseline E (ref) . One important source for a baseline
E (ref) are probabilistic models for the data points (see Section 2.4).
Given a probabilistic model p(x, y), we can compute the minimum achiev-
able risk (2.12). Indeed, the minimum achievable risk is precisely the expected
loss of the Bayes estimator b
h(x) of the label y, given the features x of a
data point. The Bayes estimator b
h(x) is fully determined by the probability
distribution p(x, y) [30, Chapter 4].
A further potential source for a baseline E (ref) is an existing, but for
11
Algorithm 2.1 One Iteration of ML Training and Validation
Input: dataset D, model H, loss function L (·, ·)
1: split D into a training set D(train) and a validation set D(val)
2: learn a hypothesis via solving ERM
X
h ∈ argmin
b L ((x, y) , h) (2.24)
h∈H
(x,y)∈D(train)
12
some reason unsuitable, ML method. This existing ML method might be
computationally too expensive to be used for the ML application at hand.
However, we might still use its statistical properties as a baseline.
We can also use the performance of human experts as a baseline. If
we want to develop a ML method that detects skin cancer from images,
a possible baseline is the classification accuracy achieved by experienced
dermatologists [31].
We can diagnose a ML method by comparing the training error Et with
the validation error Ev and (if available) the baseline E (ref) .
13
validation error and both are significantly larger than the baseline. Thus,
the learnt hypothesis seems to not overfit the training set. However, the
training error achieved by the learnt hypothesis is significantly larger
than the baseline. There can be several reasons for this to happen.
First, it might be that the hypothesis space is too small, i.e., it does not
include a hypothesis that provides a good approximation for the relation
between features and label of a data point. One remedy to this situation
is to use a larger hypothesis space, e.g., by including more features in a
linear model, using higher polynomial degrees in polynomial regression,
using deeper decision trees or ANNs with more hidden layers (deep net).
Second, besides the model being too small, another reason for a large
training error could be that the optimization algorithm used to solve
ERM (2.24) is not working properly (see Lecture 4).
14
training error and the validation error of a learnt hypothesis becomes
more difficult. As an extreme case, the validation set might consist of
data points for which every hypothesis incurs small average loss (see
Figure 2.3). Here, we might try to increase the size of the validation set
by collecting more labeled data points or by using data augmentation
(see Section 2.6). If the size of both training set and validation set
is large but we still obtain Et ≫ Ev , one should verify if data points
in these sets conform to the i.i.d. assumption. There are principled
statistical tests for the validity of the i.i.d. assumption for a given
dataset (see [32] and references therein).
label y
h(1)
training set
validation set
h(2)
h(3)
feature x
Figure 2.3: Example for an unlucky split into training set and validation set
for the model H := {h(1) , h(2) , h(3) }.
2.6 Regularization
15
a ML method is the ratio deff (H) /|D| between the model size deff (H) and
the number |D| of data points. The tendency of the ML method to overfit
increases with the ratio deff (H) /|D|.
Regularization techniques decrease the ratio deff (H) /|D| via three (essen-
tially equivalent) approaches:
• collect more data points, possibly via data augmentation (see Fig. 10.9),
• add penalty term αR h to average loss in ERM (2.1) (see Fig. 10.9),
16
of
(x + n, y) with n ∼ N (0, αI). (2.26)
label y
h(x)
original training set D
augmented
√
α
1 Pm
x(r) , y (r) , h +αR h
m r=1 L
feature x
b (α) ∈ argmin wT Qw + wT q
w
w∈Rd
Thus, like linear regression (2.7), also ridge regression minimizes a convex
quadratic function. A main difference between linear regression (2.7) and
17
ridge regression (for α > 0) is that the matrix Q in (2.27) is guaranteed to
be invertible for any training set D. In contrast, the matrix Q in (2.7) for
linear regression might be singular for some training sets.5
The statistical aspects of the solutions to (2.27) (i.e., the parameters learnt
by ridge regression) crucially depends on the value of α. This choice can
be guided by an error analysis using a probabilistic model for the data (see
Proposition 2.1). Instead of using a probabilistic model, we can also compare
T
the training error and validation error of the hypothesis h(x) = w b (α) x
learnt by ridge regression with different values of α.
• Generate numpy arrays X, y, whose r-th row holds the features x(r) and
label y (r) , respectively, of the r-th data point in the csv file.
5
Consider the extreme case where all features of each data point in the training set D
are zero.
18
• Split the dataset into a training set and a validation set. The size of
the training set should be 100.
• Train and validate a linear model using feature augmentation via polyno-
mial combinations (see PolynomialFeatures class). Try out different
different choices of the maximal degree of these polynomial combina-
tions.
• Using a fixed value for the polynomial degree for the feature augmenta-
tion step, train and validate a linear model using ridge regression (2.25)
via the Ridge class.
• Determine the resulting training error and validation error or the model
parameters obtained for different values of α in (2.25).
19
3 A Design Principle for FL
Lecture 2 reviewed ML methods that use numeric arrays to store data points
(their features and labels) and model parameters. We have also discussed
ERM (and its regularization) as a main design principle for practical ML
systems. This lecture extends the basic ML concepts from a single-dataset
single-model setting to FL applications involving distributed collections of
data and models.
Section 3.2 introduces federated learning (FL) networks as a mathematical
model for collections of devices that generate local datasets and train local
models. Section 3.3 presents our main design principle for FL systems. This
principle, referred to as GTV minimization (GTVMin), is a special case of
ERM using a particular choice of loss function and model. The model used
by GTVMin is a (algebraic) product of local models, one for each node. The
loss function of GTVMin consists of two parts: the sum of loss functions at
each node and a penalty term that measures the variation of local models
across the edges of the FL network. The precise form of this penalty term
is a design choice and depends on the nature of the local models. If the
local models are parametrized by model parameterss, we can use the norm
of their differences across the edges in the FL network. For non-parametric
local models, we can measure the variation between two local models by
comparing their predictions on a specific dataset. We obtain the generalized
total variation (GTV) by summing their variations over all edges in the FL
network.
1
3.1 Learning Goals
Ai,i′
D(i) , w(i)
2
V := {1, . . . , n}. Each node i ∈ V of the FL network G carries a local dataset
Here, x(i,r) and y (i,r) denote, respectively, the features and the label of the
rth data point in the local dataset D(i) . Note that the size mi of the local
dataset might vary between different nodes i ∈ V.
It is convenient to collect the feature vectors x(i,r) and labels y (i,r) into a
feature matrix X(i) and label vector y(i) , respectively,
T T
X(i) := x(i,1) , . . . , x(i,mi ) , and y := y (1) , . . . , y (mi ) . (3.2)
The local dataset D(i) can then be represented compactly by the feature
matrix X(i) ∈ Rmi ×d and the vector y(i) ∈ Rmi .
Besides the local dataset D(i) , each node i ∈ G also carries a local model
H(i) . Our focus is on local models that can be parametrized by local model
parameters w(i) ∈ Rd , for i = 1, . . . , n. The usefulness of a specific choice
for the local model parameter w(i) is measured by a local loss function
Li w(i) , for i = 1, . . . , n. Note that we might use different local loss functions
3
′
we can use some cost or penalty function ϕ w(i) − w(i ) that satisfies basic
requirements such as being a (semi-)norm [33].
The choice of penalty ϕ(·) has crucial impact on the computational and
statistical properties of the FL methods arising from the design principle intro-
duced in Section 3.3. Unless stated otherwise, we use the choice ϕ(·) := ∥·∥22 ,
i.e., we measure the discrepancy between local model parameters across an
′ 2
edge {i, i′ } ∈ E by the squared Euclidean distance w(i) − w(i ) 2 . Summing
the discrepancies (weighted by the edge weights) over all edges in the FL
network yields the total variation of local model parameters
X ′ 2
Ai,i′ w(i) − w(i ) . (3.3)
2
{i,i′ }∈E
Besides inspecting its (maximum) node degrees, we can study the connec-
tivity of G also via the eigenvalues and eigenvectors of its Laplacian matrix
L(G) ∈ Rn×n .6 The Laplacian matrix of an undirected weighted graph G is
6
The study of graphs via the eigenvalues and eigenvectors of associated matrices is the
main subject of spectral graph theory [34, 35].
4
defined element-wise as (see Fig. 10.7)
−Ai,i′ for i ̸= i′ , {i, i′ } ∈ E
(G)
Li,i′ := for i = i′ (3.6)
P
i′′ ̸=i Ai,i′′
else.
0
1
−1 −1
2
L(G) = −1 1
0
2 3
−1 0 1
The Laplacian matrix is symmetric and psd which follows from the identity
X ′ 2
wT (L(G) ⊗ I)w = Ai,i′ w(i) − w(i )
2
{i,i′ }∈E
T
(1) T (n) T
for any w := (3.7)
w ,..., w .
| {z }
n
=:stack w(i)
i=1
5
The ordered eigenvalues λi L(G) can be computed via the Courant–Fischer–Weyl
CFW
λn L(G) = maxn vT L(G) v
v∈R
∥v∥=1
(3.7) X 2
= maxn Ai,i′ vi − vi′ (3.10)
v∈R
∥v∥=1 {i,i′ }∈E
and
CFW
λ2 L(G) = maxn vT L(G) v
v∈R
vT 1=0
∥v∥=1
(3.7) X 2
= maxn Ai,i′ vi − vi′ . (3.11)
v∈R
vT 1=0 {i,i′ }∈E
∥v∥=1
T
w = stack{c} = cT , . . . , cT , with some c ∈ Rd , (3.12)
6
• Consider the case λ2 = 0: Here, beside the eigenvector (3.12), we can
find at least one additional eigenvector
n ′
e = stack w(i) i=1 with w(i) ̸= w(i ) for some i, i′ ∈ V, (3.13)
w
Here, avg{w(i) } := (1/n) w(i) is the average of all local model parameters.
Pn
i=1
The bound (3.14) follows from (3.7) and the CFW for the eigenvalues of the
matrix L(G) ⊗ I.
2
The quantity on the RHS of (3.14) has an in-
Pn
i=1 w(i) − avg{w(i) }ni=1 2
7
Indeed, the projection PS w of w ∈ Rnd on S given explicitly as
T
PS w = aT , . . . , aT , with a = avg{w(i) }ni=1 . (3.16)
The edge weights are chosen A′i,i′ = Ai,i′ for any edge {i, i′ } ∈ E and A′i,i′ = 0
if the original FL network G does not contain an edge between nodes i, i′ .
1 2 1 2
3 4 3 4
8
3.3 Generalized Total Variation Minimization
Consider some FL network G whose nodes i ∈ V carry local datasets D(i) and
local model parametrized by the vector w(i) ∈ Rd . We learn these local model
parameters by minimizing their local loss and at the same time enforce a small
total variation. Requiring a small total variation of the learnt local model
parameters enforces them to be (approximately) constant over well-connected
nodes (“clusters”).
To optimally balance local loss and total variation, we solve generalized
total variation (GTV) minimization,
n
X X ′ 2
b (i) Li w(i) + α Ai,i′ w(i) − w(i ) (GTVMin).
w i=1
∈ argmin
n 2
stack w(i) i∈V {i,i′ }∈E
i=1
(3.18)
Note that GTVMin is an instance of regularized empirical risk minimization
(RERM): The regularizer is the GTV of local model parameters over the
weighted edges Ai,i′ of the FL network. Clearly, the FL network is an important
design choice for GTVMin-based methods. This choice can be guided by
computational aspects and statistical aspects of GTVMin-based FL systems.
Some application domains allow to leverage domain expertise to guess a
useful choice for the FL network. If local datasets are generated at different
geographic locations, we might use nearest-neighbour graphs based on geodesic
distances between data generators (e.g., FMI weather stations). Lecture 7
will also discuss graph learning methods that determine the edge weights Ai,i′
′
in a data-driven fashion, i.e., directly from the local datasets D(i) , D(i ) .
Let us now consider the special case of GTVMin with local models being
a linear model. For each node i ∈ V of the FL network, we want to learn the
9
T
parameters w(i) of a linear hypothesis h(i) (x) := w(i) x. We measure the
quality of the parameters via the average squared error loss
mi 2
(i) T (i,r)
X
(i) (i,r)
Li w := (1/mi ) y − w x
r=1
(3.2) 2
= (1/mi ) y(i) − X(i) w(i) 2
. (3.19)
The identity (3.7) allows to rewrite (3.20) using the Laplacian matrix L(G) as
X 2
b (i) ∈ (1/mi ) y(i) −X(i) w(i) +αwT L(G) ⊗ Id w. (3.21)
w argmin
2
w=stack w(i) i∈V
T T
with Q(i) = (1/mi ) X(i) X(i) , and q(i) := (−2/mi ) X(i) y(i) .
Thus, like linear regression (2.7) and ridge regression (2.27), also GTVMin
(3.21) (for local linear models H(i) ) minimizes a convex quadratic function,
n
b (i) wT Qw + qT w. (3.23)
w i=1
∈ argmin
n
w=stack w(i)
i=1
10
Here, we used the psd matrix
Q(1) 0 ··· 0
0 Q(2) ···
0 (i) T (i)
.. +αL ⊗I with Q := (1/mi ) X
(G) (i)
Q :=
.. .. ..
X
. . . .
0 0 · · · Q(n)
(3.24)
and the vector
T T T T
q := q(1) , . . . , q(n) , with q(i) := (−2/mi ) X(i) y(i) . (3.25)
2
proxL,ρ (w) := argmin L(w′ ) + (ρ/2) ∥w − w′ ∥2 for some ρ > 0.
w′ ∈Rd
Some authors refer to functions L for which proxL,ρ (w) can be computed
easily as simple or proximable [37]. GTVMin with proximable loss functions
can be solved quite efficiently via proximal algorithms [38].
Besides influencing the choice of optimization method, the design choices
underlying GTVMin also determine the amount of computation that is re-
quired by a given optimization method. For example, using an FL network
11
with relatively few edges (“sparse graphs”) typically results in a smaller com-
putational complexity. Indeed, Lecture 5 discusses GTVMin-based algorithms
requiring an amount of computation that is proportional to the number of
edges in the FL network.
Let us now consider the computational aspects of GTVMin (3.20) to train
local linear models. As discussed above, this instance is equivalent to solving
(3.23). Any solution w
b of (3.23) (and, in turn, (3.20)) is characterized by the
zero-gradient condition
b = −(1/2)q,
Qw (3.26)
b k + q for k = 0, 1, . . . .
w b k − η 2Qw
b k+1 := w
8
How many arithmetic operations (addition, multiplication) do you think are required
to invert an arbitrary matrix Q ∈ Rd×d ?
12
The gradient step results in the updated local model parameters w
b (i) which
we stacked into T
(1) T (n) T
w
b k+1 := w
b ,..., w
b .
How useful are the solutions of GTVMin (3.18) as a choice for the local model
parameters? To answer this question, we use - as for the statistical analysis of
ERM in Lecture 2 - a probabilistic model for the local datasets. In particular,
we use a variant of an i.i.d. assumption: Each local dataset D(i) , consists of
data points whose features and labels are realizations of i.i.d. RVs
T
y(i) = x(i,1) , . . . , x(i,mi ) w(i) + ε(i) with x(i,r) ∼ N (0, I), ε(i) ∼ N (0, σ 2 I).
i.i.d.
| {z }
local feature matrix X(i)
(3.27)
In contrast to the probabilistic model (2.13) (which we used for the analysis
of ERM), the probabilistic model (3.27) allows for different node-specific
parameters w(i) , for i ∈ V. In particular, the entire dataset obtained from
pooling all local datasets does not conform to an i.i.d. assumption.
In what follows, we focus on the GTVMin instance (3.20) to learn the
parameters w(i) of a local linear model for each node i ∈ V. For a reasonable
′
choice for FL network, the parameters w(i) , w(i ) at connected nodes {i, i′ } ∈ E
should be similar. However, we cannot choose the edge weights based on
parameters w(i) as they are unknown. We can only use (noisy) estimates
for w(i) the features and labels of the data points in the local datasets (see
13
Lecture 7).
Consider an FL network with nodes carrying local datasets generated
from the probabilistic model (3.27) with true model parameters w(i) . For
ease of exposition, we assume that
14
Proposition 3.1. Consider a connected FL network, i.e., λ2 > 0 (see (3.9)),
and the solution (3.29) to GTVMin (3.20) for the local datasets (3.27). If the
true local model parameters in (3.27) are identical (see (3.28)), we can upper
bound the deviation w b (i) of learnt model parameters
b (i) − (1/n) ni=1 w
e (i) := w
P
• the properties of local datasets, via the noise terms ε(i) in (3.27),
15
3.4 Overview of Coding Assignment
16
• The computation of solutions w
b(i) to (3.32) via (3.26) and using the
Python function [Link]. To this end, you should determine
the matrix Q and vector q in terms of local datasets D(i) and the
Laplacian matrix L(FMI) of the FL network G (FMI) .
• Studying the effect of varying values of α in (3.32) on the local loss and
total variation of the corresponding solutions w
b(i) .
17
3.5 Proofs
Let us introduce the shorthand f w(i) for the objective function of the
X 2 X ′ 2
f w(i) = (1/mi ) y(i) −X(i) w(i) 2
+α Ai,i′ w(i) −w(i )
2
i∈V {i,i′ }∈E
(3.28) X 2
= (1/mi ) y(i) −X(i) w(i) 2
i∈V
(3.27) X 2
= (1/mi ) X(i) w(i) +ε(i) −X(i) w(i) 2
i∈V
X 2
= (1/mi ) ε(i) 2
. (3.33)
i∈V
X 2 X ′ 2
b (i) =
f w (1/mi ) y(i) −X(i) w
b (i) 2 +α Ai,i′ w b (i) − w b (i )
i∈V {i,i′ }∈E | {z }2
| {z } (3.29) ′ 2
≥0 = ∥w w(i ) ∥
e (i)−e
2
X ′ 2
≥α e (i) − w
Ai,i′ w e (i )
2
{i,i′ }∈E
n
(3.14) X 2
≥ αλ2 e (i)
w 2
. (3.34)
i=1
If the bound (3.31) would not hold, then by (3.34) and (3.33) we would obtain
b (i) > f w(i) . This is a contradiction to the fact that w
b (i) solves (3.20).
f w
18
.
4 Gradient Methods
Chapter 3 introduced GTVMin as a central design principle for FL methods.
Many significant instances of GTVMin minimize a smooth objective function
f (w) over the parameter space (typically a subset of Rd ). This chapter
explores gradient-based methods, a widely used family of iterative algorithms
to find the minimum of a smooth function. These methods approximate the
objective function locally using its gradient at the current choice for the model
parameters. Chapter 5 focuses on FL algorithms obtained from applying
gradient-based methods to solve GTVMin.
1
4.2 Gradient Descent
f (w) = wT Qw + qT w. (4.1)
Note that (4.1) defines an entire family of convex quadratic functions f (w).
Each member of this family is specified by a psd matrix Q ∈ Rd×d and a
vector q ∈ Rd .
We have already encountered some ML and FL methods that minimize
an objective function of the form (4.1): Linear regression (2.3) and ridge
regression (2.27) in Chapter 2 as well as GTVMin (3.20) for local linear
models in Chapter 3. Moreover, (4.1) might serve as a useful approximation
for the objective functions arising from larger classes of ML methods [41–43].
Given model parameters w(k) , we want to update them towards a minimum
of (4.1). To this end, we use the gradient ∇f w(k) to locally approximate
f (w) (see Figure 4.1). The gradient ∇f w(k) indicates the direction in
(4.1)
= w(k) − η 2Qw(k) + q . (4.2)
The gradient step (4.2) involves the factor η which we refer to as step-size or
learning rate. Algorithm 4.1 summarizes the most basic instance of gradient-
2
based methods which simply repeats (iterates) (4.2) until some stopping
criterion is met.
T
f w(k) + w−w(k) ∇f w(k)
f (w)
n
f w(k)
3
regression (2.27) is
m
X
x(r) y (r) − wT x(r) + 2αw.
∇f (w) = −(2/m)
r=1
Note that Algorithm 4.1, like most other gradient-based methods, involves
two hyper-parameters: (i) the learning rate η used for the gradient step and
(ii) a stopping criterion to decide when to stop repeating the gradient step.
We next discuss how to choose these hyper-parameters.
The learning rate must not be too large to avoid moving away from the
optimum by overshooting (see Figure 4.2-(a)). On the other hand, if the
learning rate is chosen too small, the gradient step makes too little progress
towards the solutions of (4.1) (see Figure 4.2-(b)). Note that in practice we
can only afford to repeat the gradient step for a finite number of iterations.
4
f (w(k) )
f (w(k+1) ) f (w(k+2) )
(4.2) f (w(k+2) )
f (w(k+1) ) (4.2)
f (w(k) )
(a) (b)
Figure 4.2: Effect of inadequate learning rates η in the gradient step (4.2). (a)
If η is too large, the gradient steps might “overshoot” such that the iterates
w(k) might diverge from the optimum, i.e., f (w(k+1) ) > f (w(k) )! (b) If η is
too small, the gradient steps make very little progress towards the optimum
or even fail to reach the optimum at all.
One approach to choose the learning rate is to start with some initial
value (first guess) and monitor the decrease of the objective function. If
this decrease does not agree with the decrease predicted by the (local linear
approximation using the) gradient, we decrease the learning rate by a constant
factor. After we decrease the learning rate, we re-consider the decrease of the
objective function. We repeat this procedure until a sufficient decrease of the
objective function is achieved [45, Sec 6.1].
Alternatively, we can use a prescribed sequence (schedule) ηk , for k =
1, 2, . . . , of learning rates that vary across successive gradient steps [46]. We
can guarantee convergence of (4.2) (under mild assumptions on the objective
function f (w)) by using a learning rate schedule that satisfies [45, Sec. 6.1]
∞
X ∞
X
lim ηk = 0, ηk = ∞ , and ηk2 < ∞. (4.3)
k→∞
k=1 k=1
The first condition (4.3) requires that the learning rate eventually become
5
sufficiently small to avoid overshooting. The third condition (4.3) ensures
that this required decay of the learning rate does not take “forever”. Note
that the first and third condition in (4.3) could be satisfied by the trivial
learning rate schedule ηk = 0 which is clearly not useful as the gradient step
has no effect.
The trivial schedule ηk = 0 is ruled out by the middle condition of (4.3).
This middle condition ensures that the learning rate ηk is large enough such
that the gradient steps make sufficient progress towards a minimizer of the
objective function.
We emphasize that the conditions (4.3) do not involve any properties of
the matrix Q in (4.1). Note that the matrix Q is determined by data points
(see, e.g., (2.3)) whose statistical properties typically can be controlled only
to a limited extend (e.g., via data normalization).
For the stopping criterion we might use a fixed number kmax of iterations.
This hyper-parameter kmax might be dictated by limited resources (such as
computational time) for implementing GD or tuned via validation techniques.
We can obtain another stopping criterion from monitoring the decrease
in the objective function f w(k) : we stop repeating the gradient step (4.2)
6
in (4.1). However, they might result in sub-optimal use of computational
resources by implementing “useless” additional gradient steps. We use infor-
mation about the psd matrix Q in (4.1) to avoid unnecessary computations.9
Indeed, the choice for the learning rate η and the stopping criterion can be
guided by the eigenvalues
w(k+1) − w
b 2
≤ κ(ηk ) (Q) w(k) − w
b 2
. (4.6)
9
For linear regression (2.7), the matrix Q is determined by the features of the data
points in the training set. We can influence the properties of Q to some extent by
feature transformation methods. One important example for such a transformation is the
normalization of features.
10
What are sufficient conditions for the local datasets and the edge weights used in
GTVMin such that Q in (3.24) is invertible?
7
The contraction factor depends on the learning rate η which is a hyper-
parameter of gradient-based methods that we can control. However, the
contraction factor also depends on the eigenvalues of the matrix Q in (4.1).
In ML and FL applications, this matrix typically depends on data and can
be controlled only to some extend, e.g., using feature transformation [6, Ch.
5]. We can ensure κ(η) (Q) < 1 if we use a positive learning rate ηk < 1/U .
For a contraction factor κ(η) (Q) < 1, a sufficient condition on the
number k (ε of gradient steps required to ensure an optimization error
(tol) )
w(k+1) − w
b 2
≤ ε(tol) is (see (4.6))
(ε(tol) )
log w(0) − w /ε(tol)
2
(4.8)
b
k ≥ .
log 1/κ(η) (Q)
According to (4.6), smaller values of the contraction factor κ(η) (Q) guar-
antee a faster convergence of (4.2) towards the solution of (4.1). Figure 4.3
illustrates the dependence of κ(η) (Q) on the learning rate η. Thus, choosing
a small η (close to 0) will typically result in a larger κ(η) (Q) and, in turn,
require more iterations to ensure optimization error level ε(tol) via (4.6).
We can minimize this contraction factor by choosing the learning rate (see
Figure 4.3)
1
. η (∗) := (4.9)
λ1 + λd
[Note that evaluating (4.9) requires to know the extremal eigenvalues λ1 , λd
of Q.] Inserting the optimal learning rate (4.9) into (4.6),
(λd /λ1 ) − 1
w(k+1) − w
b 2
≤ w(k) − w
b 2
. (4.10)
(λd /λ1 ) + 1
| {z }
:=κ∗ (Q)
Carefully note that the formula (4.10) is valid only if the matrix Q in
(4.1) is invertible, i.e., if λ1 > 0. If the matrix Q is singular (λ1 = 0), the
8
1
κ(η) (Q)
κ∗ (Q) = (λd /λ1 )−1
(λd /λ1 )+1
|1−η2λ1 |
|1−η2λd |
η
1/(2λd ) η ∗ = λ1 +λ
1
d
1
λd
Figure 4.3: The contraction factor κ(η) (Q) (4.7), used in the upper bound
(4.6), as a function of the learning rate η. Note that κ(η) (Q) also depends on
the eigenvalues of the matrix Q in (4.1).
It is interesting to note that for linear regression, the matrix Q depends only
on the features x(r) of the data points in the training set (see (2.17)) but not
on their labels y (r) . Thus, the convergence of gradient steps is only affected
by the features, whereas the labels are irrelevant. The same is true for ridge
9
regression and GTVMin (using local linear models).
Note that both, the optimal learning rate (4.9) and the optimal contraction
factor
(λd /λ1 ) − 1
κ∗ (Q) := (4.12)
(λd /λ1 ) + 1
depend on the eigenvalues of the matrix Q in (4.1). According to (4.10),
the ideal case is when all eigenvalues are identical which leads, in turn, to
a contraction factor κ∗ (Q) = 0. Here, a single gradient step arrives at the
unique solution of (4.1).
In general, we do not have full control over the matrix Q and its eigenvalues.
For example, the matrix Q arising in linear regression (2.7) is determined
by the features of data points in the training set. These features might be
obtained from sensing devices and therefore beyond our control. However,
other applications might allow for some design freedom in the choice of feature
vectors. We might also use feature transformations that nudge the resulting
Q in (2.7) more towards a scaled identity matrix.
Consider the gradient step (4.2) used to find a minimum of (4.1). We again
assume that the matrix Q in (4.1) is invertible (λ1 (Q) > 0) and, in turn, (4.1)
has a unique solution w.
b
In some applications, it might be difficult to evaluate the gradient ∇f (w) =
2Qw + q of (4.1) exactly. For example, this evaluation might require to
gather data points from distributed storage locations. These storage locations
might become unavailable during the computation of ∇f (w) due to software
or hardware failures (e.g., limited connectivity).
10
We can model imperfections during the computation of (4.2) as the
perturbed gradient step
(4.1)
= w(k) − η 2Qw(k) + q + ε(k) . (4.13)
We can use the contraction factor (4.7) to upper bound the deviation between
w(k) and the optimum w
b as (see (4.6))
k
(k) (η)
k (0)
X k ′ ′
w −w
b 2
≤ κ (Q) w −w
b 2
+ κ(η) (Q) ε(k−k ) .
2
k′ =1
(4.14)
This bound applies for any number of iterations k = 1, 2, . . . of the perturbed
gradient step (4.13).
Finally, note that the perturbed gradient step (4.13) could also be used
as a tool to analyze the (exact) gradient step for an objective function f˜(w)
which does not belong to the family (4.1) of convex quadratic functions.
Indeed, we can write the gradient step for minimizing f˜(w) as
The last identity is valid for any choice of surrogate function f (w). In
particular, we can choose f (w) as a convex quadratic function (4.1) that
approximates f˜(w). Note that the perturbation term ε(k) is scaled by the
learning rate η.
11
4.6 Handling Constraints - Projected Gradient Descent
Let us now show how to adapt the gradient step (4.2) to solve the con-
strained problem
f ∗ = min wT Qw + qT w. (4.15)
w∈C
(4.1)
= PC w(k) − η 2Qw(k) + q . (4.17)
12
1. compute ordinary gradient step w(k) 7→ w(k) − η∇f w(k) and
Note that we re-obtain the basic gradient step (4.2) from the projected
gradient step (4.17) for the specific constraint set C = Rd .
f (w)
(4.2)
w(k)
PC ·
w(k) −η∇f w (k)
C
w(k+1)
The approaches for choosing the learning rate η and stopping criterion for
basic gradient step (4.2) explained in Sections 4.3 and 4.4 work also for the
projected gradient step (4.17). In particular, the convergence speed of the
projected gradient step is also characterized by (4.6) [45, Ch. 6]. This follows
from the fact that the concatenation of a contraction (such as the gradient
step (4.2) for sufficiently small η) and a projection (such as PC · ) results
13
might require significantly more computation than the basic gradient step, as
it requires to compute the projection (4.16).
of the function f (w) around the location w = w(k) (see Figure 4.1).
Let us modify (4.18) by using f (w) itself (instead of an approximation),
2
w(k+1) = argmin(1/(2η)) w − w(k) 2
+ f (w). (4.19)
w∈Rd
Like the gradient step, also (4.19) maps a given vector w(k) to an updated
vector w(k+1) . Note that (4.19) is nothing but the proximal operator of the
function f (w) [38]. Similar to the role of the gradient step as the main
building block of gradient-based methods, the proximal operator (4.19) is the
main building block of proximal algorithms [38].
To obtain a version of (4.19) for a non-parametric model, we need to be
able to evaluate its objective function directly in terms of a hypothesis h
14
instead of its parameters w. The objective function (4.19) consists of two
components. The second component f (·), which is the function we want to
minimize, is obtained from a training error incurred by a hypothesis, which
might be parametrized h(w) . Thus, we can evaluate the function f (h) by
computing the training error for a given hypothesis.
2
The first component of the objective function in (4.19) uses w − w(k) 2
to
measure the difference between the hypothesis maps h(w) and h(w . Another
(k) )
measure for the difference between two hypothesis maps can be obtained by
′
using some test dataset D′ = x(1) , . . . , x(m ) : The average squared difference
is a measure for the difference between h and h(k) . Note that the measure
(4.20) does require any model parameters but only the predictions delivered
by the hypothesis maps on D′
2
It is interesting to note that (4.20) coincides with w − w(k) 2
for the
linear model h(w) (x) := wT x and a specific construction of the dataset D′ .
This construction uses the realizations x(1) , x(2) , . . . of i.i.d. RVs with a
common probability distribution x ∼ N (0, I). Indeed, by the law of large
15
numbers
m ′
2
(w(k) )
X
′ (w) (r) (r)
lim (1/m ) h x −h x
m′ →∞
r=1
m ′ 2
(k) T
X
′ (r)
= lim
′
(1/m ) w−w x
m →∞
r=1
m ′
′
X T T
w − w(k) x(r) x(r) w − w(k)
= lim
′
(1/m )
m →∞
r=1
m ′
(k) T (r) T
X
′ (r)
w − w(k)
= w−w lim
′
(1/m ) x x
m →∞
r=1
| {z }
=I
2
= w − w(k) 2
. (4.21)
2
To arrive at our generalization of the gradient step, we replace w − w(k) 2
16
4.8 Overview of Coding Assignment
• Generate numpy arrays X, y, whose r-th row hold the features x(r) and
label y (r) , respectively, of the r-th data point in the csv file.
• Split the dataset into a training set and validation set. The size of the
training set should be 100.
• To develop some intuition for the behaviour of Algorithm 4.1, you could
try out difference choices for the learning rate η and maximum number
17
kmax of gradient steps. For each choice, you could monitor and plot the
objective function value f w(k) for the model parameters as a function
18
5 FL Algorithms
Chapter 3 introduced GTVMin as a flexible design principle for FL methods
that arise from different design choices for the local models and edge weights
of the FL network. The solutions of GTVMin are local model parameters
that strike a balance between the loss incurred on local datasets and the total
variation. This chapter applies the gradient-based methods from Chapter 4 to
solve GTVMin. The resulting FL algorithms can be implemented by message
passing over the edges of the FL network. The details of how this message
passing is implemented physically (e.g., via short range wireless technology)
is beyond the scope if this course.
Section 5.2 studies the gradient step for the GTVMin instance obtained for
training local linear models. In particular, we show how the convergence rate
of the gradient step can be characterized by properties of the local datasets
and their FL network. Section (5.3) spells out the gradient step in the form
of message passing over the FL network. Section 5.4 presents a FL algorithm
that is obtained by replacing the exact GD with a stochastic approximation.
Section 5.5 discusses FL algorithms for the single-model setting where we
want to train a single global model in a distributed fashion.
1
• know some motivation for stochastic gradient descent (SGD),
(5.1)
(5.2)
T T
with Q(i) = (1/mi ) X(i) X(i) , and q(i) := (−2/mi ) X(i) y(i) .
2
Therefore, the discussion and analysis of gradient-based methods from Chapter
4 also apply to GTVMin (5.1). In particular, we can use the gradient step
(4.1)
= w(k) − η 2Qw(k) + q (5.3)
3
Proposition 5.2. Consider the matrix Q in (5.2). If λ2 L(G) > 0 (i.e.,
the FL network in (5.1) is connected) and λ̄min > 0 (i.e., the average of the
matrices Q(i) is non-singular), then the matrix Q is invertible and its smallest
eigenvalue is lower bounded as
1
min{λ2 L(G) αρ2 , λ̄min /2}. (5.6)
λ1 (Q) ≥ 2
1+ρ
Here, we used the shorthand ρ := λ̄min /(4λmax ) (see (5.4)).
Prop. 5.2 and Prop. 5.1 can provide some guidance for the design choices
of GTVMin. According to the convergence analysis of gradient-based methods
in Chapter 4, the eigenvalue λ1 Q should be close to λdn Q to ensure fast
4
L1 (w) := 1000(w + 5)2 and L2 (w) := 1000(w − 5)2 . Use a fixed learning rate
η := 0.5 · 10−3 for the iteration (5.3).
n
w(k) =: stack w(i,k) (5.8)
i=1
.
The update (5.9) consists of two components, denoted (I) and (II). The
component (I) is nothing but the negative gradient −∇Li w(i,k) of the local
2
loss Li w(i) := (1/mi ) y(i) − X(i) w(i) 2 . Component (I) drives the local
model parameters w(i,k+1) towards the minimum of Li (·), i.e., having a small
T
deviation between labels y (i,r) and the predictions w(i,k+1) x(i,r) . Note that
5
we can rewrite the component (I) in (5.9), as
mi
X T
x(i,r) y (i,r) − x(i,r) w(i,k) . (5.10)
(2/mi )
r=1
The purpose of component (II) in (5.9) is to force the local model pa-
rameters to be similar across an edge {i, i′ } with large weight Ai,i′ . We
control the relative importance of (II) and (I) using the GTVMin parameter
α: Choosing a large value for α puts more emphasis on enforcing similar local
model parameters across the edges. Using a smaller α puts more emphasis
on learning local model parameters delivering accurate predictions (incurring
a small loss) on the local dataset.
A1,2
w(2,k)
A1,3
w(1,k)
w(3,k)
Figure 5.1: At the beginning of iteration k, node i = 1 collects the cur-
rent local model parameters w(2,k) and w(3,k) from its neighbours. Then, it
computes the gradient step (5.9) to obtain the new local model parameters
w(1,k+1) . These updated parameters are then used in the next iteration for
the local updates at the neighbours i = 2, 3.
The execution of the gradient step (5.9) requires only local information
at node i. Indeed, the update (5.9) node i depends only on its current
model parameters w(i,k) , the local loss function Li (·), the neighbours’ model
′
parameters w(i ,k) , for i′ ∈ N (i) , and the corresponding edge weights Ai,i′ (see
Figure 5.1). In particular, the update (5.9) does not depend on any properties
6
(model parameters or edge weights) of the FL network beyond the neighbours
N (i) .
We obtain Algorithm 5.1 by repeating the gradient step (5.9), simulta-
neously for each node i ∈ V, until a stopping criterion is met. Algorithm
5.1 allows for potentially different learning rates ηk,i at different nodes i and
iterations k (see Section 4.3). It is important to note that Algorithm 5.1
7
compute w(i,k+1)
i i i
′ ′
w(i ,k) w(i,k) Ai,i′ w(i ,k+1) w(i,k+1)
i′ i′ i′
′
compute w(i ,k+1)
Figure 5.2: Algorithm 5.1 alternates between message passing across the
edges of the FL network and updates of local model parameters.
8
5.4 FedSGD
of B randomly chosen data points from D(i) . While (5.10) requires summing
over m data points, the approximation requires to to sum over B (typically
B ≪ m) data points.
Inserting the approximation (5.11) into the gradient step (5.9) yields the
approximate gradient step
(i,r) T
X
(i,k+1) (i,k) (i,r) (i,r) (i,k)
w := w + η (2/B) x y − x w
r∈B
| {z }
≈(5.10)
(i′ ,k)
X
(i,k)
(5.12)
+ 2α Ai,i′ w −w .
i′ ∈V\{i}
9
Algorithm 5.2 FedSGD for Local Linear Models
Input: FL network G; GTV parameter α; learning rate ηk,i ;
local datasets D(i) = for each node i;
(i,1) (i,1)
x ,y , . . . , x(i,mi ) , y (i,mi )
batch size B; some stopping criterion.
Output: linear model parameters w
b (i) at each node i ∈ V
Initialize: k := 0; w(i,0) := 0
10
5.5 FedAvg
D(i) Ai,1
Figure 5.3: Star graph G (star) with centre node representing a server and
peripheral nodes representing clients that generate local datasets.
11
Instead of using GTVMin with a connected FL network and a large value
of α, we can also enforce identical local copies w
b (i) via a constraint:
X 2
wb ∈ arg min (1/mi ) y(i) − X(i) w(i) 2
w∈C
i∈V
′
with C = w = stack{w(i) }ni=1 : w(i) = w(i ) for any i, i′ ∈ V . (5.13)
Note that the constraint set C is nothing but the subspace defined in (3.15).
The projection of a given collection of local model parameters w = stack{w(i) }
on C is given by
T X
PC w = vT , . . . , vT with v := (1/n) w(i) . (5.14)
i∈V
We can solve (5.13) using projected GD from Chapter 4 (see Section 4.6).
The resulting projected gradient step for solving (5.13) is
T (i)
b k(i) := w(i,k) +ηi,k (2/mi ) X(i) y −X(i) w(i,k) (5.15)
w
| {z }
(local gradient step)
X (i′ )
w(i,k+1) := (1/n) wbk ( projection ) . (5.16)
i′ ∈V
• First, each node computes the update (5.15), i.e., a gradient step towards
2
a minimum of the local loss y(i) − X(i) w 2 .
12
The averaging step (5.16) might take much longer to execute than the
local update step (5.15). Indeed, (5.16) typically requires transmission of local
model parameters from every client i ∈ V to a server or central computing
unit. Thus, after the client i ∈ V has computed the local gradient step (5.15)
b k(i) from all clients
it must wait until the server (i) has collected the updates w
and (ii) sent back their average w(i,k+1) to i ∈ V.
Instead of using a computational cheap gradient step (5.15),11 and then
being forced to wait for receiving w(i,k+1) back from the server, a client
might “make better use” of its time. For example, the client i could execute
several local gradient steps (5.15) in order to make more progress towards
the optimum. Another option is to use a local minimization of Li (v) :=
2
(1/mi ) y(i) − X(i) v 2
around w(i,k) ,
2 2
b k(i)
w (i) (i)
:= argmin (1/mi ) y − X v 2 +(1/η) v − w(i,k) 2 . (5.17)
v∈Rd | {z }
=Li (v)
Note that (5.17) is nothing but the proximal operator of Li (v) (see (4.19)).
We obtain Algorithm 5.3 from (5.15)-(5.16) by replacing the gradient step
(5.15) with the local minimization (5.17).
The local minimization step (5.17) is closely related to ridge regression.
Indeed, (5.17) is obtained from ridge regression (2.25) by replacing the regu-
2
larizer R v := ∥v∥22 with the regularizer R v := v − w b k(i) .
2
As the notation in (5.17) indicates, the parameter η plays a role similar
to the learning rate of a gradient step (4.2). It controls the size of the
11
For a large local dataset, the local gradient step (5.15) might actually be computation-
ally expensive and should be replaced by an approximation, e.g., based on the stochastic
gradient approximation (5.11).
13
neighbourhood of w(i,k) over which (5.17) optimizes the local loss function
Li (·). Choosing a small η forces the update (5.17) to not move too far from
the given model parameters w(i,k) .
We can interpret the update (5.17) as a form of ERM (2.3) for linear
regression. Indeed,
2 2
min (1/mi ) y(i) −X(i) v 2
+(1/η) w (i,k)
−v 2
=
v∈Rd
mi
X
(i,r) (i,r) T
2 (i,k) 2
min (1/mi ) y − x v +(1/η) w −v 2
=
v∈Rd
r=1
mi
X d
(i,r) T
2 X (r) T
2
(i,r) (i,k)
(5.18)
min (1/mi ) y − x v + (mi /η) wr − e v .
v∈Rd
r=1 r=1
Here, e(r) denotes the vector obtained from the rth column of the identity
matrix I. The only difference between (5.18) and (2.3) is the presence of the
sample weights (mi /η).12
12
The “.fit()” method of the Python class sklearn.linear_model.LinearRegression
allows to specify sample weights via the parameter sample_weight [27].
14
Algorithm 5.3 FedAvg to train a linear model
Input: client list V
Server. Initialize. k := 0
X
w [k] := 1/ V w(i) .
i∈V
15
5.7 Overview of Coding Assignment
This coding assignment builds on the coding assignment “ML Basics” (see
Section 2.7) and the coding assignment “FL Design Principle” (see Section
3.4). In particular, we consider an FL network G (FMI) with each node i ∈ V
being a FMI weather station. The node i ∈ V holds a local dataset D(i) that
consists of mi data points. Each data point is a temperature measurement,
T
taken at station i, and characterized by d = 7 features x = x1 , . . . , x7
and a label y which is the temperature measurement itself. The features are
(normalized) values of the latitude and longitude of the FMI station as well
as the (normalized) year, month, day, hour, minute when the temperature
has been measured.
The edges of G (FMI) are obtained using the Python function add_edges().
Each FMI station i is connected to its nearest neighbours i′ , using the
T
Euclidean distance between the corresponding vectors lat(i) , lon(i) ∈ R2 .
The number of neighbours is controlled by the input parameter numneighbors.
All edges {i, i′ } ∈ E have the same edge weight Ai,i′ = 1.
Your tasks involve
• For each node i ∈ V (FMI station), add node attributes that store the
feature matrix (as numpy arrays) X(i) and label vector y(i) . The r-th
row of these holds the features x(i,r) and label y (i,r) , respectively, of the
16
r-th data point recorded for FMI station i in the csv file.
• For each node i ∈ V (FMI station), add a node attribute that stores
local model parameters w(i) (as a numpy array).
• Split each local dataset into a training set and a validation set.
• Try to develop some intuition for the effect of choosing different values
for the hyper-parameters such as GTVMin parameter α, learning rate
η or batch size in Algorithms 5.1 and 5.2.
17
5.8 Proofs
The first inequality in (5.5) follows from well-known results on the eigenvalues
of a sum of symmetric matrices (see, e.g., [3, Thm 8.1.5]). In particular,
The second inequality in (5.5) uses the following upper bound on the maximum
eigenvalue λn L(G) of the Laplacian matrix:
(a)
λn L(G) = max vT L(G) v
v∈S(n−1)
(3.7) X 2
= max Ai,i′ vi − vi′
v∈S(n−1)
{i,i′ }∈E
(b) X
Ai,i′ 2 vi2 + vi2′
≤ max
v∈S(n−1)
{i,i′ }∈E
(c) X X
= max 2vi2 Ai,i′
v∈S(n−1)
i∈V i′ ∈N (i)
(3.5) X
≤ max 2vi2 d(G)
max
v∈S(n−1)
i∈V
= 2d(G)
max . (5.20)
Here, step (a) uses the CFW of eigenvalues [3, Thm. 8.1.2.] and step (b) uses
the inequality (u+v)2 ≤ 2(u2 +v 2 ) for any u, v ∈ R. For step (c) we use the
identity i∈V i′ ∈N (i) f (i, i′ ) = {i,i′ } f (i, i′ ) + f (i′ , i) (see Figure 5.20).
P P P
18
2
w21 w
+ 2
A1,22 2
0
1 A1,3 2 w 2
1 + w2
3
3
Similar to the upper bound (5.20) we also start with the CFW for the
eigenvalues of Q in (5.2). In particular,
λ1 = min
2
wT Qw. (5.21)
∥w∥2 =1
T T T n
with c := avg w(i) (5.23)
w= c ,..., c i=1
.
Note that
∥w∥22 = ∥w∥22 + ∥w∥
e 22 . (5.24)
19
Regime I. This regime is obtained for ∥w∥
e 2 ≥ ρ ∥w∥2 . Since ∥w∥22 = 1,
and due to (5.24), we have
e 22 ≥ ρ2 /(1 + ρ2 ).
∥w∥ (5.25)
(3.7),(3.14)
e 22
αλ2 L(G) ∥w∥
≥
(5.25)
≥ αλ2 L(G) ρ2 /(1 + ρ2 ). (5.26)
20
Here, step (a) follows from max∥y∥2 =1,∥x∥2 =1 yT Qx = λmax . Inserting (5.29)
into (5.28) for ρ = λ̄min /(4λmax ),
(5.27)
wT Qw ≥ ∥w∥22 λ̄min /2 ≥ (1/(1 + ρ2 ))λ̄min /2 (5.30)
21
6 Some Main Flavours of FL
Chapter 3 discussed GTVMin as a main design principle for FL algorithms.
GTVMin learns local model parameters that optimally balance the individual
local loss with their variation across the edges of an FL network. Chapter
5 discussed how to obtain practical FL algorithms. These algorithms solve
GTVMin using distributed optimization methods, such as those from Chapter
4.
This chapter discusses important special cases of GTVMin, known as
“main flavours”. These flavours arise for specific construction of local datasets,
choices of local models, measures for their variation and, last but not least, the
weighted edges in the FL network. We next briefly summarize the resulting
main flavours of FL discussed in the following sections.
Section 6.2 discusses single-model FL that learns model parameters of
a single (global) model from local datasets. This single-model flavour can
be obtained from GTVMin using a connected FL network with large edge
weights or, equivalently, a sufficient large value for the GTVMin parameter.
Section 6.3 discusses how clustered FL is obtained from GTVMin over
FL networks with a clustering structure. CFL exploits the presence of
clusters (subsets of local datasets) which can be approximated using an i.i.d.
assumption. GTVMin captures these clusters if they are well-connected by
many (large weight) edges in the FL network.
Section 6.4 discusses horizontal FL which is obtained from GTVMin over
an FL network whose nodes carry different subsets of a single underlying global
dataset. Loosely speaking, horizontal FL involves local datasets characterized
by the same set of features but obtained from different data points from an
1
underlying dataset.
Section 6.5 discusses vertical FL which is obtained from GTVMin over an
FL network whose nodes the same data points but using different features.
As an example, consider the local datasets at different public institutions (tax
authority, social insurance institute, supermarkets) which contain different
informations about the same underlying population (anybody who has a
Finnish social security number).
Section 6.6 shows how personalized FL can be obtained from GTVMin by
using specific measures for the total variation of local model parameters. For
example, using deep neural networks as local models, we might only use the
model parameters corresponding to the first few input layers to define the
total variation.
After this chapter, you will know particular design choices for GTVMin
corresponding to some main flavours of FL:
• single-model FL
• vertical FL.
2
6.2 Single-Model FL
Some FL use cases require to train a single (global) model H from a decen-
tralized collection of local datasets D(i) , i = 1, . . . , n [15, 51]. In what follows
we assume that the model H is parametrized by a vector w ∈ Rd . Figure
6.1 depicts a server-client architecture for an iterative FL algorithm that
generates a sequence of (global) model parameters w(k) , k = 1, . . . .. After
computing the new model parameters w(k+1) , the server broadcasts it to the
clients i. During the next iteration, each client i uses the current global model
parameters w(k) to compute a local update w(i,k) based on its local dataset
D(i) . The precise implementation of this local update step depends on the
choice for the global model H (trained by the server). One example for such
a local update has been discussed in Chapter 5 (see (5.17)).
Chapter 5 already hinted at an alternative to the server-based system in
Figure 6.1. Indeed, we might learn local model parameters w(i) for each client
i using a distributed optimization of GTVMin. We can force the resulting
model parameters w(i) to be (approximately) identical by using a connected
FL network and a sufficiently large GTVMin parameter α.
To minimize the computational complexity of the resulting single-model
FL system, we prefer FL networks with small number of edges such as the
star graph in Figure 5.3 [50]. However, to increase the robustness against
node/link failures we might prefer using an FL network that has more edges.
This “redundancy” helps to ensure that the FL network is connected even
after removing some of its edges.
Much like the server-based system from Figure 6.1, GTVMin-based meth-
ods using a star graph offers a single point of failure (the server in Figure 6.1
3
global model parameters w(k) at time k
server
w (k +
1)
k)
w(k+1) w (3,
w(2,k)
(1, k)
w
1)
(k+
w
1 2 3
4
or the centre node in Figure 5.3). Chapter 8 will discuss the robustness of
GTVMin-based FL systems in slightly more detail (see Section 8.3).
6.3 Clustered FL
5
network. In particular, the FL network should contain many edges (with
large weight) between nodes in the same cluster and few edges (with small
weight) between nodes in different clusters. To fix ideas, consider the FL
network in Figure 6.3, which contains a cluster C = {1, 2, 3}.
w(5)
w(2)
C ∂C
w(1) w(4)
w(3)
Figure 6.2: The solutions of GTVMin (3.18) are local model parameters that
are approximately identical for all nodes in a tight-knit cluster C.
L(C) of the induced sub-graph G (C) :14 The larger λ2 L(C) , the better the
Note that for a single-node cluster C = {i}, the cluster boundary coincides
with the node degree, |∂[| C] = d(i) (see (3.4)).
14
The graph G (C) consists of the nodes in C and the edges {i, i′ } ∈ E for i, i′ ∈ C.
6
Intuitively, GTVMin tends to deliver (approximately) identical model
parameters w(i) for nodes i ∈ C if λ2 L(C) is large and the cluster boundary
|∂[| C] is small. The following result makes this intuition more precise for the
special case of GTVMin (5.1) for local linear models.
is upper bounded as
X
X 2 1 1 2 2
e (i) 2
w ≤ (C)
ε(i) 2
+ α |∂[| C]2 w(C) 2
+R2
. (6.4)
i∈C
αλ2 L i∈C
m i
′
b (i ) 2 .
Here, we used R := maxi′ ∈V\C w
The bound (6.4) depends on the cluster C (via the eigenvalue λ2 L(C)
and the boundary |∂[| C]) and the GTVMin parameter α. Using a larger C
might result in a decreased eigenvalue λ2 L(C) .15 According to (6.4), we
15
Consider an FL network (with uniform edge weights) that contains a fully connected
cluster C which is connected via a single edge with another node i′ ∈ V \ C (see Figure
′
6.3). Compare the corresponding eigenvalues λ2 L(C) and λ2 L(C ) of C and the enlarged
cluster C ′ := C ∪ {i′ }.
7
should then increase α to maintain a small deviation w
e (i) of the learnt local
model parameters from their cluster-wise average. Thus, increasing α in (3.18)
enforces its solutions to be approximately constant over increasingly larger
subsets (clusters) of nodes (see Figure 6.3).
For a connected FL network G, using a sufficiently large α for GTVMin
results in learnt model parameters that are approximately identical for all
nodes V in G. The resulting approximation error is quantified by Prop. 6.1
for the extreme case where the entire FL network forms a single cluster, i.e.,
C = V. Trivially, the cluster boundary is then equal to 0 and the bound (6.4)
specializes to (3.31).
We hasten to add that the bound (6.4) only applies for local datasets that
conform with the probabilistic model (6.2). In particular, it assumes that all
cluster nodes i ∈ C have identical model parameters w(C) . Trivially, this is
no restriction if we allow for arbitrary error terms ε(i) in the probabilistic
model (6.4). However, as soon as we place additional assumptions on these
error terms (such as being realizations of i.i.d. Gaussian RVs) we should
verify their validity using principled statistical tests [32,52]. Finally, we might
2
replace w(C) 2
in (6.4) with an upper bound for this quantity.
8
C
C C
small α
moderate α large α
Figure 6.3: The solutions of GTVMin (3.18) become increasingly clustered
for increasing α.
6.4 Horizontal FL
Horizontal FL uses local datasets D(i) , for i ∈ V, that contain data points
characterized by the same features [53]. As illustrated in Figure 6.4, we can
think of each local dataset D(i) as being a subset (or batch) of an underlying
global dataset
9
(1) (1) (1)
x1 x2 ··· xd y (1)
(2) (2) (2)
D(1)
(2)
x1 x2 ··· xd y
.. .. ... .. ..
. . . .
D(i)
(m) (m) (m) (m)
x1 x2 · · · xd y
D(global)
Figure 6.4: Horizontal FL uses the same features to characterize data points
in different local datasets. Different local datasets are constituted by different
subsets of an underlying global dataset.
for all nodes i (including the unlabelled ones U). GTVMin-based methods
combine the information in the labelled local datasets D(i) , for i ∈ V \ U and
their connections (via the edges in G) with nodes in U (see Figure 6.4).
10
i∈U
6.5 Vertical FL
Vertical FL uses local datasets that are constituted by the same (identical!)
data points. However, each local dataset uses a different choice of features to
characterize these data points [55]. Formally, vertical FL applications revolve
around an underlying global dataset
1, . . . , m. The labels y (i,r) are identical to the labels in the global dataset,
y (i,r) = y (r) . The feature vectors x(i,r) are obtained by a subset F (i) :=
11
{j1 , . . . , jd } of the original d′ features in x(r) ,
(r) (r) T
x(i,r) = xj1 , . . . , xjd .
(1)
D(i)
D
(1) (1) (1)
x1 x2 ··· xd y (1)
(2) (2) (2) (2)
x1 x2 · · · xd y
.. .. .. .. ..
. . . . .
(m) (m) (m) (m)
x1 x2 · · · xd y
D(global)
Figure 6.6: Vertical FL uses local datasets that are derived from the same data
points. The local datasets differ in the choice of features used to characterize
the common data points.
12
identical while the parameters of the deeper layers might be different for each
local dataset.
The partial parameter sharing for local models can be implemented in
many different ways [56, Sec. 4.3.]. One way is to use a choice for the GTV
′ 2
penalty function that is different from ϕ = w(i) − w(i ) 2 , which is the main
choice in our course. In particular, we could construct the penalty function
as a combination of two terms,
′ ′ ′
ϕ w(i) − w(i ) := α(1) ϕ(1) w(i) − w(i ) + α(2) ϕ(2) w(i) − w(i ) . (6.5)
13
u(1) h(1)
2
v(1) (2)
h2
x x
h(1) (x) h(2) (x)
(1) (2)
h1 h1
(3)
h2
x
h(3) (x)
(3)
h1
Figure 6.7: Personalized FL with local models being ANNs with one hidden
T
(i) T (i) T
layer. The ANN h is parametrized by the vector w = u ,
(i) (i)
, v
with parameters u(i) of hidden layer and the parameters v(i) of the output
layer. We couple the training of u(i) via GTVMin using the discrepancy
′ 2
measure ϕ = u(i) − u(i ) 2 .
14
6.7 Few-Shot Learning
15
6.8 Overview of Coding Assignment
This coding assignment builds on the coding assignment “ML Basics” (see
Section 2.7) and the coding assignment “FL Design Principle” (see Section
3.4). In particular, we consider an FL network G (FMI) with each node i ∈ V
being a FMI weather station. The node i ∈ V holds a local dataset
(i) (i,1) (i,1) (i,mi ) (i,mi )
D = x ,y ,..., x ,y
• For each node i ∈ V (FMI station), add node attributes that store the
feature matrix (as numpy arrays) X(i) and label vector y(i) . The r-th
16
row of these holds the features x(i,r) and label y (i,r) , respectively, of the
r-th data point recorded for FMI station i in the csv file. Add another
node attribute that stores the index of the cluster to which this node
belongs to.
– z(i) ∈ R2 with entries being the latitude and longitude of the FMI
station i ∈ V.
• For each clustering obtained in the previous task, compute the cluster-
Pmi′ (i′ ,r)
wise average temperature ŷ (C) = P ′ 1 m ′ i′ ∈C r=1 for each
P
y
i ∈C i
(i,r) 2
the average squared error loss Pn 1 mi i∈V m .
P P i (i,r)
r=1 y − ŷ
i=1
17
6.9 Proofs
To verify (6.4), we follow a similar argument as used in the proof (see Section
3.5.1) of Prop. 3.1.
First we decompose the objective function f w in (5.1) as follows:
f (w) =
X
2 2 2
(i′ ) (i′ )
X X
(1/mi ) y(i) −X(i) w(i) 2
+α Ai,i′ w −w (i)
+ (i)
Ai,i′ w −w
2 2
i∈C i,i′ ∈C {i,i′ }∈∂C
| {z }
=:f ′ w
+ f ′′ w . (6.6)
Note that only the first component f ′ depends on the local model parameters
w(i) of cluster nodes i ∈ C. Let us introduce the shorthand f ′ w(i) for the
′ ′
function obtained from f ′ (w) for varying w(i) , i ∈ C, but fixing w(i ) := w
b (i )
for i′ ∈
/ C.
We obtain the bound (6.4) via a proof by contradiction: If (6.4) does
not hold, the local model parameters w(i) := w(C) , for i ∈ C, result in a
smaller value f ′ w(i) < f ′ w
b (i) than the choice w
b (i) , for i ∈ C. This would
18
First, note that
X 2
f ′ w(i) = (1/mi ) y(i) −X(i) w(C) 2
i∈C
X
2 2
(i′ )
X
(C)
+α Ai,i′ w −w(C) 2 + Ai,i′ w (C)
−w
b
2
{i,i′ }∈E {i,i′ }∈E
i,i′ ∈C i∈C,i′ ∈C
/
(6.2) X 2 X ′ 2
= (1/mi ) ε(i) 2
+α Ai,i′ w(C) − w
b (i )
2
i∈C {i,i′ }∈E
i∈C,i′ ∈C
/
(a)
2 2 2
(i′ )
X X
≤ (1/mi ) ε(i) 2 +α Ai,i′ 2 w(C) 2
+ w
b
2
i∈C {i,i′ }∈E
i∈C,i′ ∈C
/
X 2 2
≤ (1/mi ) ε(i) 2 + α |∂[| C]2 w(C) 2
+R . 2
(6.7)
i∈C
Step (a) uses the inequality ∥u+v∥22 ≤ 2 ∥u∥22 +∥v∥22 which is valid for any
two vectors u, v ∈ Rd .
On the other hand,
X ′ 2
f′ w
b (i) ≥ α b (i) − wb (i )
Ai,i′ w
i,i′ ∈C | {z }2
(6.3) ′ 2
= ∥w w(i ) ∥
e (i)−e
2
(3.14) X 2
≥ αλ2 L(C) e (i) 2 . (6.8)
w
i∈C
If the bound (6.4) would not hold, then by (6.8) and (6.7) we would obtain
f′ wb (i) > f ′ w(i) , which contradicts the fact that w
b (i) solves (5.1).
19
7 Graph Learning
Chapter 3 discussed GTVMin as a main design principle for FL algorithms. In
particular, Chapter 5 discusses FL algorithms that arise from the application
of optimization methods, such as the gradient-based methods from Chapter
4, to solve GTVMin.
The computational and statistical properties of these algorithms depend
crucially on the properties of the underlying FL network. For example, the
amount of computation and communication required by FL systems typically
grows with the number of edges in the FL network. Moreover, the connectivity
of the FL network steers the pooling of local datasets into clusters that share
common model parameters.
In some applications, domain expertise can guide the choice for the FL
network. However, it might also be useful to learn the FL network in a more
data-driven fashion. This chapter discusses methods that learn an FL network
solely from a given collection of local datasets and corresponding local loss
functions.
The outline of this chapter is as follows: Section 7.2 discusses how the
computational and statistical properties of Algorithm 5.1 from Chapter 5 can
guide the construction of the FL network. Section 7.3 presents some ideas
for how to measure the discrepancy (lack of similarity) between two local
datasets. The measure for the discrepancy is an important design choice for
the graph learning methods discussed in Section 7.4. We formulate these
methods as the optimization of edge weights given the discrepancy measure
for any pair of local datasets. The formulation as an optimization problem
allows to include connectivity constraints such as a minimum value for each
1
node degree.
Consider the GTVMin instance (3.20) for learning the model parameters
of a local linear model for each local dataset D(i) . To solve (3.20), we use
Algorithm 5.1 as a message passing implementation of the basic gradient
step (5.3). Note that GTVMin (3.20) is defined for a given FL network G.
Therefore, the choice for G is critical for the statistical and computational
properties of Algorithm 5.1.
Statistical Properties. The statistical properties of Algorithm 5.1 can
be assessed via a probabilistic model for the local datasets. One important
example for such a probabilistic model is the clustering assumption (6.2)
of CFL (see Section 3.3.1). For CFL, we would like to learn similar model
parameters for nodes in the same cluster.
2
According to Prop. 6.1, the GTVMin solutions will be approximately
constant over C, if λ2 L(C) is large and the cluster boundary |∂[| C] is small.
the nodes in C. This informal statement can be made precise using a celebrated
result from spectral graph theory, known as Cheeger’s inequality [35, Ch. 21].
Alternatively, we can analyse λ2 L(C) by interpreting (or approximating)
expected node degree is given as E d(i) = pe (|C| − 1). With high probability,
d(G) (7.1)
(i)
max ≈ E d = pe (|C| − 1).
16
This approximation is particularly useful if the FL network G itself is (close to) a
typical realization of an ER graph.
3
calculation reveals that λ2 L = |C|pe . Thus, we have the approximation
(7.1)
λ2 L(C) ≈ λ2 L = |C|pe ≈ d(G) (7.2)
max .
4
(G)
the maximum node degree dmax and eigenvalue λ2 L(G) (see (5.5) and (5.6)).
recent work studies graph constructions that maximize λ2 L(G) for a given
(G)
(prescribed) maximum node degree dmax = maxi∈V d(i) [64, 65].
Spectral graph theory also provides upper bounds on λ2 L(G) in terms of
the node degrees [35, 36, 66]. These upper bounds can be used as a baseline
for practical constructions of the FL network: If some construction results in
a value λ2 L(G) close to the upper bound, there is little benefit in trying to
further improve the construction (in the sense of achieving higher λ2 L(G) .
The next result provides one example for such an upper bound.
n (i)
λ2 L(G) ≤ d , for every i = 1, . . . , n. (7.3)
n−1
Proof. CFW [3, Thm. 8.1.2.] with the evaluation of the quadratic form
wT L(G) w for the specific vector w e = w̃(1) = −(1/n), . . . , w̃(i) = 1 −
T
(1/n), . . . , w̃ (n)
= −(1/n) . Note that w
e T 1 = 0, i.e., this vector is or-
thogonal to the subspace (3.15) (for the special case of local models with
dimension d = 1).
5
nr. of iterations
per-iteration complexity
nr. of edges
Figure 7.1: The per-iteration complexity and number of iterations required by
Algorithm 5.1 depends on the number of edges in the underlying FL network
in different manners.
′ ′
and D(i ) via the Euclidean distance w(i) − w(i ) 2
between the parameters
6
′
w(i) , w(i ) of the probability distributions.
In general, we do not know the parameters of the probability distribution
p(i) D(i) ; w(i) underlying a local dataset.17 We might still be able to estimate
these parameters, e.g., using variants of maximum likelihood [6, Ch. 3]. Given
′
the estimates w b (i ) for the model parameters, we can then compute the
b (i) , w
′ ′
discrepancy measure d(i,i ) := w b (i ) 2 .
b (i) − w
Example. Consider local datasets being a single number y (i) which is
modelled as a noisy observation y (i) = w(i) + n(i) with n(i) ∼ N (0, 1). The
maximum likelihood estimator for w(i) is then obtained as ŵ(i) = y (i) [30, 68]
′ ′
and, in turn, the resulting discrepancy measure d(i,i ) := y (i) − y (i ) [69].
Example. Consider nodes i ∈ V that generate local datasets D(i) . Each
data point in D(i) is characterized by a label value from a finite set Y (i) . We
can the measure the similarity between i, i′ by the fraction of data points in
S ′ ′
D(i) D(i ) with label values in Y (i) ∩ Y (i ) [70].
Example. Consider local datasets D(i) constituted by images of hand-
written digits 0, 1, . . . , 9. We model a local dataset using a hierarchical
probabilistic model: Each node i ∈ V is assigned a deterministic but unknown
(i) (i) (i)
distribution α(i) = α1 , . . . , α9 . The entry αj is the fraction of images at
node i that show digit j. We interpret the labels y (i,1) , . . . , y (i,mi ) as realiza-
tions of i.i.d. RVs, with values in {0, 1, . . . , 9} and distributed according to
α(i) . The features are interpreted as realizations of RVs with conditional dis-
tribution p(x|y) which is the same for all nodes i ∈ V. We can then estimate
the dis-similarity between nodes i, i′ via the distance between (estimations
17
One exception is when we generate the local dataset by drawing i.i.d. realizations from
(i)
p D(i) ; w(i) .
7
′
of) the parameters α(i) and α(i ) .
The above discrepancy measure construction (using estimates for the
parameters of a probabilistic model) is a special case of a more generic
two-step approach:
′
• First, we determine a vector representation z(i) ∈ Rm for each local
dataset D(i) [6, 71].
′
• Second, we measure the discrepancy between local datasets D(i) , D(i ) via
′
the distance between the corresponding representation vectors z(i) , z(i ) ∈
′
Rm .
′
One example for constructing a vector representation z(i) ∈ Rm for a local
dataset D(i) is, under an i.i.d. assumption, to use the maximum likelihood
estimate w
b (i) for the parameters of a probabilistic model p(x, y; w(i) ).
′
Let us next discuss a construction for the vector representation z(i) ∈ Rm
that is motivated by SGD (see Section 5.4). In particular, we could define
the discrepancy between two local datasets by interpreting them as two
possible batches used by SGD to train a model. If these two batches have
similar statistical properties (in the sense of being useful for the model
training), then their corresponding gradient approximations (5.11) should
be aligned. This suggests to use the gradient ∇f (w′ ) of the average loss
f (w) := (1/|D(i) |) (x,y)∈D(i) L (x, y) , h(w) as a vector representation z(i)
P
8
encoder decoder
′
D(i) z(i) ∈ Rm b (i)
D
h(·) h∗ (·)
for a local dataset. In particular, we fed it into an encoder network which has
been trained jointly with a decoder network on some learning task. Figure
7.2 illustrates a generic autoencoder setup.
9
Ai,i′ ∈ R+ via
′
X
Ai,i′ d(i,i ) . (7.4)
i,i′ ∈V
The objective function (7.4) penalizes having a large edge weight Ai,i′ between
′
two nodes i, i′ with a large discrepancy d(i,i ) . Note that the objective function
(7.4) is minimized by the trivial choice Ai,i′ = 0, i.e., the empty graph without
any edges E = ∅.
As discussed in Section 7.2, the FL network should have a sufficient
amount of edges in order to ensure that the GTVMin solutions are useful
model parameters. Indeed, the desired pooling effect of GTVMin requires
the eigenvalue λ2 L(G) to be sufficiently large. Ensuring a large λ2 L(G)
The constraints (7.5) require that each node i is connected with other nodes
(G)
using total edge weight (weighted node degree) i′ ̸=i Ai,i′ = dmax .
P
Combining the constraints (7.5) with the objective function (7.4) results
10
in the following graph learning principle,
′
X
bi,i′ ∈ argmin
A Ai,i′ d(i,i ) (7.6)
Ai,i′ =Ai′ ,i
i,i′ ∈V
11
modifying the last constraint in (7.6),
′
X
bi,i′ ∈ argmin
A Ai,i′ d(i,i ) (7.7)
Ai,i′ =Ai′ ,i
i,i′ ∈V
18
Can you think of FL application domains where the connectivity (e.g., via short-range
wireless links) of two clients i, i′ ∈ V might also reflect the pair-wise similarities between
′
probability distributions of local datasets D(i) , D(i ) ?
12
7.5 Overview of Coding Assignment
13
the following constructions for the representation vectors:
• II: the vector z(i) consisting of the parameters of a GMM that is fit to
the feature vectors x(i,1) , . . . , x(i,mi ) in the local dataset D(i) .
Note that for each of the above discrepancy measures, we obtain (via con-
necting nearest neighbours) a potentially different FL network G (I) , G (II) , G (III) .
Carefully note that these FL networks also depend on the choice for the input
parameter node_degree.
Using each of the above constructions for the FL network, learn local model
parameters for a linear model by applying FedGD Algorithm 5.1. To this
end, you have to split each local dataset into a training set and a validation
set (Algorithm 5.1 is applied to the training sets). Diagnose the resulting
model parameters by computing the average (over all nodes) training error
and validation error.
14
8 Trustworthy FL
The Story So Far. We have introduced GTVMin as a main design principle
for FL in Chapter 3. Chapter 5 applied the gradient-based methods from
Chapter 4 to solve GTVMin, resulting in practical FL systems. Our focus
has been on computational and statistical properties of these FL systems. In
this and the following chapters, we shift the focus from technical properties
to the trustworthiness of FL systems.
Section 8.2 reviews key requirements for trustworthy AI, which includes
FL systems, that have been put forward by the European Union [73, 74].
We will also discuss how these requirements guide the design choices for
GTVMin-based methods. Our focus will be on the three design criteria for
trustworthy FL: privacy, robustness and explainability. This chapter covers
robustness and explainability, the leakage and protection of privacy in FL
systems is the subject of Chapter 9.
Section 8.3 discusses the robustness of FL systems against perturbations
of local datasets and computations. A special type of perturbation is the
intentional modification or poisoning of local datasets (see Chapter 10).
Section 8.4 introduces a measure for the (subjective) explainability of the
personalized models trained by FL systems.
1
• have some intuition about how robustness, privacy and transparency
guides design choices for local models, loss functions and FL network in
GTVMin
2
Simple is Good. Human oversight can be facilitated by relying on
simple local models. Examples include linear models with few features or
decision trees with small depth. However, we are unaware of a widely accepted
definition of when a model is simple. Loosely speaking, a simple model results
in a learn hypothesis that allows humans to understand how features of a
data point relate to the prediction h(x). This notion of simplicity is closely
related to the concept of explainability which we discuss in more detail in
Section 8.4.
Continuous Monitoring. In its simplest form, GTVMin-based methods
involve a single training phase, i.e., learning local model parameters by
solving GTVMin. However, this approach is only useful if the data can
be well approximated by an i.i.d. assumption. In particular, this approach
works only if the statistical properties of local datasets do not change over
time. For many FL applications, this assumption is unrealistic (consider a
social network which is exposed to constant change of memberships and user
behaviour). It is then important to continuously compute a validation error
on a timely validation set which is then used, in turn, to diagnose the overall
FL system (see [6, Sec. 6.6]).
3
physical computers (e.g., a wireless network of devices with limited computa-
tional capability). These computers typically incur imperfections, such as a
temporary lack of connectivity or a device shutting down due to running out
of battery. Moreover, also the data generation processes might be subject to
perturbations such as statistical anomalies (outliers) or intentional modifica-
tions (see Chapter 10). Section 8.3 studies in some detail the robustness of
GTVMin-based systems against different perturbations of data sources and
imperfections of computational infrastructure.
4
only access those features of data points (being users) that are relevant for
predicting the label.
Data Governance. FL systems might use local datasets that are gener-
ated by human users, i.e., personal data. Whenever personal data is used,
special care must be dedicated towards data protection regulations [82]. It
might then be useful (or even compulsory) to designate a data protection
officer and conduct a data protection impact assessment [74].
Privacy. The operation of a FL system must not violate the fundamental
human right for privacy [83]. Chapter 9 discusses quantitive measures and
methods for privacy protection in GTVMin-based FL systems.
5
probabilistic model for the data, e.g., the conditional variance of the label
y, given the features x of a random data point. Another example for an
uncertainty measure is the validation error of a trained local model.
Explainability. The transparency of a GTVMin-based FL system also
includes the explainability of the trained local models. Section 8.4 discusses
quantitative measures for the subjective explainability of a learnt hypothesis.
We will also use this measure as a regularizer to obtain GTVMin-based
systems that guarantee subjective explainability “by design”.
“...we must enable inclusion and diversity throughout the entire AI system’s
life cycle...this also entails ensuring equal access through inclusive design
processes as well as equal treatment.” [74, p.18].
The local datasets used for the training of local models should be carefully
selected to not enforce existing discrimination. In a health-care application,
there might be significantly more training data for patients of a specific
gender, resulting in models that perform best for that specific gender at the
cost of worse performance for the minority [79, Sec. 3.3.]. Fairness is also
important for ML methods used to determine credit score and, in turn, if a
loan should be granted or not [84]. Here, we must ensure that ML methods
do not discriminate customers based on ethnicity or race. To this end, we
could augment data points via modifying any features that mainly reflect the
ethnicity or race of a customer (see Figure 8.1).
6
compensation y
h(x)
original training set D
augmented
gender x
7
to share them across the edges of the FL network. Computation and communi-
cation always requires energy should be generated in an environmental-friendly
fashion [86].
8
8.3.1 Sensitivity Analysis
min wT Qw + qT w. (8.1)
w=stack{w(i) }
The matrix Q and vector q are determined by the feature matrices X(i) and
label vector y(i) at nodes i ∈ V (see (2.27)). We next study the sensitivity of
(the solutions of) (8.1) towards external perturbations of the label vector.19
Consider an additive perturbation y
e(i) := y(i) + ε(i) of the label vector
y(i) . Using the perturbed label vector y
e(i) results also in a “perturbation” of
GTVMin (8.1),
min wT Qw + qT w + nT w + c. (8.2)
w=stack{w(i) }
T
(1) T (n) T
An inspection of (2.27) yields that n = . The
(1)
(n)
ε X ,..., ε X
next result provides an upper bound on the deviation between the solutions
of (8.1) and (8.2).
Proposition 8.1. Consider the GTVMin instance (8.1) for learning local
model parameters of a linear model for each node i ∈ V of an FL network G.
We assume that the FL network is connected, i.e., λ2 L(G) > 0 and the local
datasets are such that λ̄min > 0 (see (5.4)). Then, the deviation between w
b (i)
and the solution w
e (i) to the perturbed problem (8.2) is upper bounded as
n n
X 2 λmax (1 + ρ2 )2 X 2
w
b (i)
− e (i) 2
w ≤ 2 ε(n) 2
. (8.3)
i=1 min{λ2 L(G) αρ2 , λ̄min /2} i=1
19
Our study can be generalized to also take into account perturbations of the feature
matrices X(i) , for i = 1, . . . , n.
9
Here, we used the shorthand ρ := λ̄min /(4λmax ) (see (5.4)).
10
This estimation error consists of two components, the first component
(i′ )
being avg w
b − w for each node i ∈ V. Note that this error component is
identical for all nodes i ∈ V. The second component of the estimation error is
(i′ )
the deviation w e (i) := w b (i) − avg w
b of the learnt local model parameters
′ (i′ ) ′
b (i ) , for i′ = 1, . . . , n, from their average avg w b (i ) . As
= (1/n) ni′ =1 w
P
w b
discussed in Section 3.3.2, these two components correspond to two orthogonal
subspaces of Rd·n .
According to Prop. 3.1, the second error component is upper bounded as
n n
X 2 1 X 2
e (i) 2
w ≤ (1/mi ) ε(i) 2
. (8.5)
i=1
λ2 α i=1
To bound the first error component c̄−w, using the shorthand c̄ := avg w ,
(i)
b
we first note that (see (3.20))
X 2 X ′ 2
(1/mi ) y(i) −X(i) w− w
e (i) e (i) − w
e (i ) . (8.6)
c̄ = argmin 2
+α Ai,i′ w
w∈Rd 2
i∈V {i,i′ }∈E
Here, λ̄min is the smallest eigenvalue of (1/n) ni=1 Q(i) , i.e., the average of
P
T
the matrices Q(i) = (1/mi ) X(i) X(i) over all nodes i ∈ V.20 Note that the
bound (8.7) is only valid if λ̄min > 0 which, in turn, implies that the solution
to (8.6) is unique.
20
We encountered the quantity λ̄min already during our discussion of gradient-based
methods for solving the GTVMin instance (3.20) (see (5.4)).
11
We can develop (8.7) further using
n
X T
(1/mi ) X(i) ε(i) + X(i) w
e (i)
i=1 2
n
(a) X T
(1/mi ) X(i) ε(i) + X(i) w
e (i)
≤
2
i=1
v
u n
(b) √ uX T 2
≤ nt (1/mi ) X(i) ε(i) + X(i) w
e (i)
2
i=1
v
u n
√ u
(c) X T 2 T 2
≤ nt 2 (1/mi ) X(i) ε(i) + 2 (1/mi ) X(i) X(i) w
e (i)
2 2
i=1
v
u n
√ u
(d) X 2 2
≤ nt (2/mi )λmax ∥ε(i) ∥2 +2λ2max ∥w
e (i) ∥2 . (8.8)
i=1
Here, step (a) uses the triangle inequality form the norm ∥·∥2 , step (b)uses the
Cauchy-Schwarz inequality, step (c) uses the inequality ∥a + b∥22 ≤ 2 ∥a∥22 +
∥b∥2 and step (d) uses the maximum eigenvalue λmax := maxi∈V λd Q(i)
2
T
of the matrices Q(i) = (1/mi ) X(i) X(i) (see (5.4)).
Inserting (8.8) into (8.7) results in the upper bound
n
X 2 2
∥c̄ − w∥22 ≤2 (1/mi )λmax ε(i) 2 + λ2max e (i) 2
w /(nλ̄2min )
i=1
n
(8.5) X 2
≤ 2 λmax + (λ2max /(λ2 α)) (1/mi ) ε(i) 2
/(nλ̄2min ). (8.9)
i=1
12
guide the choice for the FL network G and the features of data points in the
local datasets.
According to (8.9), we should use an FL network G with large λ2 L(G)
13
rithm 5.1 can be modelled as perturbed GD (4.13) from Chapter 4. We can
then analyze the robustness of the resulting FL system via the convergence
analysis of perturbed GD (see Section 4.5).
According to (4.14), the performance of the decentralized Algorithm 5.1
degrades gracefully in the presence of imperfections such as missing or faulty
communication links. In contrast, the client-server based implementation of
FedAvg Algorithm 5.3 offers a single point of failure (the server).
8.3.4 Stragglers
14
forcing the other (faster) nodes to wait until also the slower ones are ready,
we could instead let them continue with their local updates. This results in
an asynchronous variant of Algorithm 5.1 which we summarize in Algorithm
8.1.
Note that Algorithm 8.1 still uses a global iteration counter k. However,
the role of this counter is fundamentally different to its role in the synchronous
Algorithm 5.1. In particular, we merely need the counter k for notational
convenience to denote the (arbitrary) time-instants during which the asyn-
chronous local updates (8.10). These updates take place only for a subset
of (“active”) nodes i ∈ A(k) . All other (“inactive”) nodes i ∈
/ A(k) leave their
current model parameters unchanged, w(i,k+1) := w(i,k) .
The asynchronous local update (8.10) at some node i ∈ A(i) uses the
′
“outdated” model parameters w(i ,ki,i′ ) at its neighbours i′ ∈ N (i) . In par-
ticular, some of these neighbours might have not been in the active sets
A(k−1) , A(k−2) , . . . during the most recent iterations. In this case, we cannot
′
use w(i ,k) as it is simply not available to node i at “time” k. Rather, we might
′
need to use w(i ,ki,i′ ) obtained during some previous “time” ki,i′ .
We can interpret the iteration index ki,i′ as the most recent time instant
during which node i′ has updated its local model parameters and shared it
with node i. The difference k − ki,i′ , in turn, can be viewed as a measure
for the communication delay from node i′ to node i. The robustness of (the
convergence of) Algorithm 8.1 against these communication delays is studied
in-depth in [48, Ch. 6 and 7].
15
Algorithm 8.1 Asynchronous FedGD for Local Linear Models
Input: FL network G; GTV parameter α; learning rate η
local dataset D(i) = for each i; some
(i,1) (i,1)
x ,y , . . . , x(i,mi ) , y (i,mi )
stopping criterion.
Output: linear model parameters w
b (i) for each node i ∈ V
Initialize: k := 0; w(i,0) := 0 for all nodes i ∈ V.
16
8.4 Subjective Explainability of FL Systems
17
model parameters w(i) by
X 2
(i) (i) T (i)
(8.11)
(1/ Dt ) u x −x w .
(i)
x∈Dt
It seems natural to add this measure as a penalty term to the local loss
function in (5.1), resulting in the new loss function
2 (i)
X 2
Li w(i) := (1/mi ) y(i) −X(i) w(i) 2 +ρ (1/ Dt ) u(i) x −xT w(i) .
| {z } (i)
training error |
x∈Dt
{z }
subjective explainability
(8.12)
The regularization parameter ρ controls the preference for a high subjective
T
explainability of the hypothesis h(i) (x) = w(i) x over a small training
error [90]. It can be shown that (8.12) is the average weighted squared error
loss of h(i) (x) on an augmented version of D(i) . This augmented version
(i)
includes the data point x, u(i) (x) for each data point x in the test-set Dt .
18
information (explanation) about a map such that user can anticipate the
results of applying the map to arbitrary arguments. However, the map A
might be much more complicated compared to a learnt hypothesis (which
could be a linear map for linear models). The different level of complexity
of these two families of maps requires to use different forms of explanation.
For example, we might explain a FL algorithm using a pseudo-code such as
Algorithm 5.1. Another form of explanation could be a Python code snippet
that illustrates a potential implementation of the algorithm.
19
1 from sklearn . datasets import load_iris
2 from sklearn . model_selection import train_test_split
3 from sklearn . tree import D e c i s i o n T r e e C l a s s i f i e r
4 from sklearn . metrics import accuracy_score
5
23 # Calculate accuracy
24 accuracy = accuracy_score ( y_test , y_pred )
25 accuracy
Figure 8.2: Python code for fitting a decision tree model on the Iris dataset.
20
8.5 Overview of Coding Assignment
This coding assignment continues from the coding assignment “FL Algo-
rithms” (see Section 5.7). This previous assignment required you to implement
Algorithm 5.1 and apply it to an FL network whose nodes are FMI weather
stations. We now study the effect of adding perturbations ε(i) to the label
vector y(i) , consisting of temperature measurements taken at the FMI station
i. In particular, we replace the label vector y(i) with the perturbed label
vector y(i) + ε(i) . How much do the local model parameters delivered by
Algorithm 5.1 change when adding perturbations of different strength ε(i) 2 .
21
9 Privacy-Protection in FL
The core idea of FL is to share information contained in local datasets in order
to improve the training of ML models. Chapter 5 discussed FL algorithms that
share information in the form of model parameters that are computed from
the local loss function. Each node i ∈ V receives current model parameters
from other nodes and, after executing a local update, shares its new model
parameters with other nodes.
Depending on the design choices for GTVMin-based methods, sharing
model parameters allows to reconstruct local loss functions and, in turn, to
estimate private information about individual data points such as health-care
customers (“patients”) [91]. Thus, the bad news is that FL systems will
almost inevitably incur some leakage of private [Link] good news
is, however, that the extent of privacy leakage can be controlled by (i) careful
design choices for GTVMin and (ii) applying slight modifications of basic FL
algorithms (such as those from Chapter 5).
This chapter revolves around two main questions:
Section 9.2 addresses Q1 while Sections 9.3 and 9.4 address Q2.
1
• know some quantitative measures for privacy leakage,
Consider a FL system that trains a personalized model for the users, indexed
by i = 1, . . . , n, of heart rate sensors. Each user i generates a local dataset
D(i) that consists of time-stamped heart rate measurements. We define a
single data point as a single continuous activity, e.g. as a 50-minute long run.
The features of such a data point (activity) might include the trajectory in
the form of a time-series of GPS coordinates (e.g., measured every 30 seconds).
The label of a data point (activity) could be the average heart rate during
the activity. Let us assume that this average heart rate is private information
that should not be shared with anybody.22
Our FL system also exploits the information provided by a fitness expert
that determines pair-wise similarities Ai,i′ between users i, i′ (e.g., due to
body weight and height). We then use Algorithm 5.1 to learn, for each user i,
the model parameters w(i) for some AI health-care assistant [92]. In what
follows, we interpret Algorithm 5.1 as a map A(·) (see Figure 9.1). The map A
n
reads in the dataset D := D(i) i=1 (constituted by the local datasets D(i) for
(i) n
i = 1, . . . , n) and delivers the local model parameters A D := stack w .
{z i=1}
b
|
w
b
22
In particular, we might not want to share our heart rate profiles with a potential future
employer who prefers candidates with a long life expectation.
2
A (i) n
D(i) stack w
b i=1
Figure 9.1: Algorithm 5.1 maps the collection D of local datasets D(i) to the
learnt model parameters w
b (i) , for each node i = 1, . . . , n. These learnt model
parameters are (approximate) solutions to the GTVMin instance (3.20).
3
5
3
x2
2
r=2
1
0
0 1 2 3 4 5
x1
Figure 9.2: Scatterplot of data points r = 1, 2, . . ., each characterized by
(r) (r) 2
features x(r) = x1 , x2 and a binary label y (r) ∈ {◦, ×}. The plot also
indicates the decision regions of a hypothesis b
h that has been learnt via ERM.
Would you be able to infer the label of data point r = 2 if you knew the
decision regions?
4
b D′ )
p(w;
b D)
p(w;
T w
b
5
against varying the heart rate value y (i,r) ,
A D − A D′
2
. (9.1)
ε
Here, D denotes some given collection of local datasets and D′ is a modified
dataset. In particular, D′ is obtained by replacing the actual average heart
rate y (i,r) with the modified value y (i,r) + ε. The privacy-protection of A
is higher for smaller values (9.1), i.e., the output changes only little when
varying the value of the average heart rate.
Another measure for the non-invertibility of A is referred to as differential
privacy (DP). This measure is particularly useful for stochastic algorithms
that use some random mechanism. One example for such a random mechanism
is the selection of a random subset of data points (a batch) within Algorithm
5.1. Section 9.3 discusses another example of a random mechanism: add the
realization of a RV to the (intermediate) results of an algorithm.
A stochastic algorithm A can be described by a probability distribution
b D) over a measurable space that is constituted by the possible values of
p(w;
the learnt model parameters w
b (see Figure 9.3).23 This probability distribution
is parametrized by the dataset D that is fed as input to the algorithm A.
DP measures the non-invertibility of a stochastic algorithm A via the
similarity of the probability distributions obtained for two datasets D, D′
that are considered neighbouring or adjacent [79, 96]. Typically, we consider
D′ as adjacent to D if it is obtained by modifying the features or label of a
single data point in D. As a case in point, consider data points representing
physical activities which are characterized by a binary feature xj ∈ {0, 1}
23
For more details about the concept of a measurable space, we refer to the literature on
probability and measure theory [93–95].
6
that indicates an excessively high average heart rate during the activity. We
could then define neighbouring datasets via flipping the feature xj of a single
data point. In general, the notion of neighbouring datasets is a design choice
used in the formal definition of DP.
Definition 2. (from [96]) A stochastic algorithm A is (α, γ)-RDP if, for any
two neighbouring datasets D, D′ ,
Dα p(w;b D) p(w; ′
b D ) ≤ γ. (9.4)
7
Proposition 9.1. Consider a FL system A that is applied to some dataset
D and some (possibly stochastic) map B that does not depend on D. If A is
(ε, δ)-DP (or (α, γ)-RDP), then so is also the composition B ◦ A.
8
For a privacy-preserving algorithm A, there should be no test T for which
both PD→D′ and PD′ →D are small (close to 0). This intuition can be made
precise as follows (see, e.g., [100, Thm. 2.1.] or [101]): If an algorithm A is
(ε, δ)-DP, then
exp(ε)PD→D′ + PD′ →D ≥ 1 − δ. (9.5)
Thus, if A is (ε, δ)-DP with small ε, δ (close to 0), then (9.5) implies PD→D′ +
PD′ →D ≈ 1.
Depending on the underlying design choices (for data, model and optimization
method), a GTVMin-based method A might already ensure DP by design.
However, for some design choices the resulting GTVMin-based method A
might not ensure DP. However, according to Prop. 9.1, we might then still be
able to ensure DP by applying pre- and/or post-processing techniques to the
input (local datasets) and output (learnt model parameters) of A. Formally,
this means to compose the map A with two (possibly stochastic) maps I and
O, resulting in a new algorithm with map A′ := O ◦ A ◦ I. The output of A′
for a given dataset D is obtained by
9
DP is simply to add some noise [96],
T i.i.d.
O(A) := A + n, with noise n = n1 , . . . , nnd , n1 , . . . , nnd ∼ p(n). (9.6)
Note that the post-processing (9.6) is parametrized by the choice for the
probability distribution p(n) of the noise entries. Two important choices are
the Laplacian distribution p(n) := 2b 1
exp − |n| and the norm distribution
b
n2
1
(i.e., using Gaussian noise n ∼ N (0, σ 2 )).
p(n) := √2πσ 2
exp − 2σ 2
is (ε, δ)-DP [96, Thm. 3.22]. It might be difficult to evaluate the sensitivity
(9.7) for a given FL algorithm A [102]. For a GTVMin-based method, i.e.,
A(D) is a solution to (3.18), we might obtain upper bounds on ∆2 A by a
10
used for pre-processing might be different from just adding the realization of
a RV (see (9.6)): 24
11
test” A. The hypothesis bh might be learnt by a ERM-based method (see
(r)
Algorithm 2.1) using a training set consisting of pairs A Dsyn , s(r) for
some r ∈ {1, . . . , L}.
12
I (s; Φ(x))
I (y; Φ(x))
Figure 9.4: The solutions of the privacy funnel (9.8) trace out (for varying
constraint R) a curve in the plane spanned by the values of I (s; Φ(x))
(measuring the privacy leakage) and I (y; Φ(x)) (measuring the usefulness of
the transformed features for predicting the label).
we can use the MI I (y; Φ(x)) to measure the predicability of the label y
from Φ(x). A large value I (y; Φ(x)) indicates that Φ(x) allows to accurately
predict y (which is of course preferable).
It seems natural to use a feature map Φ(x) that optimally balances a small
I (s; Φ(x)) (privacy protection) with a sufficiently large I (y; Φ(x)) (allowing
to accurately predict y). The mathematically precise formulation of this plan
is known as the privacy funnel [106, Eq. (2)],
Figure 9.4 illustrates the solution of (9.8) for varying R, i.e., minimum value
of I (y; Φ(x)).
Optimal Private Linear Transformation. The privacy funnel (9.8)
13
uses the MI I (s; Φ(x)) to quantify the privacy leakage of a feature map Φ(x).
An alternative measure for the privacy leakage is the minimum reconstruction
error s− ŝ. The reconstruction ŝ is obtained by applying a reconstruction map
r(·) to the transformed features Φ(x). If the joint probability distribution
p(s, x) is a multivariate normal and the Φ(·) is a linear map (of the form
Φ(x) := Fx with some matrix F), then the optimal reconstruction map is
again linear [30].
We would like to find the linear feature map Φ(x) := Fx such that for any
linear reconstruction map r (resulting in ŝ := rT Fx) the expected squared
error E{(s − ŝ)2 } is large. The smallest possible expected squared error loss
measures the level of privacy protection offered by the new features z = Fx.
The larger the value ε(F), the more privacy protection is offered. It can
be shown that ε(F) is maximized by any F that is orthogonal to the cross-
covariance vector cx,s := E{xs}, i.e., whenever Fcx,s = 0. One specific choice
for F that satisfies this orthogonality condition is
Figure 9.5 illustrates a dataset for which we want to find a linear feature map
F such that the new features z = Fx do not allow to accurately predict a
private attribute.
9.5 Problems
9.1. Where is Alice? Consider a device, named Alice, that implements the
asynchronous Algorithm 8.1. The local dataset of the device are temperature
14
x2
food preference y
gender s
x1
Figure 9.5: A toy dataset D whose data points represent customers, each
T
characterized by features x = x1 , x2 . These raw features carry information
about a private attribute s (gender) and the label y (food preference) of a
person. The scatter-plot suggests that we can find a linear feature transfor-
mation F := f T ∈ R1×2 resulting in a new feature z := Fx that does not
allow to predict s, while still allowing to predict y.
15
measurements from some FMI weather station. Assuming that no other device
interacts with Alice except your device, named Bob. Develop a software for
Bob that interacts with Alice according to Algorithm 8.1 in order to determine
at which FMI station we can find Alice.
16
9.6 Overview of Coding Assignment
Consider a social media post of a friend that is travelling across Finland. This
post includes a snapshot of a temperature measurement and a clock. Can
you guess the latitude and longitude of the location where your friend took
this snapshot? We can use ERM to do this: Use Algorithm 2.1 to learn a
vector-valued hypothesis b
h for predicting latitude and longitude from the time
and value of a temperature measurement. For the training set and validation
set, we use the FMI recordings stored in the data file.
17
9.6.2 Ensuring Privacy with Pre-Processing
Repeat the privacy attack described in Section 9.6.1 but this time using a
pre-processed version of the raw data. In particular, try out combinations of
randomly selecting a subset of the data points in the data file and also adding
noise to their features and label. How well can you predict the latitude and
longitude from the time and value of a temperature measurement using a
hypothesis h
b learnt from the perturbed data.
Repeat the privacy attack described in Section (9.6.1) but this time using
a post-processing of the learnt hypothesis h
b (obtained from Algorithm 2.1
applied to the data file). In particular, study how well you can predict the
latitude and longitude from the time and value of a temperature measurement
using a noisy prediction hypothesis h(x)
b + n. Here, n is a realization drawn
from a multivariate normal distribution N (0, σ 2 I).
18
1. read in all data points stored in the data file and construct a feature
matrix X ∈ Rm×7 with m being the total number of data points
2. remove the sample means from each feature, resulting in the centered
feature matrix
b := X − (1/m)11T X , 1 := 1, . . . , 1 T ∈ Rm . (9.11)
X
3. extract the private attribute (centred normalized latitude) for each data
point and store in the vector
T
ĉx,s := (1/m) X
b s (9.13)
The matrix F obtained from (9.10) by replacing cx,s with ĉx,s , is then
used to compute the privacy preserving features z(r) = Fx(r) for r = 1, . . . , m.
To verify if these new features are indeed privacy preserving, we use linear
regression (as implemented by the LinearRegression class of the Python
package scikit-learn) to learn the model parameters of a linear model to
(r)
predict the private attribute s(r) = x1 (the latitude of the FMI station at
which the r-th measurement has been taken) from the features z(r) . (Note
that you have already trained and validated a linear model within the “ML
Basics” assignment.) We can use the validation error of the trained linear
model as a measure for the privacy protection (larger validation error means
higher privacy protection). Beside the privacy protection, the new features
19
z(r) should still allow to accurately predict the label y (r) of a data point. To
this end, we learn the model parameters of a linear model to predict the
private property y (r) solely from the features z(r) .
20
10 Data and Model Poisoning in FL
Every ML method, including ERM or GTVMin, is to some extent at the mercy
of the data generator. Indeed, the model parameters learnt via ERM (for basic
ML) or via GTVMin (for FL) are determined by the statistical properties
of the training set. We must hope (e.g., via an i.i.d. assumption) that the
data points in the training set truthfully reflect the statistical properties of
the underlying data generation process. However, these data points might
have been intentionally perturbed (poisoned).
In general, it is impossible to perfectly detect if data points have been
poisoned. The perfect detection of those perturbed data points requires
complete knowledge of the underlying probability distribution. However, we
typically do not know this probability distribution but can only estimate it
from (possibly perturbed) data points. We can then use this estimate to
detect perturbed data points via outlier detection techniques.
Instead of trying to identify and remove poisoned data points we can also
try to make GTVMin-based FL systems more robust against data poisoning.
We have already discussed the robustness of GTVMin-based methods in
Section 8.3. The level of robustness crucially depends on the design choices
for the local models, local loss functions and FL network in GTVMin.
Besides data poisoning, FL systems can also be subject to model poisoning
attacks. FL systems are implemented with distributed computers such as a
collection of smartphones (clients) that are inter-connected by wireless links
(see Section 5). For some applications it is not possible (or desirable) to
enforce sophisticated authentication techniques that determine which clients
can participate in the FL system. The distributed computations might then
1
be compromised (poisoned) in order to disturb the training of the local model
at a dedicated (“target”) client.
2. know how design choices for GTVMin-based methods affect the vulner-
ability of FL systems against these attacks.
2
perturbations at the nodes i′ ∈ W then propagate over the edges of the FL
network (via the model parameters sharing step 3 in Algorithm 5.1) and
result in perturbed model parameters at other nodes (whose local datasets
have not been poisoned).
Based on the objective of an attack, we distinguish between the following
attack types:
3
with feature values falling in this range such that the hypothesis h̃(i)
delivers a specific prediction (e.g., a prediction that results in granting
access to a restricted area within a building). Figure 10.1 illustrates, for
some (target) node i, the result of a backdoor attack on a FL system.
4
h̃(i) (x)
label y
ĥ(i) (x)
h̄(i) (x)
“backdoor"
features x
Figure 10.1: A local dataset D(i) along with three hypothesis maps learnt via
some GTVMin-based method such as Algorithm 5.1. These three maps are
obtained for different attacks to the FL system: The map ĥ(i) is obtained
when no manipulation is applied (no attacks). A backdoor attack aims at
nudging the learnt hypothesis h̃(i) to behave similar to ĥ(i) when applied to
data points in D(i) . However, it behaves very different for a certain range
of feature values outside of D(i) . This value range can be interpreted as a
backdoor that is used to trigger a malicious prediction. In contrast to backdoor
attacks, a denial-of-service attack tries to enforce a learnt hypothesis h̄(i) that
delivers poor predictions on the local dataset D(i) .
5
10.3 Data Poisoning
• Clean-Label Attack: The attacker leaves the labels untouched and only
manipulates the features of data points in the training set.
6
compared to the squared error loss. Another class of robust loss functions is
obtained by including a penalty term (as in regularization).
10.5 Assignment
7
10.5.1 Denial-of-Service Attack
8
Glossary
k-means The k-means algorithm is a hard clustering method which assigns
each data point of a dataset to precisely one of k different clusters.
The method alternates between updating the cluster assignments (to
the cluster with nearest cluster mean) and, given the updated cluster
assignments, re-calculating the cluster means [6, Ch. 8].. 5, 6, 17, 23
1
observed reward signals.. 2, 9
2
among any hypothesis (not even required to belong to the hypothesis
space H) [30]. This minimum achievable risk (referred to as the Bayes
risk) is the risk of the Bayes estimator for the label y of a data point,
given its features x. Note that, for a given choice of loss function, the
Bayes estimator (if it exists) is completely determined by the probability
distribution p(x, y) [30, Chapter 4]. However, there are two challenges
to computing the Bayes estimator and Bayes risk: i) the probability
distribution p(x, y) is unknown and needs to be estimated but (ii)
even if we know p(x, y) it might be computationally too expensive to
compute the Bayes risk exactly. A widely used probabilistic model is
the multivariate normal distribution (x, y) ∼ N (µ, Σ) for data points
characterised by numeric features and labels. Here, for the squared
error loss, the Bayes estimator is given by the posterior mean µy|x of
the label y given the features x [30, 108]. The corresponding Bayes risk
is given by the posterior variance σy|x
2
(see Figure 10.2). . 5, 11, 13, 14
3
µy|x
σy|x
× y
ĥ(x)
Figure 10.2: If features and label of a data point are drawn from a multivariate
normal distribution, we can achievive minimum risk (under squared error
loss) by using the Bayes estimator µy|x to predict the label y of a data point
with features x. The corresponding minimum risk is given by the posterior
variance σy|x
2
. We can use this quantity as a baseline for the average loss of a
trained model ĥ.
probability distribution and the choice for the loss function L (·, ·).. 3,
4, 11
4
data points as realizations of i.i.d. RVs,
5
cluster A cluster is a subset of data points that are more similar to each
other than to the data points outside the cluster. The quantitative
measure of similarity between data points is a design choice. If data
points are characterized by Euclidean feature vectors x ∈ Rd , we can
define the similarity between two data points via the Euclidean distance
between their feature vectors.. 1, 5–8, 11, 13, 17, 23, 43
6
points in the same cluster are more similar with each other than with
those outside the cluster [54]. We obtain different clustering methods
by using different notions of similarity between data points.. 2
7
define a function as convex if its epigraph is a convex set [29].. 2, 4, 5,
8–10, 17, 36, 37, 47
λ1 ≤ . . . ≤ λn .
covariance
matrix The covariance
matrix of a RV x ∈ R is defined as
d
T
.. 11, 17, 18, 32
E x−E x x−E x
8
data minimization principle European data protection regulation includes
a data minimization principle. This principle requires a data controller
to limit the collection of personal information to what is directly relevant
and necessary to accomplish a specified purpose. The data should be
retained only for as long as necessary to fulfill that purpose [82, Article
5(1)(c)] [?].. 19
data point A data point is any object that conveys information [109]. Data
points might be students, radio signals, trees, forests, images, RVs, real
numbers or proteins. We characterize data points using two types of
properties. One type of property is referred to as a feature. Features
are properties of a data point that can be measured or computed in
an automated fashion. A different kind of property is referred to as
labels. The label of a data point represents some higher-level fact (or
quantity of interest). In contrast to features, determining the label of a
data point typically requires human experts (domain experts). Roughly
speaking, ML aims to predict the label of a data point based solely on
its features.. 1–20, 23–31, 34–36, 38–48
dataset With a slight abuse of language we use the term dataset or set of
data points to refer to an indexed list of data points z(1) , z(2) , . . .. Thus,
there is a first data point z(1) , a second data point z(2) and so on. Strictly
9
speaking, a dataset is a list and not a set [112]. We need to keep track
of the order of data points in order to cope with several data points
having the same features and labels. Database theory studies formal
languages for defining, structuring, and reasoning about datasets [?]..
1–17, 19, 24, 28–31, 35, 45
10
is continued until the data point ends up in a leaf node (having no
children nodes). . 3, 6, 13, 14
deep net A deep net is a ANN with a (relatively) large number of hidden
layers. Deep learning is an umbrella term for ML methods that use a
deep net as their model [71].. 14, 15, 26
device Any physical system that is can be used to store and process data.
In the context of ML, we typically mean a computer that is able to read
in data points from different sources and, in turn, to train a ML model
using these data points.. 3, 14, 16
11
privacy leakage incurred by revealing the output. Roughly speaking, a
ML method is differentially private if the probability distribution of the
output A(D) does not change too much if the sensitive attribute of one
data point in the training set is changed. Note that differential privacy
builds on a probabilistic model for a ML method, i.e., we interpret its
output A(D) as the realization of a RV. The randomness in the output
can be ensured by intentionally adding the realization of an auxiliary
RV (noise) to the output of the ML method.. 6–12, 35
12
eigenvalue decomposition The eigenvalue decomposition for a square ma-
trix A ∈ Rd×d is a factorization of the form
A = VΛV−1 .
the average loss incurred by h when applied to the data points in D.. 3,
13, 14, 33, 44, 45
estimation error Consider data points with feature vectors x and label y. In
some applications we can model the relation between the features and the
label of a data point as y = h̄(x)+ε. Here, we used some true underlying
13
hypothesis h̄ and a noise term ε which summarized any modelling or
labelling errors. The estimation error incurred by a ML method that
learns a hypothesis b
h, e.g., using ERM, is defined as b
h(x) − h̄(x), for
some feature vector. For a parameterized hypothesis space, consisting
of hypothesis maps that are determined by a model parameters w, we
can define the estimation error as ∆w = w
b − w [68, 113].. 8–10, 12, 13
4, 7, 16, 29, 38
14
explainability of a trained model can be constructed by comparing its
predictions with the predictions provided by a user on a test-set [88, 90].
Alternatively, we can use probabilistic models for data and measure
explainability of a trained ML model via the conditional (differential)
entropy of its predictions, given the user predictions [89, 116]. . 1, 3, 6,
17–19
feature A feature of a data point is one of its properties that can be mea-
sured or computed easily without the need for human supervision. For
example, if a data point is a digital image (e.g„ stored in as a jpeg
file), then we could use the red-green-blue intensities of its pixels as
features. Domain-specific synonyms for the term feature are covariate,
explanatory variable, independent variable, input (variable), predictor
(variable) or regressor [117–119].. 1–20, 23, 24, 27–31, 34, 35, 38, 41–43,
47, 48
feature map A map that transforms the original features of a data point
into new features. The so-obtained new features might be preferable over
the original features for several reasons. For example, the arrangement
of data points might become simpler (of more linear ) in the new feature
space, allowing to use linear models in the new features. This idea is
a main driver for the development of kernel methods [120]. Moreover,
the hidden layers of a deep net can be interpreted as a trainable feature
map followed by a linear model in the form of the output layer. Another
reason for learning a feature map could be that learning a small number
of new features helps to avoid overfitting and ensure interpretability [121].
15
The special case of a feature map delivering two numeric features is
particularly useful for data visualization. Indeed, we can depict data
points in a scatterplot by using two features as the coordinates of a
data point.. 14, 15
16
of [14] consider federated averaging methods where the local model
training is implemented by running several GD steps.. 2, 14
17
Gaussian random variable A standard Gaussian RV is a real-valued ran-
dom variable x with probability density function (pdf) [5, 108, 122]
1 2
p(x) = √ exp−x /2 .
2π
18
• Data minimization principle: ML systems should only use necessary
amount of personal data for their purpose.
. 4, 6
19
w(k) . We can achieve this typically by using a gradient step
with a sufficiently small step size η > 0. Figure 10.3 illustrates the effect
of a single GD step (10.2). . 1, 2, 5–8, 11–14, 17, 20, 22, 28, 36, 43, 44
Figure 10.3: A single gradient step (10.2) towards the minimizer w of f (w).
b := w − η∇f (w).
w (10.3)
20
−η∇f (w(k) )
f (·)
∇f (w(k) )
w T (f,η) (w) w
Figure 10.4: The basic gradient step (10.3) maps a given vector w to the
updated vector w′ . It defines an operator T (f,η) (·) : Rd → Rd : w 7→ w.
b
2
b = argmin f (w′ )+(1/η) ∥w − w′ ∥2 .
w (10.4)
w′ ∈Rd
21
f (w′ )
(1/η) ∥w − w′ ∥22
22
GTV of local model parameters as a regularizer [33].. 1–15, 17, 18
23
allows for efficient matrix operations, and there is a (approximately)
linear relation between features and label, a useful choice for the hy-
pothesis space might be the linear model.. 1, 3, 4, 12–16, 29–31, 34, 40,
45–47
pdf of the underlying RVs.. 3–5, 7, 8, 10, 13–18, 24, 27, 31, 36, 41, 43
linear feature space X ′ which is (in general) different from the original
feature space X . The feature space X ′ has a specific mathematical
24
structure, i.e., it is a reproducing kernel Hilbert space [120, 124].. 15,
25, 34
x(5)
x(4) z = K x, ·
x(1) z(1)
z(5)z(4)z(3)z(2)
x(3) x(2)
Figure 10.6: Five data points characterized by feature vectors x(r) and
labels y (r) ∈ {◦, □}, for r = 1, . . . , 5. With these feature vectors, there is no
way to separate the two classes by a straight line (representing the decision
boundary of a linear classifier). In contrast, the transformed feature vectors
z(r) = K x(r) , · allow to separate the data points using a linear classifier.
25
whether the image contains a cat or not. Synonyms for label, commonly
used in specific domains, include response variable, output variable, and
target [117–119].. 1–18, 20, 23–31, 34, 35, 38, 41–43, 45, 48
Here, Ai,i′ denotes the edge weight of an edge {i, i′ } ∈ E. . 2–6, 10,
16–18
26
1
2 −1 −1
L(G) = −1 1
0
2 3 −1 0 1
law of large numbers The law of large numbers refers to the convergence
of the average of an increasing (large) number of i.i.d. RVs to the
mean of their common probability distribution. Different instances
of the law of large numbers are obtained using different notions of
convergence [122].. 14, 15
27
learning method is GD and its variants such as SGD or projected GD.
We refer by learning rate to a parameter of an iterative learning method
that controls the extent by which the current hypothesis can be modified
during a single iteration. A prime example of such a parameter is the
step size used in GD [6, Ch. 5].. 1, 2, 4–11, 13, 16, 17, 43
least absolute shrinkage and selection operator (Lasso) The least ab-
solute shrinkage and selection operator (Lasso) is an instance of struc-
tural risk minimization (SRM) to learn the weights w of a linear map
h(x) = wT x based on a training set. The Lasso is obtained from linear
regression by adding the scaled ℓ1 -norm α ∥w∥1 to the average squared
28
error loss incurred on the training set.. 40
29
data points which are characterized by features and labels. In contrast
to a single dataset used in basic ML methods, a local dataset is also
related to other local datasets via different notions of similarities. These
similarities might arise from probabilistic models or communication
infrastructure and are encoded in the edges of a FL network.. 1–18, 23,
30, 33, 35, 47
local model Consider a collections of local datasets that are assigned to the
nodes of a FL network. A local model H(i) is a hypothesis space assigned
to a node i ∈ V. Different nodes might be assigned different hypothesis
′
spaces, i.e., in general H(i) ̸= H(i ) for different nodes i, i′ ∈ V.. 1–7, 9,
11–14, 16–18
loss ML methods use a loss function L (z, h) to measure the error incurred
by applying a specific hypothesis to a specific data point. With slight
abuse of notation, we use the term loss for both, the loss function L
itself and for its value L (z, h) for a specific data point z and hypothesis
h.. 1–9, 11–17, 31, 33, 39–41, 43, 45, 46, 48
L : X × Y × H → R+ : x, y , h 7→ L ((x, y), h)
which assigns a pair of a data point, with features x and label y, and
a hypothesis h ∈ H the non-negative real number L ((x, y), h). The
loss value L ((x, y), h) quantifies the discrepancy between the true label
y and the prediction h(x). Lower (closer to zero) values L ((x, y), h)
indicate a smaller discrepancy between prediction h(x) and label y.
30
Figure 10.8 depicts a loss function for a given data point, with features
x and label y, as a function of the hypothesis h ∈ H. . 1–9, 11, 12, 14,
loss
L ((x, y), h)
hypothesis h
Figure 10.8: Some loss function L ((x, y), h) for a fixed data point, with
feature vector x and label y, and varying hypothesis h. ML methods try to
find (learn) a hypothesis that incurs minimum loss.
model In the context of ML methods, the term model typically refers to the
hypothesis space used by a ML method [6, 131].. 1–6, 8, 9, 11–18, 23,
31
28, 29, 32, 33, 38, 40, 43, 45
model parameters Model parameters are quantities that are used to select a
specific hypothesis map from a model. We can think of model parameters
as a unique identifier for a hypothesis map, similar to how a social
security number identifies a person in Finland.. 1–11, 13–23, 31, 33, 35,
36, 38, 40, 44, 47
. 3, 4, 6, 11, 17, 18
32
neighbours The neighbours of a node i ∈ V within a FL network are those
nodes i′ ∈ V \ {i} that are connected (via an edge) to node i.. 6, 7, 10,
16, 32, 33
networked data Networked data consists of local datasets that are related
by some notion of pair-wise similarity. We can represent networked
data using a graph whose nodes carry local datasets and edges encode
pairwise similarities. One example for networked data arises in FL
applications where local datasets are generated by spatially distributed
devices.. 12, 33
33
maps. For example, the linear model H := {h(w) : h(w) (x) = w1 x + w2 }
consists of all hypothesis maps h(w) (x) = w1 x + w2 with a particular
T
choice for the parameters w = w1 , w2 ∈ R2 . Another example of
parameters is the weights assigned to the connections between neurons
of an ANN.. 1–18
34
privacy leakage Consider a (ML or FL) system that processes a local dataset
D(i) and shares data, such as the predictions obtained for new data
points, with other parties. Privacy leakage arises if the shared data
carries information about a private (sensitive) feature of a data point
(which might be a human) of D(i) . The amount of privacy leakage can
be measured via MI using a probabilistic model for the local dataset.
Another quantitative measure for privacy leakage is DP.. 2
35
probability density function (pdf ) The probability density function (pdf)
p(x) of a real-valued RV x ∈ R is a particular representation of its prob-
ability distribution. If the pdf exists, it can be used to compute the
probability that x takes on a value from a (measurable) set B ⊆ R
via p(x ∈ B) = B p(x′ )dx′ [5, Ch. 3]. The pdf of a vector-valued RV
R
18, 24
36
computed efficiently are sometimes referred to as proximable or simple
[37].. 11
f (w) = wT Qw + qT w + a,
37
to real numbers R. A vector-valued random variable maps elementary
events to the Euclidean space Rd . Probability theory uses the concept
of measurable spaces to rigorously define and study the properties of
(large) collections of random variables [95, 108].. 3–15, 17, 18, 24, 27,
31, 32, 35, 36, 38, 39, 41, 47
38
regression).
39
{h : h(x) = w1 x+w0 ; w1 ∈ [0.4, 0.6]}
label y h(x)
original training set D
augmented
√
α
1
Pm
m r=1 L x(r) , y (r) , h +αR h
feature x
training set might differ from its prediction errors on data points outside
the training set. Ridge regression uses the regularizer R h := ∥w∥22
for linear hypothesis maps h(w) (x) := wT x [6, Ch. 3]. The least
absolute shrinkage and selection operator (Lasso) uses the regularizer
R h := ∥w∥1 for linear hypothesis maps h(w) (x) := wT x [6, Ch. 3]..
1, 6, 9, 13, 16, 23
40
esis map h(w) (x) = wT x. The quality of a particular choice for the
parameter vector w is measured by the sum of two components. The
first component is the average squared error loss incurred by h(w) on a
set of labeled data points (the training set). The second component is
the scaled squared Euclidean norm α∥w∥22 with a regularization param-
eter α > 0. It can be shown that the effect of adding to α∥w∥22 to the
average squared error loss is equivalent to replacing the original data
points by an ensemble of realizations of a RV centered around these
data points.. 2, 3, 9, 10, 13, 16–19, 39, 40
risk Consider a hypothesis h used to predict the label y of a data point based
on its features x. We measure the quality of a particular prediction
using a loss function L ((x, y), h). If we interpret data points as the
realizations of i.i.d. RVs, also the L ((x, y), h) becomes the realization
of a RV. The i.i.d. assumption allows to define the risk of a hypothesis
as the expected loss E L ((x, y), h) . Note that the risk of h depends
on both, the specific choice for the loss function and the probability
distribution of the data points.. 2–4, 7, 11, 33, 39, 46
41
y
x
Figure 10.10: A scatterplot of data points that represent daily weather
conditions in Finland. Each data point is characterized by its minimum
daytime temperature x as feature and its maximum daytime temperature
y as the label. The temperatures have been measured at the FMI weather
station Helsinki Kaisaniemi during 1.9.2024 - 28.10.2024.
. 1, 4, 5, 47
soft clustering Soft clustering refers to the task of partitioning a given set of
data points into (few) overlapping clusters. Each data point is assigned
42
to several different clusters with varying degree of belonging. Soft
clustering methods determine the degree of belonging (or soft cluster
assignment) for each data point and each cluster. A principled approach
to soft clustering is by interpreting data points as i.i.d. realizations of
a GMM. We then obtain a natural choice for the degree of belonging
as the conditional probability of a data point belonging to a specific
mixture component.. 6, 11
squared error loss The squared error loss measures the prediction error of
a hypothesis h when predicting a numeric label y ∈ R from the features
x of a data point. It is defined as
2
L ((x, y), h) := y − h(x) .
|{z}
=ŷ
43
Pm
r=1
P
r∈B
r=1 ∇w L z , w by replacing the sum over all data points in the training
Pm (r)
44
iterating.. 1, 3, 4, 6, 7, 10, 13, 15, 16
test set A set of data points that have neither been used to train a model,
e.g., via ERM, nor in a validation set to choose between different models..
2
training error The average loss of a hypothesis when predicting the labels of
data points in a training set. We sometimes refer by training error also
the minimum average loss incurred on the training set by the optimal
hypothesis from a hypothesis space.. 1, 3, 11–15, 17–19, 39, 45, 46
45
The European Union has put forward seven key requirements (KRs)
for trustworthy AI (that typically build on ML methods) [73]: KR1 -
Human Agency and Oversight, KR2 - Technical Robustness
and Safety, KR3 - Privacy and Data Governance, KR4 - Trans-
parency, KR5 - Diversity Non-Discrimination and Fairness,
KR6 Societal and Environmental Well-Being, KR7 - Account-
ability. . 1
validation Consider a hypothesis ĥ that has been learnt via some ML method,
e.g., by solving ERM on a training set D. Validation refers to the practice
of evaluating the loss incurred by hypothesis ĥ on a validation set that
consists of data points that are not contained in the training set D.. 1,
6, 11
validation set A set of data points used to estimate the risk of a hypothesis
ĥ that has been learnt by some ML method (e.g., solving ERM). The
average loss of ĥ on the validation set is referred to as the validation
error and can be used to diagnose a ML method (see [6, Sec. 6.6.]).
The comparison between training error and validation error can inform
directions for improvements of the ML method (such as using a different
hypothesis space).. 3, 11, 12, 14, 15, 17, 19, 45, 46
46
variance The variance of a real-valued RV x is defined as the expectation
2
of the squared difference between x and its expectation
E x − E{x}
E{x}. We extend this definition to vector-valued RVs x as E x −
2
E{x} 2
.. 3, 4, 14
vertical FL Vertical FL uses local datasets that are constituted by the same
data points but characterizing them with different features [55]. For
example, different healthcare providers might all contain information
about the same population of patients. However, different healthcare
providers collect different measurements (blood values, electrocardiog-
raphy, lung X-ray) for the same patients.. 2, 11, 12
∇f w
b =0⇔f w
b = min f (w).
w∈Rd
. 12
47
0/1 loss The 0/1 loss L ((x, y) , h) measures the quality of a classifier h(x)
that delivers a prediction ŷ (e.g., via thresholding (10.1)) for the label
y of a data point with features x. It is equal to 0 if the prediction
is correct, i.e., L ((x, y) , h) = 0 when ŷ = y. It is equal to 1 if the
prediction is wrong, L ((x, y) , h) = 1 when ŷ ̸= y.. 1
48
References
[1] W. Rudin, Real and Complex Analysis, 3rd ed. New York: McGraw-
Hill, 1987.
[2] ——, Principles of Mathematical Analysis, 3rd ed. New York: McGraw-
Hill, 1976.
[3] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed. Bal-
timore, MD: Johns Hopkins University Press, 2013.
[4] G. Golub and C. van Loan, “An analysis of the total least squares
problem,” SIAM J. Numerical Analysis, vol. 17, no. 6, pp. 883–893, Dec.
1980.
[6] A. Jung, Machine Learning: The Basics, 1st ed. Springer Singapore,
Feb. 2022.
49
[9] H. Ates, A. Yetisen, F. Güder, and C. Dincer, “Wearable devices for
the detection of covid-19,” Nature Electronics, vol. 4, no. 1, pp. 13–14,
2021. [Online]. Available: [Link]
[11] S. Cui, A. Hero, Z.-Q. Luo, and J. Moura, Eds., Big Data over Networks.
Cambridge Univ. Press, 2016.
50
[15] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
Challenges, methods, and future directions,” IEEE Signal Processing
Magazine, vol. 37, no. 3, pp. 50–60, May 2020.
51
google keyboard query suggestions,” 2018. [Online]. Available:
[Link]
52
[30] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed.
New York: Springer, 1998.
53
[39] R. Peng and D. A. Spielman, “An efficient parallel solver for SDD linear
systems,” in Proc. ACM Symposium on Theory of Computing, New
York, NY, 2014, pp. 333–342.
54
[44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, PyTorch: An Imperative Style, High-
Performance Deep Learning Library. Red Hook, NY, USA: Curran
Associates Inc., 2019.
55
[52] D. J. Spiegelhalter, “An omnibus test for normality for small samples,”
Biometrika, vol. 67, no. 2, pp. 493–496, 2024/03/25/ 1980. [Online].
Available: [Link]
56
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018, pp. 1199–1208.
[62] G. Keiser, Optical Fiber Communication, 4th ed. New Delhi: Mc-Graw
Hill, 2011.
[65] Y.-T. Chow, W. Shi, T. Wu, and W. Yin, “Expander graph and
communication-efficient decentralized optimization,” in 2016 50th Asilo-
57
mar Conference on Signals, Systems and Computers, 2016, pp. 1715–
1720.
[69] S. Chepuri, S. Liu, G. Leus, and A. Hero, “Learning sparse graphs under
smoothness prior,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, 2017, pp. 6508–6512.
58
[73] E. Commission, C. Directorate-General for Communications Networks,
and Technology, The Assessment List for Trustworthy Artificial Intelli-
gence (ALTAI) for self assessment. Publications Office, 2020.
59
[81] P. Samarati, “Protecting respondents identities in microdata release,”
IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 6,
pp. 1010–1027, 2001.
60
[87] C. Wang, Y. Yang, and P. Zhou, “Towards efficient scheduling of feder-
ated mobile devices under computational and statistical heterogeneity,”
IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 2,
pp. 394–410, 2021.
61
in 2019 IEEE Fifth International Conference on Big Data Computing
Service and Applications (BigDataService), 2019, pp. 271–278.
[93] R. B. Ash, Probability and Measure Theory, 2nd ed. New York:
Academic Press, 2000.
[95] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995.
62
France: PMLR, 07–09 Jul 2015, pp. 1376–1385. [Online]. Available:
[Link]
63
USA: Association for Computing Machinery, 2016, pp. 43–54. [Online].
Available: [Link]
[108] R. Gray, Probability, Random Processes, and Ergodic Properties, 2nd ed.
New York: Springer, 2009.
64
[113] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning, ser. Springer Series in Statistics. New York, NY, USA:
Springer, 2001.
[121] M. Ribeiro, S. Singh, and C. Guestrin, “ “Why should i trust you?”: Ex-
plaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD,
Aug. 2016, pp. 1135–1144.
65
[122] A. Papoulis and S. U. Pillai, Probability, Random Variables, and Stochas-
tic Processes, 4th ed. New York: Mc-Graw Hill, 2002.
[130] C. Rudin, “Stop explaining black box machine learning models for high-
stakes decisions and use interpretable models instead,” Nature Machine
Intelligence, vol. 1, no. 5, pp. 206–215, 2019.
66
[131] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning
– from Theory to Algorithms. Cambridge University Press, 2014.
67