Mixing Times in Markov Chains Explained
Mixing Times in Markov Chains Explained
Justin Salez
Abstract
How many times should one shuffle a deck of 52 cards? This course is a self-
contained introduction to the modern theory of mixing for Markov chains. It consists
of a guided tour through the various methods for estimating mixing times, including
couplings, spectral analysis, discrete geometry, and functional inequalities. Each of
those tools is illustrated on a variety of examples from different contexts: interacting
particle systems, card shufflings, random walks on graphs and networks, etc. A partic-
ular attention is devoted to the cutoff phenomenon, a remarkable but still mysterious
phase transition in the convergence to equilibrium of certain chains.
1
Contents
1 Framework 3
1.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Distance to equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Relaxation time and mixing time . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The cutoff phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Random walks on graphs and groups . . . . . . . . . . . . . . . . . . . . . . 17
2 Probabilistic techniques 19
2.1 Distinguishing statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Couplings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Coalescence times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Application: random walk on the cycle . . . . . . . . . . . . . . . . . . . . . 25
2.5 Application: Ehrenfest model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Spectral techniques 30
3.1 Spectral radius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Diagonalization of reversible kernels . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Wilson’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Application: limit profile for the cycle . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Application: cutoff for the hypercube . . . . . . . . . . . . . . . . . . . . . . 40
4 Geometric techniques 42
4.1 Volume, degree, diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Application: phase transition in the Curie-Weiss model . . . . . . . . . . . . 53
4.5 Carne-Varopoulos bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Variational techniques 59
5.1 Dirichlet form and Poincaré constant . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Cheeger inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Comparison principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Distinguished paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2
1 Framework
This first chapter sets the stage on which we are going to perform. We start with a brief
but self-contained remainder on finite Markov chains and their large-time behavior. We then
introduce the total-variation distance to equilibrium, collect some of its basic properties, and
use them to define three fundamental notions that will lie at the center of our attention: the
relaxation time, the mixing time, and the cutoff phenomenon. Finally, we briefly present two
rich classes of Markov chains which are important from both a theoretical and a practical
viewpoint: random walks on graphs, and random walks on groups.
(i) A state space X , which in our case will just be a finite, non-empty set.
for every t ∈ N = {0, 1, 2, . . .} and every (x0 , . . . , xt ) ∈ X t+1 . We write X ∼ MC(X , P, ν).
Remark 1 (Markov property). The product form (3) asserts that the past (X0 , . . . , Xt−1 )
and the future (Xt+1 , Xt+2 , . . .) are conditionally independent, given the present Xt .
3
1
3
a
2
3 1
1 6
5
4
5
b c 2
3
1
6
Let X ∼ MC(X , P, ν), and let µt denote the law of the random variable Xt . Then it
follows from the above definition that µ0 = ν and that for every t ≥ 1,
X
∀y ∈ X , µt (y) = µt−1 (x)P (x, y). (4)
x∈X
µt = µt−1 P. (5)
We shall be interested in the large-time distribution of our Markov chain, i.e., the asymptotic
behavior of µt as t → ∞. If the sequence (µt )t≥0 converges to some distribution π ∈ P(X ),
then passing to the limit in the above recursion shows that π must be stationary, i.e.
π = πP. (6)
4
We are thus naturally led to investigate the question of existence and uniqueness of stationary
distributions. In that respect, the following definition will be useful.
Proof. Consider the empirical mean of the first t ≥ 1 instantaneous laws of the chain:
t−1
1X
πt := µs ∈ P(X ). (8)
t s=0
By compactness, we can extract from the sequence (πt )t≥1 a subsequence that admits a limit
π ∈ P(X ). The latter is necessarily stationary, because for each x ∈ X
µt (x) − µ0 (x)
(πt P )(x) − πt (x) = −−−→ 0.
t t→∞
This shows existence. Note that the stationary equation πP = π implies that the support of π
is closed under the accessibility relation x → y defining the diagram GP . In particular, if P is
irreducible, then any stationary law π has full support. Finally, suppose for a contradiction
that an irreducible kernel P admits two distinct stationary distributions π1 and π2 , and
consider the function m : [0, 1] → [−1, 1] defined as follows:
5
It is clear that m is continuous, with m(0) > 0 (because π1 has full support) and m(1) < 0
(because π1 ̸= π2 ). Thus, there is θ⋆ ∈ (0, 1) such that m(θ⋆ ) = 0. But then, the vector
π1 (x) − θ⋆ π2 (x)
π : x 7→ , (9)
1 − θ⋆
satisfies πP = π (because so do π1 , π2 ), has minimal entry equal to zero (by definition of
θ⋆ ) and has entry sum equal to 1 (because so do π1 , π2 ). Thus, π is a non-fully supported
stationary distribution for the irreducible kernel P , a contradiction.
Let us note that the more natural conclusion P(Xt = y) → π(y), which is stronger than (10),
requires extra assumptions, as the following counter-example shows.
which is irreducible with stationary law π = Unif(X ). Since P 2 = I (the identity matrix
on X ), we have P 2t = I and P 2t+1 = P for all t ∈ N. In particular, if we start from
the initial condition ν = δ0 , then the sequence (µt )t≥0 keeps alternating between the two
Dirac masses δ0 and δ1 , and mixing only occurs in the Césarro-mean sense (10).
6
Definition 3 (Ergodicity). The transition kernel P is called ergodic if
The following fundamental result constitutes the starting point of our study. It asserts that
any ergodic Markov chain mixes: as the number of iterations grows, the chain approaches
the stationary distribution, regardless of the initial condition.
We will later see many proofs of this result, and numerous refinements. Here is a nice practical
application: in order to approximately sample from a sophisticated target distribution π, all
we have to do is to design an ergodic Markov chain whose stationary law is π, and let it run
for sufficiently long. This observation is at the basis of a number of technological revolutions,
such as Monte Carlo Markov Chain methods or Google’s Pagerank algorithm. It motivates
the following question, to which the present course is entirely devoted.
To answer this question, we first need to agree on a way to measure the distance to equilib-
rium, i.e., the discrepancy between the law µt at time t, and the stationary law π.
7
1.2 Distance to equilibrium
There are many natural ways to measure the distance between two probability measures µ
and ν on a finite set X : Hellinger distance for statisticians, Hilbert/Lp norms for analysts,
relative entropy for physicists and information theorists, etc. Each has its own flavor and its
specific list of advantages and drawbacks, making it more relevant to the study of certain
chains than others. Rather than competing with each other, these distances are complemen-
tary: they are related one to another by an array of inequalities, and combining different
viewpoints is often the best way to analyze a given Markov chain. We will here mainly focus
on the total-variation distance, which is more natural for probabilists.
This is clearly a distance on the set P(X ) of all probability measures on X , and we have
dtv (µ, ν) ≤ 1, with equality if and only if µ and ν have disjoint supports. Let us start by
collecting a list of useful alternative expressions.
Proof. The first identity follows from the observation that changing the set A to its comple-
8
ment changes the value µ(A) − ν(A) to its opposite. For the second, note that
X
µ(A) − ν(A) = (µ(x) − ν(x))
x∈A
X
≤ (µ(x) − ν(x))+
x∈A
X
≤ (µ(x) − ν(x))+ ,
x∈X
for all A ⊆ X , and that those become equalities for A = {x ∈ X : µ(x) ≥ ν(x)}. The third
and fourth claims are obtained by taking a = µ(x) and b = ν(x) in the two identities
(a − b)+ = a − a ∧ b
1
= (|a − b| + a − b) .
2
Finally, for the last inequality, simply note that for any f : X → [−1, 1],
X
|µf − νf | = (µ(x) − ν(x)) f (x)
x∈X
X
≤ |µ(x) − ν(x)|
x∈X
= 2dtv (µ, ν),
and that there is equality in the case f (x) = sign(µ(x) − ν(x)), with sign = 1R+ − 1R− .
Let us now record a couple of basic but important properties of total variation distance.
Proof. The first claim follows from the observation that (µ, ν) 7→ µ(A) − ν(A) is trivially
convex for each A ⊆ X , and that any maximum of convex functions is convex. The second
follows from the last expression in Lemma 2, because if f ∈ [−1, 1]X , then so does P f .
9
Definition 5 (Distance to equilibrium). The distance to equilibrium associated with an
irreducible transition kernel P on X is the function DP : N → [0, 1] defined by
Remark 7 (Properties). Let us make four important comments about the function DP .
1. The first item in Lemma 3 ensures that the maximum over all distributions ν ∈
P(X ) in (14) can be reduced to a maximum over all extremal distributions (δx )x∈X :
3. The initial distance DP (0) is close to 1 if the state space X is large. Indeed,
1
DP (0) = 1 − π⋆ where π⋆ := min π(x) ≤ .
x∈X |X |
Proof. Let Π denote the “idealized” transition kernel which mixes exactly in one step, i.e.
10
P
Thus, writing ∥A∥ := maxx∈X y∈X |A(x, y)|, we arrive at the important representation
for all t ≥ 1. The desired claim now follows from the sub-multiplicativity property
Lemma 4 asserts that the non-negative sequence (ut )t∈N defined by ut := 2D(t) is sub-
multiplicative, i.e. ut+s ≤ ut us for all t, s ∈ N. By Fekete’s Lemma, this implies
1/t
−−−→ inf u1/s
ut s : s ≥ 1 . (18)
t→∞
Note that the infimum is less than 1 if P is ergodic, because we then have DP (t) < 1/2 for
t large enough. This establishes the following notable refinement of Theorem 1.
Remark 8 (Spectral radius). In Chapter 3, we will show that the fundamental quantity
n 1
o
λ⋆ (P ) = inf (2DP (t)) t : t ≥ 1 (19)
admits a simple and beautiful characterization in terms of the eigenvalues of P . For this
reason, λ⋆ (P ) is often called the spectral radius of the chain.
At first sight, Corollary 1 seems to bring a definitive answer to Question 1: the distance to
equilibrium DP (t) decays essentially like (λ⋆ (P ))t as t → ∞. It is tempting to deduce that
the chain is well mixed when the number of iterations is larger than the relaxation time
1
trel (P ) := , (20)
1
log λ⋆ (P )
defined so that λ⋆ (P ) = exp(− trel1(P ) ) (our definition differs slightly from the classical one,
which we find less natural in discrete time). In fact, this intuition is wrong, and one often
11
001100101100010110010111011...
needs to wait much longer than the relaxation time before the chain even starts to mix. The
reason is that the approximation
t
DP (t) ≈ (λ⋆ (P )) = exp −
t
(21)
trel (P )
(ε)
tmix (P ) := min{t ∈ N : DP (t) ≤ ε}.
(1/4)
The default precision is ε = 1/4, in which case we write tmix (P ) instead of tmix (P ).
The value ε = 1/4 is standard, and sufficient in practice: any smaller precision can be
achieved by just increasing tmix (P ) by a universal factor. Indeed, Lemma 18 implies
(ε) 1
∀ε ∈ (0, 1/4), tmix (P ) ≤ tmix (P ) ≤ tmix (P ) log2 . (22)
ε
but the latter can be terribly poor, as we shall see. Let us illustrate these concepts on a toy
example which is not exciting, but is simple enough to allow for explicit computations.
12
1
ε
t → DP (t)
ε′ t
(ε) (ε′ )
tmix (P ) tmix (P )
which describes the content of a window of length n “sliding” along an infinite sequence
of independent fair coin tosses (see Figure 2). Clearly, P is ergodic with π = Unif(X ).
For any x = (x1 , . . . , xn ) ∈ X and t ≤ n, we have P t (x, ·) = Unif(Ax,t ), where
13
1.4 The cutoff phenomenon
From a practical point-of-view, estimating mixing times is particularly meaningful when
the number of states becomes large. Rather than studying a fixed Markov chain, we will
thus consider a sequence of transition kernels (Pn )n≥1 whose dimensions grow with n, and
investigate the order of magnitude of tmix (Pn ) as n → ∞. Recall that the particular choice
ε = 1/4 in the definition of tmix is irrelevant: for any fixed ε ∈ (0, 1/4),
(ε)
tmix (Pn ) = Θ (tmix (Pn )) , (24)
where the notation an = Θ(bn ) means that the sequence (an /bn )n≥1 is bounded from above
and below by strictly positive constants. For many chains, we will see that an even stronger
insensitivity holds: the dependency in ε disappears completely, in the following sense.
(ε)
∀ε ∈ (0, 1), tmix (Pn ) ∼ tmix (Pn ), (25)
In words, the number of iterations required to even slightly mix (say, ε = 0.99) is asymp-
totically the same as that needed to completely mix (say, ε = 0.01), at least to first order.
Equivalently, the associated distance to equilibrium DPn undergoes a sharp phase transition,
dropping abruptly from 1 to 0 around some appropriate time-scale (tn )n≥1 , i.e.
(
1 if α < 1
∀α ∈ [0, ∞), DPn (⌊αtn ⌋) −−−→ (26)
n→∞ 0 if α > 1.
(ε)
Note that this means that tmix (Pn ) ∼ tn for all ε ∈ (0, 1), hence the equivalence with
Definition 7. For example, our computations for the sliding window of length n give
(ε)
tmix (Pn ) ∼ n, (27)
for any fixed ε ∈ (0, 1), providing our first instance of cutoff. Discovered in the 80’s in the
context of card shuffling, this remarkable phase transition has since then been established
for a broad variety of chains, from random walks on certain groups to various interacting
particle systems. However, the proofs of cutoff remain chain-specific, and finding a general
explanation constitutes the most important open problem in the field.
14
Question 2. What is the exact mechanism underlying the cutoff phenomenon?
This condition is easy to verify in practice, because it only involves a comparison of order
of magnitudes, whereas Definition 7 requires determining the precise prefactor in front of
mixing times. It is easy to see that the product condition is necessary for cutoff:
Lemma 5 (No cutoff without the product condition). Any sequence of transition kernels
(Pn )n≥1 which exhibits cutoff must satisfy the product condition.
Proof. Suppose that (Pn )n≥1 exhibits cutoff, and fix ε ∈ (0, 21 ). We then have tmix (Pn ) ∼
(ε)
tmix (Pn ), as n → ∞. On the other hand, the lower bound (23) implies
(ε) 1
tmix (P ) ≥ trel (P ) log .
2ε
Combining those two estimates, we see that
trel (Pn ) 1
lim sup ≤ 1
,
n→∞ tmix (Pn ) ln 2ε
and the right-hand side can be made arbitrarily small by choosing ε small.
Unfortunately, the converse statement – which is the one that would have been useful in
practice – does not hold in general. Even worse, any sequence (Pn )n≥1 exhibiting cutoff can
be perturbed so as to produce a counter-example, as we now explain. Given an ergodic
transition kernel P with stationary law π on a finite state space X , define
where Π is the rank-one transition kernel defined at (15), and θ ∈ (0, 1) a parameter to be
adjusted later. Note that Q is ergodic, with stationary law π. The interpretation of Q is
simple: at each step, a biased coin with parameter θ is tossed: if it lands on tails, the next
state is chosen according to P ; otherwise, it is chosen according to the stationary law π.
15
Lemma 6 (Rank-one perturbations destroy cutoff). Let (Pn )n≥1 be a sequence of ergodic
transition kernel exhibiting cutoff, and choose (θn )n≥1 in (0, 1) so that
1 1
≪ θn ≪ .
tmix (Pn ) trel (Pn )
Then, the sequence (Qn )n≥1 defined by the rank-one perturbation (28) satisfies
(ε) 1 1
trel (Qn ) ∼ trel (Pn ), and tmix (Qn ) ∼ log .
θn ε
In particular, (Qn )n≥1 satisfies the product condition, but fails to exhibit cutoff.
Proof. The impact of the rank-one perturbation (28) on the distance to equilibrium is easy
to determine: we have Q − Π = (1 − θ)(P − Π), so that (16) yields
trel (P )
trel (Q) = .
1 − trel (P ) log (1 − θ)
Note that when n is large, the entries of the perturbed matrix Qn are extremely close to
those of Pn . Yet, (Pn )n≥1 satisfies cutoff, whereas (Qn )n≥1 does not: cutoff is a delicate and
sensitive phenomenon, which can not be captured by a basic criterion such as the product
condition. Nevertheless, chains that satisfy the product condition without exhibiting cutoff
are regarded as pathological by the community, and an informal conjecture states that the
product condition should correctly predict cutoff for all reasonable chains. Giving an honest
mathematical content to this vague claim constitutes a natural open problem, which can be
seen as a first step towards Question 2.
Question 3 (Reasonable?). For which chains does the product condition imply cutoff ?
Known answers include birth and death chains and random walks on trees. However, the
above construction shows that the product condition will incorrectly predict cutoff for many
chains, including certain random walks on groups as defined next.
16
1.5 Random walks on graphs and groups
Among the various chains that will be considered in this course, a particular attention is
devoted to random walks on groups and graphs, which play an important role in many mod-
ern applications, from card shuffling to page-rank algorithms or the exploration of complex
networks. Let us here briefly recall how these random walks are defined.
(ii) Every element x ∈ X admits an inverse x−1 ∈ X such that x ⋆ x−1 = x−1 ⋆ x = id.
Here are three classical examples of finite groups to which we shall come back later:
• The cyclic group (Zn , +) of integers modulo n, equipped with addition modulo n.
• The hypercube (Zn2 , +) of binary vectors of length n, equipped with addition mod 2.
Given a finite group (X , ⋆) and a probability distribution µ ∈ P(X ), let (Zt )t≥1 be i.i.d.
random variables with law µ, and consider the process X = (Xt )t≥1 defined by
Xt := Zt ⋆ · · · ⋆ Z1 , (29)
with the usual convention that an empty product is the identity. Then X is a Markov chain
on X , called the random walk on (X , ⋆) with increment law µ. Its transition kernel is
This matrix is bi-stochastic, meaning that its transpose is also stochastic. In other words,
the uniform distribution π = Unif(X ) is stationary for P . Note that P is irreducible if
and only if the incremental support supp(µ) := {x ∈ X : µ(x) > 0} generates (X , ⋆), in
the sense that any group element x can be written as x = xt ⋆ · · · ⋆ x1 for some t ∈ N and
some x1 , . . . , xt ∈ supp(µ). A pleasant feature of random walks on groups is their intrinsic
symmetry. In particular, the choice of the initial state is irrelevant.
17
Lemma 7 (Symmetry). For random walks on groups, we have for all x ∈ X and t ∈ N,
Proof. By induction over t ∈ N, we have P t (x, y) = P t (id, x−1 ⋆ y) for all x, y ∈ X . Thus,
1X t 1
dtv P t (x, ·), π = P (id, y ⋆ x−1 ) −
2 y∈X |X |
1X t 1
= P (id, y) −
2 y∈X |X |
= dtv P t (id, ·), π ,
Two vertices x, y ∈ X such that {x, y} ∈ E are said to be neighbors, and the number of
neighbors of x is called its degree, denoted by deg(x).
18
2 Probabilistic techniques
In this chapter, we introduce two simple but powerful probabilistic tools to estimate mixing
times: distinguishing statistics (to obtain lower bounds), and couplings (to obtain upper
bounds). These important techniques are then applied to obtain sharp estimates for the
random walk on the cycle and the Ehrenfest Urn model.
It readily follows from this definition that any choice of an initial state x ∈ X and a target
event A ⊆ X provides a lower bound on the distance to equilibrium. This trivial observation
will be used so often that it deserves a lemma for better visibility.
holds for any time t ≥ 0, any initial state x ∈ X and any event A ⊆ X .
To get a good bound, the pair (x, A) should of course be chosen so that P t (x, A) is abnormally
large or small compared to the equilibrium value π(A). In practice, good candidates for
(x, A) are easily guessed, but estimating P t (x, A)−π(A) can be difficult. A simple alternative
consists in computing the first and second moment of an observable f : X → R that behaves
very abnormally when the chain is far from equilibrium. This formalizes as follows.
δ2
dtv (µ, ν) ≥ .
δ2 + σ2
where δ = |µf − νf | and σ 2 = 2Varµ (f ) + 2Varν (f ).
19
Proof. Since the right-hand side is invariant under translating f by a constant, we may
assume that µf + νf = 0, so that (µf )2 = (νf )2 = δ 2 /4. By Cauchy-Schwartz, we have
!2
X
δ2 = (µ(x) − ν(x)) f (x)
x∈X
! !
X X (µ(x) − ν(x))2
≤ (µ(x) + ν(x)) f 2 (x) .
x∈X x∈X
µ(x) + ν(x)
The first sum is exactly equal to (σ 2 + δ 2 )/2, while the second is at most 2dtv (µ, ν) by the
crude bound |µ(x) − ν(x)| ≤ µ(x) + ν(x). Rearranging yields the claim.
Remark 9 (Concentration). The above bound says that µ and ν are far apart if the ratio
σ 2 /δ 2 is small. The intuitionh is as follows: under the measure µ, the iobservable f typ-
p p
ically takes values in Iµ := µ(f ) − 10 Varµ (f ), µ(f ) + 10 Varµ (f ) by Chebychev’s
inequality, and similarly for ν. When σ 2 /δ 2 is small, the two intervals Iµ and Iν are
disjoint, so A := f −1 (Iµ ) forms a distinguishing event showing that dtv (µ, ν) is large.
Remark 10 (Complex values). The proof readily extends to the case of a complex-valued
observable f : X → C, with Varµ (f ) := µ (|f − µf |2 ) = µ|f |2 − |µf |2 .
We will illustrate the strength of those generic lower bounds on various concrete Markov
chains once we have a complementary technique to obtain matching upper bounds.
2.2 Couplings
In probability theory, coupling is a very general technique which allows one to compare two
given distributions µ and ν by constructing a pair of random variables (X, Y ) whose marginal
distributions are µ and ν. Every such pair provides a particular relation between µ and ν,
and the whole idea is to choose a relation that sheds useful light on µ and ν.
20
Of course, we can always take X and Y to be independent with respective marginals µ and
ν, but this is usually not the most interesting choice: the above definition allows for X and
Y to be entangled in an arbitrary way, and this degree of freedom can often be exploited to
obtain non-trivial information about the pair (µ, ν). The following lemma is a simple but
fruitful illustration of this general philosophy.
for any coupling (X, Y ) of µ and ν. Moreover, there is a coupling which achieves equality.
Proof. If (X, Y ) is a coupling of µ and ν, then for any set A ⊆ X we can write
Taking a maximum over all A ⊆ X establishes the first claim. Conversely, let us construct a
coupling (X, Y ) which achieves equality. The extreme cases dtv (µ, ν) = 0 and dtv (µ, ν) = 1
are trivial, so we leave them aside. By Lemma 2, we have dtv (µ, ν) = 1 − p, where
X
p := µ(x) ∧ ν(x).
x∈X
We thus want to construct a coupling (X, Y ) such that P(X = Y ) = p. To do so, we define
(
(Z, Z) if B = 1
(X, Y ) :=
(X,
b Yb ) if B = 0,
21
Remark 11 (Variational formula for total variation distance). Lemma 10 provides yet
another definition of total variation distance, to be added to the list of Lemma 2:
A considerable advantage of this new expression is the fact that it involves a minimum:
any coupling (X, Y ) of µ, ν provides an upper bound on dtv (µ, ν). The more likely
the event {X = Y } is, the better the bound will be, and the existence of a coupling
achieving equality guarantees that this strategy has no intrinsic limitation. In practice
however, estimating P(X = Y ) can become difficult if the coupling is too sophisticated,
and devising couplings that are both efficient and tractable is a delicate art.
In order to use Lemma 10 to estimate the mixing time of an ergodic kernel P , we need to
construct, for an appropriate time t ∈ N and for every initial state x ∈ X , a coupling between
P t (x, ·) and π which puts as much mass as possible on the diagonal set {(y, y) : y ∈ X }.
This strategy may seem difficult to implement at first sight, but we will now make a couple
of important observations that will considerably facilitate our task.
22
Choosing µ = P t (x, ·) and then maximizing over all x ∈ X leads to the claimed upper
bound. For the lower bound, we simply invoke the triangle inequality
dtv P t (x, ·), P t (y, ·) ≤ dtv P t (x, ·), π + dtv P t (y, ·), π ,
We are thus naturally led to the problem of coupling P t (x, ·) and P t (y, ·), for arbitrary
x, y ∈ X and t ∈ N. Recall that the instantaneous measures P t (x, ·) and P t (y, ·) were
actually constructed sequentially, by iterating t times the map µ 7→ µP , starting from the
initial measures δx and δy , respectively. In light of this sequential structure, the most natural
way to couple P t (x, ·) and P t (y, ·) is to actually couple the entire chains, i.e. to produce a
random pair (X, Y) such that X ∼ MC(X , P, δx ) and Y ∼ MC(X , P, δy ). For each t ∈ N,
the random pair (Xt , Yt ) is then a coupling of P t (x, ·) and P t (y, ·), so we have
Now, a very simple way to produce such a trajectorial coupling, simultaneously for all initial
pairs (x, y) ∈ X 2 , is to use what is known as a coupling kernel.
Any Markov chain (X, Y) on X × X whose transition kernel is of this form is called a
Markovian coupling for P , and the associated coalescence time is defined as
T := inf {t ≥ 0 : Xt = Yt } ,
With this terminology at hands, we can now state and prove the main result of this sec-
tion, which asserts that the convergence to equilibrium of a Markov chain is fast if trajectories
from different starting states can be coupled so as to meet quickly.
23
Theorem 2 (Coalescence and mixing). Consider an arbitrary coupling kernel Q for P ,
and let T denote the associated coalescence time. Then,
where the notation P(x,y) indicates that the initial state is (x, y).
Proof. Let (X, Y) denote a Markov chain on X 2 with transition kernel Q. The fact that Q
is a coupling kernel ensures that for each t ≥ 1, the conditional laws of Xt and Yt given the
past (X0 , . . . , Xt−1 , Y0 , . . . , Yt−1 ) are P (Xt , ·) and P (Yt , ·), respectively. In particular, X and
Y are Markov chains on X with transition kernel P . Consequently, Lemma 10 yields
for every t ∈ N and every (x, y) ∈ X 2 . Now, let us make the additional assumption that the
diagonal set ∆ := {(z, z) : z ∈ X } is absorbing for the coupling kernel Q, in the sense that
X
∀x ∈ X , Q ((x, x), (y, y)) = 1. (33)
y∈X
Then almost-surely, the trajectories X and Y coincide forever after coalescing, hence
Inserting this into (32) and then taking the maximum over all pairs (x, y) ∈ X 2 establishes
the claim (recall Lemma 11). Finally, note that if Q fails to satisfy the condition (33), then
we can always replace it with the new coupling kernel
′ ′
Q ((x, y), (x , y ))
if x ̸= y
Qe ((x, y), (x′ , y ′ )) := P (x, x′ ) if x = y and x′ = y ′
if x = y and x′ ̸= y ′ .
0
Since Q
e is a coupling kernel for P which satisfies (33), we obtain
∀t ≥ 0, DP (t) ≤ max 2 P(x,y) Te > t ,
(x,y)∈X
24
Figure 4: The n−cycle with n = 15.
Remark 12 (Product kernel). The above result provides a simple proof of the conver-
gence to equilibrium of ergodic Markov chains (Theorem 1). Indeed, the product kernel
This is the transition kernel of the lazy random walk on the n−cycle graph. It is also a
random walk on the cyclic group (Z/nZ, +), with increment law µ = 12 δ0 + 41 δ1 + 14 δ−1 . We
25
will show that the mixing time of this random walk grows quadratically with n.
Proposition 1 (Mixing time of the n−cycle). For lazy random walk on the n−cycle,
n2
≤ tmix (P ) ≤ n2 .
32
Upper bound: first attempt. The most natural way to construct a chain X with tran-
sition kernel P starting from a given state x ∈ X is to set X0 := x and for all t ≥ 1,
Xt := Xt−1 + ξt mod n.
1
where (ξt )t≥1 are i.i.d. with P(ξ1 = 0) = 2
and P(ξ1 = 1) = P(ξ1 = −1) = 14 . In light of this,
a naive way to couple X with a chain Y starting from another state y ∈ X consists in using
the same increments for both chains, i.e. setting Y0 := y and for all t ≥ 1,
It is then clear that (X, Y) is a Markovian coupling for P , with coupling kernel
1 ′ ′
2 if (x , y ) = (x, y)
Q ((x, y), (x′ , y ′ )) = 1
4
if (x′ , y ′ ) ∈ {(x + 1, y + 1), (x − 1, y − 1)}
0 else.
Unfortunately, this coupling is not smart at all: the difference Xt − Yt is preserved at each
step, so the coalescence time T is a.-s. infinite, and Theorem 2 only yields DP (t) ≤ 1!
Upper bound: second attempt. In order to “favor encounter”, one could try to let the
two trajectories move in opposite directions, i.e. replace (34) with
Yt := Yt−1 − ξt mod n.
Note that this remains a valid coupling, because the increment sequence (−ξt )t≥1 has the
same law as (ξt )t≥1 . The corresponding coupling kernel is then
1 ′ ′
2 if (x , y ) = (x, y)
Q ((x, y), (x′ , y ′ )) = 1
4
if (x′ , y ′ ) ∈ {(x + 1, y − 1), (x − 1, y + 1)}
0 else.
Unfortunately, this choice is also problematic: when n is even and x − y is odd, the difference
Xt − Yt remains odd at each iteration, so that T = ∞ a.-s. again!
26
Upper bound: third attempt. A solution to this parity issue consists in letting only
one of the two coordinates jump at each step, as dictated by the following coupling kernel:
(
1
if |x′ − x| + |y − y ′ | = 1
Q ((x, y), (x′ , y ′ )) = 4
0 else.
The sequence of differences (Xt − Yt )t≥0 is then a simple random walk on Z/nZ starting
from x − y, and the coalescence time T is precisely the time it takes for this walk to hit 0.
Equivalently, T is the hitting time of the set {0, n} by a simple random walk (Wt )t≥0 on Z
starting from W0 = |x − y|. The expectation of T is easily computed, for example by Doob’s
optional stopping theorem applied to the martingales (Wt )t≥0 and (Wt2 − t)t≥0 :
Since the random variable WT is {0, n}−valued, we have WT2 = nWT and it follows that
n2
E[T ] = (n − |x − y|) |x − y| ≤ .
4
n2
Applying Theorem 2 and Markov’s inequality, we conclude that DP (t) ≤ 4t
, or equivalently,
2
(ε) n
tmix (P ) ≤ .
4ε
This proves the upper bound in Proposition 1. We now turn to the lower bound.
Lower bound. Intuitively, when the number t of iterations is too small, the random walk
Xt is confined in a small interval around its starting point, and hence can not be mixed. To
formalize this, we use Lemma 8 with the starting state x = 0 and the distinguishing event
|A|
A = n4 , 3n ≥ 12 and by Chebychev inequality,
4
∩ N. We have π(A) = |X |
n 8t
P t (x, A) ≤ P |ξ1 + · · · + ξt | ≥ ≤ 2,
4 n
where we have used the fact that ξ1 , . . . , ξt are i.i.d. with mean 0 and variance 1/2. For
t < n2 /32 the right-hand side is less than 1/4, and Lemma 8 yields DP (t) > 1/4, as desired.
27
2.5 Application: Ehrenfest model
Introduced by Tatiana and Paul Ehrenfest to explain the second law of thermodynamics, the
Ehrenfest urns is a very simple model for the diffusion of gas molecules. Consider n labelled
particles evolving among two neighboring containers as follows: at each time step, a particle
is chosen uniformly at random and jumps from its current container to the other one. If one
start with, say, all particles in one container, how long will it take for the gas to equilibriate?
We can describe the state of the system by a binary vector x = (x1 , . . . , xn ), where the
variable xi ∈ {0, 1} indicates the container in which the i−th particle currently lies. The
above diffusion is then a Markov chain with state space X := {0, 1}n and transition kernel
(
1
n
if x, y differ by exactly one coordinate
Pe(x, y) :=
0 else.
This is the transition kernel of simple random walk on a well-known graph, namely the
n−dimensional hypercube. Alternatively, Pe is the transition kernel of the random walk on
the binary group Fn2 with increments being uniform on the set of vectors having exactly one
non-zero coordinate. To avoid periodicity issues, we consider the lazy kernel P = (I + Pe)/2.
Proposition 2 (Mixing time of the lazy Ehrenfest urns). For any ε ∈ (0, 1), we have
(1 − o(1)) (ε)
n log n ≤ tmix (Pn ) ≤ (1 + o(1))n log n
2
Xt := F (Xt−1 , It , Bt ),
where F (x) := (x1 , . . . , xi−1 , b, xi+1 , . . . , xn ), and where the random variables It , Bt , t ≥ 1
are independent with Bt ∼ Bernoulli(1/2) and It ∼ Unif({1, . . . , n}). Given another initial
state y ∈ X , one can then naturally couple Y ∼ MC(X , P, δy ) with X by using the same
update variables (It , Bt )t≥1 for both trajectories, i.e. by setting Y0 = y and for t ≥ 1,
Yt := F (Yt−1 , It , Bt ).
The pair (X, Y) is then a Markovian coupling for P . By construction, at any time t ≥ 0,
the two random vectors Xt and Yt agree on the set of coordinates {I1 , . . . , It }. Consequently,
28
the coalescence time T is at most the time it takes for all coordinates to have been hit:
(In fact, there is even equality when x and y are antipodal.) Consequently, we have
DP (t) ≤ P ({I1 , . . . , It } =
̸ {1, . . . , n})
Xn
≤ P (i ∈ / {I1 , . . . , It })
i=1
t
1
= n 1−
n
≤ ne−t/n
where we have successively used the union bound, the fact that I1 , . . . , It are independent
and uniform on {1, . . . , n}, and the convexity inequality 1 + u ≤ eu , valid for all u ∈ R.
The upper bound readily follows from this. For the lower bound, we apply the method of
distinguishing statistics (Lemma 9) to the observable
f : x 7→ x1 + · · · + xn ,
which counts the number of coordinates equal to 1. Under the equilibrium law π, the
coordinates are independent Bernoulli variables with parameter 1/2, so we have
n
πf =
2
n
Varπ (f ) = .
4
On the other hand, after t iterations starting from x = (0, . . . , 0), the coordinates are easily
seen to be negatively correlated Bernoulli variables with parameter 1 − (1 − n1 )t /2, so
t !
n 1
µf = 1− 1−
2 n
n
Varµ (f ) ≤ ,
4
where µ = P t (0, ·). Thus, Lemma 9 yields
1
DP (t) ≥ 1 − ,
n 1 2t
1+ 4
1− n
29
Figure 5: Eigenvalues (in red) and spectral radius (in blue) of a typical transition kernel.
3 Spectral techniques
This chapter investigates the eigenvalues and eigenvectors of transition kernels, and their
relation to mixing times. This relation is particularly deep for reversible chains, to which we
devote a central attention. To illustrate the strength of spectral techniques, we revisit two
models that were introduced in the previous chapter and obtain considerably refined results
on their convergence to equilibrium: we establish a limiting profile for random walk on the
cycle, and we prove the cutoff phenomenon for random walk on the hypercube.
30
3.1 Spectral radius
We have seen earlier (Corollary 1) that the convergence to equilibrium of Markov chains
occurs exponentially fast: more precisely, for any ergodic kernel P , the limit
1
λ⋆ (P ) := lim (DP (t)) t
t→∞
exists and is strictly less than 1. We will now see that this quantity admits a remarkable
spectral characterization. Let us first recall some terminology. An eigenvalue of P is a root
of the characteristic polynomial λ 7→ det(P −λI), i.e. a number λ ∈ C such that the equation
P f = λf (35)
(i) 1 ∈ Sp(P ).
Proof. The constant function f = 1 solves the harmonic equation P f = f , proving the first
claim. The second follows from the fact that P contracts the ∥·∥∞ norm: for any f : X → R,
∥P f ∥∞ ≤ ∥f ∥∞ .
31
We now turn our attention to eigenfunctions.
(i) If λ ̸= 1, then πf = 0.
Proof. Multiplying both sides of the identity P f = λf by π yields πf = λπf , proving the
first claim. Now, assuming that f = P f , let us prove that f is constant. Upon replacing f
by its real and imaginary parts if necessary, we may assume that f is real-valued. Denoting
by A := argminf the set of minimizers of f , we wish to prove that A = X . Fix an arbitrary
x ∈ A, and suppose for a contradiction that there is y ∈ X \ A. By irreducibility, we can
find t ≥ 0 such that P t (x, y) > 0. Evaluating the relation P t f = f at x yields
X
P t (x, z) (f (z) − f (x)) = 0.
z∈X
Since x ∈ A, each term in this sum is non-negative, so all terms must actually be zero. This
is a contradiction, because P t (x, y) > 0 and f (y) > f (x).
We are now ready to provide a spectral characterization of the asymptotic decay rate λ⋆ (P ).
Thus, the claim boils down to the identity ρ(P − Π) = λ⋆ (P ). We will actually show that
which is more than enough. Let us first prove the inclusion ⊇. Clearly, (0, 1) is an eigenpair
of P − Π, so 0 ∈ Sp(P − Π). On the other hand, if (λ, f ) is an eigenpair of P with λ ̸= 1,
then Lemma 13 forces πf = 0, i.e. Πf = 0. Thus, (λ, f ) is also an eigenpair of P − Π.
Conversely, if (λ, f ) is an eigenpair of P −Π with λ ̸= 0, then the identity (P −Π)f = λf can
be left-multiplied by Π to obtain Πf = 0, so that (λ, f ) is also an eigenpair of P . Moreover,
we have λ ̸= 1, as otherwise Lemma 13 would imply that f is constant equal to πf = 0.
32
3.2 Diagonalization of reversible kernels
Let P be an irreducible transition kernel on X , with stationary law π. Consider the (com-
plex) Hilbert space H = L2C (X , π) of all functions f : X → C, with scalar product
X
⟨f, g⟩ := π(x)f (x)g(x).
x∈X
π(y)P (y, x)
∀(x, y) ∈ X 2 , P ⋆ (x, y) = .
π(x)
Note that P ⋆ is again a transition kernel on X , which is irreducible and with stationary
law π. Note also that P ⋆⋆ = P . The duality P ↔ P ⋆ can be interpreted as time reversal: if
X ∼ MC(X , P, π) and X⋆ ∼ MC(X , P ⋆ , π), then it is easy to check that for all t ≥ 0,
d
(X0⋆ , . . . , Xt⋆ ) = (Xt , . . . , X0 ) .
The above equation, called detailed balance, is satisfied by all random walks on undirected
graphs, as well as many other interesting Markov chains. It is much stronger than the
stationarity property πP = π (which can be recovered by summing over all x ∈ X ), and has
remarkable consequences on the mixing properties of the associated Markov chain. Indeed,
the spectral theorem for self-adjoint operators can then be applied to guarantee the following.
1 = λ1 ≥ λ2 ≥ · · · ≥ λN ≥ −1,
33
(ii) There is an orthonormal basis (ϕ1 , . . . , ϕN ) of eigenfunctions of P : for all 1 ≤ i ̸= j ≤ N ,
P ϕi = λi ϕi , ∥ϕi ∥ = 1, ⟨ϕi , ϕj ⟩ = 0.
Note that, with these notations, we have λ⋆ (P ) = max{λ2 , −λN }. We will always choose
ϕ1 = 1 (this is indeed a unit eigenfunction associated with the eigenvalue λ1 = 1). Such a
spectral decomposition provides an explicit expression for the distribution of the chain.
Proof. Any function f : X → C can be decomposed over the orthonormal basis (ϕ1 , . . . , ϕN ):
N
X
f = ⟨f, ϕi ⟩ϕi
i=1
Remark 13 (Exponential mixing). The sum in Lemma 14 clearly behaves like λt⋆ (P ) as
t → ∞, thereby providing yet another proof of the convergence to equilibrium (Theorem
1) and of its geometric refinement (Corollary 1), albeit for reversible chains.
In order to use the exact expression given in Lemma 14, we need to have explicit access
to the eigenvalues and eigenfunctions of P , which is not often the case. Fortunately, the
expression can be bounded by a function of λ⋆ (P ) only, yielding the following simple and
general estimate on the mixing time of a reversible chain.
34
Theorem 4 (Mixing times of reversible chains). If P is reversible, then for all ε ∈ (0, 1),
(ε) 1
tmix (P ) ≤ trel (P ) log √ ,
2ε π⋆
P t (x,y)
Proof. Fix t ∈ N and x ∈ X . By Lemma 14, the function y 7→ π(y)
− 1 has squared norm
2 N
P t (x, ·) X
−1 = λ2t
i |ϕi (x)|
2
π i=2
N
X
≤ λ2t
⋆ (P ) |ϕi (x)|2
i=2
1
= λ2t
⋆ (P ) −1
π(x)
λ2t
⋆ (P )
≤ .
π(x)
where the third line is obtained by setting t = 0 and y = x in Lemma 14. On the other
hand, for any probability measure µ ∈ P(X ), the Cauchy-Schwarz inequality gives
1X µ(x) 1 µ
dtv (µ, π) = π(x) −1 ≤ −1 . (36)
2 x∈X π(x) 2 π
Choosing µ = P t (x, ·) and combining this with the previous estimate, we conclude that
λt⋆ (P )
DP (t) ≤ √ ,
2 π⋆
Remark 14 (L2 bound). The Cauchy-Schwarz inequality (36) plays a decisive role in the
proof, because it connects the probabilistic quantity of interest (total-variation distance)
to a much more tractable analytic quantity (Hilbert norm).
35
Remark 15 (Relaxation time vs mixing time). Combining this result with the lower
bound (23) (which does not require reversibility), we obtain
1 (ε) 1
trel (P ) log ≤ tmix (P ) ≤ trel (P ) log √ .
2ε 2ε π⋆
Thus, for reversible chains, the relaxation time provides an approximation of the mixing
1
time that is precise up to a factor which is only logarithmic in the “size” π⋆
. Note that
this would not be true without reversibility, as Example 2 shows.
Proof. We may assume that |λ| < 1, since otherwise the bound is trivial. Upon dividing f
by ∥f ∥∞ if necessary, we may further assume that ∥f ∥∞ = 1. Now, fix x ∈ X and t ∈ N.
Following Wilson’s idea, we estimate dtv (P t (x, ·), π) by applying (the complex version of)
Lemma 9 to the eigenfunction f . Letting X ∼ MC(X , P, δx ), we have
36
where we have first used the Markov property and then the fact that (λ, f ) is an eigenpair
of P . Taking expectations, we find that E[f (Xt )] = λt f (x). On the other hand, Lemma 13
(or sending t → ∞) gives πf = 0. With the notation of Lemma 9, we thus have
δ 2 = |λ|2t |f (x)|2 .
E |f (Xt+1 ) − f (Xt )|2 |X0 , . . . , Xt = E |f (Xt+1 )|2 |X0 , . . . , Xt + (1 − 2ℜ(λ)) |f (Xt )|2 .
≤ |λ|2 E f 2 (Xt ) + V,
because 2ℜ(λ) ≤ 1 + |λ|2 . Subtracting |E[f (Xt+1 )]|2 = |λ|2t+2 |f (x)|2 , we obtain
(1 − |λ|2 )(1 − ε)
(ε) 1
tmix (P ) ≥ trel (P ) log ,
2 4εV
37
3.4 Application: limit profile for the cycle
Let us illustrate our spectral techniques by applying them to the lazy random walk on the
cycle X = Z/nZ, whose transition kernel Pn acts on functions f : X → C as follows:
f (x) f (x + 1) f (x − 1)
∀x ∈ X , (Pn f )(x) = + + .
2 4 4
In particular, for any 1 ≤ k ≤ n, the function ϕk : x 7→ exp 2iπkx
n
is an eigenfunction of Pn
1+cos( 2πk )
with eigenvalue λk = 2
n
. Moreover, for 1 ≤ k, ℓ ≤ n, we have
(
1 X 2iπ(k−ℓ)x 1 if k = ℓ
e n =
n x∈X 0 else,
1+cos( 2π )
showing that (ϕ1 , . . . , ϕn ) is an orthonormal basis of CX . In particular, λ⋆ (Pn ) = 2
n
.
h2
Using 1 − cos(h) ∼ 2
and ln(1 + h) ∼ h for h ≪ 1, we obtain the asymptotic estimate
n2
trel (Pn ) ∼ . (38)
π2
(ε) (ε)
Using only this information, Remark 15 already gives tmix (Pn ) = Ω(n2 ) and tmix (Pn ) =
O(n2 ln n). Of course, we already know from Chapter 2 that the lower bound is sharp.
Moreover, cutoff can not occur, because the product condition is not satisfied. This is
confirmed by the following refined result, which uses the entire spectral decomposition of
P to conclude that the rescaled distance to equilibrium converges to a smoothly decreasing
function (hence, not a step function) displayed on Figure 3.4.
Theorem 5 (Limit profile for random walk on the cycle). For any α > 0, we have
Z 1 ∞
2 k2
X
DPn (αn ) −−−→ Ψ(α) :=
2
e−απ cos (2πku) du.
n→∞ 0 k=1
(ε)
In other words, tmix (Pn ) ∼ Ψ−1 (ε)n2 as n → ∞, for any fixed ε ∈ (0, 1).
Proof. By symmetry (Lemma 7), we can choose the initial state to be 0. Our starting point
is the following integral representation, which follows from the definition:
1 1
Z
DPn (t) = 1 − nPnt (0, ⌊un⌋) du. (39)
2 0
38
Figure 6: Plot of the limit profile Ψ : (0, ∞) → (0, 1) appearing in Theorem 5: the conver-
gence to equilibrium of random walk on the cycle occurs very gradually (no cutoff).
To estimate the integrand, we use the spectral decomposition given in Lemma 14:
⌈n/2⌉−1 2πk
!t
t
X 1 + cos n 2πk(x − y)
nP (x, y) = cos .
2 n
k=−⌊n/2⌋
2πk
!tn
1 + cos 2πk⌊un⌋ 2 k2
n
cos −−−→ e−απ cos (2πku) .
2 n n→∞
1+cos(aπ) 2
On the other hand, since 2
≤ 1 − a2 ≤ e−a for all a ∈ [−1, 1], we have the domination
2πk
!tn
1 + cos 2πk⌊un⌋ 2
n
cos ≤ e−4αk .
2 n
39
Moreover, the above domination shows that the left-hand side is bounded uniformly in n et
u, so we can pass to the limit in the integral representation (39) to obtain
Z 1
1 X 2 k2
DPn (tn ) −−−→ 1− e−απ cos (2πku) du.
n→∞ 2 0 k∈Z
We conclude by noting that the k = 0 term is 1, and that the other are even in k.
Z b
P X⌈αn2 ⌉ ∈ [an, bn] −−−→ fα (u)du,
n→∞ a
Therefore, the convergence (40) constitutes a very precise local refinement of the above
CLT, where the the macroscopic interval [an, bn] is replaced by a singleton!
so that the 2n eigenfunctions (fI )I⊆[n] form an orthonormal basis of CX . In particular, the
spectral radius is λ⋆ (Pn ) = 1 − n1 , which yields the asymptotic estimate
trel (Pn ) ∼ n.
40
(ε) (ε)
Thus, Remark 15 gives tmix (Pn ) = Ω(n) and tmix (Pn ) = O(n2 ). However, both estimates can
be considerably refined if we use the entire spectral decomposition of Pn :
Theorem 6 (Cutoff for random walk on the hypercube). For fixed α ≥ 0, we have
(
1 if α < 21 ;
DPn (αn ln n) −−−→
n→∞ 0 if α > 21 .
(ε) n ln n
In other words, tmix (Pn ) ∼ 2
as n → ∞, for any fixed precision ε ∈ (0, 1).
Proof. Since the eigenfunctions (ϕI )I⊆[n] take values in {−1, 1}, the L2 bound (36) yields
2
P t (x, ·)
4d2tv t
P (x, ·), π ≤ −1
π
X
= λ2t
I |ϕI (x)|
2
∅̸=I⊆[n]
n 2t
X n k
= 1−
k=1
k n
n
X n 2kt
≤ exp −
k=1
k n
2t
n
≤ 1 + e− n −1
− 2t
≤ ene n
− 1.
This suffices to establish the second half of the theorem (case α > 12 ). For the first half, we
1
apply Wilson’s method (Lemma 15) to the eigenpair (λ, f ), where λ = 1 − n
and
n
X n
X
f (x) := ϕ{i} = (1 − 2xi ) .
i=1 i=1
Since modifying a coordinate of x changes f (x) by ±2, we have (taking lazyness into account),
X
∀x ∈ X , P (x, y) |f (y) − f (x)|2 = 2,
y∈X
and ∥f∞ ∥ = n. Thus, Lemma 15 applies with λ = 1 − 1/n and V = 2/n2 , yielding
−1 −1
4V 1
DPn (tn ) ≥ 1+ = 1 + 1−2α+o(1) ,
(1 − |λ|2 )|λ|2tn n
when tn ∼ αn log n with fixed α ∈ (0, ∞). In particular, DPn (tn ) → 1 when α < 1/2.
41
4 Geometric techniques
We have seen that a transition kernel P is irreducible if and only if its associated diagram
GP is connected, in the sense that it contains a path from any vertex to any other. In light
of this, it is natural to suspect an intimate relation between the mixing behavior of P and
the geometry of GP . Formalizing this intuition is precisely the purpose of this chapter.
The function dist : X × X → [0, ∞] is not necessarily symmetric, but it always satisfies the
two other axioms that define a distance, namely:
Understanding how the volume of these balls grows with t is a natural geometric question.
We therefore introduce a function volP : N → [0, 1], called the volume growth of the chain:
A basic observation is that volP (t) has to be large for the chain to be mixed at time t.
42
Lemma 16 (Volume bound). We have DP (t) ≥ 1 − volP (t) for all t ∈ N.
Proof. Simply apply the distinguishing event method (Lemma 8) to the pair (x, A), where
x is any state that realizes the minimum in the definition of volP (t), and A = B(x, t).
We now present two useful consequences of this result, which are easy to apply in practice.
Recall that the degree of a vertex x ∈ X is the number of vertices at distance 1 from x:
Proof. For any x ∈ X and t ∈ N, we have π(B(x, t)) ≤ (max π) × |B(x, t)| and
When deg(P ) ≥ 2, this geometric sum is less than (deg(P ))t+1 , hence
Example 3 (Sliding window). Consider the chain in Example 2: each state has degree
2, so deg(P ) = 2. Since π is the uniform law on {0, 1}n , Corollary 2 yields
(ε) 1
tmix (P ) ≥ n − log2 .
1−ε
(ε)
This is off by only 1, the exact value of tmix (P ) being obtained by replacing ⌊·⌋ with ⌈·⌉.
43
Example 4 (Riffle shuffle). One of the most standard methods for shuffling a deck of n
cards consists in repeating the following two-step procedure:
(ii) interleave cards from the two halves to produce a new deck.
How many times shall one repeat this procedure for the deck to be well mixed? To
formalize this question, let us identify each card with a unique label i ∈ [n] and represent
a deck of cards by a permutation σ ∈ Sn , where σ(i) indicates the label of the i−th
top card in the deck. Then, the above procedure transforms σ into a new permutation
of the form σ ′ = σ ◦ γI , where the set I ⊆ [n] indicates the positions to which the top
“half ” gets relocated, and where γI is the permutation that takes values 1, 2, . . . , |I| (in
this order) on I and |I| + 1, . . . , n (in this order) on n \ I. If we choose the subset
I ⊆ [n] at random according to some prescribed distribution (e.g., uniform) and repeat
this procedure independently at each step, we obtain a well-defined random walk on the
symmetric group Sn , whose kernel is denoted by Pn . Since there at most 2n possible
choices for the subset I ⊆ [n], we have deg(Pn ) ≤ 2n , so Corollary 2 yields
(ε) 1
tmix (Pn ) ≥ log2 ((1 − ε)n!) ∼ log2 n,
n
for any fixed ε ∈ (0, 1). This general bound happens to be remarkably sharp: indeed,
3
when I is uniform, the sequence (Pn )n≥1 is known to exhibit cutoff at time 2
log2 (n)!
Our second application of Lemma 16 involves the radius of the chain, defined as the smallest
integer t ∈ N such that any two balls of radius t intersect:
(ε)
Corollary 3 (Radius bound). For any ε ∈ (0, 21 ), we have tmix (P ) ≥ rad(P ).
(ε)
Proof. For t = tmix (P ) with ε < 1/2, Lemma 16 yields volP (t) > 21 . Since two events of
probability more than 1/2 must intersect, the result follows.
44
Example 5 (Sliding window). Consider the kernel P of Example 2. The balls of radius
n − 1 around the states (0, . . . , 0) and (1, . . . , 1) are disjoint, because the former consists
of all binary words of length n that start with a 0, and the latter those with a 1. Thus,
rad(P ) ≥ n (there is in fact equality), and Corollary 3 gives
(ε)
tmix (P ) ≥ n,
for all ε < 21 . This is in fact the correct answer, as we have seen in Example 2.
where diam(P ) := maxx,y dist(x, y) denotes the diameter. In general however, the radius
may be significantly smaller than the diameter, as shown in the following example.
The latter represents the evolution of a climber on a greasy ladder, where each step has a
chance 1/2 to result in an abrupt fall. Clearly, diam(P ) = n − 1. However, rad(P ) = 1
because P (x, 1) > 0 for all x ∈ X . Thus, Corollary 3 yields the seemingly poor bound
1 (ε)
∀ε ∈ 0, , tmix (P ) ≥ 1.
2
This is in fact sharp. Indeed, consider the obvious coupling where falls occur simulata-
neously in both chains: at each step, coalescence occurs with chance at least a half, so
(ε)
Theorem 2 yields DP (t) ≤ 2−t . In particular, tmix (P ) ≤ 2 for ε = 1/4.
The elementary bounds presented in the previous section can not be expected to be sharp
in all situations, because the parameters deg(P ) and rad(P ) only depend on the structure of
the graph GP , and not on the precise transition probabilities. We will now introduce a more
sophisticated parameter called the conductance, which provides more accurate lower bounds
on mixing times by taking the precise transition probabilities into account.
45
4.2 Conductance
We turn GP into a weighted graph by defining the weight of a pair (x, y) ∈ E as follows:
By the Ergodic Theorem, this quantity represents the asymptotic proportion of time that
the edge (x, y) is traversed by the chain. Note that the formula (42) extends to a probability
measure on X 2 whose first and second marginals are equal to π:
X X
∀x ∈ X , ⃗π (x, y) = ⃗π (y, x) = π(x).
y∈X y∈X
We will measure the surface of a set A ⊆ X by the quantity ⃗π (A × Ac ), and compare it with
the volume π(A). The ratio of those two quantities is called the conductance.
̸ A ⊆ X is the ratio
Definition 15 (Conductance). The conductance of a set ∅ =
⃗π (A × Ac )
Φ(A) := .
π(A)
The conductance of a set measures the facility for the walk to escape from it. Indeed, letting
X ∼ MC(X , P, π) denote a stationary chain with transition kernel P , we have for any t ∈ N,
P(Xt ∈ A, Xt+1 ∈
/ A)
P (Xt+1 ∈
/ A|Xt ∈ A) = = Φ(A).
P(Xt ∈ A)
Thus, a set A with small conductance constitutes a “bottleneck” in which the walk is likely to
remain “trapped” for a long time. In particular, if that set misses a significant portion of the
state space (π(A) ≤ 1/2), then mixing should take long. Here is a rigorous confirmation.
l m
1
Lemma 17 (Conductance bound). We always have tmix (P ) ≥ 4Φ(P )
.
Proof. Consider a stationary chain X ∼ MC(X , P, π). Then for any A ⊆ X and t ∈ N,
t
[
{X0 ∈ A, Xt ∈
/ A} ⊆ {Xs−1 ∈ A, Xs ∈
/ A}.
s=1
46
Taking probabilities, we deduce that
X
π(x)P t (x, Ac ) ≤ t⃗π (A × Ac ).
x∈A
To conclude, choose a set A realizing the definition of Φ(P ) and set t = tmix (P ).
00 01 10 11
0 1
Example 7 (Random walk on a binary tree). Consider the lazy simple random walk on
the binary tree of height n (see Figure 7). This is the graph G = (V, E), where
• V =
Sn k
k=0 {0, 1} consists of all binary words of length at most n
• Two words form an edge if one is obtained from the other by deleting the last letter.
Consider the “left subtree”, i.e. the set A ⊆ V of all words that start with a 0. Note that
1 1
π(A) = 2
− 2|E|
, and that A × Ac contains a single edge. Thus,
1
Φ(P ) ≤ Φ(A) = .
2|E| − 2
47
l m
|E|−1
Thus, Lemma 17 gives tmix (P ) ≥ 2
= 2n − 1. This is in fact the correct order of
magnitude as n → ∞, as can be shown by coupling.
In order to identify the worst bottleneck of a chain, the following remark may be helpful.
⃗π (A × Ac ) = ⃗π (A × X ) − ⃗π (A × A)
= ⃗π (X × A) − ⃗π (A × A)
= ⃗π (Ac × A).
It follows that Φ(A) is not modified if P is replaced with P ⋆ or with (P + P ⋆ )/2. Thus,
P + P⋆
⋆
Φ(P ) = Φ(P ) = Φ .
2
48
4.3 Curvature
The geometric methods described so far only provide lower bounds. In the present section, we
introduce a fundamental geometric notion that will provide powerful upper bounds on mixing
times. Our starting point is the observation that the total-variation distance is “blind” to
the geometry of the state space: we have dtv (δx , δy ) = 1 for any x ̸= y ∈ X , regardless of
how close x is to y. A simple but far-reaching idea consists in replacing total-variation with
the following geometric quantity.
where the infimum is take over all possible couplings (X, Y ) of µ and ν.
Thus, W(·, ·) extends the function dist(·, ·) from points to probability measures.
over the compact (and convex) set of all coupling distributions of µ and ν:
( )
p ∈ [0, 1]X ×X : ∀x ∈ X ,
X X
C(µ, ν) := p(x, z) = µ(x), p(z, x) = ν(x) .
z∈X z∈X
In particular, this infimum is attained. Thus, there is always a coupling (X, Y ) that
attains the minimum in Definition 16: we call it an optimal coupling from µ to ν.
The function W is not symmetric in general, because dist is not. However, this is the only
obstruction for W to be a nice distance on P(X ), as the next lemma shows.
49
Lemma 18 (Properties). The Wasserstein distance W is convex and satisfies the sep-
aration axiom and the triangle inequality. It is symmetric if and only if dist is.
Proof of convexity. Fix µ, µ′ , ν, ν ′ ∈ P(X ) and θ ∈ [0, 1]. Let p be the law of an optimal
coupling form µ to µ′ , and let q be the law of an optimal coupling from ν to ν ′ . Then
r := θp + (1 − θ)q is the law of a coupling from θµ + (1 − θ)ν to θµ′ + (1 − θ)ν ′ , so
X
W (θµ + (1 − θ)ν, θµ′ + (1 − θ)ν ′ ) ≤ r(x, y)dist(x, y)
(x,y)∈X
X
= (θp(x, y) + (1 − θ)q(x, y)) dist(x, y)
(x,y)∈X
Proof of the separation axiom. Let µ, ν ∈ P(X ) be such that W(µ, ν) = 0. By Remark
22, we can find a coupling (X, Y ) of µ and ν such that E[dist(X, Y )] = 0. This forces
dist(X, Y ) = 0 a.-s., because dist(·, ·) is non-negative. Since dist(·, ·) moreover satisfies the
separation axiom, we deduce that X = Y a.-s., hence in distribution. Thus, µ = ν.
Proof of the triangle inequality. Fix λ, µ, ν ∈ P(X ). Write p (resp. q) for the law of an
optimal coupling from λ to µ (resp. µ to ν). Consider a random triple (X, Y, Z) with law
p(x, y)q(y, z)
∀(x, y, z) ∈ X 3 , P(X = x, Y = y, Z = z) = ,
µ(y)
this ratio being interpreted as 0 if the denominator (hence also the numerator) is zero.
Summing over all z ∈ X shows that (X, Y ) has law p, and summing over all x ∈ X shows
that (Y, Z) has law q. In particular, (X, Z) is a coupling of λ and ν, so we have
where we have used the triangle inequality for dist(·, ·), and the optimality of p and q.
Proof of symmetry. It is clear from the Definition 16 that W(·, ·) is symmetric whenever
dist(·, ·) is. The converse readily follows from Remark 21.
50
Remark 23 (Robustness). The above proofs remain valid for any function dist : X 2 →
R+ satisfying the separation axiom and the triangle inequality. Thus, the Wasserstein
distance is a very general tool that “lifts” any distance on X to a distance on P(X ).
The choice dist(x, y) := 1x̸=y gives rise to the total-variation distance, by Remark 11.
We now show that the Wasserstein distance controls the total variation distance.
Proof. The inequality 1x̸=y ≤ dist(x, y) trivially holds for all (x, y) ∈ X 2 . Integrating this
against the law of an optimal coupling from µ to ν concludes the proof.
Thus, any upper bound on the Wasserstein distance is also an upper bound on the total-
variation distance. The interest of the Wasserstein distance is that it can be efficiently
controlled by exploiting the geometry of the state space, as we will now see. The curvature
of a Markov chain measures the amount by which Wasserstein distances are contracted under
the action of the underlying transition kernel P .
This global definition seems far too delicate for practical use. Fortunately, a pleasant feature
of curvature is that it admits a simple, local characterization.
Proof. Setting ρ := max(x,y)∈E W (P (x, ·), P (y, ·)), we will establish the inequality
51
Using the triangle inequality for W (Lemma 18) and the definition of ρ, we have
t
X
W (P (x, ·), P (y, ·)) ≤ W (P (xs−1 , ·), P (xs , ·)) ≤ ρt = ρdist(x, y). (44)
s=1
This establishes (43) in the special case (µ, ν) = (δx , δy ). For the general case, let p be the
law of an optimal coupling from µ to ν, and observe that
X
(µP, νP ) = p(x, y) (P (x, ·), P (y, ·)) .
(x,y)∈X 2
= ρW(µ, ν),
where the second line uses (44) and the third the optimality of p. Thus, (43) is established.
Conversely, note that (43) is an equality when (µ, ν) = (δx , δy ) with (x, y) ∈ E achieving
the maximum in the definition of ρ. Thus, ρ is in fact the smallest constant for which (43)
holds. Comparing this with Definition 17, we conclude that ρ = e−κ(P ) .
Proof. Using the definition of κ(P ) and an immediate induction over t ∈ N, we have
Combining this with the crude bound W(·, ·) ≤ diam(P ) and Lemma 19, we obtain
52
Finally, choosing ν = π, µ = δx and maximizing over x ∈ X yields
1/t
for all t ∈ N. The first claim is obtained by sending t → ∞ (recall that DP (t) → e−1/trel (P ) ),
1
and the second by choosing t = ⌈ κ(P )
log diam(P
ε
)
⌉.
Example 8 (Hypercube). Consider the lazy random walk on the n−dimensional hyper-
cube. Fix two neighboring states x, y, and consider the coupling (X, Y ) of P (x, ·) and
P (y, ·) that updates the same coordinate using the same Bernoulli variable. Then,
1
W (P (x, ·), P (y, ·)) ≤ E[dist(X, Y )] = 1 − .
n
Since this holds for all (x, y) ∈ E, we deduce that
1 1
κ(P ) ≥ − log 1 − ≥ .
n n
(ε) n
Thus, Theorem 7 gives trel (P ) ≤ n and tmix (P ) ≤ n log ε
. A comparison with the
results of Section 3.5 shows that those estimates are remarkably sharp. Interestingly, the
1
bound κ(P ) ≤ trel (P )
= − log(1 − n1 ) shows that the first inequality in (45) is an equality.
53
probabilities ψ(+s) and ψ(−s), where
β X es
s := xj and ψ(s) := .
n es + e−s
j∈[n]\{i}
Note that ψ(s) + ψ(−s) = 1 for any s ∈ R, as required. Note also that ψ(s) increases from
0 to 1 as s ranges from −∞ to +∞. Thus, the new state of the i−th particle is likely to be
“plus” if s is a large positive number, and “minus” if s is a large negative number. Formally,
we have defined a Markov chain on X = {−1, +1}n with transition kernel
Pn
1 β P
ψ x
n i
x
j∈[n]\{i} j if y = x
n i=1
Pn (x, y) := 1
ψ − βn xi j∈[n]\{i} xj
P
n
if y = (x1 , . . . , xi−1 , −xi , xi+1 , . . . , xn )
0 else.
The fact that ψ > 0 ensures that this transition kernel is ergodic. Moreover, it is easily
checked to be reversible with respect to the measure
!2
n
1 β X
π(x) := exp xi ,
Z(β) 2n i=1
where Z(β) denotes the appropriate normalizing constant. How does the mixing time of this
chain depend on the interaction parameter β? In the easy case β = 0 (no interaction), we
recover the random walk on the hypercube, which has mixing time tmix (Pn ) = Θ(n log n).
On the other hand, in the limit where β → +∞ (strong interaction), the selected particle
will systematically adopt the majority state, so the chain will need an infinite amount of
time to move from (−1, . . . , −1) to (+1, . . . , +1). For “intermediate” values of β, it is then
tempting to believe that the mixing time will simply interpolate between those two extreme
situations, in a gradual way. In fact, the mixing time changes dramatically from O(n log n)
(fast-mixing regime) to exp(Ω(n)) (slow-mixing regime) as β passes the critical value 1.
2. For any fixed β > 1, we have tmix (Pn ) = exp (Ω(n)) (exponentially slow mixing).
Proof of fast mixing when β < 1. In light of Theorem 7, it suffices to prove that
1−β
κ(Pn ) ≥ ,
n
54
which we now do. Let I and U be independent with I ∼ Unif({1, . . . , n}) and U ∼
Unif([0, 1]). Starting from a fixed state x = (x1 , . . . , xn ) ∈ X , one can construct a ran-
dom state X = (X1 , . . . , Xn ) with law P (x, ·) by setting for each i ∈ [n],
x if I ̸= i
i
P
Xi := +1 if I = i and U ≤ ψ βn j∈[n]\{i} xj
−1 if I = i and U > ψ β P
n j∈[n]\{i} j .
x
Now, consider a state y ∈ X which differs from x by a single coordinate, say xk = −1 and
yk = +1. Then, the coupling (X, Y ) of P (x, ·), P (y, ·) that uses the same pair (U, I) gives
0 if I = k
P P
dist(X, Y ) = 2 if I ̸= k and ψ βn j∈[n]\{I} xj ≤ U < ψ βn j∈[n]\{I} yj
1 else.
P P
β β
′ 1
xj ≤ βn .
P P
But j̸=I yj − j̸=I xj ≤ 2 and ∥ψ ∥∞ ≤ 2
, so ψ n j∈[n]\{I} yj − ψ n j∈[n]\{I}
Thus, the second case occurs with probability at most β/n, and we deduce that
1 β β−1
E[dist(X, Y )] ≤ 1− 1+ ≤ e n .
n n
1−β
This shows that κ(Pn ) ≥ n
, as desired.
Proof of slow mixing when β > 1. Let us consider the event of negative magnetization:
( n
)
X
A := x∈X : xi < 0 .
i=1
The symmetry property π(x) = π(−x) ensures that π(A) ≤ 12 , so that Φ(P ) ≤ Φ(A). By
the conductance bound (Lemma 17), we only have to show that Φ(A) = exp(−Ω(n)). For
0 ≤ k ≤ n, let Ak consist of all configurations with k “plus” and n − k “minus” states:
( n
)
X
Ak := x∈X : xi = 2k − n .
i=1
Since at most one coordinate is modified at each step, the only way for the chain to jump
from A to Ac is to actually jump from A⌈n/2⌉−1 to A⌈n/2⌉ . Thus,
55
where the inequality follows from the fact that the second marginal of ⃗π is π. On the other
hand, we have π(A) ≥ maxk<⌈n/2⌉ π(Ak ) and for 0 ≤ k ≤ n, π(Ak ) = ak /Z(β), where
n β 2
ak := exp (2k − n) .
k 2n
a⌈n/2⌉
Consequently, ϕ(A) ≤ mink<⌈n/2⌉ ak
. To see that this ratio is exponentially small in n, fix
θ ∈ (0, 1) and observe that when k = k(n) ∼ θn as n → ∞, we have
1 β
log ak(n) −−−→ f (θ) := (2θ − 1)2 − θ log θ − (1 − θ) log(1 − θ).
n n→∞ 2
1
Thus, our task boils down to showing the existence of θ < 2
so that f (θ) > f (1/2). But
Thus, f ′ (1/2) = 0 and f ′′ (1/2) = 4(β − 1), so f ( 12 ) is a strict local minimum when β > 1.
√
Consequently, after t steps, the walk typically lies at distance Θ( t) from its starting point,
rather than t as we used in our volume bound. In light of this, it is natural to hope for a
quadratic improvement over the diameter lower bound (Corollary 3) for “diffusive” chains.
Note that this intuition is correct for the random walk on the n−cycle, for which we have
seen that tmix (Pn ) ≍ n2 ≍ diam(Pn )2 . With those preliminary observations in mind, let us
now state a remarkable inequality which compares the transition kernel of any reversible
chain to that of the simple random walk on Z.
56
Theorem 9 (Carne-Varopoulos estimate). For any reversible kernel P , we have
s
π(y)
P t (x, y) ≤ P (|Xt | ≥ dist(x, y)) ,
π(x)
Proof. The proof uses the Chebychev polynomials (qk )k≥0 , which are defined by the recursion
shows that qk (cos θ) = cos(kθ) for all θ ∈ R and k ∈ N. Now, observe that for all t ∈ N,
t t
e + e−iθ
iθ X
t iθXt
(cos θ) = = E e = E [cos (θXt )] = P (|Xt | = k) qk (cos θ).
2 k=0
Since this is true for all θ ∈ R, we must have the polynomial identity
t
X
zt = P (|Xt | = k) qk (z).
k=0
Let us apply this polynomial identity to the matrix P , and evaluate the (x, y)−entry:
t
X
t
P (x, y) = P (|Xt | = k) qk (P )(x, y)
k=0
t
X
= P (|Xt | = k) qk (P )(x, y),
k=dist(x,y)
where we have observed that I(x, y) = P (x, y) = · · · = P k (x, y) = 0 for k < dist(x, y), so
that qk (P )(x, y) = 0 (since qk has degree k). To conclude, it remains to show that
s
π(y)
qk (P )(x, y) ≤ , (45)
π(x)
57
for all k ≥ 0, and this is where we use reversibility: qk (P ) is a self-adjoint operator with
spectrum {qk (λ) : λ ∈ Sp(P )} ⊆ qk ([−1, 1]) ⊆ [−1, 1], where the last inclusion follows from
the identity qk (cos θ) = cos(kθ). Thus, qk (P ) is a contraction, which means that
for all observables f, g : X → C. Taking f = δy and g = δx yields exactly (45). The second
claim follows from a classical application of Markov’s inequality: for d, λ > 0,
t
eλ + e−λ
λ2 t
−λd
λXt λd
E eλXt = e−λd ≤ e−λd+
P (Xt ≥ d) = P e ≥e ≤ e 2 .
2
d2
The right-hand side is minimized for λ = dt , in which case it is equal to e− 2t . This of course
also applies to −Xt , and combining the two estimates concludes the proof.
Corollary 4 (Diffusive bound). For lazy simple random walk on any N −vertex graph,
Proof. Set d = ⌈diam(P )/2⌉, so that diam(P ) > 2(d − 1): this means that we can find two
disjoint balls of radius d − 1. Thus, there is x ∈ X such that A = B(x, d − 1) satisfies
1
π(A) ≤
2
But the elements of Ac are at distance d ≥ diam(P )/2 from x, so Theorem 9 ensures that
3 diam2 (P )
P t (x, Ac ) ≤ 2N 2 e− 8t ,
π(y) deg(y)
where we have used the crude estimates |Ac | ≤ N and π(x)
= deg(x)
≤ N . We conclude that
1 3 diam2 (P )
DP (t) ≥ − 2n 2 e− 8t .
2
(diamP )2
Thus, as long as t ≤ 16 ln N
, we have DP (t) ≥ 1
2
− √2 ,
N
concluding the proof.
58
5 Variational techniques
For an ergodic transition kernel P with stationary law π on a state space X , the convergence
to equilibrium DP (t) → 0 as t → ∞ can be equivalently formulated as follows:
for all f : X → R. In words, observables become constant under the repeated action of P .
Equivalently, the variance Var(f ) = π(f 2 ) − π 2 (f ) decays under the repeated action of P :
Definition 18 (Dirichlet form). The Dirichlet form is the quadratic form defined by
1
EP (f ) = Eπ (f (X1 ) − f (X0 ))2
2
1 X
= ⃗π (x, y) (f (y) − f (x))2
2 x,y∈X
= ⟨(I − P )f, f ⟩
59
defined in (15), which mixes exactly in a single step. Since ⃗π (x, y) = π(x)π(y), we have
1 X
EΠ (f ) = π(x)π(y) (f (x) − f (y))2
2 x,y∈X
= π(f 2 ) − π 2 (f )
= Var(f ).
A natural way to quantify the variational behavior of P consists in comparing its Dirichlet
form with that of the ideal chain Π. This leads to the following fundamental definition.
Definition 19 (Poincaré constant). The Poincaré constant of the chain is the quantity
EP (f )
γ(P ) := inf , f : X → R is not constant .
Var(f )
Since the ratio EP (f )/Var(f ) is invariant under translation and scaling, we also have
EP (f ) = EP ⋆ (f ) = E P +P ⋆ (f ),
2
P + P⋆
⋆
γ(P ) = γ(P ) = γ .
2
In words, the Dirichlet form and the Poincaré constant are invariant under time reversal.
P + P⋆
γ(P ) = 1 − λ2 ,
2
where λ2 (Q) denotes the second largest eigenvalue of a self-adjoint transition matrix Q.
60
In particular, if P is lazy and reversible, then
γ(P ) = 1 − λ⋆ (P ).
Proof. First consider the case where P is reversible. By decomposing the observable f in
our orthonormal basis (ϕ1 , . . . , ϕN ) of eigenfunctions, one finds
N
X
⟨(I − P )f, f ⟩ = (1 − λk ) |⟨f, ϕk ⟩|2 .
k=2
γ(P ) = 1 − λ2 (P ),
which establishes the claim when P is reversible. The general case is obtained by replacing
P with (P + P ⋆ )/2, which is always reversible and has the same Poincaré constant (Remark
25). Finally, when P is lazy and reversible, we have Sp(P ) ⊆ [0, 1], so λ⋆ (P ) = λ2 (P ).
Remark 26 (Range of γ(P )). The above result shows that we always have γ(P ) ∈ [0, 2],
and even that γ(P ) ∈ [0, 1] in the case where P is lazy.
We can now answer the first question raised at the beginning of this chapter.
Proof. Using the equality πf = πP f , and the definition of the adjoint P⋆ , we have
61
and the first claim readily follows. Now, if P is lazy, then for all x, y ∈ X , we have
X
π(x)P ⋆ P (x, y) = π(x) P⋆ (x, z)P (z, y)
z∈S
≥ π(x) (P (x, x)P (x, y) + P⋆ (x, y)P (y, y))
1
≥ (π(x)P (x, y) + π(y)P (y, x)) .
2
Multiplying by 1
2
(f (x) − f (y))2 and then summing over all x, y ∈ X , we obtain
EP ⋆ P (f ) ≥ EP (f ).
Since this is true for all f : X → R, we can safely conclude that γ(P ⋆ P ) ≥ γ(P ).
The above lemma shows that the variance of any observable decays exponentially fast under
the repeated action of P , with rate 1/γ(P ⋆ P ). This answers the first question raised at the
beginning of the chapter. The following result answers the second question, by showing that
1/γ(P ) plays the role of a relaxation time, without requiring reversibility.
Proof. Our starting point is the Cauchy-Schwarz bound, which we recall here:
t
1 P t (x, ·)
dtv P (x, ·), π ≤ −1 .
2 π(·)
P t (x, y) P ⋆t (y, x)
= = P ⋆t fx (y),
π(y) π(x)
δx (y)
where we have introduced the observable fx : y 7→ π(x)
. Since πP ⋆t fx = πfx = 1, we see that
2
P t (x, ·)
= Var P ⋆t fx ,
−1
π(·)
62
and the right-hand side can be estimated using Lemma 22 and Remark 25:
Remark 27 (Comparison with Theorem 4). A careful inspection reveals that the above
argument actually holds with γ(P P ⋆ ) instead of γ(P ), without the need for lazyness.
When P is reversible, we have 1 − γ(P P ⋆ ) = λ2⋆ (P ), and we recover Theorem 4.
Φ2 (P )
≤ γ(P ) ≤ 2Φ(P ).
2
Proof. In view of Remarks 25 and 19, we may suppose that P is reversible. Fix A ⊆ X .
The observable f = 1A satisfies Var(f ) = π(A)π(Ac ) and EP (f ) = ⃗π (A × Ac ). Consequently,
EP (f )
Φ(A) = π(Ac ) .
Var(f )
The claimed upper bound follows immediately. For the lower bound, we can assume that
λ2 (P ) ≥ 0, because Φ(P ) ∈ [0, 1]. Consider a non-negative observable f : X → R+ with
63
π (f = 0) ≥ 12 . For each t ∈ R+ , we may choose A = {f > t} in the definition of Φ(P ) to get
X
Φ(P )π(f > t) ≤ ⃗π (x, y)1(f (y)≤t<f (x)) .
x,y∈X
Expanding the squares, we see that the right-hand side simplifies to ∥f ∥4 − ⟨f, P f ⟩2 , so that
1
To conclude, we would like to take f = ϕ2 , but our initial assumption π(f = 0) ≥ 2
has no
reason to be satisfied. Let us instead choose f = max(ϕ2 , 0), which verifies the assumption
upon changing ϕ2 to -ϕ2 if necessary. Since f ≥ 0 and f ≥ ϕ2 , we have P f ≥ 0 and
P f ≥ λ2 ϕ2 , so P f ≥ λ2 f . Thus, ⟨f, P f ⟩ ≥ λ2 ∥f ∥2 , and (48) easily implies the claim.
By combining Theorems 10 and 11, we obtain the following important upper bound.
Corollary 5 (Conductance upper-bound). For any lazy kernel P and any ε ∈ (0, 1),
(ε) 4 1
tmix (P ) ≤ ln √ .
Φ2 (P ) 2ε π⋆
Example 9 (Hypercube). Consider lazy random walk on the hypercube. The set A :=
1 1 1
{x ∈ {0, 1}n : x1 = 0} satisfies π(A) = 2
and ⃗π (A × Ac ) = 4n
, so that Φ(A) = 2n
. We
1
deduce that Φ(P ) ≤ 2n
. On the other hand, we know that γ(P ) = 1 − λ⋆ (P ) = n1 , so
that there is equality in Cheeger’s upper bound. In particular,
1
Φ(P ) = .
2n
64
Example 10 (Cycle). On Z/nZ, any set A of size k ≤ n/2 contains at least two
1
boundary edges, so Φ(A) ≥ 2k
. Moreover, there is equality when A = {1, . . . , k}. Thus,
1
Φ(P ) = .
2⌊n/2⌋
π2
We know that γ(P ) = 1−λ⋆ (P ) ∼ n2
, so Cheeger’s lower bound is sharp up to prefactors.
The lazy random walk on Gn satisfies tmix (Pn ) = Θ(log |Gn |), by Corollaries 2 and 5.
65
the same state space X , their comparison constant is defined as
EP (f )
γ(P : Q) := inf , f : X → R is not constant .
EQ (f )
Note in particular that γ(P : Π) = γ(P ), by Remark 24. The motivation behind this
definition is contained in the following elementary but fruitful observation.
Lemma 23 (Comparison principle). If P and Q have the same stationary law, then
Thus, any lower bound on γ(Q) yields a lower bound on γ(P ), at a price γ(P : Q).
EP (f ) EP (f ) EQ (f ) EQ (f )
= × ≥ γ(P : Q) × .
Var(f ) EQ (f ) Var(f ) Var(f )
Taking an infimum over all possible choices for f concludes the proof.
⟨Af, f ⟩
λk (A) = max min ,
dim(F )=k f ∈F ⟨f, f ⟩
where the maximum ranges over all k−dimensional subspaces F ⊆ H. Applying this to
P +P ⋆
A=I− 2
on H = L2 (X , π), we obtain the global comparison principle
P + P⋆ Q + Q⋆
1 − λk ≥ γ(P : Q) 1 − λk ,
2 2
for all 1 ≤ k ≤ |X |, of which the above lemma is only the special case k = 2.
66
Theorem 12 (Distinguished paths). Let P, Q be kernels on X with supports EP , EQ
and weights ⃗πP , ⃗πQ . For each (x, y) ∈ EQ , let Υx,y be a path from x to y in GP . Then,
1 1 X
≤ max c(e), where c(e) := ⃗πQ (x, y)|Υx, y|1(e∈Υx,y ) .
γ(P : Q) e∈EP ⃗πP (e)
(x,y)∈EQ
Here |Υ| denotes the length of the path Υ, and e ∈ Υ means that e is traversed by Υ.
Now for each (x, y) ∈ EQ , we may use the fact that Υx,y is a path from x to y to write
2
X
|∇f (x, y)|2 = ∇f (e)
e∈Υx,y
X
≤ |Υx,y | |∇f (e)|2 ,
e∈Υx,y
≤ EP (f ) max c(e).
e∈EP
Since this inequality holds for all observables f : X → R, the result follows.
The simplest and most natural choice for Q is the ideal matrix Π that mixes in one step:
Corollary 6 (Rank-one case Q = Π). Let P be a transition kernel on X and for each
(x, y) ∈ X × X , let Υx,y be a path from x to y in GP . Then,
1 1 X
≤ max c(e), where c(e) := π(x)π(y)|Υx,y |1(e∈Υx,y) .
γ(P ) e∈EP ⃗π (e)
(x,y)∈X ×X
67
Remark 29 (Congestion). The quantity c(e) is called the congestion induced at the edge
e by the collection of paths (Υx,y )(x,y)∈EQ . The challenge lies in choosing paths that will
make the maximal congestion maxe∈EP c(e) as low as possible. Note that we must have
X
max c(e) ≥ ⃗πP (e)c(e)
e∈EP
e∈EP
X
= ⃗πQ (x, y)|Υx,y |2
(x,y)∈EQ
X
≥ ⃗πQ (x, y)dist2 (x, y).
(x,y)∈EQ
Thus, the best congestion we can hope for is the average quadratic distance (in GP )
between two consecutive states of MC(X , Q, π). To achieve this optimum (at least ap-
proximately), we need our paths (Υx,y )(x,y)∈EQ to be (close to) shortest paths in GP , and
the resulting congestion to be (close to) uniform across all edges, as in the next example.
Example 13 (Square grid). Let Pn be the transition matrix for lazy simple random walk
on a n × n grid: the state space is X = [n] × [n], and two vertices x = (x1 , x2 ) and
y = (y1 , y2 ) are neighbors if |x1 − x2 | + |y1 − y2 | = 1. For x, y ∈ X , consider the path
Υx,y of minimum length that goes from x to y first horizontally, then vertically. The
congestion induced at any edge is easily seen to be of order n2 , so Corollary 6 gives
Example 14 (Universal bound). One can always choose Υx,y to be a shortest path from
x to y, and use the crude bound |Υx,y | ≤ diam(P ) to deduce that
1 diam(P )
≤ .
γ(P ) ⃗π⋆
68
In particular, for lazy simple random walk on a graph G = (V, E), we have
trel (P ) = O(diam(G)|E|)
tmix (P ) = O(diam(G)|E| log |E|).
The two inequalities appearing in Remark 29 show that our distinguished paths (Υx,y )(x,y)∈EQ
should not only be as short as possible, but also as spread-out as possible across the graph,
so that the congestion is fairly balanced among the edges. A simple but powerful idea for
reducing the imbalance consists in letting the paths (Υx,y )(x,y)∈EQ be random.
1 1 X
≤ max c(e), where c(e) := ⃗πQ (x, y)E |Υx,y |1(e∈Υx,y ) .
γ(P : Q) e∈EP ⃗πP (e)
(x,y)∈EQ
Proof. The argument is exactly the same as in the case of deterministic paths, except that
we take expectations just before the very last inequality.
Here is an example that illustrates the advantage of random paths over deterministic ones.
Example 15 (Lazy random walk on K2,n ). The graph K2,n has vertex set X = {L, R}∪
[n], two vertices being neighbors if one is in {L, R} and the other in [n]. Any determin-
istic path ΥL,R from L to R contributes to the congestion of e ∈ ΥL,R by at least
1
c(e) ≥ π(L)π(R)|ΥL,R | = n.
⃗π (e)
On the other hand, when the distinguished path Υx,y is chosen uniformly at random over
1
all shortest paths from x to y, the congestion is only c(e) = 2 − n
for every edge e.
The transition kernel P is called transitive if for any two states x, x′ ∈ X , there is an
69
automorphism of P that takes x to x′ . It is called arc-transitive if for any two edges
(x, y), (x′ , y ′ ) ∈ E, there is an automorphism that takes x to x′ and y to y ′ .
Intuitively, a chain is transitive if “all states play the same role”. Examples include all
random walks on groups (take ϕ(z) = z⋆x−1 ⋆x′ ). Arc-transitivity is the stronger requirement
that “all edges play the same role”. Examples include the lazy simple random walk on the
hypercube, the cycle, and more generally discrete tori of the form Zdn for any d, n ∈ N.
1 diam2 (P )
(i) If P is transitive, then γ(P )
≤ p⋆
, where p⋆ = min(x,y)∈E P (x, y).
1 diam2 (P )
(ii) If P is arc-transitive, then γ(P )
≤ 1−α
, where α = minx∈X P (x, x).
Proof. We use the methods of random paths to compare P with the rank-one matrix Q = Π.
For each x, y ∈ X , we take Υx,y to be a uniformly chosen shortest path between x and y.
In view of Remark 29, we know that the resulting congestion satisfies
X X
⃗π (e)c(e) = π(x)π(y)dist2 (x, y) ≤ diam2 (P ).
e∈E x,y∈X
When P is arc-transitive, the quantities ⃗π (e) and c(e) do not depend on e ∈ E. Consequently,
the left-hand side equals ⃗π (E) × maxe∈E c(e), and the second bound follows because ⃗π (E) =
P
1 − x∈X ⃗π (x, x) = 1 − α. If P is only transitive, then we can write
X X X
⃗π (e)c(e) = π(x) P (x, y)c(x, y) ≥ p⋆ max c(e),
e∈E
e∈E x∈X y̸=x
P
because the quantity π(x) y̸=x P (x, y)c(x, y) does not depend on x.
70
Cheeger's inequalities establish a crucial link between geometric and spectral properties in Markov chains through conductance, denoted by Φ(P), and the Poincaré constant γ(P). These inequalities assert that for any reversible transition kernel P, the conductance provides bounds for the Poincaré constant. Specifically, the upper and lower bounds are Φ²(P)/2 ≤ γ(P) ≤ 2Φ(P), thus indicating that higher conductance corresponds to faster mixing times. This illustrates that the robustness of the chain's convergence and its variance decay rate are simultaneously governed by the system's underlying graph structure, captured via conductance, and its spectral characteristics, measured by the Poincaré constant .
A coupling kernel facilitates trajectorial coupling in Markov chains by providing a structured mechanism to create paired stochastic processes whose marginal distributions align with a given transition kernel P. This means that for a joint transition kernel Q defined on the product space, the marginals agree with P, thus allowing simultaneous construction of coupled processes (X, Y) such that each follows a Markov chain with transitions prescribed by P. This construction is pivotal in obtaining coupling trajectories which accurately represent the behavior of individualized chains starting from different states by ensuring that each step maintains consistency with the underlying chain dynamics .
The geometric approach through the concept of conductance offers advantages of intuitive understanding and empirical estimates for mixing times due to its clear-cut interpretation of flow and bottleneck properties in state spaces. Conductance, Φ(P), effectively bounds mixing rates and yields insights into convergence dynamics without delving into the spectral complexities that generally accompany eigenvalue approaches. However, practical challenges arise as accurate computation of conductance can be inherently difficult, especially in large or complex networks where optimizing over all potential cuts of the state space becomes computationally infeasible. It requires careful balancing of precision versus feasibility and often necessitates supplementary assumptions or approximations to manage computational complexity effectively .
Establishing such a coupling maximizes the concentration on the diagonal set {(y, y): y ∈ X} is challenging because constructing this optimal coupling involves complex calculations and requires knowledge of the entire structure of the probability distributions involved. The practical difficulty lies in efficiently estimating the maximum likelihood of the events where the coupled variables are equal, P(X = Y), without encountering infeasible computational demands. Sophisticated couplings might not be directly apparent, necessitating simplifications that may deviate significantly from optimal designs, particularly when faced with distributions or transitions that are not explicitly defined or are overly complex .
Reversible Markov chains are significant in understanding convergence to equilibrium because they ensure the existence of an orthonormal basis of eigenfunctions, allowing for precise analysis of spectral properties and eigenvalue gaps crucial for studying convergence rates. Reversibility simplifies the computation of variances and centralizes the transition probability framework through detailed balance, which ensures that equilibrium states are inherently consistent with iterative chain behavior. This trait implies that spectral methods, such as those involving the Poincaré constant, become applicable and effective, offering clear and concise insights into mixing times and relaxation properties, thus providing foundational support for studying equilibrium dynamics and asymptotic distributional behavior .
The variational formula for total variation distance, as described, provides a framework where minimizing the probability of discrepancy between coupled random variables yields the u001cprecise total variation distance. This involves coupling methods which are theoretically advantageous since every coupling offers an upper bound to the total variation distance, and a coupling achieving equality confirms this method's reliability. However, the practical challenge arises in the complexity and sophistication needed to estimate the probability P(X = Y) effectively through coupling. Therefore, while theoretically sound, practical estimation requires careful design of efficient and tractable couplings .
The cutoff phenomenon in random walks on the hypercube signifies a sharp transition in the convergence of the chain to its stationary distribution. It indicates that for a fixed precision ε, the mixing time is characterized by a threshold, after which the total variation distance between the chain's distribution and the stationary distribution drops abruptly. This is quantitatively expressed as t(ε) mix(Pn) ∼ n ln n /2 as n approaches infinity for any ε in (0, 1). The realization of this cutoff phenomenon implies precise control over mixing times, revealing the deterministic aspect of apparent random behaviors, and suggests that convergence is mainly driven by specific eigenvalue arrangements and the combinatorial structure of the hypercube .
Coalescence time plays a crucial role in determining the mixing time of Markov chains as it reflects how quickly two processes from different initial states converge. Using coupling kernels, the coalescence time T is defined as the first time the coupled Markov chains' trajectories coincide. Studies and theorems, such as Theorem 2, show that fast convergence to equilibrium is achieved if these trajectories meet quickly. Hence, coupling techniques that reduce coalescence time result in a lower mixing time, driving the Markov chain faster to its equilibrium .
The spectral radius significantly influences the relaxation time in Markov chains, as it is indicative of the rate at which a distribution converges to equilibrium. The spectral radius, often denoted λ*(P), is tied to the Poincaré constant γ(P), such that in reversible settings, 1 - λ²*(P) coincides with γ(P). A smaller spectral radius implies larger eigenvalues are further from 1, leading to faster exponential decay of variances and hence shorter relaxation times. Conversely, if the spectral radius nears 1, it connotes slow convergence, reflecting longer retention in non-equilibrium states. Thus, spectral properties dictate mixing dynamics with relaxation time acting as a bridge between abstract algebraic characteristics and practical asymptotic behavior .
The Central Limit Theorem applies to random walks on the hypercube by illustrating that when initial distributions are dense or spread out, the distribution of the walk's position approximates a Gaussian distribution over time. Specifically for a sufficiently large n, the positions X_⌈αn^2⌉ in the walk behave in a manner that converges to a density function of N(0, α²) mod 1 on interval [0,1). This application of CLT shows refined local behavior as the walk progresses, highlighting the blending of predictability and randomness when initial distributions permit sufficiently rich exploration across the state space. Hence, it quantifies the state's distributional dynamics over time, leveraging Gaussian properties for immense state spaces like the hypercube .