0% found this document useful (0 votes)
63 views70 pages

Mixing Times in Markov Chains Explained

This document is an introduction to the theory of mixing times for Markov chains, exploring methods for estimating these times through various techniques such as couplings and spectral analysis. It discusses key concepts including relaxation time, mixing time, and the cutoff phenomenon, with applications to systems like card shuffling and random walks. The course aims to provide a comprehensive understanding of how quickly Markov chains converge to equilibrium and the implications of this convergence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views70 pages

Mixing Times in Markov Chains Explained

This document is an introduction to the theory of mixing times for Markov chains, exploring methods for estimating these times through various techniques such as couplings and spectral analysis. It discusses key concepts including relaxation time, mixing time, and the cutoff phenomenon, with applications to systems like card shuffling and random walks. The course aims to provide a comprehensive understanding of how quickly Markov chains converge to equilibrium and the implications of this convergence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mixing times of Markov chains

Justin Salez

Abstract

How many times should one shuffle a deck of 52 cards? This course is a self-
contained introduction to the modern theory of mixing for Markov chains. It consists
of a guided tour through the various methods for estimating mixing times, including
couplings, spectral analysis, discrete geometry, and functional inequalities. Each of
those tools is illustrated on a variety of examples from different contexts: interacting
particle systems, card shufflings, random walks on graphs and networks, etc. A partic-
ular attention is devoted to the cutoff phenomenon, a remarkable but still mysterious
phase transition in the convergence to equilibrium of certain chains.

1
Contents
1 Framework 3
1.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Distance to equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Relaxation time and mixing time . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The cutoff phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Random walks on graphs and groups . . . . . . . . . . . . . . . . . . . . . . 17

2 Probabilistic techniques 19
2.1 Distinguishing statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Couplings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Coalescence times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Application: random walk on the cycle . . . . . . . . . . . . . . . . . . . . . 25
2.5 Application: Ehrenfest model . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Spectral techniques 30
3.1 Spectral radius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Diagonalization of reversible kernels . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Wilson’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Application: limit profile for the cycle . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Application: cutoff for the hypercube . . . . . . . . . . . . . . . . . . . . . . 40

4 Geometric techniques 42
4.1 Volume, degree, diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Application: phase transition in the Curie-Weiss model . . . . . . . . . . . . 53
4.5 Carne-Varopoulos bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Variational techniques 59
5.1 Dirichlet form and Poincaré constant . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Cheeger inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Comparison principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Distinguished paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2
1 Framework
This first chapter sets the stage on which we are going to perform. We start with a brief
but self-contained remainder on finite Markov chains and their large-time behavior. We then
introduce the total-variation distance to equilibrium, collect some of its basic properties, and
use them to define three fundamental notions that will lie at the center of our attention: the
relaxation time, the mixing time, and the cutoff phenomenon. Finally, we briefly present two
rich classes of Markov chains which are important from both a theoretical and a practical
viewpoint: random walks on graphs, and random walks on groups.

1.1 Markov chains


A Markov chain is specified by a triple (X , P, ν) consisting of the following ingredients:

(i) A state space X , which in our case will just be a finite, non-empty set.

(ii) A transition kernel P : X × X → [0, 1], which satisfies


X
∀x ∈ X , P (x, y) = 1. (1)
y∈X

(iii) An initial law ν : X → [0, 1], which satisfies


X
ν(x) = 1. (2)
x∈X

Definition 1 (Markov chain). A Markov chain with parameters (X , P, ν) is a sequence


X = (X0 , X1 , . . .) of X −valued random variables on a probability space (Ω, F, P) so that

P(X0 = x0 , X1 = x1 , . . . , Xt = xt ) = ν(x0 )P (x0 , x1 ) · · · P (xt−1 , xt ), (3)

for every t ∈ N = {0, 1, 2, . . .} and every (x0 , . . . , xt ) ∈ X t+1 . We write X ∼ MC(X , P, ν).

Remark 1 (Markov property). The product form (3) asserts that the past (X0 , . . . , Xt−1 )
and the future (Xt+1 , Xt+2 , . . .) are conditionally independent, given the present Xt .

3
1
3

a
2
3 1
1 6
5
4
5
b c 2
3
1
6

Figure 1: A particular transition kernel on X = {a, b, c}, represented by its diagram.

Remark 2 (Existence and uniqueness). By Dynkin’s theorem, the finite-dimensional


marginals (3) determine the law of X as a stochastic process, i.e. a random variable
taking values in the product space (X , P(X ))⊗N . Conversely, Kolmogorov extension’s
theorem (or a direct construction) always ensures the existence of X ∼ MC(X , P, ν).

Let X ∼ MC(X , P, ν), and let µt denote the law of the random variable Xt . Then it
follows from the above definition that µ0 = ν and that for every t ≥ 1,
X
∀y ∈ X , µt (y) = µt−1 (x)P (x, y). (4)
x∈X

It will be convenient to view a probability distribution µ ∈ P(X ) as a row vector, a function


f : X → R as a column vector, and a transition kernel P : X × X → [0, 1] as a matrix. In
particular, we may rewrite the above identity in the following compact way:

µt = µt−1 P. (5)

We shall be interested in the large-time distribution of our Markov chain, i.e., the asymptotic
behavior of µt as t → ∞. If the sequence (µt )t≥0 converges to some distribution π ∈ P(X ),
then passing to the limit in the above recursion shows that π must be stationary, i.e.

π = πP. (6)

4
We are thus naturally led to investigate the question of existence and uniqueness of stationary
distributions. In that respect, the following definition will be useful.

Definition 2 (Irreducibility). The transition kernel P is said to be irreducible if

∀(x, y) ∈ X 2 , ∃t ∈ N, P t (x, y) > 0. (7)

Remark 3 (Graph-theoretical interpretation). The diagram of the chain is the directed


graph GP on X obtained by placing an edge x → y whenever P (x, y) > 0 (see Figure
1). The irreducibility of P simply means that GP is strongly connected, in the sense that
it contains a path from every vertex to every other. When this is not the case, we can
always restrict P to a strongly connected component to obtain an irreducible kernel.

Lemma 1 (Stationary laws). Any transition kernel P has a stationary law π. If P is


irreducible, then the stationary law is unique and has full support.

Proof. Consider the empirical mean of the first t ≥ 1 instantaneous laws of the chain:
t−1
1X
πt := µs ∈ P(X ). (8)
t s=0

By compactness, we can extract from the sequence (πt )t≥1 a subsequence that admits a limit
π ∈ P(X ). The latter is necessarily stationary, because for each x ∈ X

µt (x) − µ0 (x)
(πt P )(x) − πt (x) = −−−→ 0.
t t→∞

This shows existence. Note that the stationary equation πP = π implies that the support of π
is closed under the accessibility relation x → y defining the diagram GP . In particular, if P is
irreducible, then any stationary law π has full support. Finally, suppose for a contradiction
that an irreducible kernel P admits two distinct stationary distributions π1 and π2 , and
consider the function m : [0, 1] → [−1, 1] defined as follows:

m(θ) := min (π1 (x) − θπ2 (x)) .


x∈X

5
It is clear that m is continuous, with m(0) > 0 (because π1 has full support) and m(1) < 0
(because π1 ̸= π2 ). Thus, there is θ⋆ ∈ (0, 1) such that m(θ⋆ ) = 0. But then, the vector
π1 (x) − θ⋆ π2 (x)
π : x 7→ , (9)
1 − θ⋆
satisfies πP = π (because so do π1 , π2 ), has minimal entry equal to zero (by definition of
θ⋆ ) and has entry sum equal to 1 (because so do π1 , π2 ). Thus, π is a non-fully supported
stationary distribution for the irreducible kernel P , a contradiction.

Remark 4 (Optimality). The converse is easily shown to be true: the irreducibility of


P is necessary and sufficient for the invariant law π to be unique and fully supported.

Remark 5 (Césarro mixing). By a standard compactness-uniqueness argument, the


above proof shows that any irreducible Markov chain X satisfies the convergence
t−1
1X
∀y ∈ X , P(Xs = y) −−−→ π(y), (10)
t s=0 t→∞

regardless of the chosen initial condition.

Let us note that the more natural conclusion P(Xt = y) → π(y), which is stronger than (10),
requires extra assumptions, as the following counter-example shows.

Example 1 (Periodic chain). On X = {0, 1}, consider the transition kernel


!
0 1
P = ,
1 0

which is irreducible with stationary law π = Unif(X ). Since P 2 = I (the identity matrix
on X ), we have P 2t = I and P 2t+1 = P for all t ∈ N. In particular, if we start from
the initial condition ν = δ0 , then the sequence (µt )t≥0 keeps alternating between the two
Dirac masses δ0 and δ1 , and mixing only occurs in the Césarro-mean sense (10).

In order to preclude the type of pathological periodicity displayed by Example 1, we need


to strengthen the irreducibility assumption (7) by exchanging the order of quantifiers.

6
Definition 3 (Ergodicity). The transition kernel P is called ergodic if

∃t ∈ N, ∀(x, y) ∈ X 2 , P t (x, y) > 0. (11)

Remark 6 (Lazyness). Ergodicity is strictly stronger than irreducibility. A simple way


to make an irreducible kernel P ergodic consists in replacing it with its lazy version:
I+P
Pb := . (12)
2
In other words, a fair coin is tossed at each step: if heads comes up, a transition is made
according to the original kernel P , otherwise nothing happens. The resulting chain X b is
just a time-changed version of the original chain X, and it retains most of its essential
features, including the stationary distribution π. Obviously, any transition kernel whose
diagonal entries are at least 1/2 is the lazy version of some transition kernel.

The following fundamental result constitutes the starting point of our study. It asserts that
any ergodic Markov chain mixes: as the number of iterations grows, the chain approaches
the stationary distribution, regardless of the initial condition.

Theorem 1 (Convergence to equilibrium). If P is ergodic, then we have

∀y ∈ X , P(Xt = y) −−−→ π(y), (13)


t→∞

regardless of the initial condition. Equivalently, P t (x, y) → π(y) for all x, y ∈ X .

We will later see many proofs of this result, and numerous refinements. Here is a nice practical
application: in order to approximately sample from a sophisticated target distribution π, all
we have to do is to design an ergodic Markov chain whose stationary law is π, and let it run
for sufficiently long. This observation is at the basis of a number of technological revolutions,
such as Monte Carlo Markov Chain methods or Google’s Pagerank algorithm. It motivates
the following question, to which the present course is entirely devoted.

Question 1 (Speed of convergence). How fast is the convergence to equilibrium (13) ?

To answer this question, we first need to agree on a way to measure the distance to equilib-
rium, i.e., the discrepancy between the law µt at time t, and the stationary law π.

7
1.2 Distance to equilibrium
There are many natural ways to measure the distance between two probability measures µ
and ν on a finite set X : Hellinger distance for statisticians, Hilbert/Lp norms for analysts,
relative entropy for physicists and information theorists, etc. Each has its own flavor and its
specific list of advantages and drawbacks, making it more relevant to the study of certain
chains than others. Rather than competing with each other, these distances are complemen-
tary: they are related one to another by an array of inequalities, and combining different
viewpoints is often the best way to analyze a given Markov chain. We will here mainly focus
on the total-variation distance, which is more natural for probabilists.

Definition 4 (Total variation). The total variation distance between µ and ν is

dtv (µ, ν) := max |µ(A) − ν(A)| .


A⊆X

This is clearly a distance on the set P(X ) of all probability measures on X , and we have
dtv (µ, ν) ≤ 1, with equality if and only if µ and ν have disjoint supports. Let us start by
collecting a list of useful alternative expressions.

Lemma 2 (Alternative expressions). For any µ, ν ∈ P(X ), we also have

dtv (µ, ν) = max (µ(A) − ν(A))


A⊆X
X
= (µ(x) − ν(x))+
x∈X
X
= 1− µ(x) ∧ ν(x)
x∈X
1X
= |µ(x) − ν(x)|
2 x∈X
1
= sup |µf − νf | .
2 f : X →[−1,1]

Proof. The first identity follows from the observation that changing the set A to its comple-

8
ment changes the value µ(A) − ν(A) to its opposite. For the second, note that
X
µ(A) − ν(A) = (µ(x) − ν(x))
x∈A
X
≤ (µ(x) − ν(x))+
x∈A
X
≤ (µ(x) − ν(x))+ ,
x∈X

for all A ⊆ X , and that those become equalities for A = {x ∈ X : µ(x) ≥ ν(x)}. The third
and fourth claims are obtained by taking a = µ(x) and b = ν(x) in the two identities

(a − b)+ = a − a ∧ b
1
= (|a − b| + a − b) .
2
Finally, for the last inequality, simply note that for any f : X → [−1, 1],
X
|µf − νf | = (µ(x) − ν(x)) f (x)
x∈X
X
≤ |µ(x) − ν(x)|
x∈X
= 2dtv (µ, ν),

and that there is equality in the case f (x) = sign(µ(x) − ν(x)), with sign = 1R+ − 1R− .

Let us now record a couple of basic but important properties of total variation distance.

Lemma 3 (Convexity and contraction).

1. The function (µ, ν) 7→ dtv (µ, ν) is convex on P(X ) × P(X ).

2. Any transition kernel P on X contracts dtv :

∀µ, ν ∈ P(X ), dtv (µP, νP ) ≤ dtv (µ, ν).

Proof. The first claim follows from the observation that (µ, ν) 7→ µ(A) − ν(A) is trivially
convex for each A ⊆ X , and that any maximum of convex functions is convex. The second
follows from the last expression in Lemma 2, because if f ∈ [−1, 1]X , then so does P f .

We are now ready to introduce the main object of our study.

9
Definition 5 (Distance to equilibrium). The distance to equilibrium associated with an
irreducible transition kernel P on X is the function DP : N → [0, 1] defined by

DP (t) := max dtv (νP t , π), (14)


ν∈P(X )

where π denotes the unique stationary distribution under P .

Remark 7 (Properties). Let us make four important comments about the function DP .

1. The first item in Lemma 3 ensures that the maximum over all distributions ν ∈
P(X ) in (14) can be reduced to a maximum over all extremal distributions (δx )x∈X :

DP (t) = max dtv (P t (x, ·), π).


x∈X

2. The second item in Lemma 3 ensures that the function DP is non-increasing.

3. The initial distance DP (0) is close to 1 if the state space X is large. Indeed,
1
DP (0) = 1 − π⋆ where π⋆ := min π(x) ≤ .
x∈X |X |

4. Theorem 1 asserts that DP (t) → 0 as t → ∞ whenever P is ergodic.

1.3 Relaxation time and mixing time


We initiate our study of DP by establishing a fundamental sub-multiplicativity property,
which shows that mixing can not slow down once it has started: if s iterations suffice to
reduce the distance to equilibrium to 1/4, then ks iterations suffice to reduce it to 2−k .

Lemma 4 (Sub-multiplicativity). We have DP (t + s) ≤ 2DP (t)DP (s) for all s, t ∈ N.

Proof. Let Π denote the “idealized” transition kernel which mixes exactly in one step, i.e.

Π(x, y) := π(y). (15)

Then, it is immediate to check that Π2 = P Π = ΠP = Π, so that for all t ∈ N,


t  
t
X t
(P − Π) = (−1)s Πs P t−s = P t − Π.
s=0
s

10
P
Thus, writing ∥A∥ := maxx∈X y∈X |A(x, y)|, we arrive at the important representation

2DP (t) = (P − Π)t , (16)

for all t ≥ 1. The desired claim now follows from the sub-multiplicativity property

∥AB∥ ≤ ∥A∥∥B∥, (17)

which is easily checked to hold for all matrices A, B ∈ RX ×X .

Lemma 4 asserts that the non-negative sequence (ut )t∈N defined by ut := 2D(t) is sub-
multiplicative, i.e. ut+s ≤ ut us for all t, s ∈ N. By Fekete’s Lemma, this implies
1/t
−−−→ inf u1/s

ut s : s ≥ 1 . (18)
t→∞

Note that the infimum is less than 1 if P is ergodic, because we then have DP (t) < 1/2 for
t large enough. This establishes the following notable refinement of Theorem 1.

Corollary 1 (Geometric decay). If P is ergodic, then there is λ⋆ (P ) < 1 such that


1
(DP (t)) t −−−→ λ⋆ (P ).
t→∞

Remark 8 (Spectral radius). In Chapter 3, we will show that the fundamental quantity
n 1
o
λ⋆ (P ) = inf (2DP (t)) t : t ≥ 1 (19)

admits a simple and beautiful characterization in terms of the eigenvalues of P . For this
reason, λ⋆ (P ) is often called the spectral radius of the chain.

At first sight, Corollary 1 seems to bring a definitive answer to Question 1: the distance to
equilibrium DP (t) decays essentially like (λ⋆ (P ))t as t → ∞. It is tempting to deduce that
the chain is well mixed when the number of iterations is larger than the relaxation time
1
trel (P ) :=  , (20)
1
log λ⋆ (P )

defined so that λ⋆ (P ) = exp(− trel1(P ) ) (our definition differs slightly from the classical one,
which we find less natural in discrete time). In fact, this intuition is wrong, and one often

11
001100101100010110010111011...

Figure 2: The “sliding window” chain with n = 6.

needs to wait much longer than the relaxation time before the chain even starts to mix. The
reason is that the approximation
 
t
DP (t) ≈ (λ⋆ (P )) = exp −
t
(21)
trel (P )

promised by Corollary 1 is only valid asymptotically, in a regime where t is so large that


the chain is already infinitesimally close to equilibrium. Thus, the relaxation time does not
answer the real question: how long do we need to wait before the distance to equilibrium
becomes small? This naturally leads to the following definition, illustrated on Figure 3.

Definition 6 (Mixing time). The mixing time of P with precision ε ∈ (0, 1) is

(ε)
tmix (P ) := min{t ∈ N : DP (t) ≤ ε}.

(1/4)
The default precision is ε = 1/4, in which case we write tmix (P ) instead of tmix (P ).

The value ε = 1/4 is standard, and sufficient in practice: any smaller precision can be
achieved by just increasing tmix (P ) by a universal factor. Indeed, Lemma 18 implies
 
(ε) 1
∀ε ∈ (0, 1/4), tmix (P ) ≤ tmix (P ) ≤ tmix (P ) log2 . (22)
ε

Thus, the fundamental quantity tmix (P ) provides a rigorous formalization of Question 1.


Understanding its dependency on the underlying kernel P is a fascinating task, to which the
present course is devoted. By (19), the relaxation time always provides the lower-bound
 
(ε) 1
tmix (P ) ≥ trel (P ) log , (23)

but the latter can be terribly poor, as we shall see. Let us illustrate these concepts on a toy
example which is not exciting, but is simple enough to allow for explicit computations.

12
1
ε
t → DP (t)

ε′ t
(ε) (ε′ )
tmix (P ) tmix (P )

Figure 3: Distance to equilibrium and mixing times.

Example 2 (Sliding window). On X = {0, 1}n , consider the transition kernel


(
1
2
if (y1 , . . . , yn−1 ) = (x2 , . . . , xn )
P (x, y) =
0 else,

which describes the content of a window of length n “sliding” along an infinite sequence
of independent fair coin tosses (see Figure 2). Clearly, P is ergodic with π = Unif(X ).
For any x = (x1 , . . . , xn ) ∈ X and t ≤ n, we have P t (x, ·) = Unif(Ax,t ), where

Ax,t := {y ∈ {0, 1}n : (y1 , y2 , · · · , yn−t ) = (xt+1 , . . . , xn )} .


n−t
Since |Ax,t | = 2t , one finds DP (t) = 1 − 21 . In other words, for all ε ∈ (0, 1),
  
(ε) 1
tmix (P ) = n − log2 .
1−ε

In particular, DP (t) = 0 for all t ≥ n. This forces λ⋆ (P ) = 0, hence trel (P ) = 0: the


relaxation time here drastically underestimates the time it takes for the chain to mix.

13
1.4 The cutoff phenomenon
From a practical point-of-view, estimating mixing times is particularly meaningful when
the number of states becomes large. Rather than studying a fixed Markov chain, we will
thus consider a sequence of transition kernels (Pn )n≥1 whose dimensions grow with n, and
investigate the order of magnitude of tmix (Pn ) as n → ∞. Recall that the particular choice
ε = 1/4 in the definition of tmix is irrelevant: for any fixed ε ∈ (0, 1/4),
(ε)
tmix (Pn ) = Θ (tmix (Pn )) , (24)

where the notation an = Θ(bn ) means that the sequence (an /bn )n≥1 is bounded from above
and below by strictly positive constants. For many chains, we will see that an even stronger
insensitivity holds: the dependency in ε disappears completely, in the following sense.

Definition 7 (Cutoff). The sequence (Pn )n≥1 exhibits a cutoff phenomenon if

(ε)
∀ε ∈ (0, 1), tmix (Pn ) ∼ tmix (Pn ), (25)

where the notation an ∼ bn means that an /bn → 1 as n → ∞.

In words, the number of iterations required to even slightly mix (say, ε = 0.99) is asymp-
totically the same as that needed to completely mix (say, ε = 0.01), at least to first order.
Equivalently, the associated distance to equilibrium DPn undergoes a sharp phase transition,
dropping abruptly from 1 to 0 around some appropriate time-scale (tn )n≥1 , i.e.
(
1 if α < 1
∀α ∈ [0, ∞), DPn (⌊αtn ⌋) −−−→ (26)
n→∞ 0 if α > 1.
(ε)
Note that this means that tmix (Pn ) ∼ tn for all ε ∈ (0, 1), hence the equivalence with
Definition 7. For example, our computations for the sliding window of length n give
(ε)
tmix (Pn ) ∼ n, (27)

for any fixed ε ∈ (0, 1), providing our first instance of cutoff. Discovered in the 80’s in the
context of card shuffling, this remarkable phase transition has since then been established
for a broad variety of chains, from random walks on certain groups to various interacting
particle systems. However, the proofs of cutoff remain chain-specific, and finding a general
explanation constitutes the most important open problem in the field.

14
Question 2. What is the exact mechanism underlying the cutoff phenomenon?

The following simple criterion was proposed as a possible answer in 2004.

Definition 8 (Product condition). (Pn )n≥1 satisfies the product condition if

tmix (Pn ) ≫ trel (Pn ),

where the notation an ≫ bn means that bn /an → 0 as n → ∞.

This condition is easy to verify in practice, because it only involves a comparison of order
of magnitudes, whereas Definition 7 requires determining the precise prefactor in front of
mixing times. It is easy to see that the product condition is necessary for cutoff:

Lemma 5 (No cutoff without the product condition). Any sequence of transition kernels
(Pn )n≥1 which exhibits cutoff must satisfy the product condition.

Proof. Suppose that (Pn )n≥1 exhibits cutoff, and fix ε ∈ (0, 21 ). We then have tmix (Pn ) ∼
(ε)
tmix (Pn ), as n → ∞. On the other hand, the lower bound (23) implies
 
(ε) 1
tmix (P ) ≥ trel (P ) log .

Combining those two estimates, we see that
 
trel (Pn ) 1
lim sup ≤ 1
,
n→∞ tmix (Pn ) ln 2ε
and the right-hand side can be made arbitrarily small by choosing ε small.

Unfortunately, the converse statement – which is the one that would have been useful in
practice – does not hold in general. Even worse, any sequence (Pn )n≥1 exhibiting cutoff can
be perturbed so as to produce a counter-example, as we now explain. Given an ergodic
transition kernel P with stationary law π on a finite state space X , define

Q := (1 − θ)P + θΠ, (28)

where Π is the rank-one transition kernel defined at (15), and θ ∈ (0, 1) a parameter to be
adjusted later. Note that Q is ergodic, with stationary law π. The interpretation of Q is
simple: at each step, a biased coin with parameter θ is tossed: if it lands on tails, the next
state is chosen according to P ; otherwise, it is chosen according to the stationary law π.

15
Lemma 6 (Rank-one perturbations destroy cutoff). Let (Pn )n≥1 be a sequence of ergodic
transition kernel exhibiting cutoff, and choose (θn )n≥1 in (0, 1) so that
1 1
≪ θn ≪ .
tmix (Pn ) trel (Pn )

Then, the sequence (Qn )n≥1 defined by the rank-one perturbation (28) satisfies
 
(ε) 1 1
trel (Qn ) ∼ trel (Pn ), and tmix (Qn ) ∼ log .
θn ε

In particular, (Qn )n≥1 satisfies the product condition, but fails to exhibit cutoff.

Proof. The impact of the rank-one perturbation (28) on the distance to equilibrium is easy
to determine: we have Q − Π = (1 − θ)(P − Π), so that (16) yields

∀t ≥ 0, DQ (t) = (1 − θ)t DP (t).

Recalling Corollary 1, we deduce that λ⋆ (Q) = (1 − θ)λ⋆ (P ), i.e.

trel (P )
trel (Q) = .
1 − trel (P ) log (1 − θ)

Specializing these general identities to P = Pn and θ = θn easily leads to the claim.

Note that when n is large, the entries of the perturbed matrix Qn are extremely close to
those of Pn . Yet, (Pn )n≥1 satisfies cutoff, whereas (Qn )n≥1 does not: cutoff is a delicate and
sensitive phenomenon, which can not be captured by a basic criterion such as the product
condition. Nevertheless, chains that satisfy the product condition without exhibiting cutoff
are regarded as pathological by the community, and an informal conjecture states that the
product condition should correctly predict cutoff for all reasonable chains. Giving an honest
mathematical content to this vague claim constitutes a natural open problem, which can be
seen as a first step towards Question 2.

Question 3 (Reasonable?). For which chains does the product condition imply cutoff ?

Known answers include birth and death chains and random walks on trees. However, the
above construction shows that the product condition will incorrectly predict cutoff for many
chains, including certain random walks on groups as defined next.

16
1.5 Random walks on graphs and groups
Among the various chains that will be considered in this course, a particular attention is
devoted to random walks on groups and graphs, which play an important role in many mod-
ern applications, from card shuffling to page-rank algorithms or the exploration of complex
networks. Let us here briefly recall how these random walks are defined.

Definition 9 (Group). A group is a pair (X , ⋆), where X is a set and ⋆ a binary


operation on X satisfying the following axioms:

(i) There is an identity element id ∈ X satisfying x ⋆ id = id ⋆ x = x for all x ∈ X .

(ii) Every element x ∈ X admits an inverse x−1 ∈ X such that x ⋆ x−1 = x−1 ⋆ x = id.

(iii) The operation ⋆ is associative, i.e. (x ⋆ y) ⋆ z = x ⋆ (y ⋆ z) for all (x, y, z) ∈ X 3 .

Here are three classical examples of finite groups to which we shall come back later:

• The cyclic group (Zn , +) of integers modulo n, equipped with addition modulo n.

• The hypercube (Zn2 , +) of binary vectors of length n, equipped with addition mod 2.

• The symmetric group (Sn , ◦) of permutations of [n], equipped with composition.

Given a finite group (X , ⋆) and a probability distribution µ ∈ P(X ), let (Zt )t≥1 be i.i.d.
random variables with law µ, and consider the process X = (Xt )t≥1 defined by

Xt := Zt ⋆ · · · ⋆ Z1 , (29)

with the usual convention that an empty product is the identity. Then X is a Markov chain
on X , called the random walk on (X , ⋆) with increment law µ. Its transition kernel is

∀x, y ∈ X , P (x, y) := µ(y ⋆ x−1 ).

This matrix is bi-stochastic, meaning that its transpose is also stochastic. In other words,
the uniform distribution π = Unif(X ) is stationary for P . Note that P is irreducible if
and only if the incremental support supp(µ) := {x ∈ X : µ(x) > 0} generates (X , ⋆), in
the sense that any group element x can be written as x = xt ⋆ · · · ⋆ x1 for some t ∈ N and
some x1 , . . . , xt ∈ supp(µ). A pleasant feature of random walks on groups is their intrinsic
symmetry. In particular, the choice of the initial state is irrelevant.

17
Lemma 7 (Symmetry). For random walks on groups, we have for all x ∈ X and t ∈ N,

dtv P t (x, ·), π = dtv P t (id, ·), π .


 

Proof. By induction over t ∈ N, we have P t (x, y) = P t (id, x−1 ⋆ y) for all x, y ∈ X . Thus,
1X t 1
dtv P t (x, ·), π = P (id, y ⋆ x−1 ) −

2 y∈X |X |
1X t 1
= P (id, y) −
2 y∈X |X |
= dtv P t (id, ·), π ,


where the second equality uses the bijective change of variables y 7→ y ⋆ x.

Definition 10 (Graph). A (finite, simple) graph is a pair G = (X , E), where

• X is a finite set whose elements are called vertices;

• E is a set of unordered pairs of vertices {x, y} called edges.

Two vertices x, y ∈ X such that {x, y} ∈ E are said to be neighbors, and the number of
neighbors of x is called its degree, denoted by deg(x).

If a graph G = (X , E) has degrees at least 1, we may define a transition kernel on X by


(
1
deg(x)
if {x, y} ∈ E
P (x, y) :=
0 else.
The corresponding Markov chain is called simple random walk on G. It describes the evo-
lution of a walker which, at each step, jumps from the current vertex to a uniformly chosen
neighbor. P is irreducible if and only if G is connected, meaning that one can go from any
vertex to any other by traversing a sequence of edges. Note that the formula
deg(x)
∀x ∈ X , π(x) := ,
2|E|
defines a probability vector on X , which satisfies the detailed balance equation

∀(x, y) ∈ X 2 , π(x)P (x, y) = π(y)P (y, x). (30)

By summation over all x ∈ X , this implies that π is stationary for P .

18
2 Probabilistic techniques
In this chapter, we introduce two simple but powerful probabilistic tools to estimate mixing
times: distinguishing statistics (to obtain lower bounds), and couplings (to obtain upper
bounds). These important techniques are then applied to obtain sharp estimates for the
random walk on the cycle and the Ehrenfest Urn model.

2.1 Distinguishing statistics


In practice, it is often easier to obtain lower bounds on mixing times than upper bounds.
The reason is that the distance to equilibrium is defined as a (double) maximum:

DP (t) = max max P t (x, A) − π(A) .


x∈X A⊆X

It readily follows from this definition that any choice of an initial state x ∈ X and a target
event A ⊆ X provides a lower bound on the distance to equilibrium. This trivial observation
will be used so often that it deserves a lemma for better visibility.

Lemma 8 (Distinguishing event). The lower bound

DP (t) ≥ P t (x, A) − π(A) ,

holds for any time t ≥ 0, any initial state x ∈ X and any event A ⊆ X .

To get a good bound, the pair (x, A) should of course be chosen so that P t (x, A) is abnormally
large or small compared to the equilibrium value π(A). In practice, good candidates for
(x, A) are easily guessed, but estimating P t (x, A)−π(A) can be difficult. A simple alternative
consists in computing the first and second moment of an observable f : X → R that behaves
very abnormally when the chain is far from equilibrium. This formalizes as follows.

Lemma 9 (Distinguishing statistics). For any µ, ν ∈ P(X ) and f : X → R, we have

δ2
dtv (µ, ν) ≥ .
δ2 + σ2
where δ = |µf − νf | and σ 2 = 2Varµ (f ) + 2Varν (f ).

19
Proof. Since the right-hand side is invariant under translating f by a constant, we may
assume that µf + νf = 0, so that (µf )2 = (νf )2 = δ 2 /4. By Cauchy-Schwartz, we have
!2
X
δ2 = (µ(x) − ν(x)) f (x)
x∈X
! !
X X (µ(x) − ν(x))2
≤ (µ(x) + ν(x)) f 2 (x) .
x∈X x∈X
µ(x) + ν(x)

The first sum is exactly equal to (σ 2 + δ 2 )/2, while the second is at most 2dtv (µ, ν) by the
crude bound |µ(x) − ν(x)| ≤ µ(x) + ν(x). Rearranging yields the claim.

Remark 9 (Concentration). The above bound says that µ and ν are far apart if the ratio
σ 2 /δ 2 is small. The intuitionh is as follows: under the measure µ, the iobservable f typ-
p p
ically takes values in Iµ := µ(f ) − 10 Varµ (f ), µ(f ) + 10 Varµ (f ) by Chebychev’s
inequality, and similarly for ν. When σ 2 /δ 2 is small, the two intervals Iµ and Iν are
disjoint, so A := f −1 (Iµ ) forms a distinguishing event showing that dtv (µ, ν) is large.

Remark 10 (Complex values). The proof readily extends to the case of a complex-valued
observable f : X → C, with Varµ (f ) := µ (|f − µf |2 ) = µ|f |2 − |µf |2 .

We will illustrate the strength of those generic lower bounds on various concrete Markov
chains once we have a complementary technique to obtain matching upper bounds.

2.2 Couplings
In probability theory, coupling is a very general technique which allows one to compare two
given distributions µ and ν by constructing a pair of random variables (X, Y ) whose marginal
distributions are µ and ν. Every such pair provides a particular relation between µ and ν,
and the whole idea is to choose a relation that sheds useful light on µ and ν.

Definition 11 (Coupling). A coupling of two probability measures µ, ν is a pair (X, Y )


of random variables defined on the same probability space, such that X ∼ µ and Y ∼ ν.

20
Of course, we can always take X and Y to be independent with respective marginals µ and
ν, but this is usually not the most interesting choice: the above definition allows for X and
Y to be entangled in an arbitrary way, and this degree of freedom can often be exploited to
obtain non-trivial information about the pair (µ, ν). The following lemma is a simple but
fruitful illustration of this general philosophy.

Lemma 10 (Estimating total variation distance by coupling). Fix µ, ν ∈ P(X ). Then,

dtv (µ, ν) ≤ P(X ̸= Y ),

for any coupling (X, Y ) of µ and ν. Moreover, there is a coupling which achieves equality.

Proof. If (X, Y ) is a coupling of µ and ν, then for any set A ⊆ X we can write

µ(A) − ν(A) = P(X ∈ A) − P(Y ∈ A)


≤ P(X ∈ A) − P(X ∈ A, Y ∈ A)
= P(X ∈ A, Y ∈
/ A)
≤ P(X ̸= Y ).

Taking a maximum over all A ⊆ X establishes the first claim. Conversely, let us construct a
coupling (X, Y ) which achieves equality. The extreme cases dtv (µ, ν) = 0 and dtv (µ, ν) = 1
are trivial, so we leave them aside. By Lemma 2, we have dtv (µ, ν) = 1 − p, where
X
p := µ(x) ∧ ν(x).
x∈X

We thus want to construct a coupling (X, Y ) such that P(X = Y ) = p. To do so, we define
(
(Z, Z) if B = 1
(X, Y ) :=
(X,
b Yb ) if B = 0,

b Yb are independent, with B ∼ Bernoulli(p) and for all x ∈ X ,


where B, Z, X,
µ(x) ∧ ν(x)
P(Z = x) =
p
(µ(x) − ν(x))+
P(X
b = x) = ,
1−p
(ν(x) − µ(x))+
P(Yb = x) = .
1−p
It is immediate to check that (X, Y ) is a coupling of µ and ν such that P(X = Y ) = p.

21
Remark 11 (Variational formula for total variation distance). Lemma 10 provides yet
another definition of total variation distance, to be added to the list of Lemma 2:

dtv (µ, ν) = min P(X ̸= Y ).


X∼µ,Y ∼ν

A considerable advantage of this new expression is the fact that it involves a minimum:
any coupling (X, Y ) of µ, ν provides an upper bound on dtv (µ, ν). The more likely
the event {X = Y } is, the better the bound will be, and the existence of a coupling
achieving equality guarantees that this strategy has no intrinsic limitation. In practice
however, estimating P(X = Y ) can become difficult if the coupling is too sophisticated,
and devising couplings that are both efficient and tractable is a delicate art.

In order to use Lemma 10 to estimate the mixing time of an ergodic kernel P , we need to
construct, for an appropriate time t ∈ N and for every initial state x ∈ X , a coupling between
P t (x, ·) and π which puts as much mass as possible on the diagonal set {(y, y) : y ∈ X }.
This strategy may seem difficult to implement at first sight, but we will now make a couple
of important observations that will considerably facilitate our task.

2.3 Coalescence times


A first useful observation is that we do not need to compare P t (x, ·) directly with the
equilibrium distribution π (which is sometimes very far from being explicit): it is enough to
compare P t (x, ·) with the more similar measure P t (y, ·), for all (x, y) ∈ X 2 .

Lemma 11 (Mixing means forgetting). For every t ∈ N, we have


1
max 2 dtv P t (x, ·), P t (y, ·) ≤ DP (t) ≤ max 2 dtv P t (x, ·), P t (y, ·) .
 
2 (x,y)∈X (x,y)∈X

Proof. By stationarity, we have π = πP = πP 2 = · · · = πP t = π(y)P t (y, ·). Thus, π


P
y∈X
is a convex combination of the measures {P t (y, ·) : y ∈ X }. Since dtv (·, ·) is convex (Lemma
3), we deduce that for any µ ∈ P(X ),

dtv (µ, π) ≤ max dtv µ, P t (y, ·) .



y∈X

22
Choosing µ = P t (x, ·) and then maximizing over all x ∈ X leads to the claimed upper
bound. For the lower bound, we simply invoke the triangle inequality

dtv P t (x, ·), P t (y, ·) ≤ dtv P t (x, ·), π + dtv P t (y, ·), π ,
  

and then take a maximum over all initial conditions (x, y) ∈ X 2 .

We are thus naturally led to the problem of coupling P t (x, ·) and P t (y, ·), for arbitrary
x, y ∈ X and t ∈ N. Recall that the instantaneous measures P t (x, ·) and P t (y, ·) were
actually constructed sequentially, by iterating t times the map µ 7→ µP , starting from the
initial measures δx and δy , respectively. In light of this sequential structure, the most natural
way to couple P t (x, ·) and P t (y, ·) is to actually couple the entire chains, i.e. to produce a
random pair (X, Y) such that X ∼ MC(X , P, δx ) and Y ∼ MC(X , P, δy ). For each t ∈ N,
the random pair (Xt , Yt ) is then a coupling of P t (x, ·) and P t (y, ·), so we have

dtv P t (x, ·), P t (y, ·) ≤ P(Xt ̸= Yt ).



(31)

Now, a very simple way to produce such a trajectorial coupling, simultaneously for all initial
pairs (x, y) ∈ X 2 , is to use what is known as a coupling kernel.

Definition 12 (Coupling kernel). A coupling kernel for P is a transition kernel Q on


the product space X × X , whose marginals agree with P in the following sense:
X
∀(x, y, x′ ) ∈ X 3 , Q ((x, y), (x′ , y ′ )) = P (x, x′ )
y ′ ∈X
X
∀(x, y, y ′ ) ∈ X 3 , Q ((x, y), (x′ , y ′ )) = P (y, y ′ ).
x′ ∈X

Any Markov chain (X, Y) on X × X whose transition kernel is of this form is called a
Markovian coupling for P , and the associated coalescence time is defined as

T := inf {t ≥ 0 : Xt = Yt } ,

with the usual convention inf ∅ = +∞.

With this terminology at hands, we can now state and prove the main result of this sec-
tion, which asserts that the convergence to equilibrium of a Markov chain is fast if trajectories
from different starting states can be coupled so as to meet quickly.

23
Theorem 2 (Coalescence and mixing). Consider an arbitrary coupling kernel Q for P ,
and let T denote the associated coalescence time. Then,

∀t ≥ 0, DP (t) ≤ max P(x,y) (T > t) ,


(x,y)∈X 2

where the notation P(x,y) indicates that the initial state is (x, y).

Proof. Let (X, Y) denote a Markov chain on X 2 with transition kernel Q. The fact that Q
is a coupling kernel ensures that for each t ≥ 1, the conditional laws of Xt and Yt given the
past (X0 , . . . , Xt−1 , Y0 , . . . , Yt−1 ) are P (Xt , ·) and P (Yt , ·), respectively. In particular, X and
Y are Markov chains on X with transition kernel P . Consequently, Lemma 10 yields

dtv P t (x, ·), P t (y, ·) ≤ P(x,y) (Xt ̸= Yt ),



(32)

for every t ∈ N and every (x, y) ∈ X 2 . Now, let us make the additional assumption that the
diagonal set ∆ := {(z, z) : z ∈ X } is absorbing for the coupling kernel Q, in the sense that
X
∀x ∈ X , Q ((x, x), (y, y)) = 1. (33)
y∈X

Then almost-surely, the trajectories X and Y coincide forever after coalescing, hence

P(x,y) (Xt ̸= Yt ) = P(x,y) (T > t).

Inserting this into (32) and then taking the maximum over all pairs (x, y) ∈ X 2 establishes
the claim (recall Lemma 11). Finally, note that if Q fails to satisfy the condition (33), then
we can always replace it with the new coupling kernel

′ ′
 Q ((x, y), (x , y ))

 if x ̸= y
Qe ((x, y), (x′ , y ′ )) := P (x, x′ ) if x = y and x′ = y ′

if x = y and x′ ̸= y ′ .

 0

Since Q
e is a coupling kernel for P which satisfies (33), we obtain
 
∀t ≥ 0, DP (t) ≤ max 2 P(x,y) Te > t ,
(x,y)∈X

where Te denotes the coalescence time under Q.


e But Te has the same law as T , because Q
e
and Q differ only on transitions that start from the diagonal set ∆.

24
Figure 4: The n−cycle with n = 15.

Remark 12 (Product kernel). The above result provides a simple proof of the conver-
gence to equilibrium of ergodic Markov chains (Theorem 1). Indeed, the product kernel

Q ((x, y), (x′ , y ′ )) := P (x, x′ ) × P (y, y ′ ),

corresponding to the independent evolution of X and Y, is obviously a coupling kernel


for P . Moreover, it is clear that the above formula tensorizes, in the sense that

Qt ((x, y), (x′ , y ′ )) := P t (x, x′ ) × P t (y, y ′ ),

for all t ≥ 0. In particular, Q is ergodic (since so is P ), and this ensures that

∀(x, y) ∈ X 2 , P(x,y) (T < ∞) = 1.

Consequently, Theorem 2 gives DP (t) → 0 as t → ∞, as desired.

2.4 Application: random walk on the cycle


As a first pedagogic illustration, let us consider the lazy random walk on the n cycle. The
state space is X = Z/nZ, and the transition kernel is

 21 if y = x


P (x, y) = 1
4
if y ∈ {x − 1, x + 1}


 0 else.

This is the transition kernel of the lazy random walk on the n−cycle graph. It is also a
random walk on the cyclic group (Z/nZ, +), with increment law µ = 12 δ0 + 41 δ1 + 14 δ−1 . We

25
will show that the mixing time of this random walk grows quadratically with n.

Proposition 1 (Mixing time of the n−cycle). For lazy random walk on the n−cycle,

n2
≤ tmix (P ) ≤ n2 .
32

Upper bound: first attempt. The most natural way to construct a chain X with tran-
sition kernel P starting from a given state x ∈ X is to set X0 := x and for all t ≥ 1,

Xt := Xt−1 + ξt mod n.
1
where (ξt )t≥1 are i.i.d. with P(ξ1 = 0) = 2
and P(ξ1 = 1) = P(ξ1 = −1) = 14 . In light of this,
a naive way to couple X with a chain Y starting from another state y ∈ X consists in using
the same increments for both chains, i.e. setting Y0 := y and for all t ≥ 1,

Yt := Yt−1 + ξt mod n. (34)

It is then clear that (X, Y) is a Markovian coupling for P , with coupling kernel

1 ′ ′
 2 if (x , y ) = (x, y)


Q ((x, y), (x′ , y ′ )) = 1
4
if (x′ , y ′ ) ∈ {(x + 1, y + 1), (x − 1, y − 1)}


 0 else.
Unfortunately, this coupling is not smart at all: the difference Xt − Yt is preserved at each
step, so the coalescence time T is a.-s. infinite, and Theorem 2 only yields DP (t) ≤ 1!

Upper bound: second attempt. In order to “favor encounter”, one could try to let the
two trajectories move in opposite directions, i.e. replace (34) with

Yt := Yt−1 − ξt mod n.

Note that this remains a valid coupling, because the increment sequence (−ξt )t≥1 has the
same law as (ξt )t≥1 . The corresponding coupling kernel is then

1 ′ ′
 2 if (x , y ) = (x, y)


Q ((x, y), (x′ , y ′ )) = 1
4
if (x′ , y ′ ) ∈ {(x + 1, y − 1), (x − 1, y + 1)}


 0 else.
Unfortunately, this choice is also problematic: when n is even and x − y is odd, the difference
Xt − Yt remains odd at each iteration, so that T = ∞ a.-s. again!

26
Upper bound: third attempt. A solution to this parity issue consists in letting only
one of the two coordinates jump at each step, as dictated by the following coupling kernel:
(
1
if |x′ − x| + |y − y ′ | = 1
Q ((x, y), (x′ , y ′ )) = 4
0 else.

The sequence of differences (Xt − Yt )t≥0 is then a simple random walk on Z/nZ starting
from x − y, and the coalescence time T is precisely the time it takes for this walk to hit 0.
Equivalently, T is the hitting time of the set {0, n} by a simple random walk (Wt )t≥0 on Z
starting from W0 = |x − y|. The expectation of T is easily computed, for example by Doob’s
optional stopping theorem applied to the martingales (Wt )t≥0 and (Wt2 − t)t≥0 :

E [WT ] = E[W0 ] = |x − y|;


E WT2 − T = E[W02 − 0] = |x − y|2 .
 

Since the random variable WT is {0, n}−valued, we have WT2 = nWT and it follows that

n2
E[T ] = (n − |x − y|) |x − y| ≤ .
4
n2
Applying Theorem 2 and Markov’s inequality, we conclude that DP (t) ≤ 4t
, or equivalently,
 2
(ε) n
tmix (P ) ≤ .

This proves the upper bound in Proposition 1. We now turn to the lower bound.

Lower bound. Intuitively, when the number t of iterations is too small, the random walk
Xt is confined in a small interval around its starting point, and hence can not be mixed. To
formalize this, we use Lemma 8 with the starting state x = 0 and the distinguishing event
|A|
A = n4 , 3n ≥ 12 and by Chebychev inequality,
 
4
∩ N. We have π(A) = |X |

 n 8t
P t (x, A) ≤ P |ξ1 + · · · + ξt | ≥ ≤ 2,
4 n
where we have used the fact that ξ1 , . . . , ξt are i.i.d. with mean 0 and variance 1/2. For
t < n2 /32 the right-hand side is less than 1/4, and Lemma 8 yields DP (t) > 1/4, as desired.

27
2.5 Application: Ehrenfest model
Introduced by Tatiana and Paul Ehrenfest to explain the second law of thermodynamics, the
Ehrenfest urns is a very simple model for the diffusion of gas molecules. Consider n labelled
particles evolving among two neighboring containers as follows: at each time step, a particle
is chosen uniformly at random and jumps from its current container to the other one. If one
start with, say, all particles in one container, how long will it take for the gas to equilibriate?
We can describe the state of the system by a binary vector x = (x1 , . . . , xn ), where the
variable xi ∈ {0, 1} indicates the container in which the i−th particle currently lies. The
above diffusion is then a Markov chain with state space X := {0, 1}n and transition kernel
(
1
n
if x, y differ by exactly one coordinate
Pe(x, y) :=
0 else.

This is the transition kernel of simple random walk on a well-known graph, namely the
n−dimensional hypercube. Alternatively, Pe is the transition kernel of the random walk on
the binary group Fn2 with increments being uniform on the set of vectors having exactly one
non-zero coordinate. To avoid periodicity issues, we consider the lazy kernel P = (I + Pe)/2.

Proposition 2 (Mixing time of the lazy Ehrenfest urns). For any ε ∈ (0, 1), we have

(1 − o(1)) (ε)
n log n ≤ tmix (Pn ) ≤ (1 + o(1))n log n
2

Proof. Given an initial state x ∈ X , a convenient sequential construction of the trajectory


X ∼ MC(X , P, δx ) consists in setting X0 = x and then, for all t ≥ 1,

Xt := F (Xt−1 , It , Bt ),

where F (x) := (x1 , . . . , xi−1 , b, xi+1 , . . . , xn ), and where the random variables It , Bt , t ≥ 1
are independent with Bt ∼ Bernoulli(1/2) and It ∼ Unif({1, . . . , n}). Given another initial
state y ∈ X , one can then naturally couple Y ∼ MC(X , P, δy ) with X by using the same
update variables (It , Bt )t≥1 for both trajectories, i.e. by setting Y0 = y and for t ≥ 1,

Yt := F (Yt−1 , It , Bt ).

The pair (X, Y) is then a Markovian coupling for P . By construction, at any time t ≥ 0,
the two random vectors Xt and Yt agree on the set of coordinates {I1 , . . . , It }. Consequently,

28
the coalescence time T is at most the time it takes for all coordinates to have been hit:

T ≤ inf {t ≥ 0 : {I1 , . . . , It } = {1, . . . , n}} .

(In fact, there is even equality when x and y are antipodal.) Consequently, we have

DP (t) ≤ P ({I1 , . . . , It } =
̸ {1, . . . , n})
Xn
≤ P (i ∈ / {I1 , . . . , It })
i=1
 t
1
= n 1−
n
≤ ne−t/n

where we have successively used the union bound, the fact that I1 , . . . , It are independent
and uniform on {1, . . . , n}, and the convexity inequality 1 + u ≤ eu , valid for all u ∈ R.
The upper bound readily follows from this. For the lower bound, we apply the method of
distinguishing statistics (Lemma 9) to the observable

f : x 7→ x1 + · · · + xn ,

which counts the number of coordinates equal to 1. Under the equilibrium law π, the
coordinates are independent Bernoulli variables with parameter 1/2, so we have
n
πf =
2
n
Varπ (f ) = .
4
On the other hand, after t iterations starting from x = (0, . . . , 0), the coordinates are easily
seen to be negatively correlated Bernoulli variables with parameter 1 − (1 − n1 )t /2, so


 t !
n 1
µf = 1− 1−
2 n
n
Varµ (f ) ≤ ,
4
where µ = P t (0, ·). Thus, Lemma 9 yields
1
DP (t) ≥ 1 −  ,
n 1 2t
1+ 4
1− n

from which the lower bound readily follows.

29
Figure 5: Eigenvalues (in red) and spectral radius (in blue) of a typical transition kernel.

3 Spectral techniques
This chapter investigates the eigenvalues and eigenvectors of transition kernels, and their
relation to mixing times. This relation is particularly deep for reversible chains, to which we
devote a central attention. To illustrate the strength of spectral techniques, we revisit two
models that were introduced in the previous chapter and obtain considerably refined results
on their convergence to equilibrium: we establish a limiting profile for random walk on the
cycle, and we prove the cutoff phenomenon for random walk on the hypercube.

30
3.1 Spectral radius
We have seen earlier (Corollary 1) that the convergence to equilibrium of Markov chains
occurs exponentially fast: more precisely, for any ergodic kernel P , the limit
1
λ⋆ (P ) := lim (DP (t)) t
t→∞

exists and is strictly less than 1. We will now see that this quantity admits a remarkable
spectral characterization. Let us first recall some terminology. An eigenvalue of P is a root
of the characteristic polynomial λ 7→ det(P −λI), i.e. a number λ ∈ C such that the equation

P f = λf (35)

admits a non-trivial solution f : X → C. Any such solution f is called an eigenfunction as-


sociated with the eigenvalue λ, and the pair (λ, f ) an eigenpair of P . Since the characteristic
polynomial has degree N := |X |, P has precisely N eigenvalues, counted with algebraic
multiplicities. The set of eigenvalues is called the spectrum of P , and denoted by Sp(P ).
For concreteness, Figure 5 shows a plot of the spectrum of a typical transition kernel. Here
are a few elementary properties of the spectrum of any transition kernel.

Lemma 12 (Eigenvalues). Let P be any transition kernel on X . Then,

(i) 1 ∈ Sp(P ).

(ii) Sp(P ) ⊆ {λ ∈ C : |λ| ≤ 1}.

(iii) If P is ergodic, then 1 is the only eigenvalue on the unit circle.

Proof. The constant function f = 1 solves the harmonic equation P f = f , proving the first
claim. The second follows from the fact that P contracts the ∥·∥∞ norm: for any f : X → R,

∥P f ∥∞ ≤ ∥f ∥∞ .

Finally, consider an eigenpair (λ, f ) with |λ| = 1. If P is ergodic, we can choose t ∈ N so


that the number α := minx∈X P t (x, x) is strictly positive. But then, the matrix
P t − αI
Q :=
1−α
λt −α
is a transition kernel, of which f is an eigenfunction with eigenvalue 1−α
. Thus, (ii) yields
|λt − α| ≤ 1 − α. Since |λt | = 1, this forces λt = 1. Replacing t with t + 1 gives λ = 1.

31
We now turn our attention to eigenfunctions.

Lemma 13 (Eigenfunctions). Let (λ, f ) be an eigenpair of P . Then,

(i) If λ ̸= 1, then πf = 0.

(ii) If λ = 1 and P is irreducible, then f is constant.

Proof. Multiplying both sides of the identity P f = λf by π yields πf = λπf , proving the
first claim. Now, assuming that f = P f , let us prove that f is constant. Upon replacing f
by its real and imaginary parts if necessary, we may assume that f is real-valued. Denoting
by A := argminf the set of minimizers of f , we wish to prove that A = X . Fix an arbitrary
x ∈ A, and suppose for a contradiction that there is y ∈ X \ A. By irreducibility, we can
find t ≥ 0 such that P t (x, y) > 0. Evaluating the relation P t f = f at x yields
X
P t (x, z) (f (z) − f (x)) = 0.
z∈X

Since x ∈ A, each term in this sum is non-negative, so all terms must actually be zero. This
is a contradiction, because P t (x, y) > 0 and f (y) > f (x).

We are now ready to provide a spectral characterization of the asymptotic decay rate λ⋆ (P ).

Theorem 3 (Spectral radius). We have λ⋆ (P ) = max {|λ| : λ ∈ Sp(P ) \ {1}}.


P
Proof. Recall the key representation (16): writing ∥A∥ := maxx∈X y∈X |A(x, y)|, we have

∀t ≥ 1, 2DP (t) = (P − Π)t .

Since ∥ · ∥ is a matrix norm, Gelfand’s formula asserts that for any A ∈ CX ×X ,


1
∥At ∥ t −−−→ ρ(A) := max {|λ| : λ ∈ Sp(A)} .
t→∞

Thus, the claim boils down to the identity ρ(P − Π) = λ⋆ (P ). We will actually show that

Sp(P − Π) = {0} ∪ Sp(P ) \ {1},

which is more than enough. Let us first prove the inclusion ⊇. Clearly, (0, 1) is an eigenpair
of P − Π, so 0 ∈ Sp(P − Π). On the other hand, if (λ, f ) is an eigenpair of P with λ ̸= 1,
then Lemma 13 forces πf = 0, i.e. Πf = 0. Thus, (λ, f ) is also an eigenpair of P − Π.
Conversely, if (λ, f ) is an eigenpair of P −Π with λ ̸= 0, then the identity (P −Π)f = λf can
be left-multiplied by Π to obtain Πf = 0, so that (λ, f ) is also an eigenpair of P . Moreover,
we have λ ̸= 1, as otherwise Lemma 13 would imply that f is constant equal to πf = 0.

32
3.2 Diagonalization of reversible kernels
Let P be an irreducible transition kernel on X , with stationary law π. Consider the (com-
plex) Hilbert space H = L2C (X , π) of all functions f : X → C, with scalar product
X
⟨f, g⟩ := π(x)f (x)g(x).
x∈X

As any operator on H, P admits an adjoint P ⋆ , characterized by the duality relation

∀f, g ∈ H, ⟨P ⋆ f, g⟩ = ⟨f, P g⟩.

Choosing f = δx and g = δy yields the following explicit expression.

Definition 13 (Adjoint). The adjoint of P is defined by

π(y)P (y, x)
∀(x, y) ∈ X 2 , P ⋆ (x, y) = .
π(x)

Note that P ⋆ is again a transition kernel on X , which is irreducible and with stationary
law π. Note also that P ⋆⋆ = P . The duality P ↔ P ⋆ can be interpreted as time reversal: if
X ∼ MC(X , P, π) and X⋆ ∼ MC(X , P ⋆ , π), then it is easy to check that for all t ≥ 0,
d
(X0⋆ , . . . , Xt⋆ ) = (Xt , . . . , X0 ) .

Definition 14 (Reversibility). P is reversible if P ⋆ = P , or equivalently,

∀(x, y) ∈ X 2 , π(x)P (x, y) = π(y)P (y, x).

The above equation, called detailed balance, is satisfied by all random walks on undirected
graphs, as well as many other interesting Markov chains. It is much stronger than the
stationarity property πP = π (which can be recovered by summing over all x ∈ X ), and has
remarkable consequences on the mixing properties of the associated Markov chain. Indeed,
the spectral theorem for self-adjoint operators can then be applied to guarantee the following.

(i) P admits N = |X | real eigenvalues, which can thus be ordered as follows:

1 = λ1 ≥ λ2 ≥ · · · ≥ λN ≥ −1,

33
(ii) There is an orthonormal basis (ϕ1 , . . . , ϕN ) of eigenfunctions of P : for all 1 ≤ i ̸= j ≤ N ,

P ϕi = λi ϕi , ∥ϕi ∥ = 1, ⟨ϕi , ϕj ⟩ = 0.

Note that, with these notations, we have λ⋆ (P ) = max{λ2 , −λN }. We will always choose
ϕ1 = 1 (this is indeed a unit eigenfunction associated with the eigenvalue λ1 = 1). Such a
spectral decomposition provides an explicit expression for the distribution of the chain.

Lemma 14 (Eigen-decomposition of reversible chains). If P is reversible, then


N
P t (x, y) X
= 1+ λti ϕi (x)ϕi (y),
π(y) i=2

for all t ∈ N and all x, y ∈ X .

Proof. Any function f : X → C can be decomposed over the orthonormal basis (ϕ1 , . . . , ϕN ):
N
X
f = ⟨f, ϕi ⟩ϕi
i=1

Since ϕ1 , . . . ϕN are eigenfunctions of P , we deduce that for all t ∈ N,


N
X
t
P f = ⟨f, ϕi ⟩λti ϕi .
i=1

Choosing f = δy /π(y) and evaluating at x ∈ X , yields exactly the result.

Remark 13 (Exponential mixing). The sum in Lemma 14 clearly behaves like λt⋆ (P ) as
t → ∞, thereby providing yet another proof of the convergence to equilibrium (Theorem
1) and of its geometric refinement (Corollary 1), albeit for reversible chains.

In order to use the exact expression given in Lemma 14, we need to have explicit access
to the eigenvalues and eigenfunctions of P , which is not often the case. Fortunately, the
expression can be bounded by a function of λ⋆ (P ) only, yielding the following simple and
general estimate on the mixing time of a reversible chain.

34
Theorem 4 (Mixing times of reversible chains). If P is reversible, then for all ε ∈ (0, 1),
  
(ε) 1
tmix (P ) ≤ trel (P ) log √ ,
2ε π⋆

where we recall that π⋆ = minx∈X π(x).

P t (x,y)
Proof. Fix t ∈ N and x ∈ X . By Lemma 14, the function y 7→ π(y)
− 1 has squared norm

2 N
P t (x, ·) X
−1 = λ2t
i |ϕi (x)|
2
π i=2
N
X
≤ λ2t
⋆ (P ) |ϕi (x)|2
i=2
 
1
= λ2t
⋆ (P ) −1
π(x)
λ2t
⋆ (P )
≤ .
π(x)

where the third line is obtained by setting t = 0 and y = x in Lemma 14. On the other
hand, for any probability measure µ ∈ P(X ), the Cauchy-Schwarz inequality gives

1X µ(x) 1 µ
dtv (µ, π) = π(x) −1 ≤ −1 . (36)
2 x∈X π(x) 2 π

Choosing µ = P t (x, ·) and combining this with the previous estimate, we conclude that

λt⋆ (P )
DP (t) ≤ √ ,
2 π⋆

from which the claim readily follows.

Remark 14 (L2 bound). The Cauchy-Schwarz inequality (36) plays a decisive role in the
proof, because it connects the probabilistic quantity of interest (total-variation distance)
to a much more tractable analytic quantity (Hilbert norm).

35
Remark 15 (Relaxation time vs mixing time). Combining this result with the lower
bound (23) (which does not require reversibility), we obtain
     
1 (ε) 1
trel (P ) log ≤ tmix (P ) ≤ trel (P ) log √ .
2ε 2ε π⋆

Thus, for reversible chains, the relaxation time provides an approximation of the mixing
1
time that is precise up to a factor which is only logarithmic in the “size” π⋆
. Note that
this would not be true without reversibility, as Example 2 shows.

3.3 Wilson’s method


The method of distinguishing statistics (Lemma 9) provides a lower bound on the distance
to equilibrium based on the first and second moments of an observable f : X → C that is
expected to behave very abnormally when the chain is far from equilibrium. In a celebrated
paper, Wilson obtained remarkably sharp lower bounds for several concrete examples of
Markov chains by taking f to be an eigenfunction of P . This fruitful idea is now known as
Wilson’s method, and summarized in the following lemma. We emphasize that reversibility
is not required here. In particular, (λ, f ) can be complex.

Lemma 15 (Wilson’s method). If (λ, f ) is an eigenpair of P , then


 −1
4V
∀t ∈ N, DP (t) ≥ 1+ ,
(1 − |λ|2 )|λ|2t

where V is the worst-case expected quadratic variation of f under P , i.e.


( )
1 X
V := max P (x, y) |f (y) − f (x)|2 .
∥f ∥2∞ x∈X y∈X

Proof. We may assume that |λ| < 1, since otherwise the bound is trivial. Upon dividing f
by ∥f ∥∞ if necessary, we may further assume that ∥f ∥∞ = 1. Now, fix x ∈ X and t ∈ N.
Following Wilson’s idea, we estimate dtv (P t (x, ·), π) by applying (the complex version of)
Lemma 9 to the eigenfunction f . Letting X ∼ MC(X , P, δx ), we have

E [f (Xt+1 )|X0 , . . . , Xt ] = (P f )(Xt ) = λf (Xt ), (37)

36
where we have first used the Markov property and then the fact that (λ, f ) is an eigenpair
of P . Taking expectations, we find that E[f (Xt )] = λt f (x). On the other hand, Lemma 13
(or sending t → ∞) gives πf = 0. With the notation of Lemma 9, we thus have

δ 2 = |λ|2t |f (x)|2 .

Let us now estimate the variance parameter σ 2 . By (37), we have

E |f (Xt+1 ) − f (Xt )|2 |X0 , . . . , Xt = E |f (Xt+1 )|2 |X0 , . . . , Xt + (1 − 2ℜ(λ)) |f (Xt )|2 .
   

But the left-hand side is at most V by definition, so taking expectations gives

E |f (Xt+1 )|2 ≤ (2ℜ(λ) − 1) E |f (Xt )|2 + V


   

≤ |λ|2 E f 2 (Xt ) + V,
 

because 2ℜ(λ) ≤ 1 + |λ|2 . Subtracting |E[f (Xt+1 )]|2 = |λ|2t+2 |f (x)|2 , we obtain

Var (f (Xt+1 )) ≤ |λ|2 Var (f (Xt )) + V,

from which it follows inductively that


1 − |λ|2t V
Var (f (Xt )) ≤ V ≤ .
1 − |λ|2 1 − |λ|2
Taking t → ∞ shows that the same is true under the equilibrium measure π. Thus,
4V
σ2 ≤ .
1 − |λ|2
Consequently, the complex version of Lemma 9 guarantees that
−1 −1
σ2
 
t
 4V
dtv P (x, ·), π ≥ 1+ 2 ≥ 1+ ,
δ (1 − |λ|2 )|λ|2t |f (x)|2
and taking a maximum over x ∈ X concludes the proof (recall that ∥f ∥∞ = 1).

Remark 16 (Spectral radius). Choosing λ so that |λ| = λ⋆ (P ), we obtain

(1 − |λ|2 )(1 − ε)
 
(ε) 1
tmix (P ) ≥ trel (P ) log ,
2 4εV

which constitutes a considerable improvement over (23) when V ≪ 1 − |λ|2 .

37
3.4 Application: limit profile for the cycle
Let us illustrate our spectral techniques by applying them to the lazy random walk on the
cycle X = Z/nZ, whose transition kernel Pn acts on functions f : X → C as follows:
f (x) f (x + 1) f (x − 1)
∀x ∈ X , (Pn f )(x) = + + .
2 4 4
In particular, for any 1 ≤ k ≤ n, the function ϕk : x 7→ exp 2iπkx

n
is an eigenfunction of Pn
1+cos( 2πk )
with eigenvalue λk = 2
n
. Moreover, for 1 ≤ k, ℓ ≤ n, we have
(
1 X 2iπ(k−ℓ)x 1 if k = ℓ
e n =
n x∈X 0 else,
1+cos( 2π )
showing that (ϕ1 , . . . , ϕn ) is an orthonormal basis of CX . In particular, λ⋆ (Pn ) = 2
n
.
h2
Using 1 − cos(h) ∼ 2
and ln(1 + h) ∼ h for h ≪ 1, we obtain the asymptotic estimate

n2
trel (Pn ) ∼ . (38)
π2
(ε) (ε)
Using only this information, Remark 15 already gives tmix (Pn ) = Ω(n2 ) and tmix (Pn ) =
O(n2 ln n). Of course, we already know from Chapter 2 that the lower bound is sharp.
Moreover, cutoff can not occur, because the product condition is not satisfied. This is
confirmed by the following refined result, which uses the entire spectral decomposition of
P to conclude that the rescaled distance to equilibrium converges to a smoothly decreasing
function (hence, not a step function) displayed on Figure 3.4.

Theorem 5 (Limit profile for random walk on the cycle). For any α > 0, we have
Z 1 ∞
2 k2
X
DPn (αn ) −−−→ Ψ(α) :=
2
e−απ cos (2πku) du.
n→∞ 0 k=1

(ε)
In other words, tmix (Pn ) ∼ Ψ−1 (ε)n2 as n → ∞, for any fixed ε ∈ (0, 1).

Proof. By symmetry (Lemma 7), we can choose the initial state to be 0. Our starting point
is the following integral representation, which follows from the definition:
1 1
Z
DPn (t) = 1 − nPnt (0, ⌊un⌋) du. (39)
2 0

38
Figure 6: Plot of the limit profile Ψ : (0, ∞) → (0, 1) appearing in Theorem 5: the conver-
gence to equilibrium of random walk on the cycle occurs very gradually (no cutoff).

To estimate the integrand, we use the spectral decomposition given in Lemma 14:
⌈n/2⌉−1 2πk
 !t  
t
X 1 + cos n 2πk(x − y)
nP (x, y) = cos .
2 n
k=−⌊n/2⌋

We choose x = 0, y = ⌊un⌋ and t = tn = ⌈αn2 ⌉. As n → ∞, we have for all k ∈ Z,

2πk
 !tn  
1 + cos 2πk⌊un⌋ 2 k2
n
cos −−−→ e−απ cos (2πku) .
2 n n→∞

1+cos(aπ) 2
On the other hand, since 2
≤ 1 − a2 ≤ e−a for all a ∈ [−1, 1], we have the domination

2πk
 !tn  
1 + cos 2πk⌊un⌋ 2
n
cos ≤ e−4αk .
2 n

Since the right-hand side is summable in k, we can safely conclude that


2 k2
X
nPntn (0, ⌊un⌋) −−−→ e−απ cos (2πku) . (40)
n→∞
k∈Z

39
Moreover, the above domination shows that the left-hand side is bounded uniformly in n et
u, so we can pass to the limit in the integral representation (39) to obtain
Z 1
1 X 2 k2
DPn (tn ) −−−→ 1− e−απ cos (2πku) du.
n→∞ 2 0 k∈Z

We conclude by noting that the k = 0 term is 1, and that the other are even in k.

Remark 17 (Local CLT). Our random walk has the representation Xt = ξ1 + · · · + ξt


mod n, where (ξt )t≥1 are i.i.d., centered and with variance 21 . Thus, the CLT yields


Z b
P X⌈αn2 ⌉ ∈ [an, bn] −−−→ fα (u)du,
n→∞ a

for 0 ≤ a ≤ b ≤ 1, where fα is the density of N 0, α2



mod 1. In view of (40), we have
2 k2
X
fα (u) = e−απ cos (2πku) .
k∈Z

Therefore, the convergence (40) constitutes a very precise local refinement of the above
CLT, where the the macroscopic interval [an, bn] is replaced by a singleton!

3.5 Application: cutoff for the hypercube


As a second illustration, let us come back to the random walk on the hypercube X = {0, 1}n
(also known as the Ehrenfest model). For any f : X → C and x = (x1 , . . . , xn ) ∈ X ,
n
1 X
(Pn f )(x) = (f (x1 , . . . , xi−1 , 0, xi+1 , . . . , xn ) + f (x1 , . . . , xi−1 , 1, xi+1 , . . . , xn )) .
2n i=1
P
xi
In particular, for any fixed set I ⊆ {1, . . . , n}, the observable ϕI : x 7→ (−1) i∈I is an
|I|
eigenfunction of Pn with eigenvalue λI = 1 − n
. Since ϕI ϕJ = ϕI∆J , we have
(
1 X 1 if I = J
ϕI (x)ϕJ (x) =
|X | x∈X 0 else,

so that the 2n eigenfunctions (fI )I⊆[n] form an orthonormal basis of CX . In particular, the
spectral radius is λ⋆ (Pn ) = 1 − n1 , which yields the asymptotic estimate

trel (Pn ) ∼ n.

40
(ε) (ε)
Thus, Remark 15 gives tmix (Pn ) = Ω(n) and tmix (Pn ) = O(n2 ). However, both estimates can
be considerably refined if we use the entire spectral decomposition of Pn :

Theorem 6 (Cutoff for random walk on the hypercube). For fixed α ≥ 0, we have
(
1 if α < 21 ;
DPn (αn ln n) −−−→
n→∞ 0 if α > 21 .

(ε) n ln n
In other words, tmix (Pn ) ∼ 2
as n → ∞, for any fixed precision ε ∈ (0, 1).

Proof. Since the eigenfunctions (ϕI )I⊆[n] take values in {−1, 1}, the L2 bound (36) yields
2
P t (x, ·)
4d2tv t

P (x, ·), π ≤ −1
π
X
= λ2t
I |ϕI (x)|
2

∅̸=I⊆[n]
n   2t
X n k
= 1−
k=1
k n
n    
X n 2kt
≤ exp −
k=1
k n
 2t
 n
≤ 1 + e− n −1
− 2t
≤ ene n
− 1.

This suffices to establish the second half of the theorem (case α > 12 ). For the first half, we
1
apply Wilson’s method (Lemma 15) to the eigenpair (λ, f ), where λ = 1 − n
and
n
X n
X
f (x) := ϕ{i} = (1 − 2xi ) .
i=1 i=1

Since modifying a coordinate of x changes f (x) by ±2, we have (taking lazyness into account),
X
∀x ∈ X , P (x, y) |f (y) − f (x)|2 = 2,
y∈X

and ∥f∞ ∥ = n. Thus, Lemma 15 applies with λ = 1 − 1/n and V = 2/n2 , yielding
 −1  −1
4V 1
DPn (tn ) ≥ 1+ = 1 + 1−2α+o(1) ,
(1 − |λ|2 )|λ|2tn n
when tn ∼ αn log n with fixed α ∈ (0, ∞). In particular, DPn (tn ) → 1 when α < 1/2.

41
4 Geometric techniques
We have seen that a transition kernel P is irreducible if and only if its associated diagram
GP is connected, in the sense that it contains a path from any vertex to any other. In light
of this, it is natural to suspect an intimate relation between the mixing behavior of P and
the geometry of GP . Formalizing this intuition is precisely the purpose of this chapter.

4.1 Volume, degree, diameter


Any transition kernel P on a finite state space X naturally induces a directed graph GP ,
called the diagram of the chain: its vertex set is X and its edge set is

(x, y) ∈ X 2 : x ̸= y & P (x, y) > 0 .



E :=

In this graph, a path of length t ∈ N is a sequence of t + 1 vertices (x0 , . . . , xt ) such that


(xs−1 , xs ) is an edge for each 1 ≤ s ≤ t. The distance from a vertex x to a vertex y is defined
as the minimum length of a path that start at x and ends at y. More concisely,

dist(x, y) := min t ≥ 0 : P t (x, y) > 0 .



(41)

The function dist : X × X → [0, ∞] is not necessarily symmetric, but it always satisfies the
two other axioms that define a distance, namely:

(i) Separation: dist(x, y) = 0 ⇐⇒ x = y for all x, y ∈ X ;

(ii) Triangle inequality: dist(x, z) ≤ dist(x, y) + dist(y, z) for all x, y, z ∈ X .

We may then consider the (forward) ball of radius t ∈ N centered at x ∈ X :

B(x, t) := {y ∈ X : dist(x, y) ≤ t}.

Understanding how the volume of these balls grows with t is a natural geometric question.
We therefore introduce a function volP : N → [0, 1], called the volume growth of the chain:

volP (t) := min π (B(x, t)) .


x∈X

A basic observation is that volP (t) has to be large for the chain to be mixed at time t.

42
Lemma 16 (Volume bound). We have DP (t) ≥ 1 − volP (t) for all t ∈ N.

Proof. Simply apply the distinguishing event method (Lemma 8) to the pair (x, A), where
x is any state that realizes the minimum in the definition of volP (t), and A = B(x, t).

We now present two useful consequences of this result, which are easy to apply in practice.
Recall that the degree of a vertex x ∈ X is the number of vertices at distance 1 from x:

deg(x) := #{y ∈ X : dist(x, y) = 1}.

Of particular interest is the maximum degree deg(P ) := maxx∈X deg(x).

Corollary 2 (Degree bound). For all ε ∈ (0, 1), one has


  1−ε

 log( max π ) if deg(P ) ≥ 2
(ε) log deg(P )
tmix (P ) ≥
  1−ε 
max π
− 1 if deg(P ) = 1

Proof. For any x ∈ X and t ∈ N, we have π(B(x, t)) ≤ (max π) × |B(x, t)| and

|B(x, t)| ≤ 1 + deg(P ) + · · · + (deg(P ))t .

When deg(P ) ≥ 2, this geometric sum is less than (deg(P ))t+1 , hence

volP (t) < (max π) × (deg(P ))t+1 .

On the other hand, when deg(P ) = 1, the geometric sum is t + 1, so we obtain

volP (t) ≤ (max π) × (t + 1).


(ε)
To conclude, take t := tmix (P ) and note that volP (t) ≥ 1 − ε, by Lemma 16.

Example 3 (Sliding window). Consider the chain in Example 2: each state has degree
2, so deg(P ) = 2. Since π is the uniform law on {0, 1}n , Corollary 2 yields
  
(ε) 1
tmix (P ) ≥ n − log2 .
1−ε
(ε)
This is off by only 1, the exact value of tmix (P ) being obtained by replacing ⌊·⌋ with ⌈·⌉.

43
Example 4 (Riffle shuffle). One of the most standard methods for shuffling a deck of n
cards consists in repeating the following two-step procedure:

(i) cut the deck into two (possibly unequal) “halves”;

(ii) interleave cards from the two halves to produce a new deck.

How many times shall one repeat this procedure for the deck to be well mixed? To
formalize this question, let us identify each card with a unique label i ∈ [n] and represent
a deck of cards by a permutation σ ∈ Sn , where σ(i) indicates the label of the i−th
top card in the deck. Then, the above procedure transforms σ into a new permutation
of the form σ ′ = σ ◦ γI , where the set I ⊆ [n] indicates the positions to which the top
“half ” gets relocated, and where γI is the permutation that takes values 1, 2, . . . , |I| (in
this order) on I and |I| + 1, . . . , n (in this order) on n \ I. If we choose the subset
I ⊆ [n] at random according to some prescribed distribution (e.g., uniform) and repeat
this procedure independently at each step, we obtain a well-defined random walk on the
symmetric group Sn , whose kernel is denoted by Pn . Since there at most 2n possible
choices for the subset I ⊆ [n], we have deg(Pn ) ≤ 2n , so Corollary 2 yields

(ε) 1
tmix (Pn ) ≥ log2 ((1 − ε)n!) ∼ log2 n,
n
for any fixed ε ∈ (0, 1). This general bound happens to be remarkably sharp: indeed,
3
when I is uniform, the sequence (Pn )n≥1 is known to exhibit cutoff at time 2
log2 (n)!

Our second application of Lemma 16 involves the radius of the chain, defined as the smallest
integer t ∈ N such that any two balls of radius t intersect:

rad(P ) := min t ≥ 0 : ∀(x, y) ∈ X 2 , B(x, t) ∩ B(y, t) ̸= ∅ .




(ε)
Corollary 3 (Radius bound). For any ε ∈ (0, 21 ), we have tmix (P ) ≥ rad(P ).

(ε)
Proof. For t = tmix (P ) with ε < 1/2, Lemma 16 yields volP (t) > 21 . Since two events of
probability more than 1/2 must intersect, the result follows.

44
Example 5 (Sliding window). Consider the kernel P of Example 2. The balls of radius
n − 1 around the states (0, . . . , 0) and (1, . . . , 1) are disjoint, because the former consists
of all binary words of length n that start with a 0, and the latter those with a 1. Thus,
rad(P ) ≥ n (there is in fact equality), and Corollary 3 gives

(ε)
tmix (P ) ≥ n,

for all ε < 21 . This is in fact the correct answer, as we have seen in Example 2.

Note that when the edge set E is symmetric, we have


 
diam(P )
rad(P ) = ,
2

where diam(P ) := maxx,y dist(x, y) denotes the diameter. In general however, the radius
may be significantly smaller than the diameter, as shown in the following example.

Example 6 (Greasy ladder). On X = {1, 2, . . . , n}, consider the transition kernel


(
1
2
if y = 1 or y = (x + 1) ∧ n
P (x, y) :=
0 else.

The latter represents the evolution of a climber on a greasy ladder, where each step has a
chance 1/2 to result in an abrupt fall. Clearly, diam(P ) = n − 1. However, rad(P ) = 1
because P (x, 1) > 0 for all x ∈ X . Thus, Corollary 3 yields the seemingly poor bound
 
1 (ε)
∀ε ∈ 0, , tmix (P ) ≥ 1.
2

This is in fact sharp. Indeed, consider the obvious coupling where falls occur simulata-
neously in both chains: at each step, coalescence occurs with chance at least a half, so
(ε)
Theorem 2 yields DP (t) ≤ 2−t . In particular, tmix (P ) ≤ 2 for ε = 1/4.

The elementary bounds presented in the previous section can not be expected to be sharp
in all situations, because the parameters deg(P ) and rad(P ) only depend on the structure of
the graph GP , and not on the precise transition probabilities. We will now introduce a more
sophisticated parameter called the conductance, which provides more accurate lower bounds
on mixing times by taking the precise transition probabilities into account.

45
4.2 Conductance
We turn GP into a weighted graph by defining the weight of a pair (x, y) ∈ E as follows:

⃗π (x, y) := π(x)P (x, y). (42)

By the Ergodic Theorem, this quantity represents the asymptotic proportion of time that
the edge (x, y) is traversed by the chain. Note that the formula (42) extends to a probability
measure on X 2 whose first and second marginals are equal to π:
X X
∀x ∈ X , ⃗π (x, y) = ⃗π (y, x) = π(x).
y∈X y∈X

We will measure the surface of a set A ⊆ X by the quantity ⃗π (A × Ac ), and compare it with
the volume π(A). The ratio of those two quantities is called the conductance.

̸ A ⊆ X is the ratio
Definition 15 (Conductance). The conductance of a set ∅ =

⃗π (A × Ac )
Φ(A) := .
π(A)

The conductance of the chain is the quantity


 
1
Φ(P ) := min Φ(A) : ∅ = ̸ A ⊆ X , π(A) ≤ .
2

The conductance of a set measures the facility for the walk to escape from it. Indeed, letting
X ∼ MC(X , P, π) denote a stationary chain with transition kernel P , we have for any t ∈ N,
P(Xt ∈ A, Xt+1 ∈
/ A)
P (Xt+1 ∈
/ A|Xt ∈ A) = = Φ(A).
P(Xt ∈ A)
Thus, a set A with small conductance constitutes a “bottleneck” in which the walk is likely to
remain “trapped” for a long time. In particular, if that set misses a significant portion of the
state space (π(A) ≤ 1/2), then mixing should take long. Here is a rigorous confirmation.

l m
1
Lemma 17 (Conductance bound). We always have tmix (P ) ≥ 4Φ(P )
.

Proof. Consider a stationary chain X ∼ MC(X , P, π). Then for any A ⊆ X and t ∈ N,
t
[
{X0 ∈ A, Xt ∈
/ A} ⊆ {Xs−1 ∈ A, Xs ∈
/ A}.
s=1

46
Taking probabilities, we deduce that
X
π(x)P t (x, Ac ) ≤ t⃗π (A × Ac ).
x∈A

On the other hand, we have P t (x, Ac ) ≥ π(Ac ) − DP (t) by Lemma 9. Thus,

π(Ac ) − DP (t) ≤ tΦ(A).

To conclude, choose a set A realizing the definition of Φ(P ) and set t = tmix (P ).

000 001 010 011 100 101 110 111

00 01 10 11

0 1

Figure 7: The binary tree of height n = 3.

Example 7 (Random walk on a binary tree). Consider the lazy simple random walk on
the binary tree of height n (see Figure 7). This is the graph G = (V, E), where

• V =
Sn k
k=0 {0, 1} consists of all binary words of length at most n

• Two words form an edge if one is obtained from the other by deleting the last letter.

Consider the “left subtree”, i.e. the set A ⊆ V of all words that start with a 0. Note that
1 1
π(A) = 2
− 2|E|
, and that A × Ac contains a single edge. Thus,

1
Φ(P ) ≤ Φ(A) = .
2|E| − 2

47
l m
|E|−1
Thus, Lemma 17 gives tmix (P ) ≥ 2
= 2n − 1. This is in fact the correct order of
magnitude as n → ∞, as can be shown by coupling.

In order to identify the worst bottleneck of a chain, the following remark may be helpful.

Remark 18 (Connected bottlenecks). Let A ⊆ X be any set realizing the definition of


Φ(P ), and suppose that A is disconnected, in the sense that it can be partitioned into
two proper subsets A1 , A2 with (A1 × A2 ) ∩ E = (A2 × A1 ) ∩ E = ∅. Then, we can write

⃗π (A1 × Ac1 ) + ⃗π (A2 × Ac2 ) π(A1 )Φ(A1 ) + π(A2 )Φ(A2 )


Φ(A) = = ,
π(A1 ) + π(A2 ) π(A1 ) + π(A2 )

so A1 , A2 must also minimize Φ. Iterating this procedure eventually produces connected


minimizers. Thus, the definition of Φ(P ) can safely be restricted to connected sets.

Remark 19 (Time-reversal). The measure ⃗π is not symmetric unless P is reversible.


Nevertheless, we may use the fact that ⃗π has equal marginals to write, for any A ⊆ X

⃗π (A × Ac ) = ⃗π (A × X ) − ⃗π (A × A)
= ⃗π (X × A) − ⃗π (A × A)
= ⃗π (Ac × A).

It follows that Φ(A) is not modified if P is replaced with P ⋆ or with (P + P ⋆ )/2. Thus,

P + P⋆
 

Φ(P ) = Φ(P ) = Φ .
2

Remark 20 (Reversibility on trees). The identity ⃗π (A × Ac ) = ⃗π (Ac × A) has the


following interesting consequence. Let P be any transition kernel supported on a tree:
removing any edge (x, y) partitions X into two connected components Ax and Ay , and

⃗π (x, y) = ⃗π (Ax × Ay ) = ⃗π (Ay × Ax ) = ⃗π (y, x).

Thus, any transition kernel supported on a tree is reversible.

48
4.3 Curvature
The geometric methods described so far only provide lower bounds. In the present section, we
introduce a fundamental geometric notion that will provide powerful upper bounds on mixing
times. Our starting point is the observation that the total-variation distance is “blind” to
the geometry of the state space: we have dtv (δx , δy ) = 1 for any x ̸= y ∈ X , regardless of
how close x is to y. A simple but far-reaching idea consists in replacing total-variation with
the following geometric quantity.

Definition 16 (Wasserstein distance). Let µ, ν be two probability measures on X . The


Wasserstein (or transportation) distance from µ to ν is the quantity

W(µ, ν) := inf E[dist(X, Y )],


X∼µ,Y ∼ν

where the infimum is take over all possible couplings (X, Y ) of µ and ν.

Let us make a couple of important comments before proceeding further.

Remark 21 (Dirac masses). For any x, y ∈ X , we trivially have

W(δx , δy ) = dist(x, y).

Thus, W(·, ·) extends the function dist(·, ·) from points to probability measures.

Remark 22 (Optimal coupling). W(µ, ν) is the infimum of the continuous functional


X
p 7→ dist(x, y)p(x, y)
x,y∈X

over the compact (and convex) set of all coupling distributions of µ and ν:
( )
p ∈ [0, 1]X ×X : ∀x ∈ X ,
X X
C(µ, ν) := p(x, z) = µ(x), p(z, x) = ν(x) .
z∈X z∈X

In particular, this infimum is attained. Thus, there is always a coupling (X, Y ) that
attains the minimum in Definition 16: we call it an optimal coupling from µ to ν.

The function W is not symmetric in general, because dist is not. However, this is the only
obstruction for W to be a nice distance on P(X ), as the next lemma shows.

49
Lemma 18 (Properties). The Wasserstein distance W is convex and satisfies the sep-
aration axiom and the triangle inequality. It is symmetric if and only if dist is.

Proof of convexity. Fix µ, µ′ , ν, ν ′ ∈ P(X ) and θ ∈ [0, 1]. Let p be the law of an optimal
coupling form µ to µ′ , and let q be the law of an optimal coupling from ν to ν ′ . Then
r := θp + (1 − θ)q is the law of a coupling from θµ + (1 − θ)ν to θµ′ + (1 − θ)ν ′ , so
X
W (θµ + (1 − θ)ν, θµ′ + (1 − θ)ν ′ ) ≤ r(x, y)dist(x, y)
(x,y)∈X
X
= (θp(x, y) + (1 − θ)q(x, y)) dist(x, y)
(x,y)∈X

= θW (µ, µ′ ) + (1 − θ)W (ν, ν ′ ) .

This proves that W is convex on P(X ) × P(X ).

Proof of the separation axiom. Let µ, ν ∈ P(X ) be such that W(µ, ν) = 0. By Remark
22, we can find a coupling (X, Y ) of µ and ν such that E[dist(X, Y )] = 0. This forces
dist(X, Y ) = 0 a.-s., because dist(·, ·) is non-negative. Since dist(·, ·) moreover satisfies the
separation axiom, we deduce that X = Y a.-s., hence in distribution. Thus, µ = ν.

Proof of the triangle inequality. Fix λ, µ, ν ∈ P(X ). Write p (resp. q) for the law of an
optimal coupling from λ to µ (resp. µ to ν). Consider a random triple (X, Y, Z) with law
p(x, y)q(y, z)
∀(x, y, z) ∈ X 3 , P(X = x, Y = y, Z = z) = ,
µ(y)
this ratio being interpreted as 0 if the denominator (hence also the numerator) is zero.
Summing over all z ∈ X shows that (X, Y ) has law p, and summing over all x ∈ X shows
that (Y, Z) has law q. In particular, (X, Z) is a coupling of λ and ν, so we have

W(λ, ν) ≤ E[dist(X, Z)]


≤ E[dist(X, Y ) + dist(Y, Z)]
= W(λ, µ) + W(µ, ν),

where we have used the triangle inequality for dist(·, ·), and the optimality of p and q.

Proof of symmetry. It is clear from the Definition 16 that W(·, ·) is symmetric whenever
dist(·, ·) is. The converse readily follows from Remark 21.

50
Remark 23 (Robustness). The above proofs remain valid for any function dist : X 2 →
R+ satisfying the separation axiom and the triangle inequality. Thus, the Wasserstein
distance is a very general tool that “lifts” any distance on X to a distance on P(X ).
The choice dist(x, y) := 1x̸=y gives rise to the total-variation distance, by Remark 11.

We now show that the Wasserstein distance controls the total variation distance.

Lemma 19 (Wasserstein bound). For any µ, ν ∈ P(X ), we have

dtv (µ, ν) ≤ W(µ, ν).

Proof. The inequality 1x̸=y ≤ dist(x, y) trivially holds for all (x, y) ∈ X 2 . Integrating this
against the law of an optimal coupling from µ to ν concludes the proof.

Thus, any upper bound on the Wasserstein distance is also an upper bound on the total-
variation distance. The interest of the Wasserstein distance is that it can be efficiently
controlled by exploiting the geometry of the state space, as we will now see. The curvature
of a Markov chain measures the amount by which Wasserstein distances are contracted under
the action of the underlying transition kernel P .

Definition 17 (Curvature). The curvature κ(P ) is the largest κ ∈ R such that

∀µ, ν ∈ P(X ), W(µP, νP ) ≤ e−κ W(µ, ν).

This global definition seems far too delicate for practical use. Fortunately, a pleasant feature
of curvature is that it admits a simple, local characterization.

Lemma 20 (Local characterization of curvature). We have

e−κ(P ) = max W (P (x, ·), P (y, ·)) .


(x,y)∈E

Proof. Setting ρ := max(x,y)∈E W (P (x, ·), P (y, ·)), we will establish the inequality

∀µ, ν ∈ P(X ), W(µP, νP ) ≤ ρW(µ, ν). (43)

Fix x, y ∈ X , and let (σ0 , . . . , σt ) be a shortest path from x to y, i.e.

t = dist(x, y), σ0 = x, σt = y, and (σs−1 , σs ) ∈ E for 1 ≤ s ≤ t.

51
Using the triangle inequality for W (Lemma 18) and the definition of ρ, we have
t
X
W (P (x, ·), P (y, ·)) ≤ W (P (xs−1 , ·), P (xs , ·)) ≤ ρt = ρdist(x, y). (44)
s=1

This establishes (43) in the special case (µ, ν) = (δx , δy ). For the general case, let p be the
law of an optimal coupling from µ to ν, and observe that
X
(µP, νP ) = p(x, y) (P (x, ·), P (y, ·)) .
(x,y)∈X 2

Since W is convex (Lemma 18), we immediately deduce that


X
W (µP, νP ) ≤ p(x, y)W (P (x, ·), P (y, ·))
(x,y)∈X 2
X
≤ ρ p(x, y)dist(x, y)
(x,y)∈X 2

= ρW(µ, ν),

where the second line uses (44) and the third the optimality of p. Thus, (43) is established.
Conversely, note that (43) is an equality when (µ, ν) = (δx , δy ) with (x, y) ∈ E achieving
the maximum in the definition of ρ. Thus, ρ is in fact the smallest constant for which (43)
holds. Comparing this with Definition 17, we conclude that ρ = e−κ(P ) .

The interest of curvature is contained in the following result.

Theorem 7 (Curvature bound). If κ(P ) > 0, then


1
trel (P ) ≤ ,
κ(P )
 
(ε) 1 diam(P )
tmix (P ) ≤ log .
κ(P ) ε

Proof. Using the definition of κ(P ) and an immediate induction over t ∈ N, we have

∀µ, ν ∈ P(X ), W(µP t , νP t ) ≤ W(µ, ν)e−κ(P )t .

Combining this with the crude bound W(·, ·) ≤ diam(P ) and Lemma 19, we obtain

dtv µP t , νP t ≤ e−κ(P )t diam(P ).




52
Finally, choosing ν = π, µ = δx and maximizing over x ∈ X yields

DP (t) ≤ diam(P )e−κ(P )t ,

1/t
for all t ∈ N. The first claim is obtained by sending t → ∞ (recall that DP (t) → e−1/trel (P ) ),
1
and the second by choosing t = ⌈ κ(P )
log diam(P
ε
)
⌉.

Example 8 (Hypercube). Consider the lazy random walk on the n−dimensional hyper-
cube. Fix two neighboring states x, y, and consider the coupling (X, Y ) of P (x, ·) and
P (y, ·) that updates the same coordinate using the same Bernoulli variable. Then,
1
W (P (x, ·), P (y, ·)) ≤ E[dist(X, Y )] = 1 − .
n
Since this holds for all (x, y) ∈ E, we deduce that
 
1 1
κ(P ) ≥ − log 1 − ≥ .
n n
(ε) n

Thus, Theorem 7 gives trel (P ) ≤ n and tmix (P ) ≤ n log ε
. A comparison with the
results of Section 3.5 shows that those estimates are remarkably sharp. Interestingly, the
1
bound κ(P ) ≤ trel (P )
= − log(1 − n1 ) shows that the first inequality in (45) is an equality.

4.4 Application: phase transition in the Curie-Weiss model


In this section, we demonstrate the strength of the above methods by establishing a dynami-
cal phase transition for one of the most fundamental statistical physics models: the mean-field
Ising ferromagnet. The latter describes the evolution of n particles (called “spins”), each
being in one of two possible states (“plus” or “minus”). Each particle has a tendency to align
its state with those of the other particles, and the strength of this interaction is controlled
by a parameter β ≥ 0 (the “inverse temperature”). The precise model is as follows.
The system can be represented by a vector x = (x1 , . . . , xn ) ∈ {−1, +1}n , where xi = +1
(resp. xi = −1) indicates that the i−th particle is in the “plus” (resp. “minus”) state.
At each step, the vector x is randomly modified as follows: a coordinate i ∈ [n] is selected
uniformly at random, and its current value xi is replaced by +1 or −1 with respective

53
probabilities ψ(+s) and ψ(−s), where
β X es
s := xj and ψ(s) := .
n es + e−s
j∈[n]\{i}

Note that ψ(s) + ψ(−s) = 1 for any s ∈ R, as required. Note also that ψ(s) increases from
0 to 1 as s ranges from −∞ to +∞. Thus, the new state of the i−th particle is likely to be
“plus” if s is a large positive number, and “minus” if s is a large negative number. Formally,
we have defined a Markov chain on X = {−1, +1}n with transition kernel
 Pn  
1 β P
ψ x
n i
x
j∈[n]\{i} j if y = x
 n i=1

 
Pn (x, y) := 1
ψ − βn xi j∈[n]\{i} xj
P
n
if y = (x1 , . . . , xi−1 , −xi , xi+1 , . . . , xn )


0 else.

The fact that ψ > 0 ensures that this transition kernel is ergodic. Moreover, it is easily
checked to be reversible with respect to the measure
 !2 
n
1 β X
π(x) := exp  xi  ,
Z(β) 2n i=1

where Z(β) denotes the appropriate normalizing constant. How does the mixing time of this
chain depend on the interaction parameter β? In the easy case β = 0 (no interaction), we
recover the random walk on the hypercube, which has mixing time tmix (Pn ) = Θ(n log n).
On the other hand, in the limit where β → +∞ (strong interaction), the selected particle
will systematically adopt the majority state, so the chain will need an infinite amount of
time to move from (−1, . . . , −1) to (+1, . . . , +1). For “intermediate” values of β, it is then
tempting to believe that the mixing time will simply interpolate between those two extreme
situations, in a gradual way. In fact, the mixing time changes dramatically from O(n log n)
(fast-mixing regime) to exp(Ω(n)) (slow-mixing regime) as β passes the critical value 1.

Theorem 8 (Dynamic phase transition for the mean-field Ising model).

1. For any fixed β < 1, we have tmix (Pn ) ≤ (1 + o(1)) n1−β


log n
as n → ∞ (fast mixing).

2. For any fixed β > 1, we have tmix (Pn ) = exp (Ω(n)) (exponentially slow mixing).

Proof of fast mixing when β < 1. In light of Theorem 7, it suffices to prove that
1−β
κ(Pn ) ≥ ,
n
54
which we now do. Let I and U be independent with I ∼ Unif({1, . . . , n}) and U ∼
Unif([0, 1]). Starting from a fixed state x = (x1 , . . . , xn ) ∈ X , one can construct a ran-
dom state X = (X1 , . . . , Xn ) with law P (x, ·) by setting for each i ∈ [n],

x if I ̸= i
 i

  P 
Xi := +1 if I = i and U ≤ ψ βn j∈[n]\{i} xj
  
 −1 if I = i and U > ψ β P

n j∈[n]\{i} j .
x

Now, consider a state y ∈ X which differs from x by a single coordinate, say xk = −1 and
yk = +1. Then, the coupling (X, Y ) of P (x, ·), P (y, ·) that uses the same pair (U, I) gives

 0 if I = k

  P   P 
dist(X, Y ) = 2 if I ̸= k and ψ βn j∈[n]\{I} xj ≤ U < ψ βn j∈[n]\{I} yj


 1 else.

 P   P 
β β
′ 1
xj ≤ βn .
P P
But j̸=I yj − j̸=I xj ≤ 2 and ∥ψ ∥∞ ≤ 2
, so ψ n j∈[n]\{I} yj − ψ n j∈[n]\{I}
Thus, the second case occurs with probability at most β/n, and we deduce that
  
1 β β−1
E[dist(X, Y )] ≤ 1− 1+ ≤ e n .
n n
1−β
This shows that κ(Pn ) ≥ n
, as desired.

Proof of slow mixing when β > 1. Let us consider the event of negative magnetization:
( n
)
X
A := x∈X : xi < 0 .
i=1

The symmetry property π(x) = π(−x) ensures that π(A) ≤ 12 , so that Φ(P ) ≤ Φ(A). By
the conductance bound (Lemma 17), we only have to show that Φ(A) = exp(−Ω(n)). For
0 ≤ k ≤ n, let Ak consist of all configurations with k “plus” and n − k “minus” states:
( n
)
X
Ak := x∈X : xi = 2k − n .
i=1

Since at most one coordinate is modified at each step, the only way for the chain to jump
from A to Ac is to actually jump from A⌈n/2⌉−1 to A⌈n/2⌉ . Thus,

⃗π (A × Ac ) = ⃗π (A⌈n/2⌉−1 × A⌈n/2⌉ ) ≤ π(A⌈n/2⌉ ),

55
where the inequality follows from the fact that the second marginal of ⃗π is π. On the other
hand, we have π(A) ≥ maxk<⌈n/2⌉ π(Ak ) and for 0 ≤ k ≤ n, π(Ak ) = ak /Z(β), where
   
n β 2
ak := exp (2k − n) .
k 2n
a⌈n/2⌉
Consequently, ϕ(A) ≤ mink<⌈n/2⌉ ak
. To see that this ratio is exponentially small in n, fix
θ ∈ (0, 1) and observe that when k = k(n) ∼ θn as n → ∞, we have
1 β
log ak(n) −−−→ f (θ) := (2θ − 1)2 − θ log θ − (1 − θ) log(1 − θ).
n n→∞ 2
1
Thus, our task boils down to showing the existence of θ < 2
so that f (θ) > f (1/2). But

f ′ (θ) = 2β(2θ − 1) + log(1 − θ) − log θ


1
f ′′ (θ) = 4β − .
θ(1 − θ)

Thus, f ′ (1/2) = 0 and f ′′ (1/2) = 4(β − 1), so f ( 12 ) is a strict local minimum when β > 1.

4.5 Carne-Varopoulos bound


The volume bound (Lemma 16) and its useful consequences (Corollaries 2 and 3) relied on
a crude observation: after t steps, the chain is necessarily at distance at most t from its
starting point. In many cases however, this best-case scenario is rather optimistic, compared
to the typical displacement of the chain. For example, a simple random walk X = (Xt )t≥0
on Z has asymptotic speed zero by the strong law of large number, and it is the statistical
fluctuations that really drive the motion, as quantified by the Central Limit Theorem:
dist(X0 , Xt ) d
√ −−−→ |Z|, where Z ∼ N (0, 1).
t t→∞


Consequently, after t steps, the walk typically lies at distance Θ( t) from its starting point,
rather than t as we used in our volume bound. In light of this, it is natural to hope for a
quadratic improvement over the diameter lower bound (Corollary 3) for “diffusive” chains.
Note that this intuition is correct for the random walk on the n−cycle, for which we have
seen that tmix (Pn ) ≍ n2 ≍ diam(Pn )2 . With those preliminary observations in mind, let us
now state a remarkable inequality which compares the transition kernel of any reversible
chain to that of the simple random walk on Z.

56
Theorem 9 (Carne-Varopoulos estimate). For any reversible kernel P , we have
s
π(y)
P t (x, y) ≤ P (|Xt | ≥ dist(x, y)) ,
π(x)

where X = (Xt )t≥0 denotes the simple random walk on Z. In particular,


s ( )
2
π(y) (dist(x, y))
P t (x, y) ≤ 2 exp − ,
π(x) 2t

for all x, y ∈ X and t ∈ N.

Proof. The proof uses the Chebychev polynomials (qk )k≥0 , which are defined by the recursion

qk+1 (z) := 2zqk (z) − qk−1 (z) (k ≥ 1),

with initial conditions q0 (z) = 1 and q1 (z) = z. The trigonometric identity

2 cos(kθ) cos(θ) = cos ((k + 1)θ) + cos ((k − 1)θ) ,

shows that qk (cos θ) = cos(kθ) for all θ ∈ R and k ∈ N. Now, observe that for all t ∈ N,
t t
e + e−iθ
 iθ X
t  iθXt 
(cos θ) = = E e = E [cos (θXt )] = P (|Xt | = k) qk (cos θ).
2 k=0

Since this is true for all θ ∈ R, we must have the polynomial identity
t
X
zt = P (|Xt | = k) qk (z).
k=0

Let us apply this polynomial identity to the matrix P , and evaluate the (x, y)−entry:
t
X
t
P (x, y) = P (|Xt | = k) qk (P )(x, y)
k=0
t
X
= P (|Xt | = k) qk (P )(x, y),
k=dist(x,y)

where we have observed that I(x, y) = P (x, y) = · · · = P k (x, y) = 0 for k < dist(x, y), so
that qk (P )(x, y) = 0 (since qk has degree k). To conclude, it remains to show that
s
π(y)
qk (P )(x, y) ≤ , (45)
π(x)

57
for all k ≥ 0, and this is where we use reversibility: qk (P ) is a self-adjoint operator with
spectrum {qk (λ) : λ ∈ Sp(P )} ⊆ qk ([−1, 1]) ⊆ [−1, 1], where the last inclusion follows from
the identity qk (cos θ) = cos(kθ). Thus, qk (P ) is a contraction, which means that

|⟨qk (P )f, g⟩| ≤ ∥f ∥∥g∥, (46)

for all observables f, g : X → C. Taking f = δy and g = δx yields exactly (45). The second
claim follows from a classical application of Markov’s inequality: for d, λ > 0,
t
eλ + e−λ

λ2 t
−λd
λXt λd
E eλXt = e−λd ≤ e−λd+
  
P (Xt ≥ d) = P e ≥e ≤ e 2 .
2
d2
The right-hand side is minimized for λ = dt , in which case it is equal to e− 2t . This of course
also applies to −Xt , and combining the two estimates concludes the proof.

Corollary 4 (Diffusive bound). For lazy simple random walk on any N −vertex graph,

(ε) (diam(P ))2


tmix (P ) ≥ .
16 ln N
for all ε ∈ (0, 12 ), provided N is large enough.

Proof. Set d = ⌈diam(P )/2⌉, so that diam(P ) > 2(d − 1): this means that we can find two
disjoint balls of radius d − 1. Thus, there is x ∈ X such that A = B(x, d − 1) satisfies
1
π(A) ≤
2
But the elements of Ac are at distance d ≥ diam(P )/2 from x, so Theorem 9 ensures that
3 diam2 (P )
P t (x, Ac ) ≤ 2N 2 e− 8t ,

π(y) deg(y)
where we have used the crude estimates |Ac | ≤ N and π(x)
= deg(x)
≤ N . We conclude that

1 3 diam2 (P )
DP (t) ≥ − 2n 2 e− 8t .
2
(diamP )2
Thus, as long as t ≤ 16 ln N
, we have DP (t) ≥ 1
2
− √2 ,
N
concluding the proof.

58
5 Variational techniques
For an ergodic transition kernel P with stationary law π on a state space X , the convergence
to equilibrium DP (t) → 0 as t → ∞ can be equivalently formulated as follows:

∀x ∈ X , (P t f )(x) −−−→ πf,


t→∞

for all f : X → R. In words, observables become constant under the repeated action of P .
Equivalently, the variance Var(f ) = π(f 2 ) − π 2 (f ) decays under the repeated action of P :

Var(P t f ) −−−→ 0. (47)


t→∞

This naturally raises the following two questions:

1. At what speed does the convergence (47) take place?

2. What are the consequences in terms of mixing times?

To answer those questions, we introduce a fundamental object: the Dirichlet form.

5.1 Dirichlet form and Poincaré constant


The Dirichlet form is a quadratic form on the Hilbert space L2 (X , π) that measures the
expected quadratic variation of observables under a transition of the stationary chain.

Definition 18 (Dirichlet form). The Dirichlet form is the quadratic form defined by
1 
EP (f ) = Eπ (f (X1 ) − f (X0 ))2

2
1 X
= ⃗π (x, y) (f (y) − f (x))2
2 x,y∈X
= ⟨(I − P )f, f ⟩

for any observable f : X → R, where Eπ denotes expectation under MC(X , P, π).

Remark 24 (Rank-one case). It is instructive to consider the “ideal chain” P = Π

59
defined in (15), which mixes exactly in a single step. Since ⃗π (x, y) = π(x)π(y), we have
1 X
EΠ (f ) = π(x)π(y) (f (x) − f (y))2
2 x,y∈X
= π(f 2 ) − π 2 (f )
= Var(f ).

A natural way to quantify the variational behavior of P consists in comparing its Dirichlet
form with that of the ideal chain Π. This leads to the following fundamental definition.

Definition 19 (Poincaré constant). The Poincaré constant of the chain is the quantity

EP (f )
 
γ(P ) := inf , f : X → R is not constant .
Var(f )

Since the ratio EP (f )/Var(f ) is invariant under translation and scaling, we also have

γ(P ) = inf {EP (f ) : ∥f ∥ = 1, πf = 0} ,

which shows that the infimum is actually attained.

Remark 25 (Time reversal). Replacing P by P ⋆ changes ⃗π (x, y) to ⃗π (y, x), so we have

EP (f ) = EP ⋆ (f ) = E P +P ⋆ (f ),
2

for all observables f : X → R. In particular, we deduce that

P + P⋆
 

γ(P ) = γ(P ) = γ .
2

In words, the Dirichlet form and the Poincaré constant are invariant under time reversal.

The Poincaré constant happens to enjoy a simple spectral interpretation.

Lemma 21 (Spectral interpretation of the Poincaré constant). We always have

P + P⋆
 
γ(P ) = 1 − λ2 ,
2

where λ2 (Q) denotes the second largest eigenvalue of a self-adjoint transition matrix Q.

60
In particular, if P is lazy and reversible, then

γ(P ) = 1 − λ⋆ (P ).
Proof. First consider the case where P is reversible. By decomposing the observable f in
our orthonormal basis (ϕ1 , . . . , ϕN ) of eigenfunctions, one finds
N
X
⟨(I − P )f, f ⟩ = (1 − λk ) |⟨f, ϕk ⟩|2 .
k=2

On the other hand, since ϕ1 ≡ 1, we have ⟨f, ϕ1 ⟩ = πf , so that


N
X
Var(f ) = |⟨f, ϕk ⟩|2 .
k=2

It readily follows that EP (f ) ≥ (1 − λ2 (P ))Var(f ), with equality when f = ϕ2 . Thus,

γ(P ) = 1 − λ2 (P ),

which establishes the claim when P is reversible. The general case is obtained by replacing
P with (P + P ⋆ )/2, which is always reversible and has the same Poincaré constant (Remark
25). Finally, when P is lazy and reversible, we have Sp(P ) ⊆ [0, 1], so λ⋆ (P ) = λ2 (P ).

Remark 26 (Range of γ(P )). The above result shows that we always have γ(P ) ∈ [0, 2],
and even that γ(P ) ∈ [0, 1] in the case where P is lazy.

We can now answer the first question raised at the beginning of this chapter.

Lemma 22 (Variational contraction). For all f : X → R, we have

Var(P f ) ≤ [1 − γ(P ⋆ P )] Var(f ).

Moreover, if P is lazy, then γ(P ⋆ P ) ≥ γ(P ).

Proof. Using the equality πf = πP f , and the definition of the adjoint P⋆ , we have

Var(f ) − Var(P f ) = ⟨f, f ⟩ − ⟨P f, P f ⟩


= ⟨f, f ⟩ − ⟨f, P ⋆ P f ⟩
= ⟨(I − P ⋆ P )f, f ⟩
= EP ⋆ P (f )
≥ γ(P ⋆ P )Var(f ),

61
and the first claim readily follows. Now, if P is lazy, then for all x, y ∈ X , we have
X
π(x)P ⋆ P (x, y) = π(x) P⋆ (x, z)P (z, y)
z∈S
≥ π(x) (P (x, x)P (x, y) + P⋆ (x, y)P (y, y))
1
≥ (π(x)P (x, y) + π(y)P (y, x)) .
2
Multiplying by 1
2
(f (x) − f (y))2 and then summing over all x, y ∈ X , we obtain

EP ⋆ P (f ) ≥ EP (f ).

Since this is true for all f : X → R, we can safely conclude that γ(P ⋆ P ) ≥ γ(P ).

The above lemma shows that the variance of any observable decays exponentially fast under
the repeated action of P , with rate 1/γ(P ⋆ P ). This answers the first question raised at the
beginning of the chapter. The following result answers the second question, by showing that
1/γ(P ) plays the role of a relaxation time, without requiring reversibility.

Theorem 10 (Poincaré bound). If P is lazy, then for all ε ∈ (0, 1),


 
2 (ε) 2 1
trel (P ) ≤ and tmix (P ) ≤ log √ .
γ(P ) γ(P ) 2ε π⋆

Proof. Our starting point is the Cauchy-Schwarz bound, which we recall here:

t
 1 P t (x, ·)
dtv P (x, ·), π ≤ −1 .
2 π(·)

In the reversible case, the existence of an orthonormal basis of eigenfunctions of P had


allowed us to prove exponential decay of the right-hand side. Without reversibility, we no
longer have an orthonormal basis of eigenfunctions at our disposal, but we can write

P t (x, y) P ⋆t (y, x)
= = P ⋆t fx (y),
π(y) π(x)
δx (y)
where we have introduced the observable fx : y 7→ π(x)
. Since πP ⋆t fx = πfx = 1, we see that

2
P t (x, ·)
= Var P ⋆t fx ,

−1
π(·)

62
and the right-hand side can be estimated using Lemma 22 and Remark 25:

Var P ⋆t fx ≤ Var (fx ) (1 − γ(P ⋆ ))t



 
1
= − 1 (1 − γ(P ))t
π(x)
−γ(P )t
e
≤ ,
π⋆
Puttings things together leads to the following conclusion, which implies the two claims:
γ(P )t
e− 2
∀t ∈ N, DP (t) ≤ √ .
2 π⋆

Remark 27 (Comparison with Theorem 4). A careful inspection reveals that the above
argument actually holds with γ(P P ⋆ ) instead of γ(P ), without the need for lazyness.
When P is reversible, we have 1 − γ(P P ⋆ ) = λ2⋆ (P ), and we recover Theorem 4.

5.2 Cheeger inequalities


In a previous chapter, we have used the geometric notion of conductance to provide a good
lower bound on mixing times. As we will now see, this quantity also gives a two-sided control
on the Poincaré constant. This result is fundamental, because it creates a bridge between a
geometric notion (conductance) and a spectral one (the Poincaré constant).

Theorem 11 (Cheeger’s inequalities). For any transition kernel P , we have

Φ2 (P )
≤ γ(P ) ≤ 2Φ(P ).
2

Proof. In view of Remarks 25 and 19, we may suppose that P is reversible. Fix A ⊆ X .
The observable f = 1A satisfies Var(f ) = π(A)π(Ac ) and EP (f ) = ⃗π (A × Ac ). Consequently,

EP (f )
Φ(A) = π(Ac ) .
Var(f )

The claimed upper bound follows immediately. For the lower bound, we can assume that
λ2 (P ) ≥ 0, because Φ(P ) ∈ [0, 1]. Consider a non-negative observable f : X → R+ with

63
π (f = 0) ≥ 12 . For each t ∈ R+ , we may choose A = {f > t} in the definition of Φ(P ) to get
X
Φ(P )π(f > t) ≤ ⃗π (x, y)1(f (y)≤t<f (x)) .
x,y∈X

Integrating w.r.t. t and interchanging the sum and integral, we obtain


1 X
Φ(P )πf ≤ ⃗π (x, y) |f (y) − f (x)| .
2 x,y∈X

We now replace f by f 2 , and apply the Cauchy-Schwarz inequality:


! !
1 X X
Φ2 (P )∥f ∥4 ≤ ⃗π (x, y) (f (y) − f (x))2 ⃗π (x, y) (f (y) + f (x))2 .
4 x,y∈X x,y∈X

Expanding the squares, we see that the right-hand side simplifies to ∥f ∥4 − ⟨f, P f ⟩2 , so that

Φ2 (P )∥f ∥4 ≤ ∥f ∥4 − ⟨f, P f ⟩2 . (48)

1
To conclude, we would like to take f = ϕ2 , but our initial assumption π(f = 0) ≥ 2
has no
reason to be satisfied. Let us instead choose f = max(ϕ2 , 0), which verifies the assumption
upon changing ϕ2 to -ϕ2 if necessary. Since f ≥ 0 and f ≥ ϕ2 , we have P f ≥ 0 and
P f ≥ λ2 ϕ2 , so P f ≥ λ2 f . Thus, ⟨f, P f ⟩ ≥ λ2 ∥f ∥2 , and (48) easily implies the claim.

By combining Theorems 10 and 11, we obtain the following important upper bound.

Corollary 5 (Conductance upper-bound). For any lazy kernel P and any ε ∈ (0, 1),
 
(ε) 4 1
tmix (P ) ≤ ln √ .
Φ2 (P ) 2ε π⋆

Example 9 (Hypercube). Consider lazy random walk on the hypercube. The set A :=
1 1 1
{x ∈ {0, 1}n : x1 = 0} satisfies π(A) = 2
and ⃗π (A × Ac ) = 4n
, so that Φ(A) = 2n
. We
1
deduce that Φ(P ) ≤ 2n
. On the other hand, we know that γ(P ) = 1 − λ⋆ (P ) = n1 , so
that there is equality in Cheeger’s upper bound. In particular,
1
Φ(P ) = .
2n

64
Example 10 (Cycle). On Z/nZ, any set A of size k ≤ n/2 contains at least two
1
boundary edges, so Φ(A) ≥ 2k
. Moreover, there is equality when A = {1, . . . , k}. Thus,

1
Φ(P ) = .
2⌊n/2⌋
π2
We know that γ(P ) = 1−λ⋆ (P ) ∼ n2
, so Cheeger’s lower bound is sharp up to prefactors.

Example 11 (Universal bound). For lazy random walk on an undirected graph G =


(V, E), the crude bound ϕ(P ) ≥ 2⃗π⋆ gives trel (P ) ≤ 8|E|2 , hence tmix (P ) = O(|E|2 log |E|).

Example 12 (Expanders). A sequence of graphs (Gn )n≥1 is an expander sequence if

1. the number of vertices diverges;

2. the degrees are uniformly bounded;

3. the conductances are uniformly bounded from below.

The lazy random walk on Gn satisfies tmix (Pn ) = Θ(log |Gn |), by Corollaries 2 and 5.

5.3 Comparison principle


The Poincaré constant γ(P ) was defined by comparing the Dirichlet form of P to that of
the idealized kernel Π. Replacing the latter by an arbitrary kernel Q is the starting point of
a very powerful comparison theory for Markov chains: suppose P is a sophisticated chain,
whose mixing time is delicate to estimate directly, and consider a much simpler chain Q
which has the same stationary law π. Then, one can transfer quantitative results from Q to
P , by paying a “price” γ(P : Q) that depends on how close Q is to P .

Definition 20 (Comparison constant). Given two irreducible transition kernels P, Q on

65
the same state space X , their comparison constant is defined as

EP (f )
 
γ(P : Q) := inf , f : X → R is not constant .
EQ (f )

Note in particular that γ(P : Π) = γ(P ), by Remark 24. The motivation behind this
definition is contained in the following elementary but fruitful observation.

Lemma 23 (Comparison principle). If P and Q have the same stationary law, then

γ(P ) ≥ γ(Q)γ(P : Q).

Thus, any lower bound on γ(Q) yields a lower bound on γ(P ), at a price γ(P : Q).

Proof. For any non-constant observable f : X → R, we have by definition

EP (f ) EP (f ) EQ (f ) EQ (f )
= × ≥ γ(P : Q) × .
Var(f ) EQ (f ) Var(f ) Var(f )

Taking an infimum over all possible choices for f concludes the proof.

Remark 28 (Extension). The Courant-Fischer-Weyl min-max principle expresses the


k−th largest eigenvalue of any compact self-adjoint operator A on a Hilbert space H as

⟨Af, f ⟩
λk (A) = max min ,
dim(F )=k f ∈F ⟨f, f ⟩

where the maximum ranges over all k−dimensional subspaces F ⊆ H. Applying this to
P +P ⋆
A=I− 2
on H = L2 (X , π), we obtain the global comparison principle

P + P⋆ Q + Q⋆
    
1 − λk ≥ γ(P : Q) 1 − λk ,
2 2

for all 1 ≤ k ≤ |X |, of which the above lemma is only the special case k = 2.

5.4 Distinguished paths


We now present a very robust technique to establish a lower bound on the comparison
constant γ(P : Q) between two irreducible chains P and Q on the same state space X .

66
Theorem 12 (Distinguished paths). Let P, Q be kernels on X with supports EP , EQ
and weights ⃗πP , ⃗πQ . For each (x, y) ∈ EQ , let Υx,y be a path from x to y in GP . Then,

1 1 X
≤ max c(e), where c(e) := ⃗πQ (x, y)|Υx, y|1(e∈Υx,y ) .
γ(P : Q) e∈EP ⃗πP (e)
(x,y)∈EQ

Here |Υ| denotes the length of the path Υ, and e ∈ Υ means that e is traversed by Υ.

Proof. Fix an observable f : X → R. Writing ∇f (x, y) := f (y) − f (x), we have


1 X
EQ (f ) = ⃗πQ (x, y) |∇f (x, y)|2 .
2
(x,y)∈EQ

Now for each (x, y) ∈ EQ , we may use the fact that Υx,y is a path from x to y to write
2
X
|∇f (x, y)|2 = ∇f (e)
e∈Υx,y
X
≤ |Υx,y | |∇f (e)|2 ,
e∈Υx,y

by the Cauchy-Schwarz inequality. Inserting above and re-arranging, we obtain


1 X
EQ (f ) ≤ ⃗πP (e) |∇f (e)| c(e)
2 e∈E
P

≤ EP (f ) max c(e).
e∈EP

Since this inequality holds for all observables f : X → R, the result follows.

The simplest and most natural choice for Q is the ideal matrix Π that mixes in one step:

Corollary 6 (Rank-one case Q = Π). Let P be a transition kernel on X and for each
(x, y) ∈ X × X , let Υx,y be a path from x to y in GP . Then,

1 1 X
≤ max c(e), where c(e) := π(x)π(y)|Υx,y |1(e∈Υx,y) .
γ(P ) e∈EP ⃗π (e)
(x,y)∈X ×X

67
Remark 29 (Congestion). The quantity c(e) is called the congestion induced at the edge
e by the collection of paths (Υx,y )(x,y)∈EQ . The challenge lies in choosing paths that will
make the maximal congestion maxe∈EP c(e) as low as possible. Note that we must have
X
max c(e) ≥ ⃗πP (e)c(e)
e∈EP
e∈EP
X
= ⃗πQ (x, y)|Υx,y |2
(x,y)∈EQ
X
≥ ⃗πQ (x, y)dist2 (x, y).
(x,y)∈EQ

Thus, the best congestion we can hope for is the average quadratic distance (in GP )
between two consecutive states of MC(X , Q, π). To achieve this optimum (at least ap-
proximately), we need our paths (Υx,y )(x,y)∈EQ to be (close to) shortest paths in GP , and
the resulting congestion to be (close to) uniform across all edges, as in the next example.

Example 13 (Square grid). Let Pn be the transition matrix for lazy simple random walk
on a n × n grid: the state space is X = [n] × [n], and two vertices x = (x1 , x2 ) and
y = (y1 , y2 ) are neighbors if |x1 − x2 | + |y1 − y2 | = 1. For x, y ∈ X , consider the path
Υx,y of minimum length that goes from x to y first horizontally, then vertically. The
congestion induced at any edge is easily seen to be of order n2 , so Corollary 6 gives

trel (Pn ) = O(n2 ),

which is actually the correct order of magnitude.

Example 14 (Universal bound). One can always choose Υx,y to be a shortest path from
x to y, and use the crude bound |Υx,y | ≤ diam(P ) to deduce that

1 diam(P )
≤ .
γ(P ) ⃗π⋆

68
In particular, for lazy simple random walk on a graph G = (V, E), we have

trel (P ) = O(diam(G)|E|)
tmix (P ) = O(diam(G)|E| log |E|).

The two inequalities appearing in Remark 29 show that our distinguished paths (Υx,y )(x,y)∈EQ
should not only be as short as possible, but also as spread-out as possible across the graph,
so that the congestion is fairly balanced among the edges. A simple but powerful idea for
reducing the imbalance consists in letting the paths (Υx,y )(x,y)∈EQ be random.

Theorem 13 (Random paths). Let P, Q be kernels on X with supports EP , EQ and


weights ⃗πP , ⃗πQ . For (x, y) ∈ EQ , let Υx,y be a random path from x to y in GP . Then,

1 1 X  
≤ max c(e), where c(e) := ⃗πQ (x, y)E |Υx,y |1(e∈Υx,y ) .
γ(P : Q) e∈EP ⃗πP (e)
(x,y)∈EQ

Proof. The argument is exactly the same as in the case of deterministic paths, except that
we take expectations just before the very last inequality.

Here is an example that illustrates the advantage of random paths over deterministic ones.

Example 15 (Lazy random walk on K2,n ). The graph K2,n has vertex set X = {L, R}∪
[n], two vertices being neighbors if one is in {L, R} and the other in [n]. Any determin-
istic path ΥL,R from L to R contributes to the congestion of e ∈ ΥL,R by at least

1
c(e) ≥ π(L)π(R)|ΥL,R | = n.
⃗π (e)

On the other hand, when the distinguished path Υx,y is chosen uniformly at random over
1
all shortest paths from x to y, the congestion is only c(e) = 2 − n
for every edge e.

We conclude with an application to chains with a high amount of symmetry.

Definition 21 (Transitivity). An automorphism of P is a permutation ϕ on X so that

∀x, y ∈ X , P (ϕ(x), ϕ(y)) = P (x, y).

The transition kernel P is called transitive if for any two states x, x′ ∈ X , there is an

69
automorphism of P that takes x to x′ . It is called arc-transitive if for any two edges
(x, y), (x′ , y ′ ) ∈ E, there is an automorphism that takes x to x′ and y to y ′ .

Intuitively, a chain is transitive if “all states play the same role”. Examples include all
random walks on groups (take ϕ(z) = z⋆x−1 ⋆x′ ). Arc-transitivity is the stronger requirement
that “all edges play the same role”. Examples include the lazy simple random walk on the
hypercube, the cycle, and more generally discrete tori of the form Zdn for any d, n ∈ N.

Corollary 7 (Poincaré inequality for symmetric chains).

1 diam2 (P )
(i) If P is transitive, then γ(P )
≤ p⋆
, where p⋆ = min(x,y)∈E P (x, y).

1 diam2 (P )
(ii) If P is arc-transitive, then γ(P )
≤ 1−α
, where α = minx∈X P (x, x).

Proof. We use the methods of random paths to compare P with the rank-one matrix Q = Π.
For each x, y ∈ X , we take Υx,y to be a uniformly chosen shortest path between x and y.
In view of Remark 29, we know that the resulting congestion satisfies
X X
⃗π (e)c(e) = π(x)π(y)dist2 (x, y) ≤ diam2 (P ).
e∈E x,y∈X

When P is arc-transitive, the quantities ⃗π (e) and c(e) do not depend on e ∈ E. Consequently,
the left-hand side equals ⃗π (E) × maxe∈E c(e), and the second bound follows because ⃗π (E) =
P
1 − x∈X ⃗π (x, x) = 1 − α. If P is only transitive, then we can write
X X X
⃗π (e)c(e) = π(x) P (x, y)c(x, y) ≥ p⋆ max c(e),
e∈E
e∈E x∈X y̸=x
P
because the quantity π(x) y̸=x P (x, y)c(x, y) does not depend on x.

70

Common questions

Powered by AI

Cheeger's inequalities establish a crucial link between geometric and spectral properties in Markov chains through conductance, denoted by Φ(P), and the Poincaré constant γ(P). These inequalities assert that for any reversible transition kernel P, the conductance provides bounds for the Poincaré constant. Specifically, the upper and lower bounds are Φ²(P)/2 ≤ γ(P) ≤ 2Φ(P), thus indicating that higher conductance corresponds to faster mixing times. This illustrates that the robustness of the chain's convergence and its variance decay rate are simultaneously governed by the system's underlying graph structure, captured via conductance, and its spectral characteristics, measured by the Poincaré constant .

A coupling kernel facilitates trajectorial coupling in Markov chains by providing a structured mechanism to create paired stochastic processes whose marginal distributions align with a given transition kernel P. This means that for a joint transition kernel Q defined on the product space, the marginals agree with P, thus allowing simultaneous construction of coupled processes (X, Y) such that each follows a Markov chain with transitions prescribed by P. This construction is pivotal in obtaining coupling trajectories which accurately represent the behavior of individualized chains starting from different states by ensuring that each step maintains consistency with the underlying chain dynamics .

The geometric approach through the concept of conductance offers advantages of intuitive understanding and empirical estimates for mixing times due to its clear-cut interpretation of flow and bottleneck properties in state spaces. Conductance, Φ(P), effectively bounds mixing rates and yields insights into convergence dynamics without delving into the spectral complexities that generally accompany eigenvalue approaches. However, practical challenges arise as accurate computation of conductance can be inherently difficult, especially in large or complex networks where optimizing over all potential cuts of the state space becomes computationally infeasible. It requires careful balancing of precision versus feasibility and often necessitates supplementary assumptions or approximations to manage computational complexity effectively .

Establishing such a coupling maximizes the concentration on the diagonal set {(y, y): y ∈ X} is challenging because constructing this optimal coupling involves complex calculations and requires knowledge of the entire structure of the probability distributions involved. The practical difficulty lies in efficiently estimating the maximum likelihood of the events where the coupled variables are equal, P(X = Y), without encountering infeasible computational demands. Sophisticated couplings might not be directly apparent, necessitating simplifications that may deviate significantly from optimal designs, particularly when faced with distributions or transitions that are not explicitly defined or are overly complex .

Reversible Markov chains are significant in understanding convergence to equilibrium because they ensure the existence of an orthonormal basis of eigenfunctions, allowing for precise analysis of spectral properties and eigenvalue gaps crucial for studying convergence rates. Reversibility simplifies the computation of variances and centralizes the transition probability framework through detailed balance, which ensures that equilibrium states are inherently consistent with iterative chain behavior. This trait implies that spectral methods, such as those involving the Poincaré constant, become applicable and effective, offering clear and concise insights into mixing times and relaxation properties, thus providing foundational support for studying equilibrium dynamics and asymptotic distributional behavior .

The variational formula for total variation distance, as described, provides a framework where minimizing the probability of discrepancy between coupled random variables yields the u001cprecise total variation distance. This involves coupling methods which are theoretically advantageous since every coupling offers an upper bound to the total variation distance, and a coupling achieving equality confirms this method's reliability. However, the practical challenge arises in the complexity and sophistication needed to estimate the probability P(X = Y) effectively through coupling. Therefore, while theoretically sound, practical estimation requires careful design of efficient and tractable couplings .

The cutoff phenomenon in random walks on the hypercube signifies a sharp transition in the convergence of the chain to its stationary distribution. It indicates that for a fixed precision ε, the mixing time is characterized by a threshold, after which the total variation distance between the chain's distribution and the stationary distribution drops abruptly. This is quantitatively expressed as t(ε) mix(Pn) ∼ n ln n /2 as n approaches infinity for any ε in (0, 1). The realization of this cutoff phenomenon implies precise control over mixing times, revealing the deterministic aspect of apparent random behaviors, and suggests that convergence is mainly driven by specific eigenvalue arrangements and the combinatorial structure of the hypercube .

Coalescence time plays a crucial role in determining the mixing time of Markov chains as it reflects how quickly two processes from different initial states converge. Using coupling kernels, the coalescence time T is defined as the first time the coupled Markov chains' trajectories coincide. Studies and theorems, such as Theorem 2, show that fast convergence to equilibrium is achieved if these trajectories meet quickly. Hence, coupling techniques that reduce coalescence time result in a lower mixing time, driving the Markov chain faster to its equilibrium .

The spectral radius significantly influences the relaxation time in Markov chains, as it is indicative of the rate at which a distribution converges to equilibrium. The spectral radius, often denoted λ*(P), is tied to the Poincaré constant γ(P), such that in reversible settings, 1 - λ²*(P) coincides with γ(P). A smaller spectral radius implies larger eigenvalues are further from 1, leading to faster exponential decay of variances and hence shorter relaxation times. Conversely, if the spectral radius nears 1, it connotes slow convergence, reflecting longer retention in non-equilibrium states. Thus, spectral properties dictate mixing dynamics with relaxation time acting as a bridge between abstract algebraic characteristics and practical asymptotic behavior .

The Central Limit Theorem applies to random walks on the hypercube by illustrating that when initial distributions are dense or spread out, the distribution of the walk's position approximates a Gaussian distribution over time. Specifically for a sufficiently large n, the positions X_⌈αn^2⌉ in the walk behave in a manner that converges to a density function of N(0, α²) mod 1 on interval [0,1). This application of CLT shows refined local behavior as the walk progresses, highlighting the blending of predictability and randomness when initial distributions permit sufficiently rich exploration across the state space. Hence, it quantifies the state's distributional dynamics over time, leveraging Gaussian properties for immense state spaces like the hypercube .

You might also like