0% found this document useful (0 votes)
11 views32 pages

Bayesian Inverse Problems Overview

The document outlines a lecture on inverse problems, focusing on Bayes' theorem and its application in estimating signals from noisy observations. It discusses the modeling of prior and likelihood distributions, and presents a case study on one-dimensional deconvolution with examples of posterior distributions and Bayesian estimators. The document also emphasizes the challenges of high-dimensional problems and the importance of credible sets in quantifying uncertainty.

Uploaded by

clearningaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Bayesian Inverse Problems Overview

The document outlines a lecture on inverse problems, focusing on Bayes' theorem and its application in estimating signals from noisy observations. It discusses the modeling of prior and likelihood distributions, and presents a case study on one-dimensional deconvolution with examples of posterior distributions and Bayesian estimators. The document also emphasizes the challenges of high-dimensional problems and the importance of credible sets in quantifying uncertainty.

Uploaded by

clearningaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Inverse Problems

Sommersemester 2022

Vesa Kaarnioja
[Link]@[Link]

FU Berlin, FB Mathematik und Informatik

Sixth lecture, May 30, 2022


Public holiday next Monday (June 6)

Monday June 6 (next week) is a public holiday.


→ There will be no lecture or exercise session on Monday June 6.
→ The next lecture and exercise session will be on Monday June 13
(two weeks from today)! The deadline for the sixth exercise sheet will also
be on Monday June 13.
Recap: Bayes’ formula for inverse problems
We are interested in the inverse problem of solving x ∈ Rd from
y = F (x) + η,
where y ∈ Rk is the measurement vector, F : Rd → Rk the forward
mapping, and η ∈ Rk is noise. We model x, y , and η as random variables.
Then we have:
Theorem (Bayes’ theorem)
We assume:
The noise η has the probability density ν on Rk .
The parameter x has the probability density π on Rd .
The random variables x and η are independent.
Then the likelihood is P(y |x) = ν(y − F (x)) and we can write

P(y |x)P(x) ν(y − F (x))π(x)


π y (x) := P(x|y ) = =: ,
P(y ) Z (y )
R
provided that Z (y ) := Rd ν(y − F (x))π(x)dx > 0.
Bayes’ formula:
ν(y − F (x))π(x)
π y (x) = .
Z (y )

The prior model π(x) describes a priori information. It should assign


high probability to objects x which are typical in light of a priori
information, and low probability to unexpected x.
The likelihood model P(y |x) = ν(y − F (x)) processes measurement
information. It gives low probability to objects that produce simulated
data which is very different from the measured data.
The number Z (y ) can be seen as a normalization constant.
The posterior distribution π y (x) = P(x|y ) represents the updated
knowledge about the parameter of interest x, given the evidence y .
Since the normalization constant Z (y ) is often not of interest in our
considerations, we frequently write the Bayes’ formula as

π y (x) ∝ ν(y − F (x))π(x),

where the symbol ∝ means equality up to a constant factor.


Case study: one-dimensional deconvolution

As motivation† , suppose that we are interested in estimating a signal


f : [0, 1] → R from noisy, blurred observations modeled as
Z 1
yi = y (si ) = K (si , t)f (t) dt + ηi , i ∈ {1, . . . , k},
0

where the blurring kernel is


 
1 2
K (s, t) = exp − 2 (s − t) , ω = 0.5,

and η ∈ Rk is measurement noise.


We will consider the so-called “linear-Gaussian setting” as well as computational
techniques for sampling posterior densities in more detail in a couple of weeks.
Specifically, we will not consider the question of how to draw samples from the posterior
density today. We will revisit this question in more detail at a later time.
Discrete model
Midpoint rule:
Z 1 d
1X
yi = K (si , t)f (t) dt + ηi ≈ K (si , tj )xj + ηi ,
0 d
j=1
j 1
where tj = d − 2d and xj = f (tj ) for j ∈ {1, . . . , d}.
i 1
If we have si = k − 2k for i ∈ {1, . . . , k}, then we have the discrete linear
model
1
y = Ax + η, where Ai,j =
K (si , tj ).
d
To employ the Bayesian approach, we treat y , η, and x as random
variables. We assume that η is Gaussian noise with variance σ 2 I ,
 1 
η ∼ N (0, σ 2 I ), ν(η) ∝ exp − 2 kηk2 .

The likelihood is then given by
 1 
P(y |x) = ν(y − Ax) ∝ exp − 2 ky − Axk2 .

Next, we have to choose a prior distribution for the unknown. Assume
that we know that x(0) = x(1) = 0 and that x is quite smooth, that is,
the value of x(t) in a point is more or less the same as in its neighbor. We
will then model the unknown as
1
xj = (xj−1 + xj+1 ) + Wj , j = 1, . . . , k, (1)
2
where the term Wj follows a Gaussian distribution N (0, γ 2 ).
The variance γ 2 determines how much the reconstructed function x
departs from the smoothness model xj = 12 (xj−1 + xj+1 ). We can write (1)
as  
2 −1
−1 2 −1 
 
1 −1 2 −1 
Lx = W , where L :=  .

2 .. .. ..
 . . . 

 −1 2 −1
−1 2
This leads to the so-called smoothness prior
 1 
π(x) ∝ exp − 2 kLxk2 .

Using Bayes’ formula, we get the posterior distribution
 1 1 
π y (x) ∝ exp − 2 ky − Axk2 − 2 kLxk2 .
2σ 2γ
For the numerical experiment, we simulate measurements using the
(smooth) ground truth signal
f (t) = 8t 3 − 16t 2 + 8t,
which satisfies f (0) = f (1) = 0. The measurements are contaminated with
10% relative noise (σ ≈ 0.0618) and we set d = k = 120.
Let us draw samples from the prior and posterior. As comparison, we also
consider a posterior obtained using the white noise prior, i.e.,
   
y 1 2 1 2
π0 (x) ∝ − 2 ky − Axk πpr,0 (x), πpr,0 (x) ∝ exp − 2 kxk .
2σ 2γ
Remark: When we simulate the measurement data, it is important to
avoid the inverse crime. One way to do this is to generate the
measurement data using a denser grid and then interpolate the forward
solution onto a coarser computational grid, which is actually used to
compute the reconstruction. (See week6.m on the course webpage!)
Samples drawn from the white noise prior, = 0.2 Samples drawn from the smoothness prior, = 0.0005
0.8 0.6

0.6 0.5

0.4 0.4

0.2 0.3

0 0.2

-0.2 0.1

-0.4 0

-0.6 -0.1

-0.8 -0.2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Samples drawn from the white noise prior, = 0.6864 Samples drawn from the smoothness prior, = 0.0032
3 2.5

2
2
1.5

1 1

0.5
0
0

-1 -0.5

-1
-2
-1.5

-3 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Samples drawn from the white noise prior, =2 Samples drawn from the smoothness prior, = 0.01
8 6

5
6
4

4 3

2
2
1
0
0

-2 -1

-2
-4
-3

-6 -4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Samples drawn from the white noise prior and the smoothness prior for several values of γ.
Samples drawn from the posterior with white noise prior, = 0.2 Samples drawn from the posterior with smoothness prior, = 0.0005
1.4 1.2
ground truth ground truth
1.2 posterior mean posterior mean
1
1

0.8 0.8

0.6
0.6
0.4

0.2 0.4

0
0.2
-0.2

-0.4 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Samples drawn from the posterior with white noise prior, = 0.6864 Samples drawn from the posterior with smoothness prior, = 0.0032
3 1.4
ground truth ground truth
2.5 posterior mean posterior mean
1.2

2
1

1.5
0.8
1
0.6
0.5
0.4
0

0.2
-0.5

-1 0

-1.5 -0.2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Samples drawn from the posterior with white noise prior, =2 Samples drawn from the posterior with smoothness prior, = 0.01
8 2.5
ground truth ground truth
posterior mean posterior mean
6 2

4 1.5

2 1

0 0.5

-2 0

-4 -0.5

-6 -1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Samples drawn from the posterior corresponding to both the white noise prior and the
smoothness prior for several values of γ. We also plot the ground truth solution and the posterior
mean. The solutions in the middle row roughly satisfy the Morozov discrepancy principle.
As the previous example illustrates, many practical problems tend to be
high-dimensional. The measurement model for the discretized
deconvolution example
y = Ax + η,
with A ∈ Rk×d , x ∈ Rd , and y , η ∈ Rk , where k corresponds to the
number of points s1 , . . . , sk where we observe the signal and d corresponds
to the number of quadrature points t1 , . . . , td discretizing the unknown
quantity x.
A grid with only k = d = 120 points already corresponds to a
120-dimensional posterior, so visualization of the posterior density is highly
nontrivial.
In practice, we are often interested in various point estimates, statistics,
samples, or the spread of the posterior distribution.
Bayesian estimators
The posterior distribution can be used to define estimators for the
conditional random variable x|y ∼ π y (x). In general, an estimator x̂ is any
function of the data y . The estimate x̂(y ) is itself an Rd -valued random
variable whose properties give information about the usefulness and quality
of the estimator.
Bayesian estimators are those defined via the posterior distribution π y . We
present the two most prominent ones. The conditional mean (CM)
estimator, which is defined as the mean
Z
x̂CM (y ) = E[x|y ] = uπ y (u)du
Rd

of the posterior distribution.


The maximum a posteriori (MAP) estimator, which is defined as the mode
x̂MAP (y ) = arg max π y (u)
u∈Rd

of the posterior distribution (if a unique mode exists).


One way to estimate spread are Bayesian credible sets. A level 1 − α
credible set Cα with α ∈ (0, 1) satisfies
Z
P(x ∈ Cα |y ) = π y (u)du = 1 − α.

For small α, it is a region that contains a large fraction of the posterior


mass.
Another way of quantifying uncertainty is to consider the problem

y † = F (x † ) + η,

where x † is thought to be a deterministic “true” value of the unknown.


We would then like to find random sets Cα that frequently contain the
truth u † , that is,
P(x † ∈ Cα ) = 1 − α.
Such a set Cα is called frequentist confidence region of level 1 − α.
Deconvolution example: posteriors with 2σ credibility envelopes.
Samples drawn from the posterior with white noise prior, = 0.2 Samples drawn from the posterior with smoothness prior, = 0.0005
1.4 1.2
ground truth ground truth
posterior mean 2 posterior mean 2
1.2
1

0.8
0.8

0.6 0.6

0.4
0.4

0.2

0.2
0

-0.2 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Samples drawn from the posterior with white noise prior, = 0.6864 Samples drawn from the posterior with smoothness prior, = 0.0032
3 1.6
ground truth ground truth
2.5 posterior mean 2 1.4 posterior mean 2

2
1.2
1.5
1
1
0.8
0.5
0.6
0
0.4
-0.5
0.2
-1

-1.5 0

-2 -0.2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Samples drawn from the posterior with white noise prior, =2 Samples drawn from the posterior with smoothness prior, = 0.01
8 2.5
ground truth ground truth
posterior mean 2 posterior mean 2
6 2

4 1.5

2 1

0 0.5

-2 0

-4 -0.5

-6 -1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Example. Assume that x ∈ R and that the posterior density is given by
   
y c u 1−c u−1
π (u) = φ + φ ,
σ1 σ1 σ2 σ2
where c ∈ (0, 1), σ1 , σ2 > 0, and φis the density of the standard normal
2
distribution, φ(u) = √12π exp − u2 . In this case,
(
0 if c/σ1 > (1 − c)/σ2 ,
x̂CM = 1 − c and x̂MAP =
1 if c/σ1 < (1 − c)/σ2 .

If c = 21 and σ1 , σ2 are small, the probability that x takes values near x̂CM
is small. On the other hand, if σ1 = cσ2 , then c/σ1 = 1/σ2 > (1 − c)/σ2 ,
so that x̂MAP = 0. If c is small, this is, however, a bad estimate for x,
since the probability for x to take values near 0 is small. Last of all, we
notice that when the conditional mean gives a poor estimate, this is
reflected in a larger posterior variance
Z ∞
2
σ = (u − x̂CM )2 π y (u)du.
−∞
We cannot say that one estimator is better than the other in all
applications.

Left: the density with σ1 = 0.08, σ = 0.04, and c = 12 . The CM estimate represents the
distribution poorly. Notice that when the CM gives a poor estimate, this is reflected in wider
variance (1 standard deviation is depicted as a red line). Right: the density with σ1 = 0.001,
σ2 = 0.1, and c = 0.01. The MAP gives a poor estimate since it is in an unlikely part of the
computational domain.
The maximum likelihood estimate

x̂ML (y ) = arg max P(y |u)


u∈Rd

answers the question: “which value of the unknown is most likely to


produce the measured data?”
The ML estimate is a non-Bayesian estimate, and in the case of ill-posed
inverse problems, often not useful. It is analogous to solving a classical
inverse problem without regularization.
Well-posedness
Assume that the posterior density is given by
1
π y (x) = g (x)π(x)
Z
with likelihood g (x) and prior density π(x). Now consider an
approximation
1
πδy (x) = gδ (x)π(x)

resulting from an approximated likelihood gδ (x). Such an approximation
can result, for example, from an approximation Fδ of the forward operator
F or from perturbed data yδ .

The question is therefore:

does |g − gδ | = O(δ) imply d(π y , πδy ) = O(δ)


for small enough δ > 0 and some metric d(·, ·) on probability densities?
Well-posedness refers to the continuity of the method of obtaining
the posterior distribution with respect to different perturbations in the
parameters. In practice, this could mean for example the following: If
we have two measurements close to each other, does this mean the
corresponding posterior distributions are close in some metric? Recall
that ill-posed problems generally are discontinuous in this regard, i.e.,
without regularization, small difference in measurements can induce
arbitrarily large difference in reconstruction. Does the Bayesian
approach then regularize the problem? The answer is yes under
certain assumptions on the modeling.
We will proceed to show that, under certain conditions, π y and πδy
satisfy
d(π y , πδy ) ≤ cδ
for δ small enough, some c > 0, and some metric d(·, ·) on probability
densities.
To this end, we define two metrics for probability densities: the total
variation distance and the Hellinger distance.
Metrics for probability densities

We introduce the total variation distance and the Hellinger distance, both
of which have been used to show well-posedness results. Here, we will use
the Hellinger distance to establish the well-posedness of Bayesian inverse
problems.
Let π and π 0 be the probability densities of two random variables with
values in Rd . We define the total variation distance between π and π 0 as
Z
1 1
dTV (π, π 0 ) = π(x) − π 0 (x) dx = kπ − π 0 kL1 ,
2 Rd 2
and the Hellinger distance between π and π 0 as
1
1 √ √
 Z
0 1 p p 2 2
dH (π, π ) = π(x) − π 0 (x) dx =√ π − π0 .
2 Rd 2 L2

The normalization constants are chosen in such a way that the largest
possible distance between two densities is one, as can be seen in the
following lemma.
Lemma
For any two probability densites π and π 0 ,

0 ≤ dTV (π, π 0 ) ≤ 1 and 0 ≤ dH (π, π 0 ) ≤ 1.

Proof. The lower bounds follow immediately from the definition of dTV
and dH . It remains to prove the upper bounds. To this end, we estimate
Z Z Z
0 1 0 1 1
dTV (π, π ) = |π(x) − π (x)|dx ≤ π(x)dx + π 0 (x)dx = 1
2 Rd 2 Rd 2 Rd

and
Z
0 2 1 p p 2
dH (π, π ) = π(x) − π 0 (x) dx
2 Rd
Z
1  
π(x) + π 0 (x) − 2 π(x)π 0 (x) dx
p
=
2 Rd
Z
1
π(x) + π 0 (x) dx = 1.


2 Rd
In what follows, we will establish bounds between Hellinger and total
variation distance and show how both distances can be used to bound the
difference of expected values with respect to two different densities; these
results will be useful in subsequent lectures.
Lemma
For any two probability densities π and π 0 , the total variation and
Hellinger distance are related by the inequalities
1
√ dTV (π, π 0 ) ≤ dH (π, π 0 ) ≤ dTV (π, π 0 ).
p
2
Proof. Using the Cauchy–Schwarz inequality and (a + b)2 ≤ 2a2 + 2b 2
leads to
Z p
1
dTV (π, π 0 ) =
p p p
π(x) − π 0 (x) · π(x) + π 0 (x) dx
2 Rd
 Z 1  Z 1
1 p p
0
2 2 1 p p
0
2 2
≤ π(x) − π (x) dx π(x) + π (x) dx
2 Rd 2 Rd
1

 Z
0 1 0
2
= 2dH (π, π 0 ).

≤ dH (π, π ) 2π(x) + 2π (x) dx
2 Rd
p p p p
0 0
p | π(x) − π (x)| ≤ | π(x) + π (x)|, since
Notice that
p
π(x), π 0 (x) ≥ 0. Thus, we have
Z p
1 2
dH (π, π 0 )2 =
p
π(x) − π 0 (x) dx
2 Rd
Z p
1 p p p
≤ π(x) − π 0 (x) · π(x) + π 0 (x) dx
2 Rd
Z
1
= π(x) − π 0 (x) dx = dTV (π, π 0 ).
2 Rd
The following lemmata show that if two densities are close in total
variation or Hellinger distance, expectations computed with respect to
both densities are also close.
Lemma
Let f be a real valued function on Rd such that
0
Eπ [f 2 ] + Eπ [f 2 ] =: f22 < ∞, then
0
Eπ [f ] − Eπ [f ] ≤ 2f2 dH (π, π 0 ). (2)

Proof. We estimate
Z
π0
π
f (x) π(x) − π 0 (x) dx

E [f ] − E [f ] =
Rd
Z p p  p p 
= f (x) π(x) − π 0 (x) π(x) + π 0 (x) dx
Rd
 Z 1  Z 1
1 p p
0
2 2
2
p p
0
2 2
≤ π(x)− π (x) dx 2 |f (x)| π(x)+ π (x) dx
2 Rd Rd
 Z 1
2
0 2 0
= 2f2 dH (π, π 0 ).

≤ dH (π, π ) 4 |f (x)| π(x) + π (x) dx
Rd
Lemma
Let f be a real valued function on Rd such that
supx∈Rd |f (x)| =: kf k∞ < ∞, then
0
Eπ [f ] − Eπ [f ] ≤ 2kf k∞ dTV (π, π 0 ).

Moreover, the following variational characterization of the total variation


distance holds:
1 0
dTV (π, π 0 ) = sup Eπ [f ] − Eπ [f ] .
2 kf k∞ ≤1

Remark: Note that the result for the Hellinger distance only assumes that
f is square integrable with respect to π and π 0 , whereas the result for the
total variation distance requires that f is bounded.
Proof. For the first part of the lemma, note that
Z
0
Eπ [f ] − Eπ [f ] = f (x) π(x) − π 0 (x) dx


Z Rd
1
≤ 2kf k∞ · |π(x) − π 0 (x)|dx = 2kf k∞ dTV (π, π 0 ).
2 Rd
This in particular shows that, for any f with kf k∞ = 1,
1 π 0
dTV (π, π 0 ) ≥ E [f ] − Eπ [f ] .
2
Our goal now is to show a choice
 of f withkf k∞ = 1 that achieves
equality. Define f (x) := sign π(x) − π 0 (x) , so that
 
f (x) π(x) − π 0 (x) = |π(x) − π 0 (x)|. Then, kf k∞ = 1 and
Z Z
0 1 0 1  
dTV (π, π ) = |π(x) − π (x)| dx = f (x) π(x) − π 0 (x) dx
2 Rd 2 Rd
1 π 0
= E [f ] − Eπ [f ] .
2
This completes the proof of the variational characterization.
Approximation theorem

We denote by

g (x) = ν(y − F (x)) and gδ (x) = ν(y − Fδ (x))

the likelihoods associated with F and Fδ , so that


1 1
π y (x) = g (x)π(x) and πδy (x) = gδ (x)π(x)
Z Zδ
with corresponding normalising constants Z , Zδ > 0. We make the
following assumptions on g and gδ .
Assumption 1. There exist δ + > 0, constants K1 , K2 > 0, and a function
ϕ: Rd → R such that Eπ [ϕ2 ] ≤ K1 and for all δ ∈ (0, δ + ),
p p
1 g (x) − gδ (x) ≤ ϕ(x)δ for all x ∈ Rd ,
p p
2 g (x) + gδ (x) ≤ K2 for all x ∈ Rd .
Lemma
Under Assumption 1 there exist δ̃ + > 0, c1 , c2 ∈ (0, +∞) such that

|Z − Zδ | ≤ c1 δ and Z , Zδ > c2 , for δ ∈ (0, δ̃ + ).


R R
Proof. Since Z = Rd g (x)π(x)dx and Zδ = Rd gδ (x)π(x)dx we have
Z

|Z − Zδ | = g (x) − gδ (x) π(x)dx
Rd
Z p 2 1  Z p 2 1
2 2
p p
≤ g (x) − gδ (x) π(x)dx g (x) + gδ (x) π(x)dx
Rd Rd
Z 1  Z 1
2 2 2 2 2
≤ δ φ(x) π(x)dx K2 π(x)dx
d d
pR R
+
≤ K1 K2 δ, δ ∈ (0, δ ).
And when δ ≤ δ̃ + := min{ 2√KZ , δ + }, we have
1 K2

1
Zδ ≥ Z − |Z − Zδ | ≥ Z .
2

The lemma follows by taking c1 = K1 K2 and c2 = 12 Z .
Theorem (Well-posedness)

Under Assumption 1, there exist δe+ > 0 and c > 0 such that

dH (π y , πδy ) ≤ cδ for all δ ∈ (0, δe+ ).

Proof. We break the distance into two error parts, one caused by the
difference between Z and Zδ , the other caused by the difference between g
and gδ :
1 √ y
q
dH (π y , πδy ) = √ π − πδy 2
2 L
r r r r
1 gπ gπ gπ gδ π
= √ − + −
2 Z Zδ Zδ Zδ L2
r r r r
1 gπ gπ 1 gπ gδ π
≤ √ − + √ − .
2 Z Z δ L 2
2 Zδ Zδ L2
On the previous slide, we obtained
r r r r
y y 1 gπ gπ 1 gπ gδ π
dH (π , πδ ) ≤ √ − +√ − .
2 Z Zδ L2 2 Zδ Zδ L2

Using the previous Lemma, for δ ∈ (0, δ̃ + ), we have for the first term
r

r
gπ 1 1 
Z 1
2
− = √ −√ g (x)π(x)dx
Z Zδ L2 Z Zδ d
| R

{z }
= Z
√ √ √
Z Zδ − Z |Z − Zδ | c1
= 1− √ = √ = √ √ √ ≤ δ.
Zδ Zδ ( Z + Zδ ) Zδ 2c2
For the second term, we obtain
r

r
gδ π 1 
Z p
2 1 rK
2 1
p
− =√ g (x) − gδ (x) π(x)dx ≤ δ.
Zδ Zδ L2 Zδ R d c 2
Therefore r
y y 1 c1 1 K1
dH (π , πδ ) ≤ √ δ+√ δ = cδ,
2 2c2 2 c2
q
c1 K1
with c = √12 2c 2
+ √1
2 c2 independent of δ.
Notice that, together with (2), i.e., the inequality
0 0
Eπ [f ] − Eπ [f ] ≤ 2f2 dH (π, π 0 ), f22 := Eπ [f 2 ] + Eπ [f 2 ],

this theorem guarantees that expectations computed with respect to π y


and πδy are in the order of δ apart.

You might also like