0% found this document useful (0 votes)
86 views16 pages

MCMC with Caching for Gaussian Processes

This document summarizes research on using temporary mapping and caching to improve the efficiency of Markov chain Monte Carlo (MCMC) methods for approximating distributions like the posterior in Gaussian process regression. It describes using a subset of the data to form an approximating distribution π* that allows mapping proposals to a temporary space to mix faster before mapping back. Experiments on a synthetic dataset show this "mapping" method leads to faster mixing than standard MCMC, as evidenced by shorter autocorrelation times. Ongoing work includes exploring other approximation methods and using "tempered transitions" for the mappings.

Uploaded by

Chunyi Wang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views16 pages

MCMC with Caching for Gaussian Processes

This document summarizes research on using temporary mapping and caching to improve the efficiency of Markov chain Monte Carlo (MCMC) methods for approximating distributions like the posterior in Gaussian process regression. It describes using a subset of the data to form an approximating distribution π* that allows mapping proposals to a temporary space to mix faster before mapping back. Experiments on a synthetic dataset show this "mapping" method leads to faster mixing than standard MCMC, as evidenced by shorter autocorrelation times. Ongoing work includes exploring other approximation methods and using "tempered transitions" for the mappings.

Uploaded by

Chunyi Wang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MCMC with Temporary Mapping and Caching with Application on Gaussian Process Regression

Advisor: Professor Radford Neal Chunyi Wang Department of Statistics, University of Toronto

Joint Statistical Meeting August 3rd, 2011

Markov Chain Monte Carlo Methods

We construct a ergodic Markov Chain with transition T (x |x) which leaves the target distribution (x) invariant, i.e. (x)T (x |x)dx = (x ) Metropolis algorithm: propose to move from x to x (according to a proposal distribution S(x |x)), accept the proposal with probability min[1, (x )/(x)]. This satises the detailed balance condition (x)T (x |x) = (x )T (x|x ) and thus the chain (called reversible) will leave the target distribution invariant.

Caching Results for Future Re-use

The evaluation of (x) is typically hard. So we should always save the result of (x) if it might be used in the future: If Metropolis proposal x is rejected, the current state will still be x and therefore (x) is needed for the next update; If a Metropolis proposal x is accepted then the current state will be x and so (x ) is needed for the next update; If the state space is discrete;

MCMC with Temporary Mapping


We can combine three stochastic mappings T , T and T to form the transition T (x |x), as follows: x y y x where x X is the original sample space and y Y is a temporary space. To leave the target distribution invariant these mappings have to satisfy (x)T (y|x)dx = (y) (y)T (y |y)dy = (y ) (y )T (x |y )dy = (x )
T T T

Mapping to a Discretizing Chain


Suppose we have a Markov Chain which leaves a distribution invariant. We can map to a space of realizations of such a chain. The current state x is mapped to a chain with one time step (whose value is x) marked. X y x Y T We dont actually compute everything beforehand, but simulate new states (and save them for future re-use) when needed.

Mapping to a Discretizing Chain - Continued


We then attempt to move the marker along the chain to another state (whose value is x ), with acceptance probability min[1, (x )/ (x ) ]. We can do multiple such updates in this space (x)/ (x) before mapping back to the original space. X x x T y T Y T (Solid line segments are the updates that are actually simulated, while the dashed segments are not).

Gaussian Process Regression: Model

We observe n training cases (x1 , y1 ), ..., (xn , yn ) where xi is a vector of inputs of length p, and yi is the corresponding scalar response, which we assume is a function of the inputs plus some noise: yi = f (xi ) + where
iid i i

N (0, 2 )

In a Gaussian Process Regression model, the prior mean of the function f is 0, and the covariance of the response is Cov(yi , yj ) = k(xi , xj ) + 2 ij

Gaussian Process Regression: Prediction


We wish to predict the response y , for a test case x based on the training [Link] predictive distribution for the response y is Gaussian: E[y |y] = k T C 1 y V ar[y |y] = v k T C 1 k where C is the covariance matrix for the training responses, k is the vector of covariances between y and each of yi , and v is the prior variance of y , [i.e. Cov(y , y )]. To do this in the Bayesian framework, we obtain a random sample from the posterior density for the hyper-parameter :
n 1 (|y) (2) 2 det(C)1 exp y T C 1 y () 2

where () is the prior for .

Complexity for the GP Regression Model


The posterior density is
n 1 (|y) (2) 2 det(C)1 exp y T C 1 y () 2

The time needed to perform the following major computations are (asymptotically, with an implementation-specic constant coecient): C det(C) C 1 T 1 y C y pn2 n3 n3 n2

In practice we compute C (pn2 ), and the Cholesky decomposition of C (n3 ), then we can cheaply obtain det(C) and y T C 1 y.

as an Approximation of : dimension reduction

We wish to nd some such that its easier to compute (than ) while similar in distribution as . Some approximation methods as listed as candidates: Subset of data (SoD): is the posterior given only a subset (of m observations) of (x1 , y1 ), ..., (xn , yn ). Need time proportional to pm2 to compute C , and m3 to invert C . Linear combination of responses: Let y = Ay where A is of rank m. y is also Gaussian, with lower dimension. is the posterior based on the covariance matrix for y , C = ACAT , of rank m. Others: SoR, Bayesian Committe Machine, etc...

as an Approximation of : diagonal plus low rank


C is usually of the form 2 I + C 0 , where C 0 is non-negative denite. If C 0 can be approximated by some lower rank matrix C 0 , then we can reduce the computation by these lemmas (D + U W V T )1 = D1 D1 U (W 1 + V T D1 U )1 V T D1 det(D + U W V T ) = det(W 1 + V T D1 U ) det(W ) det(D) Eigen-method: C = 2 I + Bm B T , where m is the diagonal matrix with eigenvalues 1 2 , ..., m of C on its diagonal, and B is an n m matrix whose columns are the corresponding orthonormal eigenvectors. Need to compute C (pn2 ) and the rst m eigenvalues and eigenvectors of C (mn2 , with a large constant factor). Nystrm methods: C = 2 I + C 0 o [C 0 ]1 C 0 where
(n,m) (m,m) (m,n) 0 C(n,m) is a n m matrix, whose m columns are m randomly 0 selected columns from C 0 . Need to compute C(n,m) (pmn), then nd the Cholesky decomposition of some m m matrix, (m3 ).

Example: Use SoD to form the


We generate a synthetic dataset as follows: y = 3 sin(x2 ) + 2 sin(1.5x + 1) + where x Unif(0, 3) and N (0, 0.52 ). We generated 500 observations as the training set, and another 1000 for the testing set.
We use the a squared exponential covariance function: 102 + 2 exp (x x )2 2 + , 2
2 0 y

Training Set 6

and the priors are log 2 N (3, 32 )


2

log 2 N (2, 32 ) log 2 N (0, 32 )


4 6 0

0.5

1.5 x

2.5

Example: Use SoD to form the - Predictions


The rst 50 observations are used as the subset to form the to implement the MCMC (with mapping), and compare the results to a Metropolis MCMC. The sample ACFs are adjusted so that they reect the same amount of evaluations of (x).
Predictions 6

0 y 2 4 6

testing cases metropolis mapping 0.5 1 1.5 x 2 2.5 3

8 0

Example: Use SoD to form the - Autocorrelations


Sample ACF Metropolis 1

Sample Autocorrelation

0.5

0.5

20

40

60 Lag

80

100

Sample ACF Mapping with SoD 1

Sample Autocorrelation

0.5

0.5

10

15 Lag

20

25

30

Comparison of autocorrelation times: Metropolis Mapping log 2 37.9 8.5 log 2 31.5 7.1 log 2 12.0 1.9

Ongoing Works

Other approximation methods for Among various approximation methods, which one is the best, or which one is better for certain situations Mapping with Tempered Transitions Instead of directly map a state into the temporary space, we borrow the idea of tempered transitions and form a sequence of mappings.

References
1. Neal, R. M. (1998) Constructing Ecient MCMC Methods Using
Temporary Mapping and Caching, Talk at Columbia University, December 2006 2. Neal, R. M. (1998) Regression and Classication Using Gaussian Process Priors Bayesian Statistics 6, pp. 475-501 Oxford University Press 3. Neal, R. M. (2008) Approximate Gaussian Process Regression Using Matrix Approximations and Linear Response Combinations Tech. Report (Draft), Dept. of Statistics, University of Toronto 4. Quionero-Candela, J., Rasmussen, C.E. and Williams, C. K. I. (2007) n Approximation Methods for Gaussian Process Regression Tech. Report MSR-TR-2007-124, Microsoft Research 5. Rasmussen, C. E. and Williams, C. K. I. (2006) Gaussian Processes for Machine Learning, The MIT Press.

You might also like