0% found this document useful (0 votes)
47 views30 pages

Monte Carlo Methods in Sampling Techniques

The document discusses Monte Carlo methods and sampling techniques in the context of Markov chains for density estimation and inference tasks. It covers various approaches to training Markov chains, sampling from different distributions, and the use of cumulative distribution functions (CDFs) for efficient sampling. The document also provides examples of ancestral sampling and its application in modeling sequences and transitions.

Uploaded by

zhongxianmen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views30 pages

Monte Carlo Methods in Sampling Techniques

The document discusses Monte Carlo methods and sampling techniques in the context of Markov chains for density estimation and inference tasks. It covers various approaches to training Markov chains, sampling from different distributions, and the use of cumulative distribution functions (CDFs) for efficient sampling. The document also provides examples of ancestral sampling and its application in modeling sequences and transitions.

Uploaded by

zhongxianmen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Sampling Monte Carlo Approximation

CPSC 440: Advanced Machine Learning


Monte Carlo Methods

Mark Schmidt

University of British Columbia

Winter 2021
Introduction to Sampling Monte Carlo Approximation

Last Time: Markov Chains


We can use Markov chains for density estimation,
d
Y
p(x) = p(x1 ) p(xj | xj−1 ),
| {z } | {z }
j=2
initial prob. transition prob.

which model dependency between adjacent features.


Different than mixture models which focus on clusters in the data.

Homogeneous chains use same transition probability for all j (parameter tieing).
Gives more data to estimate transitions, allows examples of different sizes.

Inhomogeneous chains allow different transitions at different times.


More flexible, but need more data.

MLE is discrete-state Markov chains has simple closed-form “counting” solution.


Introduction to Sampling Monte Carlo Approximation

Training Markov Chains

Some common setups for fitting the parameters Markov chains:


1 We have one long sequence, and fit parameters of a homogeneous Markov chain.
Here, we just focus on the transition probabilities.

2 We have many sequences of different lengths, and fit a homogeneous chain.


And we can use it to model sequences of any length.

3 We have many sequences of same length, and fit an inhomgeneous Markov chain.
This allows “position-specific” effects.

4 We use domain knowledge to guess the initial and transition probabilities.


Introduction to Sampling Monte Carlo Approximation

Fun with Markov Chains


Markov Chains “Explained Visually”:
[Link]

Snakes and Ladders:


[Link]

Candyland:
[Link]

Yahtzee:
[Link]

Chess pieces returning home and K-pop vs. ska:


[Link]
Introduction to Sampling Monte Carlo Approximation

Inference in Markov Chains

Given a Markov chain model, these are the most common inference tasks:
1 Sampling: generate sequences that follow the probability.

2 Marginalization: compute probability of being in state c at time j.

3 Decoding: compute assignment to the xj with highest joint probability.


Decoding and marginalization will be important when we return to supervised learning.

4 Conditioning: do any of the above, assuming xj = c for some j and c.


For example, “filling in” missing parts of the image.

5 Stationary distribution: probability of being in state c as j goes to ∞.


Usually for homogeneous Markov chains.
Introduction to Sampling Monte Carlo Approximation

Outline

1 Introduction to Sampling

2 Monte Carlo Approximation


Introduction to Sampling Monte Carlo Approximation

Fundamental Problem: Sampling from a Density

A common inference task is sampling from a density.


Generating examples xi that are distributed according to a given density p(x).
Basically, the “opposite” of density estimation: going from a model to data.
 
1
2
 
 1
1 w.p. 0.5
  
1
p(x) = 2 w.p. 0.25 ⇒ X=
3 .


3 w.p. 0.25
  
2
 
1
3
Introduction to Sampling Monte Carlo Approximation

Fundamental Problem: Sampling from a Density


A common inference task is sampling from a density.
Generating examples xi that are distributed according to a given density p(x).
Basically, the “opposite” of density estimation: going from a model to data.

We’ve been using pictures of samples to “tell us what the model has learned”.
If the samples look like real data, then we have a good density model.

Samples can also be used in Monte Carlo estimation (today):


Replace complicated p(x) with samples to solve hard problems at test time.
Introduction to Sampling Monte Carlo Approximation

Simplest Case: Sampling from a Bernoulli


Consider sampling from a Bernoulli, for example

p(x = 1) = 0.9, p(x = 0) = 0.1.

Sampling methods assume we can sample uniformly over [0, 1].


Usually, a “pseudo-random” number generator is good enough (like Julia’s rand).

How to use a uniform sample to sample from the Bernoulli above:


1 Generate a uniform sample u ∼ U(0, 1).
2 If u ≤ 0.9, set x = 1 (otherwise, set x = 0).

If uniform samples are “good enough”, then we have x = 1 with probability 0.9.
Introduction to Sampling Monte Carlo Approximation

Sampling from a Categorical Distribution


Consider a more general categorical density like

p(x = 1) = 0.4, p(x = 2) = 0.1, p(x = 3) = 0.2, p(x = 4) = 0.3,

we can divide up the [0, 1] interval based on probability values:

If u ∼ U(0, 1), 40% of the time it lands in x1 region, 10% of time in x2 , and so on.
Introduction to Sampling Monte Carlo Approximation

Sampling from a Categorical Distribution

Consider a more general categorical density like

p(x = 1) = 0.4, p(x = 2) = 0.1, p(x = 3) = 0.2, p(x = 4) = 0.3.

To sample from this categorical density we can use (sampleDiscrete function):


1 Generate u ∼ U(0, 1).
2 If u ≤ 0.4, output 1.
3 If u ≤ 0.4 +0.1, output 2.
4 If u ≤ 0.4 +0.1 + 0.2, output 3.
5 Otherwise, output 4.
Introduction to Sampling Monte Carlo Approximation

Sampling from a Categorical Distribution


General case for sampling from categorical.
1 Generate u ∼ U(0, 1).
2 If u ≤ p(x ≤ 1), output 1.
3 If u ≤ p(x ≤ 2), output 2.
4 If u ≤ p(x ≤ 3), output 3.
5 ...
The value p(x ≤ c) = p(x = 1) + p(x = 2) + · · · + p(x = c) is the CDF.
“Cumulative distribution function”.

Worst case cost with k possible states is O(k) by incrementally computing CDFs.

But to generate t samples only costs O(k + t log k) instead of O(tk):


One-time O(k) cost to store the CDF p(x ≤ c) for each c.
Per-sample O(log k) cost to do binary search for smallest c with u ≤ p(x ≤ c).
Introduction to Sampling Monte Carlo Approximation

Cumulative Distribution Function (CDF)


We often use F (c) = p(x ≤ c) to denote the CDF.
F (c) is between 0 and 1, giving proportion of times x is below c.
F (c) monotically increases with c.
F can be used for discrete and continuous variables:

[Link]

The “binary search for smallest c” method finds smallest c such that u ≤ F (c).
This same approach works for continuous and general densities.

General approach uses the inverse CDF (or “quantile”) function:


If F is invertible, then F −1 is the usual inverse:
F −1 (u) = c for the unique c where F (c) = u.
The generalization that covers non-invertible cases is F −1 (u) = inf{c | F (c) ≥ u}.
“Return the smallest c where F (c) is at least u.”
Introduction to Sampling Monte Carlo Approximation

Inverse Transform Method (Exact 1D Sampling)


Inverse transfrom method for exact sampling in 1D:
1 Sample u ∼ U(0, 1).
2 Return F −1 (u).

Why this works (invertible case);

p(F −1 (u) ≤ c) = p(u ≤ F (c)) (apply monotonic F to both sides)


= F (c) (since p(u ≤ y) = y for uniform u)

So this algorithm has the same CDF as the distribution we want to sample.

Video on pseudo-random numbers and inverse-transform sampling:


[Link]
Introduction to Sampling Monte Carlo Approximation

Example: Sampling from a 1D Gaussian


Consider a Gaussian distribution,
x ∼ N (µ, σ 2 ).
CDF has the form
  
1 c−µ
F (c) = p(x ≤ c) = 1 + erf √ ,
2 σ 2
Rc
where the “error function” is erf(c) = π2 0 exp(−t2 )dt.
In [−1, 1] with no closed-form solution, but is typically available in stats packages.

Inverse CDF has the form



F −1 (u) = µ + σ 2erf−1 (2u − 1).
To sample from a Gaussian:
1 Generate u ∼√U(0, 1).
2 Return µ + σ 2erf−1 (2u − 1).
Introduction to Sampling Monte Carlo Approximation

Sampling from a Product Distribution

Consider a product distribution,

p(x1 , x2 , . . . , xd ) = p(x1 )p(x2 ) · · · p(xd ).

Because variables are independent, we can sample independently:


Sample x1 from p(x1 ).
Sample x2 from p(x2 ).
...
Sample xd from p(xd ).

Example: sampling from a multivariate Gaussian with diagonal covariance.


Sample each variable independently based on µj and σj2 .
Introduction to Sampling Monte Carlo Approximation

Digression: Sampling from a Multivariate Gaussian

In some cases we can sample from multivariate distributions by transformation.

Recall the affine property of multivariate Gaussian:


If x ∼ N (µ, Σ), then Ax + b ∼ N (Aµ + b, AΣAT ).

To sample from a general multivariate Gaussian N (µ, Σ):


1 Sample x from a N (0, I) (each xj coming independently from N (0, 1)).
2 Transform to a sample from the right Gaussian using the affine property:

Ax + µ ∼ N (µ, AAT ),

where we choose A so that AAT = Σ (e.g., by Cholesky factorization).


Introduction to Sampling Monte Carlo Approximation

Ancestral Sampling
To sample dependent random variables we can use the chain rule of probability,

p(x1 , x2 , x3 , . . . , xd ) = p(x1 )p(x2 | x1 )p(x3 | x2 , x1 ) · · · p(xd | xd−1 , xd−2 , . . . , x1 ).

The chain rule suggests the following sampling strategy:


Sample x1 from p(x1 ).
Given x1 , sample x2 from p(x2 | x1 ).
Given x1 and x2 , sample x3 from p(x3 | x2 , x1 ).
...
Given x1 through xd−1 , sample xd from p(xd | xd−1 , xd−2 , . . . x1 ).

This is called ancestral sampling.


It’s easy if (conditional) probabilities are simple, since sampling in 1D is usually easy.
But may not be simple, binary conditional j has 2j values of {x1 , x2 , . . . , xj }.
Introduction to Sampling Monte Carlo Approximation

Ancestral Sampling Examples


For Markov chains the chain rule simplifies to
p(x1 , x2 , x3 , . . . , xd ) = p(x1 )p(x2 | x1 )p(x3 | x2 ) · · · p(xd | xd−1 ),
So ancestral sampling simplifies too:
1 Sample x1 from initial probabilities p(x1 ).
2 Given x1 , sample x2 from transition probabilities p(x2 | x1 ).
3 Given x2 , sample x3 from transition probabilities p(x3 | x2 ).
4 ...
5 Given xd−1 , sample xd from transition probabilities p(xd | xd−1 ).

For mixture models with cluster variables z we could write


p(x, z) = p(z)p(x | z),
so we can first sample cluster z and then sample x given cluster z.
If you want samples of x, sample (x, z) pairs and ignore the z values.
Introduction to Sampling Monte Carlo Approximation

Markov Chain Toy Example: CS Grad Career


“Computer science grad career” Markov chain:
Initial probabilities:

Transition probabilities (from row to column):

So p(xt = “Grad School” | xt−1 = “Industry”) = 0.01.


Introduction to Sampling Monte Carlo Approximation

Example of Sampling x1

Initial probabilities are: So initial CDF is:


0.1 (Video Games) 0.1 (Video Games)
0.6 (Industry) 0.7 (Industry)
0.3 (Grad School) 1 (Grad School)
0 (Video Games with PhD) 1 (Video Games with PhD)
0 (Academia) 1 (Academia)
0 (Deceased) 1 (Deceased)

To sample the initial state x1 :


First generate a uniform number u, for example u = 0.724.
Now find the first CDF value bigger than u, which in this case is “Grad School”.
Introduction to Sampling Monte Carlo Approximation

Example of Sampling x2 , Given x1 = “Grad School”

So we sampled x1 = “Grad School”.


To sample x2 , we’ll use the “Grad School” row in transition probabilities:
Introduction to Sampling Monte Carlo Approximation

Example of Sampling x2 , Given x1 = “Grad School”

Transition probabilities: So transition CDF is:


0.06 (Video Games) 0.06 (Video Games)
0.06 (Industry) 0.12 (Industry)
0.75 (Grad School) 0.87 (Grad School)
0.05 (Video Games with PhD) 0.97 (Video Games with PhD)
0.02 (Academia) 0.99 (Academia)
0.01 (Deceased) 1 (Deceased)

To sample the second state x2 :


First generate a uniform number u, for example u = 0.113.
Now find the first CDF value bigger than u, which in this case is “Industry”.
Introduction to Sampling Monte Carlo Approximation

Markov Chain Toy Example: CS Grad Career


Samples from “computer science grad career” Markov chain:

State 7 (“deceased”) is called an absorbing state (no probability of leaving).


Samples often give you an idea of what model knows (and what should be fixed).
Introduction to Sampling Monte Carlo Approximation

Outline

1 Introduction to Sampling

2 Monte Carlo Approximation


Introduction to Sampling Monte Carlo Approximation

Marginalization and Conditioning


Given density estimator, we often want to make probabilistic inferences:
Marginals: what is the probability that xj = c?
What is the probability we’re in industry 10 years after graduation?
Conditionals: what is the probability that xj = c given xj 0 = c0 ?
What is the probability of industry after 10 years, if we immediately go to grad school?

This is easy for simple independent models:


We directly model marginals p(xj ), and conditional are marginals:
p(xj | xj 0 ) = p(xj ).

This is also easy for mixtures of simple independent models.


Do inference for each X
mixture, add results
X using mixture probabilities:
p(xj ) = p(z, xj ) = p(z) p(xj | z).
z z
| {z }
inference within cluster

For Markov chains, it’s more complicated...


Introduction to Sampling Monte Carlo Approximation

Marginals in CS Grad Career


All marginals p(xj = c) from “computer science grad career” Markov chain:

Each row j is a state and each column c is a year (sum of values in column is 1).
Introduction to Sampling Monte Carlo Approximation

Monte Carlo: Marginalization by Sampling


A basic Monte Carlo method for estimating probabilities of events:
1 Generate a large number of samples xi from the model,
 
0 0 1 0
1 1 1 0
X=  .
0 1 1 1
1 1 1 1

2 Compute frequency that the event happened in the samples,

p(x2 = 1) ≈ 3/4,
p(x3 = 0) ≈ 0/4.

Monte Carlo methods are second most important class of ML algorithms.


Originally developed to build better atomic bombs :(
Run physics simulator to “sample”, then see if it leads to a chain reaction.
Introduction to Sampling Monte Carlo Approximation

Monte Carlo Method for Rolling Di

Monte Carlo estimate of the probability of an event A:


number of samples where A happened
.
number of samples
Computing probability of a pair of dice rolling a sum of 7:
Roll two dice, check if the sum is 7.
Roll two dice, check if the sum is 7.
Roll two dice, check if the sum is 7.
Roll two dice, check if the sum is 7.
Roll two dice, check if the sum is 7.
...

Monte Carlo estimate: fraction of samples where sum is 7.


Introduction to Sampling Monte Carlo Approximation

Summary
Markov chain inference tasks (MEMORIZE):
Sampling, marginalization, decoding, conditioning, stationary distributions.

Inverse Transform generates samples from simple 1D distributions.


When we can easily invert the CDF.

Ancestral sampling generates samples from multivariate distributions.


When conditionals have a nice form.

Monte Carlo method for approximating probabilities of an event.


Generate samples, then count how many times event happened.

Next time: the original Google algorithm.

You might also like