The basic goal of stochastic sampling is to draw random samples from a given distribution p(x). Sometimes, p(x) is not exactly known, for example, knowing only f(x)=p(x)/K.

Given the ability of stochastic sampling, we may do the following things: (i) approximate p(x), (ii) integrate p(x), and (iii) optimize p(x).

A difficulty for stochastic sampling is that the form of p(x) or f(x) is usually complex, however, we can only sample directly from some easy-formed distributions, like discrete distribution, uniform distribution or unit Gaussian.

So, usually we sample from an easy-to-sample distribution q(x) instead of from p(x) or f(x), then we either assign each sample point a “weight”, or an “acceptance probability” which decides whether we accept this point as a sample from p(x) or f(x).

Some basic sampling methods include:

- Monte Carlo: Sample from q(x). No weight or acceptance probability is calculated. Just pretends that samples from q(x) is from p(x), and use the samples to compute the integration of p(x).
- Weighted Sampling: Sample from q(x). Assign each sample x a weight w=p(x)/q(x). Use the weighted samples to compute the integration of p(x).
- Rejection Sampling: Sample from q(x), where q(x) > p(x) in the whole domain of p(x). Accept each sample x by an acceptance probability a=p(x)/q(x).

In above methods, we have to choose q(x) which is easy-to-sample and approximate p(x). These two requirements are often contradictory. In order not to be bothered by choosing q(x), we may resort to Markov Chain Monte Carlo methods, which constructively approach p(x) without q(x) along a Markov chain.

- Metropolis Sampling: Sample xt from a transition proposal q(xt|xt-1). Accept the sample by the probability

The intuition is to accept xt with high probability if f(xt) > f(xt-1). This requires that q(a|b)=q(b|a). - Metropolis-Hasting Sampling: Change the acceptance probability to

This trick removes the requirement of q(a|b)=q(b|a). - Gibbs Sampling: In case that f(x) is very high-dimensional, it is expansive to compute f(x). So we avoid computing acceptance probability by assuming a=1, which means always accept the sample.
- Simulated Annealing: Change the acceptance probability to

where T(t) decreases from a large value to 1. The intuition is that initially, a large T increases the probability to accept an xt which does not increase the value of f(xt).

Few more words for simulated annealing: to take it as a sampling method, we decrease T down to 1; to make it an optimization method, we decrease T down to 0.

References:

- B. Walsh. Markov Chain Monte Carlo and Gibbs Sampling. Lecture notes. 2004.
- David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Chapter 29 and Chapter 30.