# Stacked Monte Carlo

One of the themes of simulation informatics is a blending of ideas from statistics and machine learning and how they apply to problems in simulation.  A beautiful example of such a fusion is a recent idea from

Brendan Tracey, David Wolpert, Juan J. Alonso.  Using Supervised Learning to Improve Monte Carlo Integral Estimation.  arXiv:1108.4870

These authors tackle the problem of how to reduce the variance in a Monte Carlo estimation of an integral.  The standard Monte Carlo scheme is:

$I[f] = \int_D f(x) \, dx \approx \frac{1}{k} \sum_{i=1}^k f(x_i)$

where each sample $x_i$ is drawn from a uniform random distribution over the domain $D$.  (Note that they address more complicated probability spaces in the paper and include a discussion of importance sampling, but these complicate the essence of the idea.)  This estimate is known to be bias-free, by which we mean that the expectation of the sum is equal to the integral.  But errors arise in Monte Carlo estimates because the variance of the random estimate depends on the variance of the function $f(x)$.  In contrast, quadrature integration formulae have zero variance, but have abias term.  Consequently, the authors hope to improve a Monte Carlo estimate by reducing its variance.

To do so, they combine two ideas.  First, suppose that we have a function $g(x)$ that is a reasonable approximation to $f(x)$.  If it’s a good approximation, then we can write:

$I[f] = \int_D \alpha g(x) \, dx + \int_D f(x) - \alpha g(x) \, dx$

where the function $f(x) - \alpha g(x)$ will have smaller variance than $f(x)$ itself.  When $g(x)$ is a function that is easy to integrate exactly, then using a Monte Carlo estimate of $I[f - \alpha g]$ will produce a more accurate estimate of the underlying integral.

But where can we such a function $g(x)$?  This is where the second idea arises.  We don’t get a single function $g(x)$.  We actually get a set of functions $g(x)$ from a cross-validation study, which brings up another simulation infomatics theme: treating functions like data.  Suppose we have 20 evaluations of $f(x)$ over the real interval $[-1, 1]$.  With any four of them, we can fit a unique cubic polynomial.  Such a polynomial is the function $g(x)$, and something that is easy to integrate exactly.  Consequently, we have the left hand term $\int_D \alpha g(x) \, dx$ in our approximation.  (The issue of how to pick $\alpha$ is discussed in the paper, and we don’t deal with it here.) The remaining 16 points will let us compute a Monte Carlo estimate of the residual integral $\int_D f(x) - \alpha g(x) \, dx$.  In a standard cross-validation study, these values would be used to pick the best cubic model of the data by minimizing the error on the “hold-out” set, the remaining 16 examples.  Here, they constitute useful data!  The cross-validation idea arises because the authors propose to do this for all of the cross-validation sets.  That is, we divide the points into 5 bins of 4 points, and then average the 5 estimates we get from each of the 5 functions $g(x)$.  Hence, the term “stacked Monte Carlo” it’s a Monte Carlo estimate inside of a Monte Carlo estimate.

Here is a Matlab code that exemplifies the idea:

%% A quick example of Stacked Monte Carlo integration

%% Setup the data
f = @(x) 1./(1.1-x);
xvals = rand(20,1)*2 - 1;
fvals = f(xvals);

%% Compute the Monte Carlo estimate
mc = mean(fvals)

%%
% Boy that was easy!
trueint = log(2.1) - log(0.1)

%%
% But it's really wrong.

%% Compute the stacked Monte Carlo estimate
nsets = 5;
nsetvals = 4;
alpha = 0.5;

ests = zeros(nsets,1);

for seti=1:nsets
trainset = nsetvals*(seti-1)+1:nsetvals*seti;
testset = setdiff(1:nsetvals*nsets,trainset);

% compute the integral of the function g
xtrain = xvals(trainset);
ftrain = fvals(trainset);
gcoeffs = polyfit(xtrain,ftrain,nsetvals-1);
intgcoeffs = polyint(gcoeffs);
intg = polyval(intgcoeffs,1)-polyval(intgcoeffs,-1);

% compute the MC estimate of the remainder
xtest = xvals(testset);
ftest = fvals(testset);
intr = mean(ftest - alpha*polyval(gcoeffs,xtest));

ests(seti) = alpha*intg + intr;
end

smc = mean(ests)

When I run this, I get

mc =
1.7526

trueint =
3.0445

smc =
2.2405

I repeated this a few times.  the smc estimate is always better.  Playing around with $\alpha$ shows that using larger alpha produces a better estimate.  But it’s still great to get better accuracy from a rather simple modification.

2) While it’s true in this case that the higher $\alpha$ is the better, I have not found that to be true generally. A quick counter example is if the fit is really bad; one would want an $\alpha$ close to zero. I would recommend that anyone who is going to use this method use the method for choosing $\alpha$ described in the paper.