One of the themes of simulation informatics is a blending of ideas from statistics and machine learning and how they apply to problems in simulation. A beautiful example of such a fusion is a recent idea from

Brendan Tracey, David Wolpert, Juan J. Alonso. Using Supervised Learning to Improve Monte Carlo Integral Estimation. arXiv:1108.4870

These authors tackle the problem of how to reduce the variance in a Monte Carlo estimation of an integral. The standard Monte Carlo scheme is:

where each sample is drawn from a uniform random distribution over the domain . (Note that they address more complicated probability spaces in the paper and include a discussion of importance sampling, but these complicate the essence of the idea.) This estimate is known to be *bias-free*, by which we mean that the expectation of the sum is equal to the integral. But errors arise in Monte Carlo estimates because the variance of the random estimate depends on the variance of the function . In contrast, quadrature integration formulae have *zero variance*, but have a*bias* term. Consequently, the authors hope to improve a Monte Carlo estimate by reducing its variance.

To do so, they combine two ideas. First, suppose that we have a function that is a reasonable approximation to . If it’s a good approximation, then we can write:

where the function will have smaller variance than itself. When is a function that is easy to integrate exactly, then using a Monte Carlo estimate of will produce a more accurate estimate of the underlying integral.

But where can we such a function ? This is where the second idea arises. We don’t get a single function . We actually get a set of functions from a cross-validation study, which brings up another simulation infomatics theme: treating functions like data. Suppose we have 20 evaluations of over the real interval . With any four of them, we can fit a unique cubic polynomial. Such a polynomial is the function , and something that is easy to integrate exactly. Consequently, we have the left hand term in our approximation. (The issue of how to pick is discussed in the paper, and we don’t deal with it here.) The remaining 16 points will let us compute a Monte Carlo estimate of the residual integral . In a standard cross-validation study, these values would be used to pick the best cubic model of the data by minimizing the error on the “hold-out” set, the remaining 16 examples. Here, they constitute useful data! The cross-validation idea arises because the authors propose to do this for all of the cross-validation sets. That is, we divide the points into 5 bins of 4 points, and then average the 5 estimates we get from each of the 5 functions . Hence, the term “stacked Monte Carlo” it’s a Monte Carlo estimate inside of a Monte Carlo estimate.

Here is a Matlab code that exemplifies the idea:

%% A quick example of Stacked Monte Carlo integration %% Setup the data f = @(x) 1./(1.1-x); xvals = rand(20,1)*2 - 1; fvals = f(xvals); %% Compute the Monte Carlo estimate mc = mean(fvals) %% % Boy that was easy! trueint = log(2.1) - log(0.1) %% % But it's really wrong. %% Compute the stacked Monte Carlo estimate nsets = 5; nsetvals = 4; alpha = 0.5; ests = zeros(nsets,1); for seti=1:nsets trainset = nsetvals*(seti-1)+1:nsetvals*seti; testset = setdiff(1:nsetvals*nsets,trainset); % compute the integral of the function g xtrain = xvals(trainset); ftrain = fvals(trainset); gcoeffs = polyfit(xtrain,ftrain,nsetvals-1); intgcoeffs = polyint(gcoeffs); intg = polyval(intgcoeffs,1)-polyval(intgcoeffs,-1); % compute the MC estimate of the remainder xtest = xvals(testset); ftest = fvals(testset); intr = mean(ftest - alpha*polyval(gcoeffs,xtest)); ests(seti) = alpha*intg + intr; end smc = mean(ests)

When I run this, I get

mc = 1.7526 trueint = 3.0445 smc = 2.2405

I repeated this a few times. the smc estimate is always better. Playing around with shows that using larger alpha produces a better estimate. But it’s still great to get better accuracy from a rather simple modification.

Two quick comments:

1) Normally in cross validation, the training sets are larger than the testing sets. So, instead of fitting a polynomial with 4 points and using 16 points to correct the fit, one would normally flip those two numbers and fit the polynomial with 16 points, and correct the fit with the other 4. One could fit the parameters of the polynomial using least mean squares, for example, as is done in the paper. The reason this is done is because we want to use as many points as we can in order to get the most accurate polynomial fit.

2) While it’s true in this case that the higher $\alpha$ is the better, I have not found that to be true generally. A quick counter example is if the fit is really bad; one would want an $\alpha$ close to zero. I would recommend that anyone who is going to use this method use the method for choosing $\alpha$ described in the paper.