Adapting the Variance-Covariance Matrix

As has been noted before, there have been instances of chains simply failing to mix. While this seems rather silly in principle (sooner or later, after all, some mixing should happen and it should eventually be good enough to get convergence by various ergodicity theorems), it is rather annoying in practice.

In thinking about this problem, I have concluded that, despite the assurances of BCIMA that the chain is actually from the target distribution, it provides no assurance that the resultant chain (when the length is picked a priori) will be useful at all. This might be called the curse of slow mixing. Part of the point of targeting a 20% acceptance ratio is that the probability that, in continuous state spaces, the chain has failed to move at least once is a 32%, thus, if one were to generate five times the number of desired points and then take every fifth (a reasonable if time expensive procedure), then one can reasonably expect good mixing of the final subsample. But dropping down to a 10% acceptance ratio, the probability of failing to move after five steps is 59%.

While there is no particular reason to believe that any particular target is appropriate, I have made the decision to include a target mixing rate in the ABCIMA code because of exactly the above fact: if the mixing rate is not reasonably high (between 10% and 30%), then it is hard to generate chains of sufficient length as to be useful without taking a very long time to compute. Since points burned in the process of Backward Coupling are dead computations weight by every measure, it seems intelligent to attempt to make good use of them by using them to attain chains which use a proposal density which is expected to yield a productive chain. This, by almost any measure, is certainly one which mixes enough to explore the space.

A word of note: if the proposal is reasonably close to the target and the state space is continuous, then almost any mixing rate above a certain floor is a good one. The proposal being close to the target means that the candidate samples are coming proportionally from the right parts of the state space and the high mixing means that the chain will shuffle around between points in the high-density regions of the chain.

Small Optimizations and Numerical integration

I’ve been thinking and reading about little ways in which I can numerically optimize my code. While on the one hand preallocated memory, etc, can go a long way, I am also thinking about algorithms.

One area in particular is the numerical integrations which need to happen over and over and over again in the course of the software. At the moment it turns out that my method, while naive, is probably a pretty good one as some numerical experiments show good convergence properties with relatively small sample sizes (between 1000 and 2000 points) based on the variance of the last half of the estimates and the ACF of the same.

I went looking around for places to replace loops with vectorized operations with little luck. It is the nature of MCMC that loops become necessary. On the other hand, there are a couple places where if I am willing to sacrifice the memory to the task, I could precompute inputs to each loop iteration using the “apply” function and then use the loop to do the final transitions. Alternatively, I am thinking that because many, many inputs don’t ever change, that it may be possible to use the “call” and “do.call” functionality to prepackage all the unchanging bits into something that can then be executed much faster because, again, there is no need to do a bunch of memory accesses. At this time I am using do.call with a seperate list of arguments, so the list has to be created then fed to the do.call mechanism. This, I suspect, is far from efficient.

Adaptation Schemes for IMA

I have been reading about adaptation schemes since my origional scheme has some major problems (as has been discussed before) and I am trying to be somewhat less naive in my approach. Interestingly enough, my big issue at the moment is trying to come up with an intelligent way to do something called “diminishing adaptation” whereby as time goes on the adaptations are less and less severe.

One simple approach which I have a particular problem with based on my own experience is simply taking the covariance matrix of the data up to this point. Well, as I have already seen, it is possible to have degeneracy happen in practice even though there is a theoretical result out there in preprint on Cambridge’s MCMC preprint server which discuses the fact that the eigenvalues of the covariance matrix will not become degenerate under certain conditions. I am studying this to see how I may leverage this result and to see why I may have been having the problem before. Yet as I have seen, chains may may fail to mix in the initial 50 points which would then (if I immediately adapted based on those 50 points) cause my distribution to collapse.

I am considering two possible schema for doing adaptation of the variance-covariance matrix. In either case there is a desire to enable targeted mixing rates, to which end I am keeping track of what the candidate distributions have been and the associated mixing rates in an attempt to use past information to calibrate choices rather than take whatever is the latest set of values from the chain. So my ideas are as follows:

1) use a multiple of the old covariance matrix based on the new mixing rate where the multiplier monotonically approaches one from either above or below for the cases of too high and too low a mixing rate respectively

2) Use the mean (arithmetic? geometric? hyper-geometric?) of the eigenvalues of the old and new covariance matrices.

In either case I am unsure as to how to adapt the eigenvectors as they play a substantial role making the candidate distribution match up with the posterior (exactly why we would expect better mixing with adaptation!) I am thinking that I should merely take the most recent set and let things do at that, although it would perhaps be better to use some sort of average set of eigenvectors too.

Errors, Small Samples, Random Chance, and Bugs

The title aptly sums up the issues that my project advisor (recall, I started this project four years ago in university and then the life after college intervened to slow things down) have been fighting over the holidays while trying to prep for both a talk and the still eventual paper submission.

All this has taught me three things:

1) Thank goodness for the Unix Way of Programming, because it means that much of the code that we think is suspect will be easy to go back and test should we find evidence of sam.

2) The beauty of simulations is that you can just run more of them until the law of large numbers kicks in and observed behavior is true behavior. However this project has taught me that the law of large numbers takes a long time to kick in when small samples are at play. (SPECT is often considered a Poisson process, and as such there are a large number of small samples, at least in the simulations we’ve been running because naive methods are less costly than sophisticated ones and both handle large datasets fairly well).

3) If I’m going to stay in the numerical simulation business where my personal research projects are concerned, then I need to hurry up and both buy a much more powerful computer and learn to program in C++.

Adaptation Going Forward

After failing to make any progress on the software package, I was graced with a copy of a book about Monte Carlo methods using R (citation to follow) with a nice little section on adaptation (nothing on backward coupling at first glance, so there is at least something new in my work) and it talked about controlling the acceptance rate as the criterion for the adaptation. Their target is 40%, which is a rule of thumb, but is much higher that what I’ve been seeing (maybe 25% after a couple rounds of adaptation)

I have already mentioned problems with the candidate distribution converging too quickly and my one cludged method of preventing this: using the entire chain history to damp convergence. The other degeneracy prevention method involves an additive inflation of the eigenvalues of the variance-covariance matrix. This latter could be manipulated too, possibly based on the acceptance rate. A simple formula would be to compute the acceptance rate on the last subchain and set the inflator to split the difference between the old and new subchains.

It is clear that more thought on how to control the eigenvalues of the variance-covariance matrix to optimize mixing is a “good idea”, especially in the face of what I suspect is a multimodal posterior distribution. Some of the computed examples suggest that adapting too rapidly results in the chain finding a local maxima vice the global one (the domain is bounded and small, so exploring it should not be a problem (or so one would expect)).

Draft Paper, Posteriors and Errors

So a draft version of an academic paper should be on its way out to the editors in a few days. So hopefully all will go well. I think I’ll probably get a copy of my published paper framed. It’s not very often something like that happens to a guy like me. Once the paper is accepted for publication then it will be time to move on and start another project.

One thing I just realized having now looked at the numerical results is that small image sets can still yield very good estimates. The key metric really isn’t simple radial error, for considering posteriors quickly reveals that if the difference between the posterior density of the true source and the estimate are large then we an expect a large error. Why? Because this indicates that the true source is less likely to generate the given image than the estimated point. With that in mind, the key metric for determining how well an estimation technique performs becomes the dispersion of the estimates, and that can be perhaps best measured by the eigenvalues of the variance-covariance matrix of the estimates and the variance-covariance matrix of the errors when expressed in (x,y) terms (which because of the simple link between those two can be used interchangeably).

What those eigenvalues reveal is the consistency of the estimates. And while it is possible to construct example posterior densities which are multi-modal or otherwise degenerate so long as the chain explores the space well then obtaining consistent estimates is a good thing because it means we don’t need to run many estimates in order to obtain an ensemble estimate, a definite good thing.

Simulations Have Ran

Now that all the ABCIMA runs are done I can start working out imagery. The only small issue I’ve observed is that the adaptation seems to mean that if the mixing is poor on the initial couple chains then each subchain is going to be poor. This places a high emphasis on finding ways to ensure proper mixing.

Running Examples

So far I’ve done eight example runs. Turns out the BCIMA runs for images one and three are just as bad as the IMA runs where becoming stationary or near stationary is concerned. So there are some mixing problems. Images two and four are doing better to the point where mixing seems fine.

The ABCIMA runs are being started tonight. My other machine is having a small hickup which I think may be an Emacs issue vice an R issue, so I’ll see whether or not the first run works without constant attention. In short, a single subchain will be completed and then it seems to stop until I hit “enter” at which point it starts up again. So it could also be an issue with the code itself and having it print out status messages. This will be explored as time permit.

Once I have all the examples run I’ll start making the imagery and from there start writing the paper. I’m hoping that I don’t need more examples, but I think two will be enough for what I need to convey.

Starting To Run Examples

I am starting to run the examples that will be used in a paper my advisor and I will be submitting to a European mathematics journal.

There are four of them, all using a centered standard normal prior and a standard normal candidate distribution (for better comparison across estimation techniques). I ran the IMA analyses today to obtain 250 point chains with 1000 point burn ins.

So in two of the cases I obtained near stationary chains and in the other two there was very nice mixing. Later today, perhaps tomorrow I will make the plots for the sample images and the IMA chains. I want to see if there is something in the different examples which might hint at an explanation.

For giggles I ran the IMA process again for my two near stationary chain producing images using a uniform proposal and obtained slightly better results. Likewise with a 2I (2×2) variance-covariance matrix vice the standard normal candidate distribution. So I think the issue is that in some cases a standard normal prior is simply too strong to allow for good mixing. This is most interesting and I will need to run some more chains to see if I can repeat this result.

On a side note, some previous testing for ABCIMA suggested that the candidate distribution can rapidly degenerate to a single point with mass one if two precautions are not taken:

  1. Artificially inflating the eigenvalues of the variance-covariance matrix by a fixed amount to ensure a minimum value of same
  2. Using the entire set of previous chain points to compute the variance-covariance matrix for the next adaptation.

I think what I am seeing today is a reflection of the same basic problem: the variance-covariance matrix of the candidate distribution controls mixing and has a very strong effect. Ergo a possible correct remediation of poor mixing is very simple: inflate eigenvalues.

Fortunately in the case of IMA this is very easy to do. Likewise in BCIMA. However I have written the ABCIMA code to make the parameter which dictates the minimum eigenvalues of the said matrix inaccessible to anyone who isn’t interesting in digging around the source to manipulate it. That may need to change in version 0.5 of the package. I may also suggest that an intentionally broad candidate distribution be selected (vice simply using the prior) in (BC)IMA analyses in order to help ensure proper mixing. It would take many, many runs to ensure we obtain this–runs I don’t have the computer power to handle–but may be a correct next step.

My Slow Eee PC

It turns out with a couple of small tweaks everything works great on my Eee PC. It also turns out that programming on an EeePC enforces a very high level of efficiency in algorithms because while the software runs, it runs SLOWLY.

So the reality is when I have access to my other machine I’ll use it to do the runs and then transfer the .Rdata files over in order to economize my time.

I have four spect data files for use as examples and they look promising. So that means twelve runs, four with each algorithm. Should work out well.

Follow

Get every new post delivered to your Inbox.