Adapting the Variance-Covariance Matrix
As has been noted before, there have been instances of chains simply failing to mix. While this seems rather silly in principle (sooner or later, after all, some mixing should happen and it should eventually be good enough to get convergence by various ergodicity theorems), it is rather annoying in practice.
In thinking about this problem, I have concluded that, despite the assurances of BCIMA that the chain is actually from the target distribution, it provides no assurance that the resultant chain (when the length is picked a priori) will be useful at all. This might be called the curse of slow mixing. Part of the point of targeting a 20% acceptance ratio is that the probability that, in continuous state spaces, the chain has failed to move at least once is a 32%, thus, if one were to generate five times the number of desired points and then take every fifth (a reasonable if time expensive procedure), then one can reasonably expect good mixing of the final subsample. But dropping down to a 10% acceptance ratio, the probability of failing to move after five steps is 59%.
While there is no particular reason to believe that any particular target is appropriate, I have made the decision to include a target mixing rate in the ABCIMA code because of exactly the above fact: if the mixing rate is not reasonably high (between 10% and 30%), then it is hard to generate chains of sufficient length as to be useful without taking a very long time to compute. Since points burned in the process of Backward Coupling are dead computations weight by every measure, it seems intelligent to attempt to make good use of them by using them to attain chains which use a proposal density which is expected to yield a productive chain. This, by almost any measure, is certainly one which mixes enough to explore the space.
A word of note: if the proposal is reasonably close to the target and the state space is continuous, then almost any mixing rate above a certain floor is a good one. The proposal being close to the target means that the candidate samples are coming proportionally from the right parts of the state space and the high mixing means that the chain will shuffle around between points in the high-density regions of the chain.
Draft Paper, Posteriors and Errors
So a draft version of an academic paper should be on its way out to the editors in a few days. So hopefully all will go well. I think I’ll probably get a copy of my published paper framed. It’s not very often something like that happens to a guy like me. Once the paper is accepted for publication then it will be time to move on and start another project.
One thing I just realized having now looked at the numerical results is that small image sets can still yield very good estimates. The key metric really isn’t simple radial error, for considering posteriors quickly reveals that if the difference between the posterior density of the true source and the estimate are large then we an expect a large error. Why? Because this indicates that the true source is less likely to generate the given image than the estimated point. With that in mind, the key metric for determining how well an estimation technique performs becomes the dispersion of the estimates, and that can be perhaps best measured by the eigenvalues of the variance-covariance matrix of the estimates and the variance-covariance matrix of the errors when expressed in (x,y) terms (which because of the simple link between those two can be used interchangeably).
What those eigenvalues reveal is the consistency of the estimates. And while it is possible to construct example posterior densities which are multi-modal or otherwise degenerate so long as the chain explores the space well then obtaining consistent estimates is a good thing because it means we don’t need to run many estimates in order to obtain an ensemble estimate, a definite good thing.
Starting To Run Examples
I am starting to run the examples that will be used in a paper my advisor and I will be submitting to a European mathematics journal.
There are four of them, all using a centered standard normal prior and a standard normal candidate distribution (for better comparison across estimation techniques). I ran the IMA analyses today to obtain 250 point chains with 1000 point burn ins.
So in two of the cases I obtained near stationary chains and in the other two there was very nice mixing. Later today, perhaps tomorrow I will make the plots for the sample images and the IMA chains. I want to see if there is something in the different examples which might hint at an explanation.
For giggles I ran the IMA process again for my two near stationary chain producing images using a uniform proposal and obtained slightly better results. Likewise with a 2I (2×2) variance-covariance matrix vice the standard normal candidate distribution. So I think the issue is that in some cases a standard normal prior is simply too strong to allow for good mixing. This is most interesting and I will need to run some more chains to see if I can repeat this result.
On a side note, some previous testing for ABCIMA suggested that the candidate distribution can rapidly degenerate to a single point with mass one if two precautions are not taken:
- Artificially inflating the eigenvalues of the variance-covariance matrix by a fixed amount to ensure a minimum value of same
- Using the entire set of previous chain points to compute the variance-covariance matrix for the next adaptation.
I think what I am seeing today is a reflection of the same basic problem: the variance-covariance matrix of the candidate distribution controls mixing and has a very strong effect. Ergo a possible correct remediation of poor mixing is very simple: inflate eigenvalues.
Fortunately in the case of IMA this is very easy to do. Likewise in BCIMA. However I have written the ABCIMA code to make the parameter which dictates the minimum eigenvalues of the said matrix inaccessible to anyone who isn’t interesting in digging around the source to manipulate it. That may need to change in version 0.5 of the package. I may also suggest that an intentionally broad candidate distribution be selected (vice simply using the prior) in (BC)IMA analyses in order to help ensure proper mixing. It would take many, many runs to ensure we obtain this–runs I don’t have the computer power to handle–but may be a correct next step.
Leave a Comment