{"id":372,"date":"2014-11-07T10:09:07","date_gmt":"2014-11-07T10:09:07","guid":{"rendered":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/?p=372"},"modified":"2025-02-26T13:21:38","modified_gmt":"2025-02-26T13:21:38","slug":"adaptive-mcmc","status":"publish","type":"post","link":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/2014\/11\/07\/adaptive-mcmc\/","title":{"rendered":"Adaptive MCMC"},"content":{"rendered":"<p>The Stata Journal has not published very\u00a0much on Bayesian statistics, so I was delighted to see the article by Matthew Baker in the latest issue (Stata Journal 2014;14(3):623-661). Matthew describes a Mata program for adaptive MCMC and his paper\u00a0has encouraged me to discuss this topic. You should certainly read that article alongside this blog.<\/p>\n<p>I mentioned adaptation in chapter 3 of <em>Bayesian analysis with Stata<\/em> and I added an adapt option to the command &#8211;<strong>mcmcrun<\/strong>&#8211; so that users can automatically adapt the proposal standard deviations of their Metropolis-Hastings samplers. However, my approach was based on adaptation during the burn-in, while Matthew\u2019s paper extends this approach to continued adaptation during the whole MCMC run. This is an interesting topic about which there has been quite a lot of research.<\/p>\n<p>Adapting the sampler\u00a0throughout the chain has the\u00a0potential to be more efficient than fixing it at one value. Imagine sampling values from the conditional distribution of beta when the posterior resembles the plot below.\u00a0Ideally we would like\u00a0to use a proposal distribution for beta with a small standard deviation\u00a0when alpha is around 10 but a proposal distribution with a much larger standard deviation when alpha is around 20. Since we are unlikely to have the information necessary for designing such a sampler when the chain starts, wouldn&#8217;t it be helpful if the chain could learn how to adapt the proposal distribution as it goes along?<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/11\/contour.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-396\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/11\/contour-300x218.png\" alt=\"contour\" width=\"300\" height=\"218\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/11\/contour-300x218.png 300w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/11\/contour.png 811w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>A first sight adaptation looks attractive but before we start considering it, there is one very important message:<\/p>\n<p><span style=\"color: #ff0000\">adapting the sampler during an MCMC run is dangerous.<\/span><\/p>\n<p style=\"text-align: left\">So if you are not sure that it is valid, then don&#8217;t do it.<\/p>\n<p style=\"text-align: left\"><strong>Beware the mathematician<\/strong><\/p>\n<p style=\"text-align: left\">In order for an MCMC algorithm to converge, we want the chain to have detailed balance, that is, it needs the property that for any two values of the parameter \u03b8, say \u03b81 and \u03b82<\/p>\n<p style=\"text-align: center\">P(\u03b81) P(move \u03b81 to \u03b82) = P(\u03b82) P(move \u03b82 to \u03b81)<\/p>\n<p style=\"text-align: left\">Typically the moves are made dependent on (i) a proposal distribution and (ii) an acceptance probability, so the requirement for detailed balance becomes,<\/p>\n<p style=\"text-align: center\">P(\u03b81) P(propose \u03b81 to \u03b82) P(accept \u03b81 to \u03b82) = P(\u03b82) P(propose \u03b82 to \u03b81) P(accept \u03b82 to \u03b81)<\/p>\n<p style=\"text-align: left\">If we adapt the proposal distribution then we alter P(propose \u03b81 to \u03b82) as the chain progresses and we will have great difficulty in choosing an acceptance probability that ensures balance.<\/p>\n<p style=\"text-align: left\">A simple solution to the problem is to insist that the changes to the proposal distribution get smaller over time; eventually they will fade away and the proposal distribution will almost be fixed and as a consequence we will eventually be able to choose acceptance probabilities that guarantee detailed balance. This is usually called diminishing adaptation.<\/p>\n<p style=\"text-align: left\">Now there are mathematical proofs to the effect that if the changes in the proposal distribution fade towards zero then you can easily make the chain converge to the correct distribution. Unfortunately these are mathematical proofs about what will happen if the chain is run for ever and of course we do not run chains of infinite length.<\/p>\n<p style=\"text-align: left\">Instead of accepting the mathematical proof, we must ask ourselves a more practical question. Suppose I run an adaptive chain for 10,000 iterations and maybe the proposal distribution is changed by noticeable, but diminishing amounts; will this have a meaningful impact on my estimated posterior distributions?<\/p>\n<p style=\"text-align: left\">Remember that if we restrict adaptation to the discarded burn-in, as I do with the adapt option, then we will not have a problem.<\/p>\n<p style=\"text-align: left\"><strong>A program for investigating adaptation<\/strong><\/p>\n<p style=\"text-align: left\">Here is a relatively general program for a random walk Metropolis-Hasting sampler\u00a0that simulates values from a normal prior distribution with mean 1 and standard deviation 2. This may look rather unusual since we are used to simulating from posterior distributions that depend on data,\u00a0while here we are simulating from a prior.\u00a0It is also true that\u00a0sampling from N(1,sigma=2) is trivial and does not need MCMC but\u00a0there are advantages in testing an algorithm on a problem\u00a0for which\u00a0we know the correct answer.<\/p>\n<p style=\"text-align: left\">To try to be clear, I will use sigma to represent the standard deviation of the\u00a0target distribution\u00a0and sd for the standard deviation of the proposal distribution.<\/p>\n<pre style=\"text-align: left\"><strong> <span style=\"color: #000080\">clear<\/span>\r\n<span style=\"color: #000080\"> set seed 51817<\/span>\r\n<span style=\"color: #000080\"> set obs 10000<\/span>\r\n<span style=\"color: #000080\"> *----------------------------<\/span>\r\n<span style=\"color: #000080\"> * INITIAL VALUES<\/span>\r\n<span style=\"color: #000080\"> *----------------------------<\/span>\r\n<span style=\"color: #000080\"> \/*A*\/ local sd = 1<\/span>\r\n<span style=\"color: #000080\"> \/*B*\/ local lambda = 1<\/span>\r\n<span style=\"color: #000080\"> gen sd = `sd'<\/span>\r\n<span style=\"color: #000080\"> gen lambda = `lambda'<\/span>\r\n<span style=\"color: #000080\"> gen theta = 1<\/span>\r\n<span style=\"color: #000080\"> *----------------------------<\/span>\r\n<span style=\"color: #000080\"> * 10,000 iterations<\/span>\r\n<span style=\"color: #000080\"> *----------------------------<\/span>\r\n<span style=\"color: #000080\"> local oldlogL = log(normalden(1,1,2))<\/span>\r\n<span style=\"color: #000080\"> forvalues i=2\/10000 {<\/span>\r\n<span style=\"color: #000080\"> qui replace sd = `sd' in `i'<\/span>\r\n<span style=\"color: #000080\"> qui replace lambda = `lambda' in `i'<\/span>\r\n<span style=\"color: #000080\"> local j = `i' - 1<\/span>\r\n <span style=\"color: #000080\">*-------------------------------------<\/span>\r\n<span style=\"color: #000080\"> * METROPOLIS-HASTINGS<\/span>\r\n<span style=\"color: #000080\"> * &amp; UPDATE PROPOSAL DISTRIBUTION (SD)<\/span>\r\n<span style=\"color: #000080\"> *-------------------------------------<\/span>\r\n<span style=\"color: #000080\"> local newtheta = theta[`j']+rnormal(0,`sd')<\/span>\r\n<span style=\"color: #000080\"> local logL = log(normalden(`newtheta',1,2))<\/span>\r\n<span style=\"color: #000080\"> if log(runiform()) &lt; `logL' - `oldlogL' {<\/span>\r\n<span style=\"color: #000080\"> qui replace theta = `newtheta' in `i'<\/span>\r\n<span style=\"color: #000080\"> local oldlogL = `logL'<\/span>\r\n<span style=\"color: #000080\"> local sd = `lambda'*`sd'<\/span>\r\n<span style=\"color: #000080\"> }<\/span>\r\n<span style=\"color: #000080\"> else {<\/span>\r\n<span style=\"color: #000080\"> local sd = `sd'\/`lambda'<\/span>\r\n<span style=\"color: #000080\"> qui replace theta = theta[`j'] in `i'<\/span>\r\n<span style=\"color: #000080\"> }<\/span>\r\n<span style=\"color: #000080\"> *----------------------------<\/span>\r\n<span style=\"color: #000080\"> * UPDATE DIMINISHING FACTOR<\/span>\r\n<span style=\"color: #000080\"> *----------------------------<\/span>\r\n<span style=\"color: #000080\"> \/*C*\/ local lambda = `lambda'<\/span>\r\n<span style=\"color: #000080\"> qui replace lambda = `lambda' in `i'<\/span>\r\n<span style=\"color: #000080\"> }<\/span>\r\n<span style=\"color: #000080\"> histogram theta , \/\/\/<\/span>\r\n<span style=\"color: #000080\"> addplot( function y = normalden(x,1,2), range(-5 7) ) \/\/\/<\/span>\r\n<span style=\"color: #000080\"> leg(off)<\/span>\r\n<span style=\"color: #000080\"> mcmcstats theta<\/span>\r\n<span style=\"color: #000080\"> count if theta != theta[_n-1]<\/span>\r\n<span style=\"color: #000080\"> mcmctrace theta lambda sd , cgopt(row(3))<\/span><\/strong><\/pre>\n<p>I&#8217;ve highlighted 3 lines by labelling them as A, B and C. By changing those lines we can alter then properties of the algorithm.<\/p>\n<p>The random walk updates have a standard deviation that is controlled by the local <strong>sd<\/strong> and the proposal distribution is altered by changing sd\u00a0by an amount <strong>lambda<\/strong>. As the program stands, I set sd=1 and lambda=1 so the proposal distribution does not change. The mean posterior estimate of theta is 1.089 and its standard error (sem) is 0.0939. The standard deviation of the chain (estimating sigma) is 2.006. Here is a trace plot of the chain.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-383\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig1.png\" alt=\"adapt_fig1\" width=\"716\" height=\"521\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig1.png 716w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig1-300x218.png 300w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><\/a><\/p>\n<p>The chain mixes quite slowly because the proposal standard deviation is fixed at a value well below its optimal value. Setting line (A) to sd =4 produces better mixing with an estimate of 1.030 and a sem of 0.0422.<\/p>\n<p>Both of these are valid chains that have detailed balance because sd is fixed. So now let\u2019s allow sd to change and see what happens. If we return line (A) to read sd = 1 but put line (B) to lambda = 1.02. The proposal distribution will broaden when moves are accepted and narrow when a proposal is rejected. Eventually the chain will settle to reject about 50% of proposals. This is the algorithm used during the burn-in by &#8211;<strong>mcmcrun<\/strong>&#8211; whenever the adapt option is set.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-384\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig2.png\" alt=\"adapt_fig2\" width=\"716\" height=\"521\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig2.png 716w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig2-300x218.png 300w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><\/a><br \/>\nThe proposal standard deviation moves quickly towards the (3,5) range. The mean of theta is 1.035 with a sem of 0.0416. The changes in the proposal distribution have no material impact the convergence. Sometimes sd is too large and sometimes it is too small but the effect is small and seems to cancel.<\/p>\n<p>To create a diminishing effect we can set line (C) to be lambda = 1 + 0.02*0.999^`i\u2019. Now we can see the impact of the diminishing changes in the way that sd settles around 4 once we pass iteration 4000 and lambda gets close to 1. The algorithm has the flexibility to adjust but we soon learn about the sd and the changes to become negligible. The results are fine with a posterior mean is 0.973 and sem=0.0414.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-385\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig3.png\" alt=\"adapt_fig3\" width=\"716\" height=\"521\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig3.png 716w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig3-300x218.png 300w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><\/a><\/p>\n<p>The difficulty with adaptation is that often\u00a0we cannot be totally sure that the effect of the changes\u00a0will be\u00a0negligible within a run of finite length. This is especially true for problems with many correlated parameters,\u00a0since the target conditional distribution will change depending on the values of the other parameters. For this reason I much prefer methods that adapt during the burn-in.<\/p>\n<p><strong>An Investigation<\/strong><\/p>\n<p>Here is a suggestion for an adaptive algorithm that you might like to try. Make the standard deviation of the proposal distribution equal to half the standard deviation of the previous m simulated values of the parameter. If sigma=2 then sd should\u00a0have a mean of about 1.\u00a0Start with m=50 and then\u00a0investigate the effect of changing m. In this algorithm the adaptation does not diminish so you might also try allowing m to get larger as the chain increases in length.<\/p>\n<p>Here is what happens if we set m to be fixed at 50.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-386\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig4.png\" alt=\"adapt_fig4\" width=\"538\" height=\"394\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig4.png 538w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2014\/10\/adapt_fig4-300x219.png 300w\" sizes=\"auto, (max-width: 538px) 100vw, 538px\" \/><\/a><\/p>\n<p>Clearly all is not well. Adaptive MCMC is potentially dangerous.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Stata Journal has not published very\u00a0much on Bayesian statistics, so I was delighted to see the article by Matthew Baker in the latest issue (Stata Journal 2014;14(3):623-661). Matthew describes a Mata program for adaptive MCMC and his paper\u00a0has encouraged me to discuss this topic. You should certainly read that article alongside this blog. I [&hellip;]<\/p>\n","protected":false},"author":134,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[41,4],"class_list":["post-372","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-adaptive-mcmc","tag-stata"],"_links":{"self":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/users\/134"}],"replies":[{"embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/comments?post=372"}],"version-history":[{"count":13,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/372\/revisions"}],"predecessor-version":[{"id":402,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/372\/revisions\/402"}],"wp:attachment":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/media?parent=372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/categories?post=372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/tags?post=372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}