{"id":590,"date":"2015-06-18T08:34:51","date_gmt":"2015-06-18T08:34:51","guid":{"rendered":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/?p=590"},"modified":"2025-02-26T13:21:37","modified_gmt":"2025-02-26T13:21:37","slug":"stan-with-stata-part-v-kicking-a-marble-around-in-a-bucket","status":"publish","type":"post","link":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/2015\/06\/18\/stan-with-stata-part-v-kicking-a-marble-around-in-a-bucket\/","title":{"rendered":"Stan with Stata, Part V: Kicking a marble around in a bucket"},"content":{"rendered":"<p>In\u00a0the last few postings I have described how the Bayesian analysis program,\u00a0<strong>Stan<\/strong>, can be called from within Stata. Before we can go any further we have to understand how Stan works, which is to say we need to understand the basics of Hamiltonian Monte Carlo,\u00a0as this\u00a0is Stan&#8217;s way of sampling from the posterior.<\/p>\n<p>I&#8217;ll start the explanation this week and complete it next time.<\/p>\n<p><strong>The problem with Gibbs Sampling<\/strong><br \/>\nWinBUGS, OpenBUGS and JAGS are all based on Gibbs Sampling, hence the GS in\u00a0each of their names. Gibbs samplers work by moving through the set of parameters in some defined order, updating one set of parameters, then moving on to update the next. Each time the sampling is\u00a0controlled by the conditional distribution of the current set of parameters fixing all others\u00a0at their current values.<\/p>\n<p>Gibbs samplers reduce multi-dimensional problems to a long sequence of low dimensional moves and as such are simple to program. Their chief disadvantage is illustrated in the diagram below. Here we have two parameters with a reasonably high correlation and we update them one at a time holding the other parameter at its current value. The result is that we tend to make very local moves and it can take a long time to get from one end of the distribution to the other. Hence poor mixing and an inefficient algorithm.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/gibbs-e1427786868674.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-599\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/gibbs-e1427786868674.png\" alt=\"gibbs\" width=\"500\" height=\"363\" \/><\/a><\/p>\n<p>Ideally we would like a sampler that takes the multi-dimensional shape of the posterior into account and that is able to move more quickly across the distribution. Block updating, in which correlated parameters are grouped together in the Gibbs sampler\u00a0is an attempt to do this, but\u00a0it requires that the user understand the shape of the distribution when they are designing the sampler and so\u00a0it is\u00a0not truly automatic.<\/p>\n<p><strong>Hamiltonian Monte Carlo<\/strong><\/p>\n<p>Hamiltonian Monte Carlo (HMC) is an alternative to Gibbs sampling that is designed to give better mixing when parameters are correlated, but without the user needing to have\u00a0advance knowledge about the shape of the posterior. Sometimes HMC is called Hybrid Monte Carlo.<\/p>\n<p>Many years ago, I studied mathematics at University and in my first year they made us study pure maths, applied maths and statistics. As much as anything else, it was my inability to do applied maths that made me become a statistician. The equations of motion of a spinning top were a mystery to me\u00a0but I remember that at the end of the year we reached the Lagrangian and the Hamiltonian.\u00a0Unfortunately,\u00a0by that stage, I had lost what little intuition I had for mechanics.<\/p>\n<p>If you understand Hamiltonian mechanics then the parallels will help you to understand HMC and you will probably love the chapter by Radford Neal called <em>MCMC using Hamiltonian Dynamics<\/em> in the <em>Handbook of Markov chain Monte Carlo<\/em> published by Chapman and Hall. If, on the other hand, you are like me, then you will need a\u00a0gentler approach.<\/p>\n<p>It is actually quite difficult to talk about HMC without using physical analogies but let&#8217;s try to avoid diving into Hamilton&#8217;s equations of motion.\u00a0Instead let&#8217;s start by\u00a0imagining a much simpler algorithm for a one dimensional parameter, \u03b8, with posterior f(\u03b8). We start with a random guess \u03b8<sub>t<\/sub> and then update iteratively such that,<br \/>\n\u03b8<sub>t+1<\/sub> = \u03b8<sub>t<\/sub> +\u00a0\u03b5 df\/d\u03b8<br \/>\nwhere the derivative is evaluated at the current point. This is a crude hill climbing algorithm that makes steps of size\u00a0\u03b5 with the direction and magnitude of the gradient. When we reach the top of the hill the gradient will be zero and we will stay there. What we have is an algorithm for finding the mode of the posterior.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/hill_climb-e1427788934171.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-603\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/hill_climb-e1427788934171.png\" alt=\"hill_climb\" width=\"500\" height=\"363\" \/><\/a><\/p>\n<p>Of couse we could equally well work on the log-scale and the\u00a0value of \u03b8 at\u00a0the mode would not change.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/hill_climb_log-e1427789018101.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-604\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/hill_climb_log-e1427789018101.png\" alt=\"hill_climb_log\" width=\"500\" height=\"363\" \/><\/a><\/p>\n<p>Alternatively we could invert the curve and use the method to find\u00a0the minimum rather than\u00a0the mode.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/hill_descent_log-e1427789083613.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-605\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/hill_descent_log-e1427789083613.png\" alt=\"hill_descent_log\" width=\"500\" height=\"363\" \/><\/a><\/p>\n<p>With a hill climbing or descending \u00a0algorithm, we naturally think of it as making a series of discrete steps, but we could make the step length very small and\u00a0imagine a continuous path from our initial guess to the mode.\u00a0We would find that\u00a0the speed of movement\u00a0is greater when we pass through a steep region of the hill.\u00a0My analogy is with a marble rolling inside a bucket.<\/p>\n<p>Now this is\u00a0an aspect of the algorithm does does not come across in these two dimensional pictures; the bucket is multi-dimensional because in this algorithm all of the parameters are handled simultaneously. So the shape of the bucket reflects all of the correlations within the parameters and the algorithm automatically adjusts whatever the multi-dimensional shape of the posterior.<\/p>\n<p>One\u00a0other point to notice is that this journey is entirely deterministic, the movement depends only on the current position and the slope at the current position, it does not depend on the route that we took to get there and so we could reverse the algorithm and deduce the path that we must have taken.<\/p>\n<p>If you believe in maximum likelihood estimation, then\u00a0make f the likelihood and this algorithm would find the MLE. It has its problems; for instance, we have to be careful when choosing the step length and starting value or\u00a0the algorithm might settle on a local mode when there is a higher mode somewhere else.<\/p>\n<p>But we are Bayesians, so our hill represents a posterior distribution and we want to characterise its whole shape so that we\u00a0will be able to\u00a0find the mean of \u03b8 rather than the mode. So we do not want to settle at the mode but instead we need to keep travelling over the distribution.<\/p>\n<p>To keep the movement going, we allow\u00a0the\u00a0marble to move deterministically\u00a0within the multi-dimensional bucket\u00a0but periodically we give it a random kick. We might decide to allow the parameter to\u00a0move for a fixed time or for, say,\u00a010 discrete\u00a0steps then kick it again and let it move for another ten steps and so.<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/kick-e1427789565852.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-607\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/kick-e1427789565852.png\" alt=\"kick\" width=\"500\" height=\"363\" \/><\/a><\/p>\n<p>We want to do this in such a way that the values of the parameter at the end of each set of 10\u00a0small steps\u00a0act like a random sample from the posterior. The key is how the marble moves after we give it a kick. This problem has already been solved by the equations of motion of mechanics and a neat formulation of those equations was provided by the Irish mathematician, Hamilton, who used the conservation of energy to describe the motion. In our analogy the shape of the surface provides the marble with potential energy and the kick imparts kinetic energy. Balancing changes in one against changes in the other allows us to deduce the path.<\/p>\n<p>In turns out that if the kick has a size (momentum) z,\u00a0we can\u00a0update<br \/>\n\u03b8<sub>t+1<\/sub> = \u03b8<sub>t<\/sub> + e z<sub>t<\/sub><br \/>\nand then the energy in the kick is lost in such a way that,<br \/>\nz<sub>t+1<\/sub> = z<sub>t<\/sub> &#8211; e df\/d\u03b8<br \/>\nwhere f is\u00a0 minus the log-posterior (shape of the surface which dictates the potential energy). These two equations are a special case in which the momentum of the kick is drawn independent of position from a distribution g such that<br \/>\nd\/dz[ -log(g) ] = z<br \/>\nwhich amounts to g = exp(-z<sup>2<\/sup>\/2) which of course is proportional to a N(0,1) distribution.\u00a0This is a\u00a0very convenient way to generate the random kicks, but it is not the only way.<\/p>\n<p>Let us see how this works in practice. Suppose that the posterior is a beta distribution, B(3,2) so<br \/>\nit is proportional to \u03b8<sup>2<\/sup>(1-\u03b8)<br \/>\nf = -2log\u03b8 &#8211; log(1-\u03b8)<br \/>\ndf\/d\u03b8 = -2\/\u03b8+1\/(1-\u03b8)<br \/>\nThere are a lot of tuning parameters in this algorithm, but for my example I will choose to make sets of 20 steps of size 0.01. In all, I will generate 500 values from the posterior.<br \/>\n<span style=\"color: #0000ff\">clear<\/span><br \/>\n<span style=\"color: #008000\">* 500 sets of 20 steps<\/span><br \/>\n<span style=\"color: #0000ff\"> set obs 500<\/span><br \/>\n<span style=\"color: #0000ff\"> gen theta = 0<\/span><br \/>\n<span style=\"color: #0000ff\"> gen z = 0<\/span><br \/>\n<span style=\"color: #0000ff\"> local eps = 0.01<\/span><br \/>\n<span style=\"color: #008000\"> * random starting value<\/span><br \/>\n<span style=\"color: #0000ff\"> local t = runiform()<\/span><\/p>\n<p><span style=\"color: #0000ff\">forvalues iter=1\/500 {<\/span><br \/>\n<span style=\"color: #008000\"> * a random kick<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0 local z = rnormal()<\/span><br \/>\n<span style=\"color: #008000\"> * twenty steps<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0 forvalues i=1\/20 {<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 local z = `z&#8217; &#8211; `eps&#8217;*(-2\/`t&#8217;+1\/(1-`t&#8217;))<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 local t = `t&#8217; +`eps&#8217;*`z&#8217;<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0 }<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0 qui replace theta = `t&#8217; in `iter&#8217;<\/span><br \/>\n<span style=\"color: #0000ff\">\u00a0\u00a0\u00a0 qui replace z = `z&#8217; in `iter&#8217;<\/span><br \/>\n<span style=\"color: #0000ff\"> }<\/span><br \/>\n<span style=\"color: #0000ff\"> histogram theta , addplot(function y = 12*x*x*(1-x) , range(0 1)) leg(off)<\/span><\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-600\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta.png\" alt=\"stan_beta\" width=\"538\" height=\"394\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta.png 538w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta-300x220.png 300w\" sizes=\"auto, (max-width: 538px) 100vw, 538px\" \/><\/a><\/p>\n<p>Amazingly successful for such a simple piece of code, but it can go wrong. Here is another random sample generated by the same code,<\/p>\n<p><a href=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta_fail.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-601\" src=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta_fail.png\" alt=\"stan_beta_fail\" width=\"538\" height=\"394\" srcset=\"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta_fail.png 538w, https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/files\/2015\/03\/stan_beta_fail-300x220.png 300w\" sizes=\"auto, (max-width: 538px) 100vw, 538px\" \/><\/a><\/p>\n<p>Of course this should not happen because theta ought to stay between 0 and 1, but if, due to the finite size of the step, we do veer outside this range then the algorithm can go wild. Think of the analogy with badly tuned mode finding algorithms that diverge away from the peak, or of someone kicking the marble out of the bucket.<\/p>\n<p>Enough for one week. More on\u00a0HMC next time.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In\u00a0the last few postings I have described how the Bayesian analysis program,\u00a0Stan, can be called from within Stata. Before we can go any further we have to understand how Stan works, which is to say we need to understand the basics of Hamiltonian Monte Carlo,\u00a0as this\u00a0is Stan&#8217;s way of sampling from the posterior. I&#8217;ll start [&hellip;]<\/p>\n","protected":false},"author":134,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[65,66,56,4],"class_list":["post-590","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-hamiltonian-monte-carlo","tag-hmc","tag-stan","tag-stata"],"_links":{"self":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/590","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/users\/134"}],"replies":[{"embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/comments?post=590"}],"version-history":[{"count":12,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/590\/revisions"}],"predecessor-version":[{"id":717,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/posts\/590\/revisions\/717"}],"wp:attachment":[{"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/media?parent=590"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/categories?post=590"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/staffblogs.le.ac.uk\/bayeswithstata\/wp-json\/wp\/v2\/tags?post=590"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}