I have had in mind for some time that there is a need to say something about the Bayesian analysis facilities that were introduced in Stata14 and while preparing for that topic I was forced to question the sense of advocating my own parallel set of Stata commands. When I wrote the book, *Bayesian analysis with Stata*, there were no Bayesian options in Stata and I was assured repeatedly that none were planned. Then, shortly after publication, there was a change of mind and some very basic Bayesian commands were added to the latest release of Stata.

Clearly there is no sense in having two competing systems and equally obviously StataCorp have the resources and experience to produce far better commands than I could. Although it must be said that the official additions to Stata14 are pretty basic and cannot produce the same variety of Bayesian analyses as my own commands.

It looks to me as though Stata have rushed out a very limited command for Metropolis-Hastings sampling without paying much attention to the way in which it will fit within the broader Bayesian software that they will eventually need to produce. While it is true that the Bayesian commands in Stata14 are limited, experience suggests that StataCorp will build quickly on those foundations and eventually they will produce something very impressive. I would not be surprised if they wrote a general Gibbs sampling command using Mata, similar to OpenBUGS or JAGS, or perhaps they will opt for HMC and write a Stan-like program in Mata.

StataCorp seem to have a policy of not linking Stata to other software. One can see the sense of this from an economic point of view as it makes users dependent on Stata and encourages them to buy the new releases. It also allows StataCorp to control the quality of the analyses that Stata produces. The downside is that developments are slower than they need to be and StataCorp is constantly engaged in reinventing the wheel. I think that the experience of more flexible approaches, as typified by R and Wikipedia, is that looser control has many advantages and quality is better maintained by the continual testing that results from heavy use.

Anyway, there is little point in my continuing to develop my own Stata commands for Bayesian analysis. It is a race that I cannot win.

There is, I think, still scope for a blog explaining Bayesian methods and illustrating different types of Bayesian analyses but this ought to be based on the official version of Stata and not my own commands. This creates a problem because the official commands are still too basic to do anything really interesting. In that sense, this blog is a few years ahead of its time.

There is one final factor that needs to be taken into account. I enjoy writing this blog, I certainly would not do it otherwise.

I have not made a final decision but I will probably leave this blog open but post less frequently and meantime give thought to starting a different blog to keep myself amused.

If I were to write another book, it would be called *Bayesian methods in genetic epidemiology* and I would try to follow the style of *Bayesian analysis with Stata*, by which I mean I would attempt to emphasise practical application and the understanding of the basic concepts, but not worry too much about the rigour of the explanation. There are many people analysing genetic data who have not had a formal training in statistics and it interests me to try to explain complex statistical ideas to such non-specialists, though I realise that this is challenging and my attempts will not always be successful.

Writing a book is a major undertaking and I do not have time for it at present but I am attracted by the idea of self-publishing a book as a blog and releasing it in regular instalments, rather as Dickens released *Pickwick Papers* (and no, I do not have any delusions about my writing style, which I know to be very limited).

The trouble is that Stata is not a good program for handling genetic epidemiology datasets, which are often huge. StataCorp made some initial decisions that have been overtaken by events and StataCorp has shown itself slow to adapt. Clearly the idea of only having one spreadsheet of data open at a time is too limiting and restricting that spreadsheet to a few thousand columns makes it virtually impossible to use Stata for many modern applications. I’m sure that Stata will evolve, but progress is so slow that I fear for its long-term market share.

Anyway, a serially released book called *Bayesian methods in genetic epidemiology* would have to use R, so it would not be a natural continuation of this blog. I will give myself a few months to think it over and if I have the energy, I will make a start on *Bayesian methods in genetic epidemiology* as a fresh blog in the New Year. Meantime this blog will continue, at least for a while, but perhaps with a posting every month rather than a posting every week.

This is a Bayesian blog so, as you might expect, I have a great deal of sympathy with anyone who has a problem with p-values, but is a ban the right reaction?

Here is a question for you. What have the following in common?

(a) the charge of Light Brigade

(b) the election of Jeremy Corbyn to lead the UK Labour party

(c) the banning of p-values by *Basic and Applied Social Psychology*

My answer is that all three are admirable because they were inspired by sincerely held, well-founded beliefs, but in each case it was clear from the outset that the venture was going to end in tears.

Let me steer clear of history and politics and concentrate on the statistics.

Anyone who knows anything about the fundamentals of statistics will know that the logic of null hypotheses and p-values has very little of relevance to say about scientific investigation. This is not the place to repeat those arguments but there are so many well-known problems with p-values that they definitely deserve a place in the rubbish bin of history.

If that were not enough, we can add to the list of problems with p-values that so many scientists misunderstand and misuse them. In my experience, few people outside of the specialist statistics community have any real understanding of what a p-value actually is. So most people use p-values as a blackbox method and even then there are problems, because, due to a historical accident, conclusions are usually based on whether p<0.05. If we were to reset the threshold today, we would surely choose a stricter cut-off.

So, good for *Basic and Applied Social Psychology*, they have done the scientific world a great favour and I’m sure that people will look back on the banning of p-values as an important step forward.

The Journal’s ban extends to confidence intervals, as logically it must once p-values have been dropped, and they even express doubts about Bayesian methods, so they end up with something that is very close to an attack on statistics, masquerading as an attack on p-values.

In fact, their doubts about Bayesian statistics are, in themselves, quite interesting. The editors attack the Laplacian assumption that ignorance can be expressed by equal probabilities. Effectively an attack on the adoption of the flat priors that is so common in poor quality Bayesian analyses. So I am even in sympathy with those views.

Where the ban turns into glorious farce is that the editors have so little that is positive to offer as an alternative.

For instance, they say “we encourage the use of larger sample sizes than is typical in much psychological research”. I would certainly give my students a hard time if they said something like that. How large is large? The statement means nothing unless you have a method for defining the necessary sample size and how do you do that without a calculation that comes very close to depending on the standard error, which in turn is linked to the p-value.

The editors’ other requirement is for “strong descriptive statistics, including effect sizes”. So I do a “large” psychological experiment and I find a 6% difference between the average scores of men and women. What do you conclude? Without some consideration of sampling variation you can conclude nothing. So I tell you that men scored between 30 and 80 and women between 28 and 85. Descriptive but still not enough. Already you will probably be saying to yourself, a range of 50, so the standard deviation must have been just over 10, if I knew the sample size I could calculate the standard error of the mean.

Drop p-values by all means but you cannot escape from a consideration of random variation.

Banning p-values and then leaving the authors without firm guidance is gloriously daft. It is a pity that the editors missed the opportunity to advocate subjective Bayesian analysis or pure likelihood analysis, but I am really pleased that they have taken a step in the right direction. Their journal will suffer, but science will gain.

]]>We have two binomial samples y_{1}=8, n_{1}=20 and y_{2}=16, n_{2}=30, but we are not sure whether to model them as,

M1: binomials with different probabilities π_{1} and π_{2}

M2: binomials with a common probability ξ.

For my priors I chose, p(M1)=0.3 and p(M2)=0.7. Under M1 my priors on the parameters were p(π_{1}|M1)=Beta(5,5) and p(π_{2}|M1)=Beta(10,10) and under M2, p(η|M2)=Beta(20,20).

In the previous programs we imagined a vector of parameters Ψ with two elements (Ψ1,Ψ2). When we selected model M1 we just set π_{1}=Ψ1 and π_{2}=Ψ2. While when in model M2 we set ξ=Ψ1 and created an arbitrarily distributed, unused parameter u=Ψ2. I then set up a Metropolis-Hastings algorithm to control switches between the models and everything worked well.

It turned out that the posterior probability of M2 is about 0.77, only slightly higher than our prior model probability. So the data move us slightly in the direction of model 2. The posterior distribution of the binomial probabilities when in model 1 has a mean for π_{1}=0.433 and π_{2}=0.520 and under model 2 the posterior mean for ξ=0.489.

This algorithm represents a rather special case of RJMCMC because it does not involve any transformations when we move between Ψ and the particular parameters of the two models, or if you wish you can say that there is a transformation but it takes the form of a simple equality. The result is that the Jacobians of the transformations are 1 and we can leave them out of the calculations.

What I want to do now is to introduce a transformation into the algorithm so that we are forced to calculate the Jacobians.

First let me present again the Stata program for RJMCMC without transformation. This is effectively the same as the program that we had before, except that I have expanded the code slightly with a few redundant calculations intended to prepare us for the generalization needed when we incorporate transformations.

program myRJMCMC

args logpost b ipar

local M = `b'[1,1]

if( runiform() < 0.5 ) {

* update within the current model

if( `M’ == 1 ) {

local pi1 = rbeta(13,17)

local pi2 = rbeta(26,24)

local psi1 = `pi1′

local psi2 = `pi2′

}

else {

local xi = rbeta(44,46)

local u = rbeta(1,1)

local psi1 = `xi’

local psi2 = `u’

}

matrix `b'[1,2] = `psi1′

matrix `b'[1,3] = `psi2′

}

else {

* try to switch

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

local pi1 = `psi1′

local pi2 = `psi2′

local xi = `psi1′

local u = `psi2′

local a1 = log(binomialp(20,8,`pi1′))+log(binomialp(30,16,`pi2′))+ ///

log(betaden(5,5,`pi1′))+log(betaden(10,10,`pi2′))+log(0.3)

local a2 = log(binomialp(20,8,`xi’))+log(binomialp(30,16,`xi’))+ ///

log(betaden(20,20,`xi’))+log(betaden(1,1,`u’))+log(0.7)

if `M’ == 1 {

if log(runiform()) < `a2′ – `a1′ matrix `b'[1,1] = 2

}

else {

if log(runiform()) < `a1′ – `a2′ matrix `b'[1,1] = 1

}

}

end

matrix b = (1, 0.5, 0.5)

mcmcrun logpost b using temp.csv , replace ///

param(m psi1 psi2) burn(500) update(20000) thin(2) ///

samplers( myRJMCMC , dim(3) )

insheet using temp.csv, clear

gen pi1 = psi1

gen pi2 = psi2

mcmcstats pi1 pi2 if m == 1

gen xi = psi1

gen u = psi2

mcmcstats xi u if m == 2

tabulate m

The main features of the program are:

(a) randomly with probability 0.5, I decide either to update the parameters within the current model or to propose a move between models

(b) the within model updates are able to use Gibbs sampling because we have chosen conjugate priors

(c) the MH algorithm for switching between models calculates the acceptance probability from the likelihood of the data under that model, the prior on the parameters under that model, the prior model probability and the probability of selecting that move. Calculations are performed on a log-scale to avoid loss of precision due to very small quantities.

(d) The chance of proposing a move to M2 when we are in M1 is the same as the chance of proposing a model to M1 when we are in M2 (i.e. 0.5) so these probabilities cancel and I have left them out of the acceptance probabilities.

(e) the distribution of M2’s unused parameter, u, is arbitrary. Here it is Beta(1,1) which is a very poor choice given that under the other model the equivalent parameter is Beta(26,24). This will have a small impact on mixing but the algorithm will recover and give the correct posterior.

Now let us introduce a transformation into the algorithm. Under model 2, ξ is a common binomial parameter that replaces the separate binomial probabilities π_{1} and π_{2}. So a natural idea would be make ξ equal to the average of π_{1} and π_{2}. We could write this as,

Under M1

π_{1} = Ψ1

π_{2} = Ψ2

Under M2

ξ = (Ψ1+Ψ2)/2

u = Ψ2

A word of warning before we continue. In the paper in *The American Statistician* from which I took this problem, they use a very similar transformation, in fact they use a weighted average instead of a simple average. However, as we will see, transformations based on any sort of average cause problems and the algorithm breaks down under extreme conditions. So this is a nice illustration of RJMCMC partly because it fails and we can discover why it fails. Perhaps you can already see from the formulae for the transformations why we are going to run into difficulties.

A couple of weeks ago we derived the formula for the MH acceptance probability as

f(Y|Ψ,M_{k}) f(Ψ|M_{k}) dΨ P(M_{k}) g(M_{k} to M_{j}) / f(Y|Ψ,M_{j}) f(Ψ|M_{j}) dΨ P(M_{j}) g(M_{j} to M_{k})

Each product in the ratio includes a likelihood, a prior on the parameters, a prior on the model and the probability of selecting that move. The extra term is dΨ is present to turn the density of the prior into a probability.

We are going to work in terms of π_{1} and π_{2} or ξ and u and so we will need to transform the dΨ’s into small elements, dπ_{1} and dπ_{2} or dξ and du this means multiplying by the Jacobian formed from derivatives dπ_{1}/dΨ etc.

In our program we calculate the MH acceptance probabilities with the lines

local a1 = log(binomialp(20,8,`pi1′))+log(binomialp(30,16,`pi2′))+ ///

log(betaden(5,5,`pi1′))+log(betaden(10,10,`pi2′))+log(0.3)

local a2 = log(binomialp(20,8,`xi’))+log(binomialp(30,16,`xi’))+ ///

log(betaden(20,20,`xi’))+log(betaden(1,1,`u’))+log(0.7)

The initial transformations both had a Jacobian of 1 because they were based on equalities with derivatives of 1 and, of course, the log(1) is zero so I did not bother to include the Jacobians in the code.

The new transformation for M2 has the form

ξ = (Ψ1+Ψ2)/2

u = Ψ2

So the matrix of derivatives dξ/Ψ1 ,dξ/Ψ2, du/Ψ1, du/dΨ2 has the form

0.5 0.5

0 1

which has determinant 0.5. This is our Jacobian. It says that small increments on the dξ scale are half the size of the corresponding increments on the dΨ1 scale.

Noting that out transformation implies that Ψ1=2ξ-u, we can modify our Stata code. The changes are in red.

program myRJMCMC

args logpost b ipar

local M = `b'[1,1]

if( runiform() < 0.5 ) {

* update within the current model

if( `M’ == 1 ) {

local pi1 = rbeta(13,17)

local pi2 = rbeta(26,24)

local psi1 = `pi1′

local psi2 = `pi2′

}

else {

local xi = rbeta(44,46)

local u = rbeta(1,1)

local psi1 = 2*`xi’-`u’

local psi2 = `u’

}

matrix `b'[1,2] = `psi1′

matrix `b'[1,3] = `psi2′

}

else {

* try to switch

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

local pi1 = `psi1′

local pi2 = `psi2′

local xi = (`psi1’+`psi2′)/2

local u = `psi2′

local a1 = log(binomialp(20,8,`pi1′))+log(binomialp(30,16,`pi2′))+ ///

log(betaden(5,5,`pi1′))+log(betaden(10,10,`pi2′))+log(0.3)

local a2 = log(binomialp(20,8,`xi’))+log(binomialp(30,16,`xi’))+ ///

log(betaden(20,20,`xi’))+log(betaden(1,1,`u’))+log(0.7)+log(0.5)

if `M’ == 1 {

if log(runiform()) < `a2′ – `a1′ matrix `b'[1,1] = 2

}

else {

if log(runiform()) < `a1′ – `a2′ matrix `b'[1,1] = 1

}

}

end

As I warned earlier this is almost correct but not quite and it fails to give the correct model posterior as it spends too much time in model 1.

To illustrate the breakdown of the algorithm, I ran this program with different priors for u ranging between Beta(25,25), Beta(10,10), Beta(5,5), Beta(3,3), Beta(2,2) and Beta(1,1). Each time I ran the algorithm for 10,000 iterations and repeated the analysis 25 times. The number of iterations is rather on the low side so we get quite a bit of sampling error between the 25 repeat runs. Here is a boxplot of the estimates of the mean posterior probability for Model 2 plotted on a percentage scale.

Our previous analysis showed that the correct answer is just under 77%. So we get reasonable, but not perfect, estimates when we use Beta(25,25) as our prior for u but the estimates deteriorate markedly when we use Beta(1,1). Yet the theory says that the prior on u is arbitrary.

Beta(1,1) is equivalent to a uniform distribution, so occasionally under model M2 we might generate a value 0.95 for u. Suppose, at the same time, we generate ξ=0.45 then we will calculate Ψ1=2ξ-u=-0.05 and when we propose a move to model M1, we will set π_{1}=-0.05 and we are in trouble because π_{1 }is a probability and it must be between 0 and 1.

Our choice of making ξ the average would be a sensible transformation if there were no constraints on the parameters but when the parameters must all lie in (0,1), it will sometimes fail. Now when we use Beta(25,25) this problem does not arise very often because there is such a small probability of picking an extreme value for u and so we stay in the region where the elements of Ψ are both between 0 and 1.

It is more complicated, but much better to use the transformation

Under M1

π_{1} = invlogit(Ψ1)

π_{2} = invlogit(Ψ2)

Under M2

ξ = (invlogit(Ψ1)+invlogit(Ψ2))/2

u = invlogit(Ψ2)

In this set up, the elements of Ψ are unconstrained and can take any value between -∞ and ∞ and, whatever those values, they always transform to π_{1} and π_{2} or ξ and u that are between 0 and 1. As before, we need derivatives such as dξ/Ψ1 and dξ/Ψ2 although, in this case, the algebra is marginally easier if we calculate dΨ1/dξ and the use the relationship dξ/Ψ1 = 1/ (dΨ1/dξ).

Let’s start with M1 and invert the transformation.

Ψ1 = logit(π_{1}) = log( π_{1}/(1-π_{1}))

Ψ2 = logit(π_{2}) = log( π_{2}/(1-π_{2}))

So dΨ1/dπ_{1} = 1/[π_{1}(1-π_{1})] and inverting this formula we obtain dπ_{1}/dΨ1 = π_{1}(1-π_{1}) which means that the Jacobian matrix is

π_{1}(1-π_{1}) 0

0 π_{2}(1-π_{2})

and the determinant is π_{1}(1-π_{1})π_{2}(1-π_{2})

Similar calculations show that the Jacobian for model 2 is ξ(1-ξ)u(1-u)/2.

Now we can write a much better RJMCMC program.

program myRJMCMC

args logpost b ipar

local M = `b'[1,1]

if( runiform() < 0.5 ) {

* update within the current model

if( `M’ == 1 ) {

local pi1 = rbeta(13,17)

local pi2 = rbeta(26,24)

local psi1 = logit(`pi1′)

local psi2 = logit(`pi2′)

}

else {

local xi = rbeta(44,46)

local u = rbeta(1,1)

local psi1 = 2*logit(`xi’)-logit(`u’)

local psi2 = logit(`u’)

}

matrix `b'[1,2] = `psi1′

matrix `b'[1,3] = `psi2′

}

else {

* try to switch

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

local pi1 = invlogit(`psi1′)

local pi2 = invlogit(`psi2′)

local xi = (invlogit(`psi1′)+invlogit(`psi2′))/2

local u = invlogit(`psi2′)

local J1 = `pi1’*(1-`pi1′)*`pi2’*(1 -`pi2′)

local J2 = `xi’*(1-`xi’)*`u’*(1-`u’)/2

local a1 = log(binomialp(20,8,`pi1′))+log(binomialp(30,16,`pi2′))+ ///

log(betaden(5,5,`pi1′))+log(betaden(10,10,`pi2′))+log(0.3)+log(`J1′)

local a2 = log(binomialp(20,8,`xi’))+log(binomialp(30,16,`xi’))+ ///

log(betaden(20,20,`xi’))+log(betaden(1,1,`u’))+log(0.7)+log(`J2′)

if `M’ == 1 {

if log(runiform()) < `a2′ – `a1′ matrix `b'[1,1] = 2

}

else {

if log(runiform()) < `a1′ – `a2′ matrix `b'[1,1] = 1

}

}

end

I ran this program for a the same range of priors on u and obtained the results.

It is clear that the mean estimate does not depend on u but the variation between repeat runs is larger when we use Beta(1,1) because we make fewer moves between models.

Here is a boxplot of the numbers of model switches per run of 10,000 iterations. Beta(10,10) looks best.

Before finishing I should point out that Peter Green’s original version of RJMCMC is very slightly different to the formulation that I have presented.

Suppose that we have a large set of alternative models that can require anything between 1 and 20 parameters. We are currently in a model that requires 3 parameters and we propose a move to a model with 5 parameters. We do not need to worry about the 15 parameters that neither model requires. In other words instead of defining a full set of parameters, Ψ, we can work with the reduced set that covers the two models that we are currently comparing. This is more efficient but it requires slightly more complicated programming. In my opinion, it is both easier to follow and easier to code if we always return to the full set of parameters.

I think that we have done this example to death. Now we need to move on to more complex applications such as the flu epidemic analysis that started us on the RJMCMC path.

]]>We have two binomial samples y_{1}=8, n_{1}=20 and y_{2}=16, n_{2}=30, but we are not sure whether to model them as,

M1: binomials with different probabilities π_{1} and π_{2}

M2: binomials with a common probability ξ.

We will need priors. In their paper, Barker and Link use identical vague priors that simply cancel out of the equations, so you cannot see exactly what is going on. I prefer to set different priors and I’ll arbitrarily choose, p(M1)=0.3 and p(M2)=0.7. Under M1 my priors on the parameters will be p(π_{1}|M1)=Beta(5,5) and p(π_{2}|M1)=Beta(10,10) and under M2, p(ξ|M2)=Beta(20,20).

Before setting up the RJMCMC program let us analyse the two models separately.

Here is a standard MH program for analysing model M1. It uses the truncated MH sampler, **mhstrnc,** to ensure that all proposals for π_{1} and π_{2} are between 0 and 1. I have lazily used the function, **logdensity**; this is not very efficient, but the program is so simple that speed is not as issue.

program logpost

args lnf b

local pi1 = `b'[1,1]

local pi2 = `b'[1,2]

scalar `lnf’ = 0

logdensity binomial `lnf’ 8 20 `pi1′

logdensity binomial `lnf’ 16 30 `pi2′

logdensity beta `lnf’ `pi1′ 5 5

logdensity beta `lnf’ `pi2′ 10 10

end

matrix b = (0.5, 0.5)

mcmcrun logpost b using temp.csv , replace ///

param(pi1 pi2) burn(500) update(5000) ///

samplers( 2(mhstrnc , sd(0.5) lb(0) ub(1) ) )

insheet using temp.csv, clear

mcmcstats *

Here is a summary of the chain,

------------------------------------------------------- Parameter n mean sd sem median 95% CrI ------------------------------------------------------ pi1 5000 0.431 0.088 0.0030 0.428 ( 0.260, 0.615 ) pi2 5000 0.521 0.071 0.0022 0.522 ( 0.376, 0.656 ) -------------------------------------------------------

Mixing is fine so I will not bother with the graphics.

Probably you know that the beta prior is conjugate to the binomial probability so it can be shown theoretically that the posterior will also have a beta distribution. When the data are y out of n and the prior on p is Beta(a,b) then the posterior on p will be Beta(a+y,b+n-y).

We could take advantage of this and program our own Gibbs sampler. This new algorithm ought to converge quicker than the Metropolis-Hastings algorithm and be faster too.

program myGibbs

args logpost b ipar

matrix `b'[1,1] = rbeta(13,17)

matrix `b'[1,2] = rbeta(26,24)

end

matrix b = (0.5, 0.5)

mcmcrun logpost b using temp.csv , replace ///

param(pi1 pi2) burn(500) update(5000) ///

samplers( myGibbs , dim(2) )

insheet using temp.csv, clear

mcmcstats *

Now the results are,

------------------------------------------------------- Parameter n mean sd sem median 95% CrI ------------------------------------------------------ pi1 5000 0.433 0.088 0.0012 0.431 ( 0.270, 0.612 ) pi2 5000 0.520 0.070 0.0010 0.522 ( 0.380, 0.654 ) -------------------------------------------------------

As expected we have the same answer but we have reduced the sampling error as measured by the MCMC standard error (sem).

We can even calculate the true posterior mean since Wikipedia tells us that a Beta(a,b) distribution has a mean a/(a+b), so a Beta(13,17) should have a mean 0.433 and a Beta(26,24) should have a mean 0.520, just as we found.

We can do the same for model 2 although I will not present all of the details as they are so similar. If y_{1}=8, n_{1}=20 and y_{2}=16, n_{2}=30 have a common binomial probability with prior, Beta(20,20), the posterior will be Beta(8+16+20,12+14+20)=Beta(44,46) so the posterior mean is 0.489.

Now we will set up a chain that moves between the two models. When it is in M1 the chain will estimate (π_{1}, π_{2}) and should find exactly the same posterior that we have already derived. When the chain is in model 2 it should estimate ξ and again get the same estimate as before. The extra information will come about because the algorithm will decide whether to spend its time in M1 or M2 and it will do so in such a way that the time will be proportional to the model’s posterior probability.

Model M1 has 2 parameters and model M2 has 1 parameter. So in the notation that I set up last week, the maximal set of parameters is, Ψ = (Ψ1, Ψ2) and depending on the choice of model, we need to convert (Ψ1, Ψ2) into (π_{1}, π_{2}) or (Ψ1, Ψ2) into (ξ,u) where u is an arbitrary, unused parameter.

Now we need to know how to move between models.

There is no unique way to jump between the models, but as long as the jumps are 1:1 and we do not enter any cul-de-sacs, we can do what we like, although daft choices will lead to poor mixing and a very long chain.

For instance, if under M1 we currently have estimates (0.5,0.3), it would be odd to propose a move to M2 with ξ =0.9. It would mean that we think that the separate probabilities are 0.5 and 0.3 and we try a move to a model with a common probability that is almost 1. Obviously such a move would rarely be accepted and the chain would take a long time to converge. However, it would eventually move and it would eventually recover; though eventually might mean millions of iterations.

I’m going to start with the simplest possible proposal,

M1: π_{1} = Ψ1 and π_{2}= Ψ2

M2: ξ = Ψ1 and u=Ψ2

Last time I derived the form of the Metropolis-Hastings algorithm and mentioned that we need to be aware of the problem of parameter transformation and the way that transformation affects the increments dΨ in the Metropolis-Hastings acceptance probability. Here both of the sets of parameters are derived, untransformed, from Ψ. Depending on how you choose to think about it, there is either no transformation of parameters or the transformation is based on simple equality, so either the dΨ cancel directly, or both of the Jacobians are equal to 1 and can be omitted.

This leads us to our first RJMCMC program

program myRJMCMC

args logpost b ipar

local M = `b'[1,1]

if( `M’ == 1 ) {

* currently in model 1: update

matrix `b'[1,2] = rbeta(13,17)

matrix `b'[1,3] = rbeta(26,24)

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

* try switch to M=2

local a1 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi2′))+ ///

log(betaden(5,5,`psi1′))+log(betaden(10,10,`psi2′))+log(0.3)

local a2 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi1′))+ ///

log(betaden(20,20,`psi1′))+log(betaden(26,24,`psi2′))+log(0.7)

if log(runiform()) < `a2′ – `a1′ matrix `b'[1,1] = 2

}

else {

* currently in model 2: update

matrix `b'[1,2] = rbeta(44,46)

matrix `b'[1,3] = rbeta(26,24)

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

* try switch to M=1

local a1 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi2′))+ ///

log(betaden(5,5,`psi1′))+log(betaden(10,10,`psi2′))+log(0.3)

local a2 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi1′))+ ///

log(betaden(20,20,`psi1′))+log(betaden(26,24,`psi2′))+log(0.7)

if log(runiform()) < `a1′ – `a2′ matrix `b'[1,1] = 1

}

end

matrix b = (1, 0.5, 0.5)

mcmcrun logpost b using temp.csv , replace ///

param(m psi1 psi2) burn(500) update(5000) ///

samplers( myRJMCMC , dim(3) )

insheet using temp.csv, clear

tabulate m

mcmcstats *

mcmcstats * if m == 1

mcmcstats * if m == 2

The vector of parameters, b, has been expanded to contain an indicator of the model as well as Ψ.

Let’s note a few of the features of the program.

(a) I have programmed for clarity rather than for efficiency, for instance, I copy the parameter values into locals which strictly I do not need to do.

(b) First I find out which model the chain is currently in. Then I update the parameter estimates just as I did when I was analysing the models separately. Finally I decide whether to stay with the current model or to switch and I do this with a Metropolis-Hastings step.

(c) I calculate the log of the acceptance ratio because I want to avoid loss of precision due to very small probabilities.

(d) Each part of the acceptance ratio involves the likelihood, the prior on Ψ for that model and the prior on the model. Often there would also be a Jacobian but as noted already, the Jacobians for our moves are both 1.

(e) The proposal probabilities for the moves (g() in the notation that I set up last time) are omitted from the acceptance ratio because they cancel. Since there are only two models, we propose a move to M2 when in M1 and a move to M1 when in M2 with equal probability.

(f) Under model 2 I still have to update u even though it does not occur in the model’s likelihood. Here I have chosen to use the same distribution that I used for π_{2} under model 1, but this is only for convenience. The choice is arbitrary.

Here are the results

m | Freq. Percent Cum. ------------+----------------------------------- 1 | 1,165 23.30 23.30 2 | 3,835 76.70 100.00 ------------+----------------------------------- Total | 5,000 100.00

So we spend about 77% of the time in model 2, which means that the posterior probability of model 2 is 0.77. Our prior was 0.7 so the data have pushed us slightly further in the direction of model 2.

We can even calculate the Bayes Factor. The prior ratio of model probabilities is 0.7/0.3=2.33 and the posterior ratio is 0.77/0.23=3.35 so the Bayes Factor is 3.35/2.33=1.44. Not a very impressive Bayes Factor since it suggests that the data only push us weakly in the direction of Model 2; most of our preference for model 2 comes from the prior.

The Stata command

. count if m != m[_n-1]

Tells us that we switched model 1785 times in the chain of 5,000 iterations. The ideal for MH algorithms is to move around 30%-50% of the time so this represents good mixing.

If we extract the results for the time that the chain spent in Model 1

---------------------------------------------------------------------------- Parameter n mean sd sem median 95% CrI ---------------------------------------------------------------------------- m 1165 1.000 0.000 . 1.000 ( 1.000, 1.000 ) psi1 1165 0.435 0.091 0.0028 0.436 ( 0.262, 0.618 ) psi2 1165 0.519 0.073 0.0021 0.518 ( 0.371, 0.657 ) ----------------------------------------------------------------------------

The parameter estimates are very similar to the values that we calculated before (0.433 and 0.520).

Extracting the values for Model 2 we get

---------------------------------------------------------------------------- Parameter n mean sd sem median 95% CrI ---------------------------------------------------------------------------- m 3835 2.000 0.000 . 2.000 ( 2.000, 2.000 ) psi1 3835 0.489 0.052 0.0009 0.490 ( 0.387, 0.591 ) psi2 3835 0.521 0.070 0.0011 0.523 ( 0.379, 0.654 ) ----------------------------------------------------------------------------

Once again the estimate of the joint probability is very close to the theoretical mean (0.489). The second parameter is the u and must be ignored.

Basically that is RJMCMC, but let us try some variations.

First I said that the updating of u is arbitrary, so I will change from Beta(26,24) to Beta(1,1), which is equivalent to using a uniform distribution. Here is the density of the distribution that I used before and the uniform that I will switch to; they look very different but since that will be used for u, they may affect the mixing but should not change the posteriors.

The code is almost the same. I’ve highlighted the changes in red.

program myRJMCMC

args logpost b ipar

local M = `b'[1,1]

if( `M’ == 1 ) {

matrix `b'[1,2] = rbeta(13,17)

matrix `b'[1,3] = rbeta(26,24)

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

* try switch to M=2

local a1 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi2′))+ ///

log(betaden(5,5,`psi1′))+log(betaden(10,10,`psi2′))+log(0.3)

local a2 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi1′))+ ///

log(betaden(20,20,`psi1′))+log(betaden(1,1,`psi2′))+log(0.7)

if log(runiform()) < `a2′ – `a1′ matrix `b'[1,1] = 2

}

else {

matrix `b'[1,2] = rbeta(44,46)

matrix `b'[1,3] = rbeta(1,1)

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

* try switch to M=1

local a1 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi2′))+ ///

log(betaden(5,5,`psi1′))+log(betaden(10,10,`psi2′))+log(0.3)

local a2 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi1′))+ ///

log(betaden(20,20,`psi1′))+log(betaden(1,1,`psi2′))+log(0.7)

if log(runiform()) < `a1′ – `a2′ matrix `b'[1,1] = 1

}

end

The results are so similar that I will not even bother to show them. The mixing is slightly worse with a tendency to stay longer in the same model before switching, but even that effect is minor.

At present the algorithm always updates the parameters for the current model and then always attempts to switch models, but we do not need to follow this rigid pattern, for instance, we could randomly either update the current model or try to switch and we could vary the probabilities of updates and attempted model changes to try to improve overall mixing.

Here is a program that updates the current model 3/4 of the time and attempts a move between models 1/4 of the time.

program myRJMCMC

args logpost b ipar

local M = `b'[1,1]

if( runiform() < 0.75 ) {

* update within the current model

if( `M’ == 1 ) {

matrix `b'[1,2] = rbeta(13,17)

matrix `b'[1,3] = rbeta(26,24)

}

else {

matrix `b'[1,2] = rbeta(44,46)

matrix `b'[1,3] = rbeta(1,1)

}

}

else {

* try to switch

local psi1 = `b'[1,2]

local psi2 = `b'[1,3]

local a1 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi2′))+ ///

log(betaden(5,5,`psi1′))+log(betaden(10,10,`psi2′))+log(0.3)

local a2 = log(binomialp(20,8,`psi1′))+log(binomialp(30,16,`psi1′))+ ///

log(betaden(20,20,`psi1′))+log(betaden(1,1,`psi2′))+log(0.7)

if `M’ == 1 {

if log(runiform()) < `a2′ – `a1′ matrix `b'[1,1] = 2

}

else {

if log(runiform()) < `a1′ – `a2′ matrix `b'[1,1] = 1

}

}

end

Once again the results are so similar that it is not worth showing them. The only noticeable difference is that the mixing is worse. This time we make around 400 model switches every 5000 iterations instead of the nearly 1800 that we achieved with the first program. As a result the estimate of P(M2|Y) is not as good and we ought to run the chain for longer.

Perhaps the main conclusion is, *this is such a simple problem that virtually everything that we try will work perfectly well*.

There is one further complication to put into the algorithm and that is to introduce a parameter transformation that will require Jacobians that are not equal to 1. Since this posting is already quite long and because the Jacobian is the element that is most likely to cause an error, I will cover this next time.

]]>A few weeks ago I looked at the Google Flu Trends data and fitted a Bayesian model that identifies the start and end of the annual flu epidemic. A limitation of that model is that it must have exactly one epidemic every year. This is unrealistic as there is a clear suggestion that in some years we do not see a flu epidemic at all and in other years there appear to be two distinct outbreaks.

These patterns can be seen in the data for Austria.

When we allowed exactly one epidemic per year, the Poisson model for the number of cases per week had an expected number, mu, where,

log(mu) = a + b time/1000 + c sin(0.060415*week)+Σ H_{i}δ(year==i & week>=S_{i} & week<=E_{i})

and (H_{i},S_{i},E_{i}) represent the height, start and end of the single epidemic in year i and we sum over the different years.

What I want to do now is to develop a model that it is able to allow the number of epidemics per year to vary. Suppose that we were to generalize so that there could be either 0, 1 or 2 epidemics in any year. We have 12 years of data from Austria, so there will be 3^{12}=531,441 possible models each with a unique combination of 0, 1 or 2 epidemics in the different years.

Since we cannot be certain which of the 531,441 models will apply, we need to develop a Bayesian analysis that will select the most appropriate model as part of the fitting process and this means that our MCMC chain must move freely between the models. The main obstacle to developing such an algorithm is that, as we move from one model to another, so the number of parameters will change with the number of epidemics. We will need to develop a version of the Metropolis-Hastings algorithm that can cope with this problem.

The most commonly used algorithm for this type of Bayesian model selection is called reversible jump Markov chain Monte Carlo or RJMCMC.

This week I want to outline the theory behind RJMCMC and then next week I’ll take a simple example and program it in Stata. Eventually, I want to return to the Austrian flu data and see if I can produce a Stata program to fit a Bayesian model to those data. I must admit that I have not tried this yet and I suspect that selection from half-a-million possible models might prove quite a challenge.

When we allow up to 2 epidemics per year, the Poisson model will have three parameters that control the pattern outside of an epidemic, namely, the intercept, the coefficient of the linear time trend and the coefficient of the sine term. On top of these, there will be a further three parameters for each epidemic (start, end and height). So, in the extreme case, if there were 2 epidemics in every one of the 12 years, we would require 3+12*2*3=75 parameters. Let’s imagine a vector Ψ that contains all 75.

For a particular model we might have one epidemic during most years but occasionally no epidemic or two epidemics, so often we will need some of the 75 parameters and not others. Let’s suppose that for model M_{k} we will need parameters θ_{k} and we will not need parameters u_{k}, where θ_{k} and u_{k} together make up Ψ.

Now imagine that the chain is currently in Model k and we are considering a move to Model j, i.e. from parameters (θ_{k},u_{k}) to (θ_{j},u_{j}).

Our aim is to generate a chain in which we visit model M_{k} with the posterior probability P(M_{k}|Y,Ψ), where Ψ is the current value of the parameters. Averaging over the whole chain will be like integrating over Ψ and will mean that the chain is in model k with probability P(M_{k}|Y).

If the chain really does have model probabilities P(M_{k}|Y,Ψ) and it is constructed by movements between models that have transition probabilities such as P(move M_{j} to M_{k}) then for the chain to stay converged we will require that,

Sum_{j} P(M_{j}|Y,Ψ) P(move M_{j} to M_{k}) = P(M_{k}|Y,Ψ)

Which, in words, means that the total chance of moving into Model k must be equal to the probability of model k.

One way to ensure that this property holds is to have **detailed balance**, which says that, the probability of being in Model k and moving to Model j must equal the chance of being in Model j and moving to Model k.

P( M_{k}|Y, Ψ) P(move M_{k} to M_{j}) = P( M_{j}|Y, Ψ) P(move M_{j} to M_{k})

If detailed balance holds then,

Sum_{j} P(M_{j}|Y,Ψ) P(move M_{j} to M_{k}) = Sum_{j} P( M_{k}|Y, Ψ) P(move M_{k} to M_{j})

and as the sum of the probabilities of all of the allowed moves from Model k must be one,

Sum_{j} P(M_{j}|Y,Ψ) P(move M_{j} to M_{k}) = P( M_{k}|Y, Ψ) Sum_{j} P(move M_{k} to M_{j}) = P(M_{k}|Y, Ψ)

which is just what we need.

In Metropolis-Hastings algorithms, movement between models is a two stage process made out of a proposal and its acceptance or rejection. So detailed balance becomes,

P( M_{k}|Y, Ψ) g(M_{k} to M_{j}) a(M_{k} to M_{j}) = P( M_{j}|Y, Ψ) g(M_{j} to M_{k}) a(M_{j} to M_{k})

where I have used g to represent the proposal probability and a to represent the acceptance probability.

This tells us that, at any point in the chain, the ratio of the acceptance probabilities a(M_{k} to M_{j})/ a(M_{j} to M_{k}) must equal

P(M_{j}|Y, Ψ) g(M_{j} to M_{k}) / P(M_{k}|Y, Ψ) g(M_{k} to M_{j})

In essence this is the Metropolis-Hastings algorithm for movement between Models. We calculate this ratio based on the posterior P(M_{k}|Y, Ψ) and our personal choice of movement probabilities g(). Then any pair of acceptance probabilities that have this ratio will create a chain that converges to the correct posterior, so such a chain will select the different models with frequencies that are proportional to the posterior model probabilities.

Since we want to encourage moves in order to obtain good mixing, it makes sense to make the larger of the two acceptance probabilities equal to one. For instance, if the larger is a(M_{j} to M_{k}) then we set it to 1 and make,

a(M_{k} to M_{j}) = P(M_{j}|Y, Ψ) g(M_{j} to M_{k}) / P(M_{k}|Y, Ψ) g(M_{k} to M_{j})

and if a(M_{k} to M_{j}) is the larger then we set it to 1 and make,

a(Mj to M_{k}) = P(M_{k}|Y, Ψ) g(M_{k} to M_{j}) / P(M_{j}|Y, Ψ) g(M_{j} to M_{k})

This can be summarised as

a(M_{j} to M_{k}) = min { P( M_{k}|Y, Ψ) g(M_{k} to Mj) / P( M_{j}|Y, Ψ) g(M_{j} to M_{k}) , 1}

which is how people usually write the M-H acceptance probability.

All we need to do now is to decide how to calculate P(M_{k}|Y, Ψ). Bayes theorem tells us that

P(M_{k}|Y,Ψ) α P(Y|Ψ,M_{k}) P(M_{k}|Ψ) α P(Y|Ψ,M_{k}) P(Ψ|M_{k}) P(M_{k})

So

a(M_{j} to M_{k}) = min { P(Y|Ψ,M_{k}) P(Ψ|M_{k}) P(M_{k}) g(M_{k} to M_{j}) / P(Y|Ψ,M_{j}) P(Ψ|M_{j}) P(M_{j}) g(M_{j} to M_{k}) , 1}

This looks like a standard MH algorithm. So where is the problem? In fact there three hidden issues.

(a) we do not use Ψ in our model specification but a subset of Ψ. Remember that Ψ=(Θ,u) and the split will not be the same for M_{k} and M_{j}.

(b) the algebraic forms of the models M_{k} and M_{j} may be different (e.g. a gamma and a log-normal), in which case the constants in the formulae may not cancel as they do in most MH calculations.

(c) Bayes theorem applies to probabilities, P, but our models are usually expressed as probability densities, f. Since the area under f defines probability, you might say that P(Ψ|M_{k}) = f(Ψ|M_{k}) dΨ where dΨ defines a tiny region with the same dimension as Ψ. So the acceptance probabilities ought to be written as

a(M_{j} to M_{k}) = min { f(Y|Ψ,M_{k})dY f(Ψ|M_{k}) dΨ P(M_{k}) g(M_{k} to M_{j}) / f(Y|Ψ,M_{j})dY f(Ψ|M_{j}) dΨ P(M_{j}) g(M_{j} to M_{k}) , 1}

Notice that there is no small increment associated with M_{k} because the models are discrete and are described directly by probabilities and the delta for Y will cancel as the data are the same whichever model we use.

The reason that we do not usually bother with the dΨ is that, in many circumstances, it too cancels. Unfortunately we have to be very careful over this cancellation when the parameterizations of the two model are different. In that case, the ratio of the deltas becomes the Jacobian of the transformation from one set of parameters to the other.

We will sort out the specifics of this calculation when we consider particular examples.

One question that we have not yet addressed is how we choose the moves; a choice that is equivalent to defining the function g(). As with any MH algorithm this choice is arbitrary provided that a possible move from M_{k} to M_{j} is matched by a possible move from M_{j} to M_{k} and we do not allow cul-de-sacs from where it is impossible to reach some of the other models. So the algorithm will work for a more or less arbitrary g() but some choices will lead to very poorly mixing chains in which we stay with the same model for a long time before we switch, while other types of proposals will lead to frequent moves and better mixing. Part of the skill of using RJMCMC is to come up with an efficient strategy for proposing moves between models.

In the case of the epidemic models, we could have a scheme that proposes a move from M_{k} to a random choice among the 533,540 possible alternative models but such an algorithm would be incredibly difficult to program and would lead to a lot of rejections, so it would be hugely inefficient. An algorithm in which we make local moves is likely to perform much better. Perhaps, at any stage in the chain, we could select a year at random and either propose that we add one epidemic or drop one epidemic. Once again this is best illustrated in the context of a specific example.

So next week I will take an example, fill in the details and then program the algorithm in Stata.

I’ll finish this week with a reference. The original idea of RJMCMC was developed by Peter Green in 1995 (*Biometrika*, *82*(4), 711-732). That paper is surprisingly readable for *Biometrika* but it is probably not the best introduction to the topic. My suggestion is to start with the simplified approach described in,

Barker, R. J., & Link, W. A. (2013).

Bayesian multimodel inference by RJMCMC: a Gibbs sampling approach.

*The American Statistician*, *67*(3), 150-156.

Dear Allstat Member

The Royal Statistical Society is delighted to announce that Bradley Efron will present his recently published paper at our next Journal webinar on Wednesday 21st October at 4pm (UK time). We are also very pleased to welcome Andrew Gelman, as discussant for this event, and Peter Diggle, as chair, to what we are sure will be an interesting and informative event.

Journal webinars are sponsored by Quintiles and are free and open to everyone.

From our prestigious Series B journal, we have selected this important, thought-provoking paper:

Paper: ”Frequentist accuracy of Bayesian estimates’

Author and presenter: Bradley Efron, Max H Stein Professor of Humanities and Sciences, Professor of Statistics at Stanford University

Discussant: Andrew Gelman, Professor of Statistics and Political Science and Director of the Applied Statistics Center, Columbia University

Chair: Peter Diggle, President of the Royal Statistical Society and Distinguished University Professor, CHICAS, Lancaster University Medical School

The paper was published in June in the Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol 77 (3), 617-646.

It will be free to access from early until late October.

Webinars are simple and free to join. Just dial in and join us on 21st October to listen to the author’s presentation and the ensuing discussion.

You can ask the author a question over the phone or simply issue a message using the web based teleconference system.

Questions can also be emailed in advance and further information can be requested from journalwebinar@rss.org.uk.

More details about this journal webinar can be found in StatsLife and on the RSS website.

Those unable to listen in live can listen to the podcast and view slides from the presentation afterwards on the main RSS website.

The recording will also be posted on Youtube.

To recap, we were analysing the weekly data from Chile

In the graph you can see that the number of cases rises slightly during the winter and falls back to zero in the summer. On top of that cyclical pattern there are winter spikes of varying intensity and duration that represent the flu epidemics. The model for these data assumes that the number of cases follows a Poisson distribution and that each year the log of the expected number of cases, mu, is made up of a sine wave plus a random jump, H_{i}, in the level that has a starting week, S_{i}, and a finishing week, E_{i}.

log(mu) = a + b time/1000 + c sin(0.060415*week) + H_{i}δ(year==i & week>=S_{i} & week<=E_{i})

The model was quite successful at capturing the epidemics in Chile so now we will look at the data for another country. I’ve chosen Belgium because its epidemics are fairly clear, just as they are in Chile, but also because year 3 in Belgium shows the suggestion of a second peak that occurs prior to the main flu epidemic.

We can fit the same Bayesian model to these data using the code that I gave before, provided that we adjust the definitions of weeks and years to reflect the fact that Belgium is in the Northern Hemisphere and its summer minimum occurs in the middle of a conventional year. To make life easier for myself, I have defined my years to start on the 1st of July.

With this modification we can fit the model exactly as before. The plot below shows the same data together with a red line representing the model fit created from the posterior means for each parameter. On the whole the Bayesian algorithm has again performed well.

However, look carefully at year 3. The algorithm has picked up the minor peak and failed to find the larger one. This suggests that we have two problems.

Firstly, we need to improve the mixing so that the chain can move between peaks within the same year. This would mean trying bigger jumps in the location of the epidemic with the start and end times both moving together. To achieve this it might be easier to parameterize in terms of the start and the length of the epidemic rather than the start and the end.

The second problem is even more fundamental. Some years appear to have two genuine epidemics and our model only allows for one. Look at the data for Austria. It appears that there are often two spikes in a year, one small and one large.

One might argue whether or not the second peak reflects a pattern in influenza or a pattern in internet use, but our job is to explain these data and at present we are not able to represent this pattern.

There are various ways of adapting the model to allow for a second peak. We might, for example, fit a model that has 2 epidemics every year and rely on the size of the second spike to be small in years with only one epidemic.

Such a model would be a little untidy because the position of the non-existent epidemics would be very poorly defined and mixing would be a problem. What is more, when we came to analyse the results we would have to pick an arbitrary spike height and declare spikes smaller than that to be non-existent or false epidemics. The choice of this cut-off value would be very difficult to justify. It just does not feel Bayesian.

A better solution would be to allow the number of epidemics each year to be a parameter in the model that we could estimate as part of the chain. Then the analysis would control whether or not a small spike stays in the model. Small spikes might come and go in the chain and the proportional of the time that two spikes are present would give us an estimate of the probability that there were two real epidemics that year.

This approach would be neater but it creates a problem. The MCMC chain will need to jump between solutions that have different numbers of epidemics and, as each epidemic has a height, a start and an end, it follows that whole sets of parameters will drop out of the model when we jump to a solution with a smaller number of spikes and those sets of parameters will re-appear when we jump to a solution with extra spikes.

Rather than think of this situation as a single model in which the number of parameters changes, it is simpler to think of it as a large set of models each with a fixed number of parameters. Our job is both to select the best model and to estimate its parameters, although as Bayesians we will acknowledge uncertainty over which model is actually the best.

We have 12 years of data from Belgium and each year can have 0, 1 or 2 epidemics, so we are left to choose between 3^{12}=531,441 different models. Each epidemic has a start, an end and a height, so the model without any epidemics will have 0 spike parameters and at the other extreme the model with 2 epidemics every year will have 12*3*2=72 spike parameters. Somehow, we must jump randomly between the 531,441 models, changing the parameters as we move.

The most commonly used Bayesian algorithm for moving between models with different parameterizations is called Reversible Jump Markov chain Monte Carlo (RJMCMC). This algorithm is very important in statistics because it is needed for a wide range of problems in which the exact form of the model is not known in advance.

Think of all of the regression models that you have fitted in which you did not know at the outset which covariates needed to be included.

When I was writing the book, *Bayesian Analysis with Stata*, I considered putting in a chapter on RJMCMC but in the end decided that it would take too much space to cover the topic properly, so it got left out. Now we need it.

]]>

The plot below shows the estimates of the number of flu cases each week for Chile. The vertical lines denote periods of 52 weeks. As Chile is in the southern hemisphere, the winter peak in flu cases occurs around the middle of the year.

The pattern seems to be, an annual cycle that is almost zero at the start and end of the year, on to which is imposed an annual spike corresponding to a flu epidemic. These spikes are of varying severity and duration. Perhaps there is also a drift upwards in the total number of estimated cases per year, which I suppose to be a result of increased internet use rather than a genuine increase in the number of flu cases, though the latter is not impossible given a growing population and increased urbanisation.

Anyway, our aim is to model these data in such a way that enables us to determine the starts of the annual epidemics. To that end, in my last posting I developed a Metropolis-Hastings sampler called **mhsint** that updates integer based parameters, such as the week in which an epidemic starts or the week when it ends. What we need now is a Bayesian model for these data.

The model that I am going to adopt just follows my description. I will suppose that the counts follow a Poisson distribution and that the log of the mean number of cases has a linear drift over time plus a sine term that has a frequency of π/52 (≈0.06), so that it rises and falls back to zero over 52 weeks

log(mean) = a + b time/1000 + c sin(0.060415*week)

On top of this basic trend I’ll impose an annual epidemic of strength H_{i} that starts in week S_{i} and ends in week E_{i}, where i denotes the year. So

log(mean) = a + b time/1000 + c sin(0.060415*week)+H_{i} δ(year==i & week>=S_{i} & week<=E_{i})

Here δ is 0 or 1 depending on whether the condition holds.

**Preparing the data**

The data can be downloaded from https://www.google.org/flutrends/about/ by clicking on **world** or the country that interests you. There is a parallel dataset for Dengue Fever. I copied the portion of the file that contains the data and since it is comma delimited, I saved it as a .csv file that I called googleflu.csv.

A little preprocessing is required to turn dates into week numbers and to drop incomplete years. Eventually we obtain, year=1…12 corresponding to 2003…2014, week = 1…52 or 53 (it depends when the first day of the first week falls in that year) and time = 1…627 denoting the week number.

insheet using googleflu.csv , clear

gen day = date(date,”YMD”)

gen time = 1+(day-day[1])/7

gen year = real(substr(date,1,4))

gen start = year != year[_n-1]

gen t = start*time

sort year time

by year: egen w = sum(t)

gen week = time – w + 1

drop w t start

drop if year==2002 | year==2015

drop if chile == .

replace year = year – 2002

keep year week time chile

save tempflu.dta, replace

**Priors**

The intercept parameter, called a in my model formula, represents the log of the predicted number of cases at the start of the first year when it is the height of the Chilean summer, I’ll suppose that a ~ N(0,sd=1). A value of 0 would correspond to an expectation of 1 case per week.

I’ve scaled the time by dividing by 1000 because I expect the trend per week to be very small. If the coefficient b were 1 then after 1000 weeks (≈20 years) the predicted number of cases would have drifted up by a factor of about 3. I’ll suppose that b ~ N(1,sd=1)

Each year the sine term goes from 0 to a peak of 1 and back to 0. In the absence of an epidemic I would imagine that the background predicted number of cases might increase by a factor of perhaps 10 between the summer and the winter, so I’ll make the coefficient of the sine term, c ~ N(2,sd=1).

As to the the extent of the epidemics, I imagine that they might increase the background rate by a factor of 10 or even 100, but an epidemic cannot lower the number of cases, so H_{i} must all be positive. I’ll make H_{i} ~ Gamma(2,0.5)

It is more difficult to place a prior on the start and end of an epidemic because Si and Ei will not be independent. I’ll suppose that the epidemic is likely to start around week 20 and make

P(S_{i}=k) α 1/(1+abs(k-20)) S_{i} < 1 & S_{i} < 50

and

P(E_{i}=h|S_{i}=k) α 1/(1+abs(h-k-8)) Ei>S_{i} & E_{i}<=52

which would imply that I expect the epidemic to last 8 weeks.

**Calculating the log-posterior**

The code for calculating the log-posterior follows the description of the model. The only places where we must be careful are when guarding against an invalid model. This can happen when the limits of the epidemic move outside [1,52], or when the epidemic ends before it starts, or when the value of the log-posterior becomes infinite.

It is an unfortunate aspect of Stata that infinity is represented by a large positive number so any time that the log-posterior become infinite, the Metropolis-Hastings algorithm will automatically accept the move. In this model, this can happen if we try a very large value for the severity of the spike because the log-posterior depends on the exponential of this value and scalars in Stata can only go up to about 10^{300} before they exceed the double precision of a computer.

In the following code, the parameters are arranged in a row matrix, b, so that a, b and c come first, then the 12 H_{i}‘s the the 12 S_{i}‘s and finally the 12 E_{i}s.

program logpost

args lnf b

local invalid = 0

tempvar eta lnL

gen `eta’ = `b'[1,1]+`b'[1,2]*time/1000+`b'[1,3]*sin(0.060415*week)

local hterm = 0

forvalues i=1/12 {

local h = `b'[1,`i’+3]

local hterm = log(`h’) – 2*`h’

local S = `b'[1,`i’+15]

local E = `b'[1,`i’+27]

if `E’ <= `S’ | `E’ > 52 | `S’ < 2 | `h’ > 250 local invalid = 1

local hterm = `hterm’ – log(1+abs(`S’-20)) – log(1+abs(`E’-`S’-8))

replace `eta’ = `eta’ + `h’ if year == `i’ & week >= `S’ & week <= `E’ }

if `invalid’ == 1 scalar `lnf’ = -9e300

else {

gen `lnL’ = chile*`eta’ – exp(`eta’)

qui su `lnL’

scalar `lnf’ = r(sum) + `hterm’ – `b'[1,1]*`b'[1,1]/2 – (`b'[1,2]-1)*(`b'[1,2]-1)/2 ///

– (`b'[1,3]-2)*(`b'[1,3]-2)/2

if `lnf’ == . scalar `lnf’ = -9e300

}

end

**Fitting the Model**

In fitting the model I have used the new Metropolis-Hastings sampler, mhsint, that was introduced last time. There is a version of that function that can be downloaded from my webpage that allows a range parameter to control the maximum allowable move. In this analysis the range is set to 3 so the proposed moves for the start and end of the epidemic can be randomly up or down by 1, 2 or 3.

matrix b = (0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ///

20,20,20,20,20,20,20,20,20,20,20,20, ///

30,30,30,30,30,30,30,30,30,30,30,30)

matrix s = J(1,39,0.1)

mcmcrun logpost b using temp.csv, replace ///

param(a b c h1-h12 s1-s12 e1-e12) burn(1000) update(10000) thin(5) adapt ///

samplers( 3(mhsnorm , sd(s) ) 12(mhslogn , sd(s)) 24(mhsint , range(3)) ) jpost

There are a couple of coding points to note: I use the adapt option to tune the standard deviations of the MH samplers during the burnin and I set the jpost option since I am calculating the joint posterior and not the conditional posteriors of the individual parameters.

**Examining the Results**

The list of summary statistics for the parameters is rather long so I have only shown the Hi, Si, Ei values for 2 of the 12 years

---------------------------------------------------------------------------- Parameter n mean sd sem median 95% CrI ---------------------------------------------------------------------------- a 2000 -0.755 0.067 0.0095 -0.757 ( -0.903, -0.620 ) b 2000 2.979 0.107 0.0105 2.980 ( 2.764, 3.194 ) c 2000 2.186 0.062 0.0073 2.188 ( 2.059, 2.312 ) h4 2000 0.398 0.193 0.0097 0.365 ( 0.105, 0.897 ) h5 2000 1.648 0.068 0.0024 1.647 ( 1.512, 1.788 ) s4 2000 28.418 10.649 1.5287 28.000 ( 3.000, 45.000 ) s5 2000 19.851 0.358 0.0120 20.000 ( 19.000, 20.000 ) e4 2000 50.090 2.849 0.1463 51.000 ( 42.000, 52.000 ) e5 2000 26.878 0.355 0.0118 27.000 ( 26.000, 27.000 ) ----------------------------------------------------------------------------

The intercept, a, and the two regression coefficients, b and c, are reasonably close to my prior estimates, although perhaps the upward drift is a little larger than I was expecting.

Most of the epidemics are easy to detect but I have chosen to display results for one year without a clear epidemic and one with. The epidemic in year 4 is poorly defined. It has a low height and might start any time between week 3 and week 45. In contrast the epidemic in year 5 is much more clear cut. It raised the expected winter level by exp(1.6)≈5 times and almost certainly started in week 20 and finished in week 27.

Convergence is not a major problem although the estimated epidemic in year 4 does jump around and were we particularly interested in that year we would need a much longer run in order to capture its posterior. A more realistic solution might be to allow the model to have years in which there is no epidemic. Perhaps we will return to try that modification at a later date.

If we save the posterior mean estimates we can use them to plot a fitted curve

insheet using temp.csv , clear

mcmcstats *

forvalues i=1/39 {

local p`i’ = r(mn`i’)

}

use tempflu.dta, clear

gen eta = `p1′ + `p2’*time/1000 + `p3’*sin(0.060415*week)

forvalues i=1/12 {

local j = 3 + `I’

local h = `p`j”

local j = 15 + `i’

local S = `p`j”

local j = 27 + `i’

local E = `p`j”

replace eta = eta + `h’ if year == `i’ & week >= `S’ & week <= `E’

}

gen mu = exp(eta)

twoway (line chile time ) (line mu time, lcol(red) lpat(dash)) , leg(off)

Here the black line represents the data and the red line is the fitted curve based on the posterior means. We seem to pick up the flu epidemics quite well and it is clear that the lack of an epidemic in year 4 and the strong epidemic in week 5 accord with the numerical results.

Since the model uses a log-link function we can see the fit better if we look at the log-count and the log of the fitted values.

The drift upwards is clearer although perhaps it was not linear over the whole range of the data. My overall impression is that the model has been relatively successful in capturing the trend.

There is much more to do but this is enough for one week.

]]>The novel aspect of this project is that influenza is monitored by tracking online Google searches related to influenza. The number of searches has been calibrated against surveys of the number of medically diagnosed cases. You can find more information on Wikipedia.

There is a fascinating on-going debate in the media and on the internet about the ability of such search engine data to predict cases of influenza. What is most interesting is the emotion that this debate generates. People who are attracted by big data get enthusiastic about the project and play down its limitations, while people who equate big data with big brother stress the weakness in the flu estimates and overlook the potential. It is as though they are not really talking about Google flu trends but merely using the debate as a vehicle for justifying their wider prejudices.

Anyway, I will just accept the data and see how we might model them to detect influenza epidemics. After all, even if the search based predictions are wrongly calibrated they might still rise when the true number of cases of flu rises, which could be enough to enable us to detect the start of an outbreak. As an example, here is the data for Chile.

The dashed vertical lines denote periods of 52 weeks. Remember that Chile is in the southern hemisphere so the winter influenza peak occurs around the middle of the year. Just looking at the plot one can see a small seasonal trend of the type that is most obvious in year 4 (around 200 weeks). Then on top of that there is usually an annual epidemic that causes a sharp spike of variable size and variable duration. For the purposes of monitoring influenza, we would like to analyse these data in real time and to decide as soon as possible when an epidemic is underway.

I will try to derive a Bayesian analysis of these data and in the process of developing and fitting such a model I will need to create a new sampler and integrate it into the general structure described in my book.

In my model the weekly counts will be thought of as being dependent on a latent variable that denotes whether or not an epidemic is in progress. We will define a variable Z that takes the value 0 when there is no current epidemic, and 1, when there is a current epidemic. Of course, Z is not observed but has to be inferred from the pattern of counts. Our aim is to detect as soon as possible when Z switches from 0 to 1.

It appears from the plot that it would be reasonable to start by assuming that there is exactly one flu epidemic every year and that the severity varies from the very minor (e.g. year 4) to the very extreme (year 7).

Each epidemic is characterised by the week that it starts, S, and the week that it ends, E. These parameters will take integer values and of course S must be less that E. It would be nice to be able to update estimates of S and E using a Metropolis-Hastings algorithm but the programs given in my book only work for parameters defined over a continuous range: **mhsnorm** is good for parameters defined over an infinite range (-∞,∞), **mhslogn** is good for (0,∞) and **mhstrnc** is good for (a,b), where a and b are finite. Our first job therefore is to design a new Metropolis-Hastings sampler that can deal with integer valued parameters.

Let’s suppose that θ is a parameter defined over the integers (1, 2, … , n) and and that we have a program that works out the log-posterior given θ. If the current value of the parameter is θ_{0}, we need to make a proposal say θ_{1} and compare logpost(θ_{0}) with logpost(θ_{1}) to see if we accept a move to θ_{1} or reject it and stay at θ_{0}.

We might propose a random move to any other integer in the range [1,n]. The disadvantage of completely random moves is that many of the proposed moves will take us to poorly supported values and such moves are very likely to be rejected. Lots of rejected moves will mean poor mixing.

An alternative strategy would be to propose a move to a neighbouring value, i.e. from θ_{0}=12 to θ_{1}=11 or 13. Such local moves are more likely to be accepted but the change is small so the algorithm will require lots of moves to cover the posterior and this might also be inefficient, especially early in the chain when we have not located the peak of the posterior.

On balance my gut feeling favours short moves to neighbouring points. So let’s assume that when θ_{0}=12 we are equally likely to suggest a move up to 13 or a move down to 11. Similarly if θ_{0}=11 we are equally likely to propose a move to 10 or 12. The key feature is Pr(propose 11 | θ_{0}=12 ) = Pr(propose 12 | θ_{0}=11 ). This is the characteristic that defines the special case known as a Metropolis algorithm in which the proposal probabilities cancel and we only need to worry about the comparative values of the log-posterior.

There is a small problem at the end points of the range. Consider the case when θ_{0}=1, the move to θ_{1}=0 is not allowed because the range is [1,n]. We can cope with this, either by modifying the rule for making proposals, or by making the log-posterior at θ_{1}=0 so small that the move is never accepted. The latter is easier to program.

The program for the new sampler will follow the same pattern as **mhsnorm**. So let’s look at that program and see what needs to be changed. I have highlighted the few places that need our attention in green.

program **mhsnorm**, rclass

syntax anything [fweight] [if] [in], **SD(string) ** [ LOGP(real 1e20) DEBUG ]

tokenize “`anything'”

local logpost “`1′”

local b “`2′”

local ipar = `3′

tempname logl newlogl

if “`debug'” == “” local qui “qui”

local oldvalue = `b'[1,`ipar’]

scalar `logl’ = .

if `logp’ < 1e19 scalar `logl’ = `logp’

if `logl’ == . {

`qui’ `logpost’ `logl’ `b’ `ipar’ [`weight’`exp’] `if’ `in’

if `logl’ == . {

di as err “**mhsnorm**: cannot calculate log-posterior for parameter `ipar'”

matrix list `b’

exit(1400)

}

}

**matrix `b'[1,`ipar’] = `oldvalue’ + `sd’*rnormal()**

scalar `newlogl’ = .

`qui’ `logpost’ `newlogl’ `b’ `ipar’ [`weight’`exp’] `if’ `in’

if `newlogl’ == . {

di as err “**mhsnorm**: cannot calculate log-posterior for parameter `ipar'”

matrix list `b’

exit(1400)

}

if log(uniform()) < (`newlogl’-`logl’) local accept = 1

else {

local accept = 0

matrix `b'[1,`ipar’] = `oldvalue’

scalar `newlogl’ = `logl’

}

return scalar accept = `accept’

return scalar logp = `newlogl’

end

Most of the program is involved in housekeeping and error checking so the changes are minimal. First, since I am going to call the new sampler **mhsint**, we need to change all references to mhsnorm into mhsint, then we need to drop the redundant SD option and finally we change the line that makes the new proposal from

**matrix `b'[1,`ipar’] = `oldvalue’ + `sd’*rnormal()**

to

**matrix `b'[1,`ipar’] = `oldvalue’ + 1 – 2*(runiform()<0.5)**

This ensures that the proposal has an equal chance of being 1 more or 1 less that the previous value.

Let’s design a simple test to show how mhsint can be used with mcmcrun in just the same way as the other samplers mentioned in my book. I will suppose that a variable can take the values 1 or 2 or 3 with posterior probabilities 0.2, 0.3 and 0.5. Here is a program for calculating the log-posterior. If we move outside the range [1,3] then the log-posterior is set to a very low number so as to effectively block the move.

program logpost

args lnf b

local k = `b'[1,1]

if `k’ == 1 scalar `lnf’ = log(0.2)

else if `k’ == 2 scalar `lnf’ = log(0.3)

else if `k’ == 3 scalar `lnf’ = log(0.5)

else scalar `lnf’ = -999.9

end

I’ll start the run with the variable (that I’m calling k) set to 2. This I use mcmcrun to create a short burnin and then 500 samples.

matrix b = (2)

mcmcrun logpost b using temp.csv, replace ///

param(k) burn(50) update(500) samp( mhsint )

insheet using temp.csv , clear

tabulate k

k | Freq. Percent Cum. ------------+----------------------------------- 1 | 102 20.40 20.40 2 | 145 29.00 49.40 3 | 253 50.60 100.00 ------------+----------------------------------- Total | 500 100.00

Clearly the chain has settled to the correct proportions.

Next let’s design a test that is closer to the flu problem. Suppose that we generate at sequence of 50 values, i.e. the flu counts for each week over most of a year. At the start of the year the counts will be Poisson with a mean of 2 but at some random point between 10 and 40 weeks they will switch to a Poisson with a mean of 4.

clear

set seed 626169

set obs 50

local start = int(10+runiform()*31)

gen t = _n

gen mu = 2 + 2*(t >= `start’)

gen y = rpoisson(mu)

scatter y t, c(l) xline(`start’)

In this simulation the switching point is at 31. What we need to do is to estimate the time of the switch, which I’ll again call k, from the 50 counts.

Here is a program that calculates the log-posterior. If we step outside the range [10,40] then the move is blocked by setting a very small log-posterior, otherwise we calculate a standard Poisson log-likelihood with known means 2 or 4 and a switch at the current k, which is stored as the first element of our parameter vector, b.

program logpost

args lnf b

if `b'[1,1] < 10 | `b'[1,1] > 40 scalar `lnf’ = -9999.9

else {

tempvar mu lnL

gen `mu’ = 2

qui replace `mu’ = 4 if _n >= `b'[1,1]

gen `lnL’ = y*log(`mu’) – `mu’

qui su `lnL’

scalar `lnf’ = r(sum)

}

end

We can run the analysis much as before

matrix b = (25)

mcmcrun logpost b using temp.csv, replace ///

param(k) burn(50) update(500) samp( mhsint )

insheet using temp.csv , clear

histogram k , start(9.5) width(1) xlabel(10(5)40)

The estimated log-posterior does indeed peak at 31 but given the nature of the data that we saw in the time series plot it is not surprising that the switch looks more likely to have come later in the sequence rather than earlier.

If we inspect the trace plot of the chain then we can see that we only made steps of size one but, despite this, the mixing looks reasonable.

It would be a simple matter to change from steps of size 1 to steps of size 1 or 2 or steps of size 1 or 2 or 3. We just need to change the line of code that creates the random proposal.

As it stands, the program logpost does not include a prior on k, so the implicit assumption is that prior to seeing the data we thought that all switching points in the range [10,40] were equally likely. As it happens, this accords with the way that the data were generated but had the data been real counts of flu in a small population, we might well have had prior belief about when the switch would occur. For example, we might believe that 25 is the most likely switching point and that the prior probability is proportional to 1/(1+abs(k-25))

clear

range k 10 40 31

gen prior = 1/(1+abs(k-25))

twoway bar prior k

We can incorporate this prior into the calculation of the log-posterior

program logpost

args lnf b

if `b'[1,1] < 10 | `b'[1,1] > 40 scalar `lnf’ = -9999.9

else {

tempvar mu lnL

gen `mu’ = 2

qui replace `mu’ = 4 if _n >= `b'[1,1]

gen `lnL’ = y*log(`mu’) – `mu’

qui su `lnL’

scalar `lnf’ = r(sum) – log(1+abs(`b'[1,1]-25))

}

end

and the posterior is pulled slightly to the left.

Next time I will use the new sampler to help me build a model for the flu data.

]]>