Gibbs Sampling with a Dirichlet Process

This is the third in a series of postings on the use of Dirichlet processes for non-parametric Bayesian analysis and their implementation in Stata.

In this posting I will create a Stata program that fits a Bayesian model that incorporates a Dirichlet process prior. As my example I will model the peak expiratory flow measurements that were first used to illustrate agreement study analysis in an article from 1986 published in the Lancet. These data were re-analysed by Sophia Rabe-Hesketh and Anders Skrondal’s in their book ‘Multilevel and longitudinal modeling using Stata’. The data can be downloaded to Stata using
. use http://www.stata-press.com/data/mlmus/pefr, clear
and the values are:

  wp1 wp2 wm1 wm2
  494 490 512 525
  395 397 430 415
  516 512 520 508
  434 401 428 444
  476 470 500 500
  557 611 600 625
  413 415 364 460
  442 431 380 390
  650 638 658 642
  433 429 445 432
  417 420 432 420
  656 633 626 605
  267 275 260 227
  478 492 477 467
  178 165 259 268
  423 372 350 370
  427 421 451 443

What we have here are repeated measurements of peak expiratory flow (how fast you can blow air from your lungs in Litres/min) on 17 subjects. The measurements were made twice using the Wright flow meter (wp1 and wp2) and then twice using a simpler device called a mini Wright meter (wm1 wm2). The objective was to measure the agreement between the two devices.

This problem has a hierarchical structure in that the subjects can be viewed as being sampled from a population with some distribution of peak flow. Then there is a random measurement error term associated with each reading that also has a distribution. The usual analysis assumes that both of these distributions are normal but we will create a model that keeps the normal distribution for the measurement errors but uses a Dirichlet process to give greater flexibility to the distribution across subjects.

We can write the model for subject i=1…17 with method j=1,2 on occasion k=1,2 as
Yijk = αi + β Xijk + εijk
where Xi1k = 0 and Xi2k = 1 to denote which type of meter was used. ei1k ~ N(0,τ1) and εi2k ~ N(0,τ2) since the measurement errors might differ between meters. Finally the subject effects, αi ~ f(), where f is the distribution across subjects.

Now we need some priors. Peak flow varies a lot with age, height and gender with values ranging between 300 and 700L/min. If a typical value is 500L/min then be would not expect the meters to differ by more than say, 10%, so β will be below 50, hopefully a lot below 50. Measurement error would perhaps be 5%-10% and might be higher with the mini meter.

In the end I decided on the following priors:
f will be drawn from a Dirichlet process DP(M,F) with M = 100 and F=N(500,sd=100).
β ~ N(0,sd=25)
τ1 ~ G(3,0.001) τ2 ~ G(2,0.001)
The program given below could easily be modified to accommodate your own priors should you not like mine.

So we want an MCMC algorithm that samples from the posterior
p(β, τ1, τ2, α, f | Y ) α p(Y | β, τ1, τ2, α, f ) p(β) p(τ1) p(τ2) p(α | f) p(f)

The first term on the right hand side is the likelihood of the data, Y, and it does not depend on f() when we know αi, so the posterior simplifies to,
p(β, τ1, τ2, α, f | Y ) α p(Y | β, τ1, τ2, α ) p(β) p(τ1) p(τ2) p(α | f) p(f)

If we integrate over this unknown distribution f() we get,
p(β, τ1, τ2, α | Y ) α p(Y | β, τ1, τ2, α ) p(β) p(τ1) p(τ2) ∫ p(α | f) p(f) df

Although the integral is complex we do have the Chinese Restaurant algorithm that tells us how to sample successive α from ∫ p(α | f) p(f) df.

A Gibbs Sampler

The likelihood of the data, p(Y | β, τ1, τ2, α ), has a normal structure created by the measurement error so my chosen priors for β and the τ’s are conjugate and updating is relatively straight forward (see chapter 4 of Bayesian Analysis with Stata).

p(f) is a Dirichlet process and as we have already noted we can sample α from ∫ p(α | f) p(f) df using the Chinese restaurant algorithm. However, now we have an extra problem because there is information about α in the likelihood as well as in this prior.

The Chinese restaurant algorithm says that we draw αi conditional on all other parameters (including the other α’s) from a mixture distribution that we can write as
δ(αj)/(M+16) + M F(α)/(M+16)
This is to be read as meaning that we take either one of the other 16 existing α’s, each with probability 1/(M+16), or we make a fresh draw from F with probability M/(M+16).

If we denote the part of the likelihood that contains αi as L then we must draw one of the existing α’s, say αj, with probability proportional to L(αj)/(M+16) and a fresh value with probability proportional to M L(αi)F(αi) /(M+16). The constant of proportionality can be found by using the fact that the 17 probabilities must sum to one.

Generally the need to sample new values from L(α)F(α) will severely limit the usefulness of this algorithm but here we have a likelihood based on a normal distribution and a normal distribution for F; it turns out that the algebra is tractable.

The algebra is a bit painful and quite difficult to write out in a blog that does not make it easy to create equations, so I have written the algebra in full in a downloadable pdf file wright.

Here is a Stata program that fits the model using the algorithm described in the pdf

use http://www.stata-press.com/data/mlmus/pefr, clear
 *------------------------
 * PRIORS
 *------------------------
 local a1 = 3
 local b1 = 0.001
 local a2 = 2
 local b2 = 0.001
 local m = 0
 local t = 1/625
 local A = 500
 local T = 0.0001
 local M = 100
 *------------------------
 * Initial values
 *------------------------
 gen alpha = (wp1+wp2)/2
 gen tmp = wp1 - wp2
 qui su tmp
 local t1 = 2/r(Var)
 qui replace tmp = wm1 - wm2
 qui su tmp
 local t2 = 2/r(Var)
 qui replace tmp = (wp1+wp2)/2 - alpha
 qui su tmp
 local b = r(mean)
 *------------------------
 * Updates
 *------------------------
 tempname pf
 postfile `pf' beta tau1 tau2 alpha1 alpha2 alpha3 alpha4 alpha5 alpha6 alpha7 alpha8 alpha9 alpha10 alpha11 alpha12 ///
 alpha13 alpha14 alpha15 alpha16 alpha17 using temp.dta, replace
 forvalues iter=1/10500 {
 *------------------------
 * beta (b)
 *------------------------
 local tb = 34*`t2'+`t'
 qui replace tmp = (wm1-alpha) + (wm2-alpha)
 qui su tmp
 local mb = (`t2'*r(sum)+`t'*`m')/`tb'
 local b = rnormal(`mb',1/sqrt(`tb'))
 *------------------------
 * tau1 (t1)
 *------------------------
 qui replace tmp = (wp1-alpha)*(wp1-alpha)+(wp2-alpha)*(wp2-alpha)
 qui su tmp
 local at1 = `a1'+17
 local bt1 = 1/(0.5*r(sum)+1/`b1')
 local t1 = rgamma(`at1',`bt1')
 *------------------------
 * tau2 (t2)
 *------------------------
 qui replace tmp = (wm1-alpha-`b')*(wm1-alpha-`b')+(wm2-alpha-`b')*(wm2-alpha-`b')
 qui su tmp
 local at2 = `a2'+17
 local bt2 = 1/(0.5*r(sum)+1/`b2')
 local t2 = rgamma(`at2',`bt2')
 *------------------------
 * alpha
 *------------------------
 forvalues i=1/17 {
 local C2 = `T' +2*`t1'+2*`t2'
 local C1 = `T'*`A' + `t1'*(wp1[`i']+wp2[`i']) + ///
 `t2'*((wm1[`i']-`b')+(wm2[`i']-`b'))
 local C0 = `T'*`A'*`A' + `t1'*(wp1[`i']*wp1[`i']+wp2[`i']*wp2[`i']) ///
 + `t2'*((wm1[`i']-`b')*(wm1[`i']-`b')+(wm2[`i']-`b')*(wm2[`i']-`b'))
 local q0 = `M'*sqrt(`T'/`C2')*exp(-0.5*(`C0'-`C1'*`C1'/`C2'))
 local sq = `q0'
 forvalues j=1/17 {
 if `j' == `i' local q`j' = 0
 else local q`j' = exp(-0.5*(`t1'*(wp1[`i']-alpha[`j'])* ///
 (wp1[`i']-alpha[`j']) + `t1'*(wp2[`i']-alpha[`j'])*(wp2[`i']-alpha[`j']) ///
 +`t2'*(wm1[`i']-alpha[`j']-`b')*(wm1[`i']-alpha[`j']-`b') + ///
 `t2'*(wm2[`i']-alpha[`j']-`b')*(wm2[`i']-alpha[`j']-`b') ))
 local sq = `sq' + `q`j''
 }
 forvalues j=0/17 {
 local q`j' = `q`j''/`sq'
 }
 local j = 0
 local sq = `q0'
 local u = runiform()
 while `u' > `sq' {
 local ++j
 local sq = `sq' + `q`j''
 }
 if `j' == 0 qui replace alpha = rnormal(`C1'/`C2',1/sqrt(`C2')) in `i'
 else qui replace alpha = alpha[`j'] in `i'
 }
 local plist "(`b') (`t1') (`t2')"
 forvalues i=1/17 {
 local a = alpha[`i']
 local plist "`plist' (`a')"
 }
 post `pf' `plist'
 }
 postclose `pf'
 use temp.dta, clear
 gen sd1 = 1/sqrt(tau1)
 gen sd2 = 1/sqrt(tau2)

I ran this program to create 10,500 updates and then discarded the first 500 although this length of burn-in was rather more than I needed as my starting values are quite good. Here are the results

. mcmcstats *

----------------------------------------------------------------------------
Parameter         n     mean       sd      sem   median         95% CrI
----------------------------------------------------------------------------
 beta         10000    4.759    5.766   0.0806    4.796 (  -6.585,  16.060 )
 tau1         10000    0.003    0.001   0.0000    0.003 (   0.001,   0.005 )
 tau2         10000    0.001    0.001   0.0000    0.001 (   0.001,   0.003 )
 alpha1       10000  499.418   11.764   0.1309  499.185 ( 477.360, 523.361 )
 alpha2       10000  405.752   11.634   0.1332  405.933 ( 383.264, 428.662 )
 alpha3       10000  511.760   11.172   0.1151  511.822 ( 489.389, 534.095 )
 alpha4       10000  422.584   10.841   0.1178  422.445 ( 401.530, 444.636 )
 alpha5       10000  481.346   11.436   0.1284  481.084 ( 459.092, 504.483 )
 alpha6       10000  591.071   11.740   0.1301  590.951 ( 568.720, 615.034 )
 alpha7       10000  413.595   10.816   0.1153  413.818 ( 391.458, 434.218 )
 alpha8       10000  418.100   12.824   0.1795  419.044 ( 390.077, 441.458 )
 alpha9       10000  642.327   10.814   0.1136  642.332 ( 620.764, 663.581 )
 alpha10      10000  431.244   10.917   0.1148  430.854 ( 410.868, 453.090 )
 alpha11      10000  420.439   10.662   0.1117  420.406 ( 399.531, 441.874 )
 alpha12      10000  631.905   12.173   0.1495  632.674 ( 606.440, 654.117 )
 alpha13      10000  262.062   12.832   0.1573  262.716 ( 234.243, 285.572 )
 alpha14      10000  479.474   11.340   0.1235  479.730 ( 456.457, 501.834 )
 alpha15      10000  205.544   17.827   0.3663  203.432 ( 175.602, 245.963 )
 alpha16      10000  385.349   13.025   0.1674  385.670 ( 359.127, 410.078 )
 alpha17      10000  429.787   11.392   0.1265  429.315 ( 408.605, 453.572 )
 sd1          10000   19.689    4.161   0.0974   18.929 (  13.626,  29.886 )
 sd2          10000   27.858    4.881   0.0887   27.631 (  18.978,  38.418 )
----------------------------------------------------------------------------

Let us contrast these results with the same model but M=20 in the Dirichlet process prior. This represents less confidence in the normal base distribution.

----------------------------------------------------------------------------
Parameter         n     mean       sd      sem   median         95% CrI
----------------------------------------------------------------------------
 beta         10000    5.028    5.785   0.0805    5.051 (  -6.592,  16.412 )
 tau1         10000    0.003    0.001   0.0000    0.003 (   0.001,   0.006 )
 tau2         10000    0.001    0.001   0.0000    0.001 (   0.001,   0.003 )
 alpha1       10000  498.001   11.686   0.1313  497.852 ( 476.034, 521.054 )
 alpha2       10000  407.969   11.642   0.1351  408.766 ( 384.703, 428.393 )
 alpha3       10000  510.131   11.398   0.1212  509.970 ( 488.062, 532.669 )
 alpha4       10000  421.499    9.441   0.1087  421.396 ( 402.823, 441.031 )
 alpha5       10000  482.188   11.449   0.1306  481.866 ( 460.235, 505.658 )
 alpha6       10000  591.471   12.285   0.1438  590.892 ( 568.875, 617.403 )
 alpha7       10000  414.766    9.953   0.1116  415.716 ( 393.231, 432.483 )
 alpha8       10000  418.245   11.662   0.1671  419.348 ( 391.538, 439.163 )
 alpha9       10000  640.930   10.551   0.1171  640.892 ( 620.243, 662.228 )
 alpha10      10000  427.722   10.377   0.1164  426.519 ( 409.715, 450.368 )
 alpha11      10000  419.820    9.452   0.1044  420.137 ( 400.499, 438.751 )
 alpha12      10000  633.032   12.139   0.1641  634.452 ( 605.031, 653.556 )
 alpha13      10000  260.309   14.378   0.2157  261.895 ( 229.308, 285.398 )
 alpha14      10000  480.481   11.333   0.1265  480.546 ( 456.773, 502.093 )
 alpha15      10000  206.370   19.913   0.4823  202.659 ( 175.060, 248.730 )
 alpha16      10000  387.677   13.881   0.1853  388.309 ( 358.905, 413.419 )
 alpha17      10000  426.614   10.398   0.1213  425.484 ( 408.662, 449.858 )
 sd1          10000   19.694    4.503   0.1144   18.673 (  13.332,  30.498 )
 sd2          10000   28.151    5.137   0.0956   27.970 (  18.771,  38.937 )
----------------------------------------------------------------------------

It is interesting to see that the prior on the distribution of the subject effects does have a noticeable influence on the estimate of beta, the parameter which measures the average difference between the two meters.

In my recent postings I have tried to explain the idea behind the Dirichlet process in a simple, non-mathematical way and to demonstrate that these priors can be incorporated into practical Stata programs for data analysis.

So far I have only scratched the surface of a vast and very active area of research. Other possibilities include making the base distribution depend on parameters in a hierarchical structure, placing a prior on the concentration parameter M, extending to non-normal likelihoods and non-normal base distributions.

Dirichlet processes priors give rise to data sets with clusters of repeat observations so the results are inherently discrete even though we are modelling a continuous distribution; some people find this unsatisfactory and so a lot of work has gone into developing Dirichlet process mixture models in which the Dirichlet process defines the distribution of the parameters of a standard distribution such as the normal. The data are then randomly drawn from a mixture of those standard distributions and consequently there are no repeated values. This creates a natural extension of a finite mixture model.

For all of these modified Dirichlet process models there are competing algorithms that might be used and these could be exact or approximate, fast or slow.

Dirichlet process methods are very powerful and have found application in virtually every area of applied research. The peak flow example demonstrates that they can be implemented in Stata but in my next posting I want to ask whether using Stata for such analyses is a good idea. After all, why implement an analysis in Stata if it is easier to program it in R, or even better, if someone else has already written the R code?

Unfortunately, that discussion will have to wait for a couple of weeks as I have a conference to go to.

About John

Leave a Reply Click here to cancel reply.

Categories

Tags

Archives

Pages

Authors

Recent Comments