This is the third in a series of postings on the use of Dirichlet processes for non-parametric Bayesian analysis and their implementation in Stata.
In this posting I will create a Stata program that fits a Bayesian model that incorporates a Dirichlet process prior. As my example I will model the peak expiratory flow measurements that were first used to illustrate agreement study analysis in an article from 1986 published in the Lancet. These data were re-analysed by Sophia Rabe-Hesketh and Anders Skrondal’s in their book ‘Multilevel and longitudinal modeling using Stata’. The data can be downloaded to Stata using
. use http://www.stata-press.com/data/mlmus/pefr, clear
and the values are:
wp1 wp2 wm1 wm2
494 490 512 525
395 397 430 415
516 512 520 508
434 401 428 444
476 470 500 500
557 611 600 625
413 415 364 460
442 431 380 390
650 638 658 642
433 429 445 432
417 420 432 420
656 633 626 605
267 275 260 227
478 492 477 467
178 165 259 268
423 372 350 370
427 421 451 443
What we have here are repeated measurements of peak expiratory flow (how fast you can blow air from your lungs in Litres/min) on 17 subjects. The measurements were made twice using the Wright flow meter (wp1 and wp2) and then twice using a simpler device called a mini Wright meter (wm1 wm2). The objective was to measure the agreement between the two devices.
This problem has a hierarchical structure in that the subjects can be viewed as being sampled from a population with some distribution of peak flow. Then there is a random measurement error term associated with each reading that also has a distribution. The usual analysis assumes that both of these distributions are normal but we will create a model that keeps the normal distribution for the measurement errors but uses a Dirichlet process to give greater flexibility to the distribution across subjects.
We can write the model for subject i=1…17 with method j=1,2 on occasion k=1,2 as
Yijk = αi + β Xijk + εijk
where Xi1k = 0 and Xi2k = 1 to denote which type of meter was used. ei1k ~ N(0,τ1) and εi2k ~ N(0,τ2) since the measurement errors might differ between meters. Finally the subject effects, αi ~ f(), where f is the distribution across subjects.
Now we need some priors. Peak flow varies a lot with age, height and gender with values ranging between 300 and 700L/min. If a typical value is 500L/min then be would not expect the meters to differ by more than say, 10%, so β will be below 50, hopefully a lot below 50. Measurement error would perhaps be 5%-10% and might be higher with the mini meter.
In the end I decided on the following priors:
f will be drawn from a Dirichlet process DP(M,F) with M = 100 and F=N(500,sd=100).
β ~ N(0,sd=25)
τ1 ~ G(3,0.001) τ2 ~ G(2,0.001)
The program given below could easily be modified to accommodate your own priors should you not like mine.
So we want an MCMC algorithm that samples from the posterior
p(β, τ1, τ2, α, f | Y ) α p(Y | β, τ1, τ2, α, f ) p(β) p(τ1) p(τ2) p(α | f) p(f)
The first term on the right hand side is the likelihood of the data, Y, and it does not depend on f() when we know αi, so the posterior simplifies to,
p(β, τ1, τ2, α, f | Y ) α p(Y | β, τ1, τ2, α ) p(β) p(τ1) p(τ2) p(α | f) p(f)
If we integrate over this unknown distribution f() we get,
p(β, τ1, τ2, α | Y ) α p(Y | β, τ1, τ2, α ) p(β) p(τ1) p(τ2) ∫ p(α | f) p(f) df
Although the integral is complex we do have the Chinese Restaurant algorithm that tells us how to sample successive α from ∫ p(α | f) p(f) df.
A Gibbs Sampler
The likelihood of the data, p(Y | β, τ1, τ2, α ), has a normal structure created by the measurement error so my chosen priors for β and the τ’s are conjugate and updating is relatively straight forward (see chapter 4 of Bayesian Analysis with Stata).
p(f) is a Dirichlet process and as we have already noted we can sample α from ∫ p(α | f) p(f) df using the Chinese restaurant algorithm. However, now we have an extra problem because there is information about α in the likelihood as well as in this prior.
The Chinese restaurant algorithm says that we draw αi conditional on all other parameters (including the other α’s) from a mixture distribution that we can write as
δ(αj)/(M+16) + M F(α)/(M+16)
This is to be read as meaning that we take either one of the other 16 existing α’s, each with probability 1/(M+16), or we make a fresh draw from F with probability M/(M+16).
If we denote the part of the likelihood that contains αi as L then we must draw one of the existing α’s, say αj, with probability proportional to L(αj)/(M+16) and a fresh value with probability proportional to M L(αi)F(αi) /(M+16). The constant of proportionality can be found by using the fact that the 17 probabilities must sum to one.
Generally the need to sample new values from L(α)F(α) will severely limit the usefulness of this algorithm but here we have a likelihood based on a normal distribution and a normal distribution for F; it turns out that the algebra is tractable.
The algebra is a bit painful and quite difficult to write out in a blog that does not make it easy to create equations, so I have written the algebra in full in a downloadable pdf file wright.
Here is a Stata program that fits the model using the algorithm described in the pdf
use http://www.stata-press.com/data/mlmus/pefr, clear *------------------------ * PRIORS *------------------------ local a1 = 3 local b1 = 0.001 local a2 = 2 local b2 = 0.001 local m = 0 local t = 1/625 local A = 500 local T = 0.0001 local M = 100 *------------------------ * Initial values *------------------------ gen alpha = (wp1+wp2)/2 gen tmp = wp1 - wp2 qui su tmp local t1 = 2/r(Var) qui replace tmp = wm1 - wm2 qui su tmp local t2 = 2/r(Var) qui replace tmp = (wp1+wp2)/2 - alpha qui su tmp local b = r(mean) *------------------------ * Updates *------------------------ tempname pf postfile `pf' beta tau1 tau2 alpha1 alpha2 alpha3 alpha4 alpha5 alpha6 alpha7 alpha8 alpha9 alpha10 alpha11 alpha12 /// alpha13 alpha14 alpha15 alpha16 alpha17 using temp.dta, replace forvalues iter=1/10500 { *------------------------ * beta (b) *------------------------ local tb = 34*`t2'+`t' qui replace tmp = (wm1-alpha) + (wm2-alpha) qui su tmp local mb = (`t2'*r(sum)+`t'*`m')/`tb' local b = rnormal(`mb',1/sqrt(`tb')) *------------------------ * tau1 (t1) *------------------------ qui replace tmp = (wp1-alpha)*(wp1-alpha)+(wp2-alpha)*(wp2-alpha) qui su tmp local at1 = `a1'+17 local bt1 = 1/(0.5*r(sum)+1/`b1') local t1 = rgamma(`at1',`bt1') *------------------------ * tau2 (t2) *------------------------ qui replace tmp = (wm1-alpha-`b')*(wm1-alpha-`b')+(wm2-alpha-`b')*(wm2-alpha-`b') qui su tmp local at2 = `a2'+17 local bt2 = 1/(0.5*r(sum)+1/`b2') local t2 = rgamma(`at2',`bt2') *------------------------ * alpha *------------------------ forvalues i=1/17 { local C2 = `T' +2*`t1'+2*`t2' local C1 = `T'*`A' + `t1'*(wp1[`i']+wp2[`i']) + /// `t2'*((wm1[`i']-`b')+(wm2[`i']-`b')) local C0 = `T'*`A'*`A' + `t1'*(wp1[`i']*wp1[`i']+wp2[`i']*wp2[`i']) /// + `t2'*((wm1[`i']-`b')*(wm1[`i']-`b')+(wm2[`i']-`b')*(wm2[`i']-`b')) local q0 = `M'*sqrt(`T'/`C2')*exp(-0.5*(`C0'-`C1'*`C1'/`C2')) local sq = `q0' forvalues j=1/17 { if `j' == `i' local q`j' = 0 else local q`j' = exp(-0.5*(`t1'*(wp1[`i']-alpha[`j'])* /// (wp1[`i']-alpha[`j']) + `t1'*(wp2[`i']-alpha[`j'])*(wp2[`i']-alpha[`j']) /// +`t2'*(wm1[`i']-alpha[`j']-`b')*(wm1[`i']-alpha[`j']-`b') + /// `t2'*(wm2[`i']-alpha[`j']-`b')*(wm2[`i']-alpha[`j']-`b') )) local sq = `sq' + `q`j'' } forvalues j=0/17 { local q`j' = `q`j''/`sq' } local j = 0 local sq = `q0' local u = runiform() while `u' > `sq' { local ++j local sq = `sq' + `q`j'' } if `j' == 0 qui replace alpha = rnormal(`C1'/`C2',1/sqrt(`C2')) in `i' else qui replace alpha = alpha[`j'] in `i' } local plist "(`b') (`t1') (`t2')" forvalues i=1/17 { local a = alpha[`i'] local plist "`plist' (`a')" } post `pf' `plist' } postclose `pf' use temp.dta, clear gen sd1 = 1/sqrt(tau1) gen sd2 = 1/sqrt(tau2)
I ran this program to create 10,500 updates and then discarded the first 500 although this length of burn-in was rather more than I needed as my starting values are quite good. Here are the results
. mcmcstats *
---------------------------------------------------------------------------- Parameter n mean sd sem median 95% CrI ---------------------------------------------------------------------------- beta 10000 4.759 5.766 0.0806 4.796 ( -6.585, 16.060 ) tau1 10000 0.003 0.001 0.0000 0.003 ( 0.001, 0.005 ) tau2 10000 0.001 0.001 0.0000 0.001 ( 0.001, 0.003 ) alpha1 10000 499.418 11.764 0.1309 499.185 ( 477.360, 523.361 ) alpha2 10000 405.752 11.634 0.1332 405.933 ( 383.264, 428.662 ) alpha3 10000 511.760 11.172 0.1151 511.822 ( 489.389, 534.095 ) alpha4 10000 422.584 10.841 0.1178 422.445 ( 401.530, 444.636 ) alpha5 10000 481.346 11.436 0.1284 481.084 ( 459.092, 504.483 ) alpha6 10000 591.071 11.740 0.1301 590.951 ( 568.720, 615.034 ) alpha7 10000 413.595 10.816 0.1153 413.818 ( 391.458, 434.218 ) alpha8 10000 418.100 12.824 0.1795 419.044 ( 390.077, 441.458 ) alpha9 10000 642.327 10.814 0.1136 642.332 ( 620.764, 663.581 ) alpha10 10000 431.244 10.917 0.1148 430.854 ( 410.868, 453.090 ) alpha11 10000 420.439 10.662 0.1117 420.406 ( 399.531, 441.874 ) alpha12 10000 631.905 12.173 0.1495 632.674 ( 606.440, 654.117 ) alpha13 10000 262.062 12.832 0.1573 262.716 ( 234.243, 285.572 ) alpha14 10000 479.474 11.340 0.1235 479.730 ( 456.457, 501.834 ) alpha15 10000 205.544 17.827 0.3663 203.432 ( 175.602, 245.963 ) alpha16 10000 385.349 13.025 0.1674 385.670 ( 359.127, 410.078 ) alpha17 10000 429.787 11.392 0.1265 429.315 ( 408.605, 453.572 ) sd1 10000 19.689 4.161 0.0974 18.929 ( 13.626, 29.886 ) sd2 10000 27.858 4.881 0.0887 27.631 ( 18.978, 38.418 ) ----------------------------------------------------------------------------
Let us contrast these results with the same model but M=20 in the Dirichlet process prior. This represents less confidence in the normal base distribution.
---------------------------------------------------------------------------- Parameter n mean sd sem median 95% CrI ---------------------------------------------------------------------------- beta 10000 5.028 5.785 0.0805 5.051 ( -6.592, 16.412 ) tau1 10000 0.003 0.001 0.0000 0.003 ( 0.001, 0.006 ) tau2 10000 0.001 0.001 0.0000 0.001 ( 0.001, 0.003 ) alpha1 10000 498.001 11.686 0.1313 497.852 ( 476.034, 521.054 ) alpha2 10000 407.969 11.642 0.1351 408.766 ( 384.703, 428.393 ) alpha3 10000 510.131 11.398 0.1212 509.970 ( 488.062, 532.669 ) alpha4 10000 421.499 9.441 0.1087 421.396 ( 402.823, 441.031 ) alpha5 10000 482.188 11.449 0.1306 481.866 ( 460.235, 505.658 ) alpha6 10000 591.471 12.285 0.1438 590.892 ( 568.875, 617.403 ) alpha7 10000 414.766 9.953 0.1116 415.716 ( 393.231, 432.483 ) alpha8 10000 418.245 11.662 0.1671 419.348 ( 391.538, 439.163 ) alpha9 10000 640.930 10.551 0.1171 640.892 ( 620.243, 662.228 ) alpha10 10000 427.722 10.377 0.1164 426.519 ( 409.715, 450.368 ) alpha11 10000 419.820 9.452 0.1044 420.137 ( 400.499, 438.751 ) alpha12 10000 633.032 12.139 0.1641 634.452 ( 605.031, 653.556 ) alpha13 10000 260.309 14.378 0.2157 261.895 ( 229.308, 285.398 ) alpha14 10000 480.481 11.333 0.1265 480.546 ( 456.773, 502.093 ) alpha15 10000 206.370 19.913 0.4823 202.659 ( 175.060, 248.730 ) alpha16 10000 387.677 13.881 0.1853 388.309 ( 358.905, 413.419 ) alpha17 10000 426.614 10.398 0.1213 425.484 ( 408.662, 449.858 ) sd1 10000 19.694 4.503 0.1144 18.673 ( 13.332, 30.498 ) sd2 10000 28.151 5.137 0.0956 27.970 ( 18.771, 38.937 ) ----------------------------------------------------------------------------
It is interesting to see that the prior on the distribution of the subject effects does have a noticeable influence on the estimate of beta, the parameter which measures the average difference between the two meters.
In my recent postings I have tried to explain the idea behind the Dirichlet process in a simple, non-mathematical way and to demonstrate that these priors can be incorporated into practical Stata programs for data analysis.
So far I have only scratched the surface of a vast and very active area of research. Other possibilities include making the base distribution depend on parameters in a hierarchical structure, placing a prior on the concentration parameter M, extending to non-normal likelihoods and non-normal base distributions.
Dirichlet processes priors give rise to data sets with clusters of repeat observations so the results are inherently discrete even though we are modelling a continuous distribution; some people find this unsatisfactory and so a lot of work has gone into developing Dirichlet process mixture models in which the Dirichlet process defines the distribution of the parameters of a standard distribution such as the normal. The data are then randomly drawn from a mixture of those standard distributions and consequently there are no repeated values. This creates a natural extension of a finite mixture model.
For all of these modified Dirichlet process models there are competing algorithms that might be used and these could be exact or approximate, fast or slow.
Dirichlet process methods are very powerful and have found application in virtually every area of applied research. The peak flow example demonstrates that they can be implemented in Stata but in my next posting I want to ask whether using Stata for such analyses is a good idea. After all, why implement an analysis in Stata if it is easier to program it in R, or even better, if someone else has already written the R code?
Unfortunately, that discussion will have to wait for a couple of weeks as I have a conference to go to.
Recent Comments