This is another in a series of posting about using conducting advanced statistical analyses in Stata by sending a job to R and then reading the results back into Stata. Our task for testing this process is to use the R package DPpackage to fit a Bayesian Dirichlet process mixture (DPM) model for smoothing a scatter plot. The data for demonstrating the code consist of 298 IgG measurements made on children of different ages.
In my last posting I described how we can use Stata’s –shell- command to submit an R job and this time we will consider how the R script can be created by a Stata ado file so that a Stata user need know nothing of R when they fit the model.
Previously, we worked with a text file called DPC.R than contains the R script. Repeating what we had last time, the file contains,
setwd(‘E:/IGG’)
require(foreign)
require(DPpackage)
D <- read.dta(‘igg.dta’)
zbar <- c(7,3)
Sigma <- matrix(c(4,1,1,1),2,2)
MyPrior <- list(alpha=5, k0=0.25, nu1=10, m1=zbar, psiinv1=7*solve(Sigma))
MyDesign <- list(nburn=5000, nsave=5000, nskip=3, ndisplay=1000)
xp <- c(1,2,3,4,5)
fitIgG <- DPcdensity(y=D$igg, x=D$age, xpred=xp,ngrid=100,
compute.band=TRUE, type.band=”HPD”, prior=MyPrior, mcmc=MyDesign,
state=NULL, status=TRUE)
nClu <- as.data.frame(fitIgG$save.state$thetasave[,7])
names(nClu) <- “nclus”
write.dta(nClu,file=’nclusters.dta’)
cDen <- as.data.frame(cbind(fitIgG$grid,t(rbind(fitIgG$densp.m,
fitIgG$densp.l,fitIgG$densp.h))))
head(cDen)
names(cDen) <- c(“IgG”,”m1″,”m2″,”m3″,”m4”,”m5”,
“l1″,”l2″,”l3″,”l4″,”l5″,”h1″,”h2″,”h3″,”h4″,”h5”)
write.dta(cDen,file=’cDensity.dta’)
Last time we prepared this script, submitted it from within Stata and read the results into Stata for further processing. This approach is fine for those of us who know R but let’s now suppose that we want to prepare a program for those Stata users who know nothing of R. We will need to prepare a program that writes the script for them as well as executing it.
There are lots of ways of going this but I will describe my preferred method.
Within my PERSONAL folder I have a subfolder called RTEMPLATES and here I store the templates for the R scripts that run the jobs that I want to make available. Here is the template corresponding to DPC.R.
setwd(‘@’)
if( !require(‘foreign’) ) {
install.packages(‘foreign’,repos=’http://cran.ma.imperial.ac.uk/’,dep=TRUE)
if( !require(‘foreign’) ) stop() }
if( !require(‘DPpackage’) ) {
install.packages(‘DPpackage’,repos=’http://cran.ma.imperial.ac.uk/’,dep=TRUE)
if( !require(‘DPpackage’) ) stop() }
D <- read.dta(‘@’)
zbar <- c(@)
Sigma <- matrix(c(@),2,2)
MyPrior <- list(alpha=@, k0=@, nu1=@, m1=zbar, psiinv1=@*solve(Sigma))
MyDesign <- list(nburn=@, nsave=@, nskip=@, ndisplay=1000)
xp <- c(@)
fitIgG <- DPcdensity(y=D£@, x=D£@, xpred=xp, ngrid=@,
compute.band=TRUE, type.band=”HPD”, prior=MyPrior, mcmc=MyDesign,
state=NULL, status=TRUE)
nClu <- as.data.frame(fitIgG£save.state£thetasave[,7])
names(nClu) <- “nclus”
write.dta(nClu,file=’nclusters.dta’)
cDen <- as.data.frame(cbind(fitIgG£grid,t(rbind(fitIgG£densp.m,
fitIgG£densp.l,fitIgG£densp.h))))
names(cDen) <- c(@)
write.dta(cDen,file=’cDensity.dta’)
A template is almost the same as the script itself but with three minor changes; the symbol @ replaces items that are problem dependent, such as the values of the priors, these will be replaced before the program is run; the symbol £ replaces $ and there is a little extra code associated with loading packages. The £’s stop Stata from treating lines containing code, such as D$age, as if they contain a Stata global and the extra code associated with loading packages is intended to protect the Stata user from the need to install the package.
In R, packages are installed over the internet and stored in a library; from there they can be loaded when needed. The extra code tries to load the package using R’s require() function but if that fails, the most likely reason is that the package has never been installed into the user’s R library, so an attempt is made to do that. In my example, I try to install from the R mirror site at Imperial College in London. Good practice says that you ought to install from your closest mirror site, so if your intended Stata user is not in southern England then you ought to replace this address with the address of a mirror site in their own country. You can find the addresses of the mirror sites by issuing the following command in R ,
> M <- getCRANmirrors()
and then looking at M. If your program is intended for Stata users in many different countries then you would need to make the Stata user choose a mirror site, perhaps by creating an ado file that shows the user a list of available sites that you obtain via the getCRANmirrors function.
I have also written a program called rtemplate.ado that fills in the missing elements of an R template with its arguments. So the command,
rtemplate DPcdensity.txt temp.R a b c “3,4”
would read the template DPcdensity.txt from RTEMPLATES and create a file temp.R in the current working directory. In temp.R, the first @ is replaced by “a” the second by “b” etc. At the same time pounds are converted into dollars. The source for this ado file is very simple,
program rtemplate
gettoken template 0 : 0
gettoken outfile 0 : 0
tempname fhin fhout
local personal : sysdir PERSONAL
file open `fhout’ using “`outfile'”, write replace
file open `fhin’ using “`personal’RTEMPLATES/`template'”, read
file read `fhin’ inLine
while !r(eof) {
tokenize `”`inLine'”‘ , parse(“@£”)
local j = 1
while `”“j””‘ != “” {
if `”“j””‘ == “@” {
gettoken item 0 : 0
file write `fhout’ `”`item'”‘
}
else if `”“j””‘ == “£” file write `fhout’ “$”
else file write `fhout’ `”“j””‘
local ++j
}
file write `fhout’ _n
file read `fhin’ inLine
}
file close `fhin’
file close `fhout’
end
Now all I need to do is to write a program that calculates the missing items, @, and sends them to rtemplate.ado in the correct order. For instance,
program r_DPcdensity
syntax varlist [ using/ ], XPredict(numlist) PMean(name) ///
PCOVariance(name) PAlpha(real) PK0(real) PNU1(real) ///
[ Burnin(integer 1000) Updates(integer 5000) Skip(integer 3) ///
NGrid(integer 50) ] LOGfile(string)
* current working directory
if “`using'” == “” tempfile using
local wd = subinstr(“`c(pwd)'”,”\”,”/”,.)
* temporary data file contain x and y
tokenize `varlist’
local y “`1′”
local x “`2′”
tempfile dfile
local dfile = subinstr(“`dfile'”,”\”,”/”,.)
preserve
keep `varlist’
save “`dfile'”, replace
restore
* mean of base distribution
local a1 = `pmean'[1,1]
local a2 = `pmean'[1,2]
local mv = “`a1′,`a2′”
* covariance matrix of base distribution
local a1 = `pcovariance'[1,1]
local a2 = `pcovariance'[1,2]
local a3 = `pcovariance'[2,2]
local cv = “`a1′,`a2′,`a2′,`a3′”
* df of base distribution
local df = `pnu1′ – 2 – 1
* values of xpredict & names for m,l,h
local xlist “”
local mlist “”
local llist “”
local hlist “”
local i = 0
foreach v of numlist `xpredict’ {
local ++i
local mlist `”`mlist’,”m`i'””‘
local llist `”`llist’,”l`i'””‘
local hlist `”`hlist’,”h`i'””‘
if “`xlist'” == “” local xlist “`v'”
else local xlist “`xlist’,`v'”
}
* make the script
rtemplate DPcdensity “`using'” “`wd'” “`dfile'” “`mv'” “`cv'” ///
“`palpha'” “`pk0′” “`pnu1′” “`df'” “`burnin'” “`updates'” ///
“`skip'” “`xlist'” “`y'” “`x'” “`ngrid'” `”`y’`mlist’`llist’`hlist'”‘
* run the R script
cap rm `logfile’
execute R CMD BATCH `using’ `logfile’
end
Now a Stata user can have all of the complexity of R hidden from them and all they need to do is to type commands such as,
. use igg.dta, clear
. matrix mu = (7,3)
. matrix sigma = (4,1\1,1)
. r_DPcdensity igg age using temp.R , xp(1 2 3 4 5) priorm(mu) priorcov(sigma) ///
priora(5) priork0(0.25) priornu1(10) log(log.out)
. type log.out
. use nclusters.dta, clear
. histogram nclus , start(0.5) width(1) freq title(Number of Clusters)
and they would fit a Bayesian DPM model with their chosen priors and plot a histogram of the number of clusters at each iteration of the sampler. The computation will be done in R but the Stata user need know nothing of that.
The key to making the facilities of R available within Stata it to prepare good quality ado files with a structure similar to r_DPcdensity.ado and good quality help files that explain the options. The ado file given above is very basic and would benefit from more checking of the options and some helpful error messages. None the less, writing a good version of r_DPcdensity.ado is very, very much easier than rewriting the source code of the R function DPcdensity.
Next time I will complete this series of postings on using R with Stata by revisiting the way in which results are transferred from R to Stata. Rather than re-package the results into neat .dta files as I have done so far, I will show how the entire set of results can be dumped to a text file and read by Stata.
Regarding automated installing of R packages from an R mirror site,
a general approach regardless of geographic location is to use:
http://cran.rstudio.com/. For example: install.packages(‘DPpackage’, repos=’http://cran.rstudio.com/’, dep=TRUE)
Thanks, I did not know that. One of the best things about blogging is that the blogger learns so much.