This is the last in a series of posting about using conducting advanced statistical analyses in Stata by sending a job to R and then reading the results back into Stata. I have spent a while on this topic and although I believe it to be very important, next time I want to return to discussing simpler matters.

Our task for illustrating the process has been to use an R package called DPpackage to fit a Bayesian Dirichlet process mixture (DPM) model for smoothing a scatter plot. The dataset for demonstrating the code contains 298 IgG measurements made on children of different ages.

So far we have seen how to write a Stata ado file that prepares an R script, sends it to R and then reads back the results. To simplify the task, up until now I have supposed that the R script includes R commands for packaging the results into Stata .dta files so that reading the results into Stata is simple.

Many R packages create very complicated list structures of their results and so R code for extraction and re-packaging for export to Stata is not trivial. For this reason I usually adopt a different approach; I dump the entire results structure into a text file and then I use Stata ado files to read the text file and extract any elements that I want. This makes my R scripts simpler and it means that I have access to all of the results in Stata and I am not limited to those exported by the R script.

First let’s see how we dump the results to a text file. Here is the R template that I developed last time but with the lines that extract the results replaced by new code for dumping the results structure (in green).

setwd(‘@’)

if( !require(‘foreign’) ) {

install.packages(‘foreign’,repos=’http://cran.ma.imperial.ac.uk/’,dep=TRUE)

if( !require(‘foreign’) ) stop() }

if( !require(‘DPpackage’) ) {

install.packages(‘DPpackage’,repos=’http://cran.ma.imperial.ac.uk/’,dep=TRUE)

if( !require(‘DPpackage’) ) stop() }

D <- read.dta(‘@’)

zbar <- c(@)

Sigma <- matrix(c(@),2,2)

MyPrior <- list(alpha=@, k0=@, nu1=@, m1=zbar, psiinv1=@*solve(Sigma))

MyDesign <- list(nburn=@, nsave=@, nskip=@, ndisplay=1000)

xp <- c(@)

fitIgG <- DPcdensity(y=D£@, x=D£@, xpred=xp, ngrid=@,

compute.band=TRUE, type.band=”HPD”, prior=MyPrior, mcmc=MyDesign,

state=NULL, status=TRUE)

sink(‘@’)

dput(fitIgG)

sink()

In R, the sink() function redirects output to a file, the name of which will be specified later as it is represented by an @ in the template. Then I write out the list structure of results using the dput() function and finally I redirect the output back to the log file. This approach makes for much simpler R code and places the emphasis of the programming where it belongs, in Stata.

This code saves the entire set of results to a text file, but it is not a text file that is easy to use because the structure of results is quite complex. To illustrate this here are the first few lines of the results created by this particular R script.

structure(list(call = quote(DPcdensity.default(y = D$igg, x = D$age,

xpred = xpred, ngrid = 50, compute.band = TRUE, type.band = “PD”,

prior = prior, mcmc = mcmc, state = NULL, status = TRUE)),

y = c(1.5, 2.70000004768372, 1.89999997615814, 4, 1.89999997615814,

4.40000009536743, 1.5, 2.20000004768372, 1.60000002384186,

4.69999980926514, 1.60000002384186, 1.39999997615814, 3,

The full text file has 7640 lines.

To enable me to extract data from such files I have written two Stata ado files. One called **rdescribe.ado** that reads dumped results and lists all of the items that could be extracted and one called **rextract.ado** that extracts a requested item.

Let’s suppose that I dumped my DPcdensity results to a file called fitigg.txt. The Stata command

. rdescribe using fitigg.txt

would list the names of the items in the results file but not their values. For this example the list is quite long, so I just give the first few lines.

Structures and sub-structures within file: fitigg.txt

call (string)

y (single variable)

coefficients (matrix or set of variables)

varnames (single variable)

modelname (single value)

cpo (single variable)

fso (single variable)

prior (list)

alpha (single value)

k0 (single value)

m1 (single variable)

nu1 (single value)

psiinv1 (matrix or set of variables)

Dim: 2, 2

mcmc (list)

nburn (single value)

nsave (single value)

nskip (single value)

ndisplay (single value)

state (list)

alpha (single value)

m1 (matrix or set of variables)

Dim: 2, 1

muclus (matrix or set of variables)

Dim: 398, 2

We see that the results include an item called **call**, which is a string containing the R command that was executed, next is an item containing the values of the y variable and so it goes on.

The results structure called **prior** contains the priors that I chose and is itself a list including items such as **alpha** and **k0**. In fact you will see that there are two items called **alpha**, one in **prior** and another in **state** (state gives the contents of the sampler at the end of the run and can be used to continue the chain if desired).

Once we know what is in the results file we can use **rextract.ado** to extract any of the items. For instance,

. rextract using iggresults.txt , item(y) clear

Will clear any existing data from Stata’s memory and extract y into a Stata variable with that name.

. rextract using iggresults.txt , item(prior(alpha))

Extracts the value of alpha (a single number) from the list called prior and returns it as the local r(prior_alpha).

. rextract using iggresults.txt , item(state(muclus)) matrix

Extracts muclus (cluster means at the end of the run) and stores them in a 398×2 Stata matrix called state_mclus.

. rextract using iggresults.txt , item(state(muclus)) clear

Also extracts muclus but this time stores the results in two Stata variables called state_muclus_1 and state_muclus_2.

So there you have it. There is no need to program advanced statistical analyses in Stata or even in Mata. Instead we only need to prepare a conversion program that enables a Stata user to select their options and that then prepares the script and sends it to R. In this way any Stata user can have access to these advanced analyses even if they have no idea how to program in R.

The code for **rdescribe.ado** and **rextract.ado** are contained in the pdf rdescribe_rextract that you can download from here and then copy and paste into separate ado files. The programs were prepared for personal use rather than general release, but you should not have a problem getting them to work.

## Recent Comments