My recent postings have been on Bayesian non-parametric analysis with Dirichlet processes and they have raised a couple of questions in my mind that I should now like to discuss.
Is Stata a suitable vehicle for advanced statistical analysis?
and if so,
Who should do the programming?
Dirichlet processes (DPs) provide very flexible Bayesian priors with an infinite number of parameters and as such they are quite complex and they require special algorithms. Despite this, they have been widely used and they are the subject of a lot of interesting recent research.
I tried to pitch my postings at what I consider to be a typical Stata user; in my mind this means someone who is keen to use the best statistical models but who is not themselves developing new theory or designing new algorithms. So I wrote my postings to explain the principles of a DP and I took a very simple example to show how a Gibbs sampler written in Stata could be used to fit a DP model. My hope was that, if the reader understood the principles, then they would be in a better position to follow more complex applications.
What next? Should I provide Stata programs for other DP applications? Or would it be better to advise a Stata user interested in such advanced analyses to learn R, because R is the language used by most of the statisticians who are developing these methods. There is, for instance, a package in R called, rather boringly, DPpackage, that already contains a host of programs for different Bayesian analyses with DP priors.
Before I get too negative let me comment on a few of the positives about using Stata. First, the interface is much better than R, better even than RStudio. Second Stata is more uniformly reliable than R; R does not have the equivalent of StataCorp to produce new commands and so has to rely on its users to do all of the applications programming. Next, Mata is fantastic and gives Stata a big advantage over many other statistics programs in that we can use it to produce fast compiled code in a fully integrated way. So a suite of Mata programs for DP models would be fast and perfectly practical. Finally, Stata has a large number of users doing important applied work who would benefit from access to such advanced methods, but these researchers might never use methods like DPs if it means transferring to R. Statistical programs have an important educational role because they introduce non-specialist users to new methods of analysis.
Despite these positives, there is no way that a I have the time to duplicate all of the work that went into DPpackage in order to create an equivalent set of programs in Stata and even if I did have the time, it is questionable whether this would be a good way forward.
One alternative would be for StataCorp to take on this programming task. They have the resources and would, I’m sure, do a fine job, but DP models are just one of a myriad of advanced statistical techniques and if they were to join the queue, they would have relatively low priority and we would probably be at Stata release 50 before they saw the light of day. As yet Stata does not even have an integrated program for basic Bayesian analysis, which is why I wrote the book on which this blog is based.
So what is the way forward? Well, in my opinion, no statistics package can ever hope to offer everything and so it should not try. When I want to run a basic Bayesian analysis I might use WinBUGS or OpenBUGS, programs specifically written for that purpose and so, to facilitate that, I wrote a set of Stata commands that make it easy to use Stata to configure the data for WinBugs and then call the program and read the results back into Stata. This model of working could be more widely used.
In my day job, I analyse a lot of genetic data so I have written a Stata program that enables me to send an analysis to a program called PLINK and to read the results back into Stata. So let’s not try to duplicate R’s DPpackage in Stata, instead let us write a program that creates an R job within Stata and sends it to R and then reads back the results. The key would then be to make this interface as smooth as possible.
Under this model of working, Stata becomes the control centre sending jobs to whatever software is most suitable for that particular computation.
Can Stata talk to R? Certainly it can, I know because I regularly use this approach. The limitation is that unless you can program in the R language, the interface between the two packages needs to do that work for you; effectively you would need a Stata ado file that understands how to prepare the R script for the particular task that you want to perform.
My applications are much simpler. R can read Stata .dta files so I do not need to prepare special data files. I write a text file containing the R script because I know the R language and then I call R from within Stata and I read the results into Stata via a text file. I could, of course, just use R and drop Stata entirely (as sometimes I do), but I like working in Stata particularly when I want to explore my data or summarize the results and anyway, Stata is useful when I am collaborating with colleagues who do not know R.
To help a Stata user who is not familiar with R, I could write a Stata ado file that understands the options in DPpackage and which creates the desired R job and runs it. This would certainly be quicker and easier for me than it would be to re-write the whole DPpackage.
If I could request two things that would really increase the scope of Stata, they would be; integrated methods for communicating with other statistical programs and some way of handling very large data sets. With these facilities we would not need to duplicate the code for every specialised application that takes our fancy. Perhaps though, this would be against the interests of StataCorp, after all who would upgrade to Stata15 if they already had access to all of its new features by calling equivalent functions available in R?
Over my next few postings I will take another example that uses a Dirichlet process prior and show how it can be analysed using Stata in combination with the DPpackage in R. Then I will draft a Stata ado file that automates the process.
Dear John, thank you very much for addressing these topics here. I’m wondering if there is any plan to address Dynamic Linear Models (DLM) here. I’d really appreciate this.
Marie-Eve, I am pleased that you are finding my blog helpful. I try to plan my postings ahead of time, so for example, I know that the next few will deal with linking Stata and R, but I am always on the look out for new topics and I’ll add DLMs to the list.
Dear Professor, I think the DPpackage written by Jara is very good although running MCMC using R is not the most efficient way. Stata has a very large user group (most users doing social sciences only use Stata), so coding Bayesian nonparametric things in Stata would be very exciting!! But I am afraid that running such an intensive MCMC algorithm using Stata would be much slower…
You make an important point – there is a huge community of Stata users who would be reluctant to switch to R but who want access to analyses that are not available in Stata – my current series of postings suggest how we might make R packages available to Stata users – the problem is where do we start? there are lots of good R packages and it is so difficult to know which would be most useful to Stata users – My proposal is to use R for computation in the background so the speed of Stata is not an issue – however my experience is that Mata is as fast as any R package
Thanks for your reply. I have read through your last posts. That’s very exciting! I really like Stata and I have worked with Stata for ages. But recently I have to use Matlab & R to code the DP mixture. Do you mean executing an ado file in stata which is programmed using “mata:” would be as fast as R package?