No Bayesian analysis this week, instead I want to talk more generally about statistical computing. I think that this discussion follows on naturally from my recent postings about linking Stata and R.
As you might imagine I am quite a fan of Stata, but not one who is blinded to its limitations and for a growing number of tasks I find myself turning to R. This week I would like to discuss the relative merits of the two packages and speculate a little about the future of statistical computing software.
About 15 years ago I went on sabbatical to the Royal Children’s Hospital in Melbourne, where I had a fantastic time. When I was there, they asked me to teach a statistics course to a group of medics and researchers. Alongside the lectures on statistics, the students were to get a course in using Stata. At that time, I had never used Stata, so obviously I had to give myself a crash course before I started.
I was amazed at how quickly the students took to Stata and it was clear that Stata has a lot of advantages:
- it is easy to learn
- it offers a wide range of statistical analyses
- it has a great degree of flexibility
- it presents results in a clear format
- it is supported by a wide range of introductory textbooks
- it is well-controlled by StataCorp so that one can have real faith in Stata’s results
- using Stata really does help students to learn about statistics
When I returned to Leicester after the sabbatical, I pressed very strongly for us to adopt Stata on our masters course in medical statistics. At the time, most of the analysis on the course was done using SAS. I got my way and we switched to teaching both SAS and Stata, but it was immediately evident that given the choice, almost all of the students preferred to work with Stata.
About 5 or 10 years ago we introduced an optional, intensive short course in R for students interested in working in areas where R is the package of choice, such as genetic research.
This year we have made another major change, Stata and R will be taught as the main software for the masters course and SAS will become the option. Time will tell, but I have a feeling that left to choose between Stata and R, most students are going to choose R. In many ways I regret this but Stata is showing signs of being rather slow to adapt. Perhaps, the strong control coming from StataCorp is a weakness as well as a strength.
In my opinion, Stata continues to win over R in terms of ease of use and in the way that it presents results, but RStudio has helped R to catch up and R has some important advantages of its own.
Obviously, R is free and this will always be a big plus, especially for students. However, Stata is reasonably priced and I do not see this as the main reason why R will win out. As important as the fact that R is free, is the way that R works with the family of other free software available on the web, such as latex and knitr, and the way that so much free support can be accessed via the internet.
Like everything associated with computing, statistics has changed enormously in recent years and I am sure that the pace of change will only increase. Here are a few obvious trends: much larger datasets, the growth in the use of Bayesian statistics, the increased use of other computer intensive methods of analysis, data mining, dynamic computer graphics, automatic report writing, websites for accessing databases and websites for presenting the results of statistical analyses. In all of these areas, R is ahead of Stata.
Take the simple issue of the size of the dataset. What student is going to spend hours becoming an expert in using Stata if, when the come to do their project towards the end of the course, they cannot read the data because of an arbitrary limit on the number of Stata variables? R, on the other-hand, is only limited by the size of your computer.
What student is going to spend hours becoming an expert in using Stata if, when they want to perform the latest type of analysis, they have to switch to R because that is the program adopted by the researcher who developed the method?
A big influence on our students was a demonstration that I gave of the new version of R Markdown. No more cutting and pasting results into Word but instead a very simple way of converting the output of an R script into html, Word, pdf or even into a slide presentation.
Why is R drawing further and further ahead of Stata? Perhaps it is because R is not owned by a company but it is, in a real sense, the property of its users. These users drive R forward in a way that no company can hope to compete with. It is true that Stata has its own community of users, but academic statisticians and people interested in statistical computing seem to have adopted R.
Could Stata counter the rise of R? I think so, but I do not expect that it will. Instead of Stata trying to provide everything itself, my suggestion would be for Stata to act as a control centre communicating with other software. To take the example of using R that I have been posting about recently, Stata could make it much easier for its users to send jobs to R in order to take advantage of R’s wide range of packages. Stata could provide features like the smooth transfer of data, control of R running in the background while the user continues with Stata, a generic way of setting R options in Stata, access to R help through the Stata viewer, and so on.
And what is good for R is also good for other free programs available over the internet, like MikTex, OpenBUGS and PLINK.
Sadly, I don’t see this happening and I fear that Stata will continue along its current path in which case my prediction is that it will become a tool for use in introductory statistics courses and for use by people who only want to do relatively basic statistical analyses. Let’s hope that I am wrong.
My practice is to write postings a couple of weeks in advance so that I am not under last-minute pressure. Since I wrote this particular posting we have had one of our staff-student meetings in which the students can make general comments on the course. One of the points raised by the students was that they need more help with basic statistical tasks in R. Stata is well-designed and it makes it easy to perform simple analyses but Stata becomes more cumbersome when you want to program a non-standard task. R on the other hand requires a lot of basic skills before you can do even the simplest analysis but comes into its own for more complex tasks. The student feedback was very helpful and we will certainly modify the way that we introduce R next year. Time will tell whether the problems that the students have had getting started with R will put some students off using it. I have had one piece of coursework back from the students and almost everyone of them used R, even though they had a free choice to use either R or Stata. Perhaps there will be a split whereby those who find computing easy will gravitate to R and those who struggle with computing will prefer Stata.