No Bayesian analysis this week, instead I want to talk more generally about statistical computing. I think that this discussion follows on naturally from my recent postings about linking Stata and R.
As you might imagine I am quite a fan of Stata, but not one who is blinded to its limitations and for a growing number of tasks I find myself turning to R. This week I would like to discuss the relative merits of the two packages and speculate a little about the future of statistical computing software.
About 15 years ago I went on sabbatical to the Royal Children’s Hospital in Melbourne, where I had a fantastic time. When I was there, they asked me to teach a statistics course to a group of medics and researchers. Alongside the lectures on statistics, the students were to get a course in using Stata. At that time, I had never used Stata, so obviously I had to give myself a crash course before I started.
I was amazed at how quickly the students took to Stata and it was clear that Stata has a lot of advantages:
- it is easy to learn
- it offers a wide range of statistical analyses
- it has a great degree of flexibility
- it presents results in a clear format
- it is supported by a wide range of introductory textbooks
- it is well-controlled by StataCorp so that one can have real faith in Stata’s results
- using Stata really does help students to learn about statistics
When I returned to Leicester after the sabbatical, I pressed very strongly for us to adopt Stata on our masters course in medical statistics. At the time, most of the analysis on the course was done using SAS. I got my way and we switched to teaching both SAS and Stata, but it was immediately evident that given the choice, almost all of the students preferred to work with Stata.
About 5 or 10 years ago we introduced an optional, intensive short course in R for students interested in working in areas where R is the package of choice, such as genetic research.
This year we have made another major change, Stata and R will be taught as the main software for the masters course and SAS will become the option. Time will tell, but I have a feeling that left to choose between Stata and R, most students are going to choose R. In many ways I regret this but Stata is showing signs of being rather slow to adapt. Perhaps, the strong control coming from StataCorp is a weakness as well as a strength.
In my opinion, Stata continues to win over R in terms of ease of use and in the way that it presents results, but RStudio has helped R to catch up and R has some important advantages of its own.
Obviously, R is free and this will always be a big plus, especially for students. However, Stata is reasonably priced and I do not see this as the main reason why R will win out. As important as the fact that R is free, is the way that R works with the family of other free software available on the web, such as latex and knitr, and the way that so much free support can be accessed via the internet.
Like everything associated with computing, statistics has changed enormously in recent years and I am sure that the pace of change will only increase. Here are a few obvious trends: much larger datasets, the growth in the use of Bayesian statistics, the increased use of other computer intensive methods of analysis, data mining, dynamic computer graphics, automatic report writing, websites for accessing databases and websites for presenting the results of statistical analyses. In all of these areas, R is ahead of Stata.
Take the simple issue of the size of the dataset. What student is going to spend hours becoming an expert in using Stata if, when the come to do their project towards the end of the course, they cannot read the data because of an arbitrary limit on the number of Stata variables? R, on the other-hand, is only limited by the size of your computer.
What student is going to spend hours becoming an expert in using Stata if, when they want to perform the latest type of analysis, they have to switch to R because that is the program adopted by the researcher who developed the method?
A big influence on our students was a demonstration that I gave of the new version of R Markdown. No more cutting and pasting results into Word but instead a very simple way of converting the output of an R script into html, Word, pdf or even into a slide presentation.
Why is R drawing further and further ahead of Stata? Perhaps it is because R is not owned by a company but it is, in a real sense, the property of its users. These users drive R forward in a way that no company can hope to compete with. It is true that Stata has its own community of users, but academic statisticians and people interested in statistical computing seem to have adopted R.
Could Stata counter the rise of R? I think so, but I do not expect that it will. Instead of Stata trying to provide everything itself, my suggestion would be for Stata to act as a control centre communicating with other software. To take the example of using R that I have been posting about recently, Stata could make it much easier for its users to send jobs to R in order to take advantage of R’s wide range of packages. Stata could provide features like the smooth transfer of data, control of R running in the background while the user continues with Stata, a generic way of setting R options in Stata, access to R help through the Stata viewer, and so on.
And what is good for R is also good for other free programs available over the internet, like MikTex, OpenBUGS and PLINK.
Sadly, I don’t see this happening and I fear that Stata will continue along its current path in which case my prediction is that it will become a tool for use in introductory statistics courses and for use by people who only want to do relatively basic statistical analyses. Let’s hope that I am wrong.
PS
My practice is to write postings a couple of weeks in advance so that I am not under last-minute pressure. Since I wrote this particular posting we have had one of our staff-student meetings in which the students can make general comments on the course. One of the points raised by the students was that they need more help with basic statistical tasks in R. Stata is well-designed and it makes it easy to perform simple analyses but Stata becomes more cumbersome when you want to program a non-standard task. R on the other hand requires a lot of basic skills before you can do even the simplest analysis but comes into its own for more complex tasks. The student feedback was very helpful and we will certainly modify the way that we introduce R next year. Time will tell whether the problems that the students have had getting started with R will put some students off using it. I have had one piece of coursework back from the students and almost everyone of them used R, even though they had a free choice to use either R or Stata. Perhaps there will be a split whereby those who find computing easy will gravitate to R and those who struggle with computing will prefer Stata.
Hi John, interesting post. I am a genetic epidemiology PhD student at the University of North Carolina, and I am going through a similar transition from Stata/mata to R, motivated by the excellent ggplot2 package, git integration and, as you mentioned, output control.
With that being said, I am very glad I got started with Stata. Perhaps its greatest feature is the stellar quality of its help files (whoever is in charge of this at Statacorp deserves a raise, a yacht and a thank-you card). The manual provides excellent statistical overviews (the preface to their new Bayesian suite is a succinct yet effective introduction), and intuitively introduces commands so that newcomers can easily understand the tools they work with, and implement them without much frustration. Further, since all main packages are regulated by Statacorp, syntax and options are highly consistent across functions, allowing new users to quickly form up intuition.
When I got started with statistical computing (from mostly C), I also loved the clarity of Stata’s interface, helpful error codes, easy logging, and default settings encouraging smart practices (such as modular programming). While Stata lacked project-management tools at the time, this has now been rectified somewhat (better than out-of-the-box R in my opinion).
I’ll end this by mentioning that SAS is a miserable piece of software maintained by what seem to be time-travelling FORTRAN enthusiasts. I rank SAS somewhere between paper-and-pencil and the google calculator.
I agree with pretty much everything that say – perhaps it is because we share a background in genetic epidemiology – I still use Stata for some teaching and when I am collaborating with non-statistical colleagues on small projects but for large scale research, R is just so much better
I learned Stata at the economics course I took at University of Essex, but I have since learned R myself, and I can’t for the love of God understand how anyone would think teaching Stata over R is a good idea. Outside academia, Stata is not widely used, at least not compared to R, in my experience. There also seem to a be a trend in the energy sector at least, to gradually use R more. Being familiar and comfortable with using R, will thus be a key skill for many jobs in the very near future.
PS. Great article, I found it on Google by searching for Stata VS R as I am involved in a Facebook argument over the matter. 😀
I agree with your comment but I have to admit to having a soft spot for Stata; it has some well-designed features and I’ve seen non-statisticians produce work with Stata that would be beyond them in R. That said, for anyone who is serious about statistics, R is hugely superior and the gap between R and Stata is increasing because Stata has been slow to adapt.
I’m a SAS programmer who is trying to learn R because the healthcare agency I work for cannot afford SAS. But I’m finding R very difficult to learn from a book and thought about switching to STATA. I’m interested in something that’s easy to learn because I’ll need to teach the non-statistical people on my team to use it for basic data manipulation, freqs, etc. Would you recommend that I push forward with R and maybe it will eventually click with me? Or do you think STATA would be better since the others on my team are not well-versed with programming and statistics?
Kristin
This is such a difficult question. There is no doubt in my mind that non-specialists find Stata easier, especially at the beginning. You can think of Stata as a spreadsheet with a command language and people find that idea easy to grasp. However, if you get past those early stages then there are clear benefits in using R. I have been thinking recently about ways of teaching R and I’m experimenting with an introduction based on data frames, i.e. spreadsheet with commands. I suspect that it is possible to teach R in a way that is intuitive to non-specialists but this is rarely done because R appeals to people who think like mathematicians. I am sure that a reformed SAS programmer would have no problem with R but I would be more concerned about your team. It is such a close decision that I’m tempted to duck the question but that would be cowardly. On balance, in your position, I would go for R with RStudio, put a bit of work into becoming relative expert myself but then teach a very restricted subset of R to my team. I have a teach-yourself-R booklet that I use with my students. I’d happily send you a copy if you would be interested.
I’m also slowly moving towards R and found datacamp.com extremely useful. They teach you online using R studio which is helpful. I still use Stata from time to time as i find the user interface a lot nicer and intuitive. But i think as mentioned above the gap between R and Stata is increasing day by day and opting for R may be a better option for the future.
I also came to this article through a ‘Stata vs. R’ web search, and it’s very enlightening, so thank you!
am currently trailing both of these (along with a few also-ranks that have now fallen out of the race) for my, admittedly limited, post-hoc data analysis needs.
The thing that I like a bout Stata is that, as a non-statistician, it is realively easy to get results for simple analyses (one you get around the overcrowded menu system). But it is very much lacking in some of the post-hoc tests that I need.
With R, on the other hand, I can run these tests and do so relatively confidently, with the hand-holding of RStudio, but do you think I can get my head around the complexity of something as simple as quantiles over several groups? It seems that R makes some of the most straightforward procedures almost willfully obtuse. And there are several ways of achieving what appears to be the same goal, but differently. In some respects I suspect that this might be a case of too many cooks and a lack of a true systematic approach to development.
I just wish that Stata would support those couple of missing tests and I wouldn’t even have to think about R,
You make a very good point when you say ‘R makes some of the most straightforward procedures almost willfully obtuse’ and it is also true that the root cause is the ‘lack of a true systematic approach to development’ but that is R’s strength as well as its weakness. R is good because it has a wonderfully logical and flexible structure and it has a large group of users who have developed an amazing range of packages, but as there is no central monitoring you get duplication and inconsistent, sometime poor, design. On the other hand, Stata is consistent and well-designed which makes it easy to learn, but it lacks the flexibility and it is slow to adapt, so there are important things that Stata cannot do. Stata is easier for non-specialists but part of that is due to the way that R is taught. Teachers of R seem to want to start from the position that the student needs to understand the full scope of the language when for many people a limited subset of R that imitates Stata would be much more appropriate. If teachers of Stata started with macro substitution and Mata programming, Stata would seem even more obtuse than R.
Thank you for the thoughtful post. You have captured the essence of the R dilemma for those approaching R for the first time:
“R on the other hand requires a lot of basic skills before you can do even the simplest analysis but comes into its own for more complex tasks.”
Therein lies much of the barrier to learning R. Those who are not committed to getting to the later stages of using it for complex tasks tend to go away in frustration when they cannot figure out how to do the simple things. Thus, many people will fall back to using more limited menu-driven programs or hybrids like Stata.
It would certainly be in the best interests of Stata and its customers to embrace R rather than continuing in its efforts to compete against it. I see a similar thing happening in the physical science world, where there is competition between Matlab and R and/or Matlab and Python. Whereas some scientific graphics and data visualization programs like Origin Pro have provided an interface to R, Matlab continues to assert its superiority — such hubris is surely self-destructive and ultimately not a good way to further the aims of the company or its clients.
A good point. I did experiment with using Stata to run R in the background so that Stata could take advantage of the many R facilities that Stata lacks. In the end though it just seemed simpler to use R. The recent developments with RStudio and the tidyverse have gone a long way to address the difficulties in getting started with R and probably the right approach is to ask how best to teach R to non-statisticians rather than directing them to Stata.
Very interesting article (and ditto comments). I’m a student finishing off my degree in statistics and economics. I started with STATA when i was still in my econ bachelors, but now in my statistics classes its all about R.
Another reason why I think that R will eventually “take over”, so to speak, is that not only fields of study or business but also academia and business in general are getting more and more intertwined. Some industries/fields are switching quickly to R, where new methods, techniques and packages are created. Once some of these methods trickle down to other fields, they will want to use R as well. R can be used for basicly anything and will therefore have a much higher adoption and development rate.
A final reason is the fact that the world is globalizing very fast and income differences are large. For us the Stata License fees are bothersome, but nothing to worry about. Someone doing research or business in a poor country however, will not want to pay a year worth of food for some software.
Once R is basicly a staple in academia as well as in business, who will want to learn Stata?
I have a sentimental attachment to Stata but I fear you that you are right – the future belongs to R
Totally agree about the strengths/weakness comparison outlined here as a teacher of an upper level undergrad/masters econometrics class. I especially love the Rmarkdown-knitr-latex-pdf capability. Historically I have been using stata (for teaching only) and actively use Python and Matlab (and R to a lesser degree) in my own research.
Seriously contemplated converting the class to R and began porting stata code over to R. I found that things economists love about stata- easy clustering for standard errors, robust standard errors, marginal effects and elasticities can be done in R, after lots of code and jumping through hoops. Of course, I found code snippets on the web for doing this, but it just works in stata.
While also probably alot less computationally efficient than R, I find the mata environment a better intuitive approach toward linear algebra programming (which I require the students to do) than R.
Not trying to start a flame war, because I really wanted to do this class in R, but thought it more important for students to focus on the conceptual material rather than peeling time away for coding. I have no doubts a student who knows the conceptual material and masters R will have better job market prospects over the same student who knows stata.
You offer an interesting perspective on a problem that many of us are struggling with at the moment. I teach on an MSc in Biostatistics that used to teach Stata and SAS, then converted to Stata and R and now is discussing whether there is any point in continuing with Stata. My feeling is that the advent of Data Science and Big Data has had a knock on effect on statistics and that Stata simply cannot cope with the current demands for data analysis. Stata is a good teaching tool but we are asking ourselves whether it is helpful to the students to learn a program that they will eventually need to drop. Perhaps extra time on R would be more useful to them. Certainly R is more attractive to employers although there are still employers using Stata.
I was a molecular biologist in cancer genetics and have been using SPSS or Graphpad Prism for my statistical analysis in the past. Recently, I have gone into a career in medical education research and SPSS/Graphpad Prism is not readily available in my organization. Therefore, I have jump started the use of Rstudio but have been progressing very slowly. While considering whether to stick with Rstudio or switch to STATA to learn and advance my statistical analysis, I came to this website through the search of ‘STATA vs R’. My struggle with R was that it took me a lot of time finding info on getting one single command to either tidy up my data or to run one analysis. Currently, I am interested to learn CFA/PCA/SEM analysis to advanced my current work in education research, would you recommend me to stick with R or make a switch to STATA?
This is such a difficult question. Stata is much easier to learn but ultimately it is more limited than R. Of course, you may not need to go so far that Stata’s limitations ever become an issue. One of the problems with R is that it is usually taught as a programming language rather than as a data analysis tool, so people learn to do things that they will never need. If you decide to stick with R then try learning the tidyverse, this is a collection of packages that cover all of the basic steps in a data analysis. The packages are designed to work together and might be all that you really need. The author of these packages is Hadley Wickham and he has written an e-book about this approach that you can find at http://r4ds.had.co.nz/
Actually lets hope that you are right and STATA disappears so i dont have to learn another statistics package when i already learned R.
I have had some good results from using Stata with non-specialists and I wrote this posting with a sense of regret that Stata is not adapting to the changing needs of the whole statistics community. R is not without its faults, but at least it is addressing them.
Nice discussion above.
I’ve been looking around for quantitative studies comparing different statistical software and haven’t found any. The best studies would specify some approximation of a population of tasks or uses of statistical software. Then it would randomly sample from that “population.” Then it would assess how many lines of code are necessary to perform the tasks (or menu selections, etc.) or other metrics of how long it takes to complete tasks. Best of all would be to produce some type of prediction model that would identify the likelihood that a given task would be more efficiently done with one software versus another. Or perhaps different areas of science have subtle differences (one has more interaction variables, one has more dummy variables, one has more time series, etc., some of these differences are widely known) which would imply one software is more efficient than another.
Perhaps stata is better for data management and R is better for statistical analysis. Perhaps different tasks are done better by one or the other.
Even if there are a mere 100,000 people using statistical software for 100 hours a year, that’s 1,000,000 hours of work. If a rigorous, quantitatively based study could bring the efficiency with which these people work by just 1%, that’s a 10,000 hour gain, more than enough to justify the resources that would go into such a series of studies. The fact there are no such studies is a sign of a collective action problem / a system of incentives for individual researchers that is not conducive to maximizing the good.
Instead of a few different statistical environments (R, stata, etc.) perhaps there should be a large number of radically different environments, that one individual would know 10 or so of, because the different approaches are so well suited for different tasks. Again, perhaps a sign of a collective action problem that numerous such tools aren’t being developed, but then again, a lot of people are creating all kinds of packages in R.
As far as how students react, etc. You have to live in the real world and make things work in the real world. But individuals’ subjective reaction to a software is not necessarily a good indication of whether paying the cost of learning the software (perhaps over “the medium term / five years”) is outweighed by the increased efficacy of the software over the lengthy amount of time the veteran statistician uses it. The impact of just a 1% rise in efficiency over a 30 year career is enormous, and may justify a “steep learning curve.”
Thanks
I am a statistician to my very core so the idea of a quantitative comparison is very attractive to me, but even I hesitate in this instance. There are definitely some things that we can quantify, such as the maximum number of variables that the program can handle or the time taken to fit a particular complex model. The problem is that neither of these metrics is important to you if you only ever fit simple models to small datasets. Unfortunately many of the critical metrics are much harder (but not impossible) to quantify. How do you measure the clarity of the presentation of the results or the ease with which you can find an error in a piece of code? I believe that the best software is developed when one starts with a clear plan so that one maintains a consistent philosophy. In my opinion, it is the absence of such an idea that is the limitation of programs such as SAS that started in an era of batch processing and has bolted on adaptations as each new development has come along. The original idea was not flexible enough to cope with modern analysis and hence the product is cumbersome to use. Stata suffers in much the same way, for instance, the idea that we can represent our data in a single rectangular array might have been acceptable 20 years ago but is now simply indefensible. Stata has been very stubborn in refusing to adapt from its original concept. You might argue that they are maintaining the purity of the original idea and are avoiding the mistakes of SAS but as a consequence Stata is becoming a good solution to a shrinking problem. Judged in this way you can see why R is so successful, it started with a clear, flexible philosophy and has an open approach to development, which makes it quick to adapt. How do you quantify these benefits? Perhaps the ultimate quantification comes from numbers such as, how many people use each program, how many job adverts require each program or how many articles are published that use each program.
I find this article so helpful, thank you Professor. I’m learning R programming and i had a doubt that stata is better for learning, but now i’ll keep be attached to R, but when i’ll have time, i’ll take a look at stata, it won’t be a waste of time of course ^^.
I am pleased that you found it helpful – recently a lot of effort has gone into making R more standardised and so easier to learn and use – search for ‘tidyverse’ to find an introduction to this work – there are several good videos on youtube, especially good are those by Hadley Wickham who is the leader of these developments
I’m the one that left the December 19, 2016 comment. Yes, it would be hard to quantify whether R or STATA is better. But quantification is inescapable. If I choose STATA over R, I’ve implicitly said that the utility of using STATA is greater than the utility of using R. That is a quantification. So the question is, how do we bring down measurement error? Casual assessments of utility are going to have more measurement error than a rigorous attempt at measuring utility. There should at least be recommendations for people based on more rigorous evidence given a specific type of task: if you want to do X, then STATA users did the task in 4.5 hours on average, R users in 6.0 hours on average, etc.
I have some sympathy with your support for quantification provided that you can create a well-defined question. The difficulty when comparing Stata and R is that Stata is better at some things and R is better at others, what is more, Stata is better for some people and R is better for others.
Thanks John for this article. As somebody starting out in statistics and needing to choose a software package to work with this has been very useful.
Dear John,
I was delighted to stumble across your tackling of this question. I love STATA, and learned what little statistics I have managed grasp with its help. I use it, in Leicester as it happens, mostly for fairly ‘standard’ survival analyses in patient data. But I am starting to feel major shortcomings, especially when trying to handle whole genome gene expression data from cell culture experiments, and when trying to make beautiful graphs.
I think I will continue to use it for now for survival analyses, but I need to build up some skills in r. Where should I start? The tidyverse?
BW,
John
John, I used to advise researchers to start with Stata because it is so easy to learn but recently I have changed my approach and now I just advise everyone to use R from the beginning. The interesting debate is now about how to approach R. There is no doubt in my mind that the best solution is via RStudio, the tidyverse and the pipe; the problem is that most teaching resources are written using base R, so when you try to teach yourself you will inevitably come across lots of base R. I recently ran a course for our research students and post docs that was intended to act as a bridge. It introduced the tidyverse approach to R for people who had a very basic experience of base R. I will be repeating this as a one-day course at Imperial in November. It is quite possible that I’ll run the course again in Leicester at some point, in which case I’ll let you know. John