This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found on my github page. If you have not already done so then you might start by reading my introduction to this blog.
In this post I discuss why I do not use tidymodels on the course.
In an earlier post I suggested that the tidyverse would have looked very different had it been designed by a group of statisticians rather than data scientists. Recently, RStudio has started a new project called tidymodels that provides tools for modelling based on tidyverse principles. Initially, I thought that tidymodels would help fill the statistical gap in the tidyverse, but sadly it has not done so. In fact, I think that tidymodels is a disappointment even when modelling is viewed from a data science perspective. In this post, I set out my objections to tidymodels and explain why I do not use it in my teaching.
This post is considerably longer than my target length of 2000 words, so I thought that an executive summary might be helpful.
When tidymodels was started there were two existing packages that could have been used as a template for the project. One was a tidyverse package written by Hadley Wickham called modelr and the other was a popular machine learning package written by Max Kuhn called caret. Faced with a 50-50 decision, I believe that the tidymodels team made the wrong choice by going with the caret approach. This mistake is not a trivial one. I’ll argue that teaching in the caret style encourages bad practice and in the long-term is harmful for students. Development of modelr would have produced a much better framework for modelling that would have fitted better with the rest of the tidyverse.
The Argument against tidymodels
You will have gathered already that I do not like the design of tidymodels, so this post is going to be a bit of a rant. Of course, my dislike of tidymodels is just opinion, I am sure that there are many people who think that tidymodels is great. In recognition of this, I ought to start each sentence with “In my opinion..” or “It seems to me that …”, but this would make for a tedious read. Keep this in mind if you read on.
The need for a tidymodels project is clear. Whether you come from a statistical or a data science background, modelling is a key component of any data analysis. Despite this, modelling was by far the weakest element of the tidyverse, which is much stronger on data entry, data handling, visualisation and report production.
At present, tidymodels is made up of about thirty packages, some of which existed before the start of the project, but which have since been adopted under the tidymodels umbrella. The packages provide facilities such as a standard interface to modelling functions, standard presentation of the results of model fitting, resampling, inference and various ways to look at model performance.
Viewed from the outside, it seems to me that tidymodels started with an idea rather than a design. The idea was to develop packages for modelling in the style of caret, an earlier package for machine learning that was written by Max Kuhn, the same person who is now driving the development of tidymodels. The adoption of the caret approach prejudged many of the key questions would have arisen had the team sat down with a blank sheet of paper and asked, how should we extend the tidyverse to incorporate modelling?
Perhaps, it would have better had tidymodels not used the word “tidy” in the name of the project and if it had not been so closely linked to RStudio and the tidyverse. I would still not like the approach, but it would have left the door open for others to develop packages for modelling that are more in keeping with the general philosophy of the tidyverse.
To say that I do not like the tidymodels approach does not imply that I dislike all of its packages, so let’s start by looking at what is probably the best package in tidymodels, in that way we will be able to see how the others fall short. The package that I have in mind is called broom and, like several of the tidymodels packages, it pre-dates the project. broom is excellent and I use it regularly in my own work and I teach its use to all of my students. It does a very specific job that would be tedious if we had to write our own R code.
As you probably know, broom provides three functions tidy(), glance() and augment() that take the results returned by a model fitting function and extract respectively, the table of model coefficients, measures of goodness of fit and a table of fitted values and residuals. In each case the extracted information is returned as a tibble; so the functions work perfectly with the pipe.
R users have contributed a wonderfully varied set of modelling functions, but the downside of relying on users to write those functions is that there is a lot of duplication and a lack of standardisation. R’s many model fitting functions use different terminology, require different forms of model specification and they return their results in a wide range of formats. broom addresses the last of these problems by converting the results of the model fits into a standard form.
When it started, broom provided the code needed to extract the results from a small selection of the most commonly used model fitting functions. Then, as broom became more popular, code was added to handle more and more models. This organic growth is typical of the way we all write code; we start with a simple idea, it proves useful and overtime the code grows. In an ideal world, we would not write code in this way, instead we should sit down and specify a plan for the project and code to that plan. In the real world, we do not know at the outset that a particular function will prove that useful and anyway, many of the best innovations only occur to us once we start to use our functions.
Had the author known at the outset that broom would take-off in the way that it has, then I am sure that he would have started with a plan that acknowledged that it is impractical to provide code for handling every model fitting function in R; there are simply too many and every month more are released. Instead, what is needed is a framework that enables the authors of model fitting functions to contribute their own versions of tidy(), glance() and augment().
Creating a framework that accepts code contributed by other users is not a trivial exercise. The contributed functions need to be coded to a reasonable standard, they need consistent naming of returned objects and they need consistent documentation. Together this implies the need for software that checks the contributions. What about the issue of governance? If I use lm(), I obtain the model coefficients with broom::tidy.lm(), but if I fit a model with the function unusualModel() provided as part of the unusual package, will I use unusual::tidy.unusualModel() or broom::tidy.unusualModel()? Who has final responsibility, broom or unusual? How are name clashes be to avoided? What happens when unusualModel() is updated or enhanced or has bugs fixed? In many ways it is more difficult to provide a framework for others than it is to write the code yourself.
None of this is news to the developers of broom who are currently addressing all of these issues by creating just such a framework. You can read about their approach here. In brief, broom, it is an excellent package that is developing exactly as it should.
The tidymodels project’s history is different from that of broom. tidymodels carries the ‘tidy’ brand name and the developers knew from the outset that it would be widely used. There is no excuse for starting without an overall design that ensures integration, extendibility, consistency and efficiency. Let’s list some principles that might have guided the development of tidymodels.
Each package within tidymodels should address a single specific issue in modelling
The tidymodels project will use user-written functions for the actual model fitting, so it must create a framework that enables the authors of new functions to contribute to tidymodels
tidymodels should integrate with the tidyverse and be suitable for programming with the pipe.
The tidymodels functions should look to the user as if they were a part of the tidyverse, with similar user interfaces.
tidymodels should not duplicate what is already available in the tidyverse
tidymodels should be consistent with the philosophy of the R language and the tidyverse. In particular it should encourage the use of functions
- tidymodels should provide functions that implement its design. Users will then be able to use those tools to develop new packages that enhance tidymodels
tidymodels should encourage good practice in data analysis
This list seems very non-controversial to me, but you have to be very careful when you accept a set of principles because they often imply things that you had not anticipated.
Now let’s see how tidymodels matches up to these principles. The idea that the tidymodels packages should address specific issues, suggests packages for preparing the data prior to model fitting, packages to control the specification and fitting of models, packages for extracting information on the model fit, packages for model comparison, packages for assessing model fit, packages for measuring or summarising model performance and so on. In other words there needs to be an overall design.
Data manipulation is handled in the tidyverse by dplyr and tidyr, so it might well be that preparing the data prior to model fitting is best addressed by extensions to dplyr rather than by a completely new approach. We should be careful not to duplicate. Similarly, modelling makes a lot of use of visualisations whether it is for data exploration or for model checking, so there will be a need for packages of helper functions that work with ggplot2, along the lines of ggdendro, ggridges and any number of others.
Once the overall design has been formulated, most of what is needed will be in the form of either, frameworks to enable users to contribute, or tidyverse style functions that users can incorporate into their own code. It is not the job of tidymodels to do everything for us.
As an example, an important aspect of preparing the data for modelling is the replacement of missing data by imputation. So, tidymodels should have a design that helps us to make use of the many imputation packages already available in R. The problems of tidymodels imputation would exactly mirror those of broom. Lots of external functions offering different types of imputation, with the potential for more functions to be added in the future. So tidymodels needs to provide a framework for coding imputation.
Single imputations would be relatively straightforward; data frame in and imputed data frame out, but what about multiple imputation? Do we have a nested tibble with a row for each of the multiple imputations and a list column in which each entry is itself an imputed data set, or do we unnest the data and stack the imputed data sets under one another in the style used by Stata. What do we do if the dataset is very large? Ten multiple imputations might not fit in memory. Perhaps there is an efficient way to store the results that only saves the imputations and which re-constructs the complete tibble from the actual and imputed data when it is needed; this is the approach taken by the mice package.
It is not my job to write the design for tidymodels imputation; that is the job of the tidymodels project. Once the design is formulated then they should think about the functions that implement it. The basic functions should be provided by the tidymodel project itself, but they should just be tools from which users can create their own packages. It is not in the spirit of R to have a centralised team of coders who prepare and oversee every function. Avoiding that mistake is what has made R stronger than other statistical packages such as Stata, SPSS and SAS.
Now let’s look at imputation within the current version of tidymodels. The package that I’d want to consider is called recipes and it describes itself as a tool for creating design matrices prior to model fitting. Unfortunately, mission creep set in and recipes also addresses issues of pre-processing including data transformation and imputation. As a result, there is duplication of things that can already been done with dplyr, like variable standardisation, and there is even a function in recipes for log transformation. Sadly, the imputation facilities are limited and don’t seem to be designed in a way that will extend easily. What’s more the facilities are deep within recipes, so they are not suitable for users to adapt. The end result is that recipes does too much, not very well.
My criticisms of the recipes package go even deeper than this. The way that the package works is that it sets up a recipe, which is a description of the steps required to prepare a data frame for fitting a model. Typical steps will involve variable selection, transformation and specification of the model formula. These steps are saved without being executed. Then, at some later point, the recipe is “baked”, in other words, the steps in the recipe are executed on a particular set of data. This is a slight simplification, but it captures the essentials.
Why do I object to this way of working? Well, delayed execution is just the same as placing everything in a function, the steps that you choose are like the arguments of the function and baking is just a word for executing the function. In other words, the recipes package duplicates the fundamental R process of writing functions and it does so by creating a rather clumsy alternative. It is artificial, unnecessary, less literate and it is not in the spirit of R. If you are trying to lead your students towards functional programming, then the recipes package is a giant step in the wrong direction.
I am sure that a large part of the design of recipes was motivated by the desire to aid R users who have not yet reached the stage of writing their own functions, but it is completely misguided. I would go as far as to say that getting students to work in this way will, in the long run, do more harm than good.
Let’s consider a common problem in model fitting and see how tidymodels might help. The problem I have in mind is that of specifying different linear predictors for model comparison or variable selection. R uses formulae for specifying the additive structure of the predictors. This works well when there are only a handful of predictors but it is extremely cumbersome when you have hundreds or thousands of potential predictors. Particular problems arise when you want to fit a range of models.
Suppose that you have 100 predictors ranked in order of importance by some feature selection algorithm and that you decide to fit a series of models taking just the first predictor, then taking the first and second, then the first, second and third until eventually you have a model that uses all 100. To program this, you could either create multiple datasets each with the required predictors, or you could create multiple formulae. Obviously, you could create the formulae using glue() or paste() but it would be tedious and a convenience function that facilitated this in a simple way would be very helpful. Perhaps the set of formulae could be saved as a column in a tibble, after which the fits could be added as an additional list column using functional programming and the map() functions. This use of list columns is the basis of the approach taken by the modelr package.
The problem of specifying model formulae is very similar to the problem of imputation; we need a grammar of formulae and a grammar of imputation in much the same way as ggplot2 implements the grammar of graphics. Of course, someone needs to design the grammar of formulae before they start programming and that someone ought to be the tidymodels team. After that, they should turn their attention to coding tools that others can use. Design first, program second.
Interestingly there are strong parallels between recipes and another tidymodels package called infer. infer calculates p-values by different methods for a limited but important set of hypothesis tests. The package was, I believe, intended as a teaching aid and it too re-invents a basic form of R coding with the aim of making it simpler to use.
Here is a typical piece of code using infer. It is adapted from one of the infer vignettes that can be found here. This code calculates a bootstrap p-value from a t-statistic for testing whether the variable, x, in the tibble, df, has a mean of 8.
t_bar <- specify(response = x) %>%
calculate(stat = "t")
specify(response = x) %>%
hypothesize(null = "point", mu = 8) %>%
generate(reps = 1000) %>%
calculate(stat = "t") %>%
get_p_value(obs_stat = t_bar, direction = "two_sided")
What do the functions, specify(), hypothesize(), generate() do? They do nothing, other than specify different components of the test specification. These components are eventually accessed by calculate(), which simulates the sampling distribution for the test. In R, a calculation is controlled by the arguments to a function, so in effect these functions just define arguments. It would have been possible to design infer so that the second of the pipes was coded (this would not run)
calculate(response = x,
null = "point",
mu = 8,
reps = 1000,
stat = "t") %>%
get_p_value(obs_stat = t_bar, direction = "two_sided")
As an aside, the choice of the function name calculate() is awful; it gives no clue as to what the function does.
The main point is that infer, as it actually works, mis-uses the pipe to enable arguments to be built-up in stages. I am sure that this design was well-meant, it was probably considered as being helpful in encouraging students to breakdown the testing process into a series of decisions. Caret and recipes adopt the same logic. The problem is, it is not R. I spend a lot of time getting students to understand the structure of the R language and this style of coding undermines that effort.
The designs of recipes and infer raise the tricky issue of whether tidymodels should have the aim of making statistical analysis more accessible. It may be harsh to say it, but it is as if someone thought, data scientists are not up to doing proper modelling or proper R coding, so tidymodels will need to simplify the process for them.
There is a place in R for teaching packages, but they need to be designed as front-ends to professional software. We need a proper set of modelling tools that can be used by people who know what they are doing, before we develop the simplified interfaces.
One of the pieces of advice that I give to my students is, if you don’t understand the statistical method then don’t use it; otherwise the potential for disaster is too great. Much better to do something simple, like calculate a few means. Nobody listens to me, deep down everyone carries the belief that complex methods must be better than simple ones.
Programs, such as SPSS, are a pet hate of mine, not because they do anything wrong, but because they make it easy to do things that you do not understand. Unfortunately, tidymodels is taking us down the SPSS route.
The potential for tidymodels to encourage people to fit an inappropriate model is well-illustrated by some of the analyses linked to the TidyTuesday project. Recently, I have become a big fan of TidyTuesday and in particular the excellent videos produced by David Robinson and Julia Silge. Each week, the TidyTuesday project releases a new, moderately sized data set and invites everyone to explore and analyse it and to share their results.
David Robinson has very bravely taken up the TidyTuesday challenge by producing live videos in which he analyses that week’s data from scratch without any preparation. The videos are available on youTube and they are an amazingly teaching tool. The videos do not try to explain the code that is used, but it is rare that I have watched a video without picking up some good coding tip. They also provide wonderful illustrations of the art of problem solving in coding, a topic that is really hard to teach.
David’s videos concentrate on data exploration, usually without model fitting, while Julia’s are more pre-planned and are designed to demonstrate particular aspects of tidymodels. The videos are a great resource and I strongly recommend them. Be warned though, if you watch David’s videos you will come away with typing envy. He types code faster than I think.
David Robinson is not an expert in tidymodels, but few users will be able to match his expertise in R and the tidyverse. With this in mind, watch the video on Palmer penguins. That week’s data contained measurements on the sizes of three species of penguins and in the video David started with a few visualisations and then moved on to create models to predict species from the size measurements using tidymodels. Of course, he gets there in the end, but the process he goes through is tortuous and illustrates many of the short comings of the tidymodels project that I have already discussed. For instance, at one stage he manages to use tidymodels to fit a binary response model to data with three response categories.
I’ve spotted a number of modelling errors related to the use of tidymodels in TidyTuesday videos and I suspect that a large part of the problem is that tidymodels sits between you and the model, so you lose the feel for what you are doing. What is more, by specifying a list of processing steps that are subsequently executed together, the user is discouraged from checking each intermediate stage to make sure that it has executed correctly. The less that you check your analysis, the more likely it is to contain errors.
When you write code that bundles together many processing steps, there is a trap that it is very easy to fall into and that is, over-complicating everything. You want to cover every possibility and so the software just grows and grows. A better approach to coding is to provide simple, well-designed tools that users can put together in different ways to suit their own needs. The complexity should come from the application and not from the tools themselves.
Despite all that I have said, don’t be put off watching the TidyTuesday videos, they are great, but we are all human and we all make mistakes. I cringe inside when I think back to some of the analyses that I have published over my career. The point is that David and Julia are near the top of the tree and if they get things wrong when using tidymodels, what hope is there for the rest of us. You could argue that they might have made the same mistakes had they not been using tidymodels and of course that is true, but my feeling is that packages that are overly complex and which keep you at arms length from the analysis will encourage more errors.
Think of it like this, if you give someone a hammer and tell them to spend the day hammering nails into a block of wood, then eventually they will hit their own finger and it will be painful. However, if you give them a badly designed hammer with a heavy head and a wibbly-wobbly rubber handle then you will get a lot more accidents. Too much of tidymodels is like a hammer with a rubber handle.
I have never been lucky enough to attend one of the RStudio conferences, so I try to compensate by watching the youTube videos of the talks. This week, I watched a talk by Claus Wilke on a package called ggtext. The package helps you to format the axis titles in ggplot2. In itself, the package is not that special, but about six minutes into the video, Claus presents the first illustration of his solution and I immediately thought to myself, “that’s neat” and at the same moment, the audience gave a spontaneous round of applause. The solution was simple, it did the job required and you could instantly see how you might use and adapt it. tidymodels is the exact opposite of ggtext, so it is not the least bit surprising that using tidymodels can lead to modelling errors.
Take, as another example, the parsnip package, which creates a common interface for model fitting. Amongst other things, it allows you to specify a linear regression model and subsequently to decide whether to fit it with lm or stan or whatever. The models fitted by lm and stan are obviously closely related but they are not identical and yet parsnip allows you to switch easily between them. Alter another argument and you can add a penalty and fit the model with glmnet; fine if you understand the form of the penalty, but penalties come in many forms. Parsnip does more than create a common interface, it simplifies and it provides build-in defaults, which add assumptions that are hidden from the user.
From a teaching point of view, I am particularly concerned that the functions that I use should encourage good practice in data analysis. Almost all of my students start with two assumptions that need challenging;
if it is an R package, it must be a good/well-established method
a function’s defaults were chosen by the experts, so they must be sensible
Packages such as recipes and parsnip may be intended to take away the burden of controlling every aspect of the analysis, but in doing so they reinforce these false assumptions and consequently they encourage bad practice.
It has been a long argument, so perhaps a summary of the main points will help convince you that tidymodels should go back to the drawing board.
1. tidymodels appears to be based on the idea that modelling should be made easier, while a better approach would have been to base it on a comprehensive design that captures how modelling ought to be performed within the tidyverse.
2. tidymodels duplicates some features already available in the tidyverse.
3. tidymodels uses separate functions to define the stages in the analysis and only performs the calculations once the full method is specified. This has two consequences,
- it discourages checking of intermediate steps and so makes errors more likely.
- it strays away from the basic structure of R in which a function is used to define an action that is to be performed at some later stage and the arguments to that function define the specific choices.
4. tidymodels makes it easier to specify the model that you want to use by hiding the details and terminology of the package that will perform the calculations. In doing so,
- tidymodels makes decisions for the user in the form of defaults, so that the user will find it harder to keep track of the details of their analysis.
- the user is encouraged to use methods that they do not completely understand.
5. tidymodels offers a complete one-stop solution to modelling. Consequently,
- it is difficult for the user to adapt the tidymodels functions for use outside of tidymodels.
- it is difficult for users to add to tidymodels.
- the design is centrally controlled, which will limit the speed of its development and will not make full use of the skills of the R Community.
I believe that if the tidymodels team were to start afresh, they would come up with a design that would be very similar to that used in the modelr package. tidymodels needs to provide a strong design and a few basic tools. If the design and tools are any good, the community of R users will adopt them and develop new packages for data modelling.