Starting with the tidyverse

 

 

This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. You can find the introduction to this blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.

 

In this post I’ll discuss teaching the tidyverse without first introducing base R.

 

 

There is an excellent blog by David Robinson called VarianceExplained that has a post which discusses the advantages of teaching the tidyverse to beginners (http://varianceexplained.org/r/teach-tidyverse/).  David has also given talks on starting R with the tidyverse that you can watch on youTube (https://www.youtube.com/watch?v=dT5A0sAWc2I ). In discussing the merits and problems associated with introducing R via the tidyverse, I’ll try not to duplicate what has already been said, but a degree of overlap is inevitable, not least because those sources played a role in persuading me to adopt this style of teaching.

 

The tidyverse, as I am sure that you know, is a collection of R packages inspired by the work of Hadley Wickham; indeed in the early days the packages were often referred to as the Hadleyverse. Hadley Wickham was motivated by a desire to create a set of high-level R functions that together cover all of the stages of a data analysis, so there are packages of functions for importing data, functions for manipulating data, functions for graphics and so on. Of course, these functions do nothing that you could not already do with base R, but they are well-written and they are in a fairly consistent style, which makes them efficient and relatively easy to use.

 

When designing a course, it is not enough to say that you are “starting with the tidyverse”; after all, you could teach the tidyverse by beginning with a thorough description of the readr package before moving on to a comprehensive look at dplyr and then tidyr. Instead, when I talk of starting with the tidyverse, I mean introducing a few carefully chosen tidyverse functions that work together to enable you to do something practical, and in addition, I mean presenting those functions without any prior mention of base R.

 

At least as important as the range of tasks carried out by the tidyverse  is the fact that its functions are designed to be combined using the pipe. Personally, I doubt whether the tidyverse would have had nearly as much impact were it not for the pipe.  The pipe enables a style of coding in which functions are linked by the %>% symbol, so that the output from one function becomes the first argument in the next function. This allows you to chain together a series of functions in the order in which they are to be executed. It creates very neat and easy to read R code. ( https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html) .

 

Any type of data object could be passed on in a pipe, but for beginners, I find that it helps to start with the idea of “data frame in, data frame out”. That is, to envisage the functions in the pipe as completing a series of operations on a single data frame. This simplification helps beginners to build up a mental picture of what their code does and it fits neatly with their experience of working with data in a spreadsheet.

 

For better or for worse, most of my students start R with previous experience of having used Excel and an early ambition of my R course is to convince the students to stop using Excel; it is limited in its capability and the way that it is commonly used violates most of the principles of reproducibility. Despite this, picturing functions as operating on a rectangular data structure creates a comfortable starting point for most students.

 

When I teach base R, I introduce the idea of a list before I mention the data frame, in that way I can explain that a data frame is a special type of list made up of vectors of the same length, so operations on data frames are just special cases of operations on lists. When starting with the tidyverse, the approach has to be very different. I am forced to introduce the data frame by analogy with a rectangular spreadsheet with rows corresponding to subjects and columns corresponding to variables. This is a kind of ‘provisional truth’ and is typical of the approach that has to be adopted when you teach practical skills without the theoretical underpinning. We deliberately present a simplified version of the truth and develop the underlying concepts as we progress.

 

Slowly developing understanding is the way that young children learn; we do not define the system of integers before we teach children to count. However, with adults we tend to move away from this style of teaching towards a more logically ordered method. In my experience, most adults are happy to revert to the simplified form of learning, especially when their motivation is practical. However, you need to be prepared for awkward questions, especially from students who have been taught base R before, or who have read a book on R. How will you reply if during the very first session a student interrupts and says, “but I thought that a data frame was a type of list”?

 

Handling questions from students who have some knowledge beyond what you are currently trying to teach is tricky and we all develop our own way of coping. Obviously we want to encourage discussion and questioning so we must take care not to be dismissive of a seemingly unnecessary question, as that could stop others from asking really important questions. Perhaps the first thing to do when presented with such a question is to judge whether it is being asked so that the student can show-off their knowledge, or because they really are confused. In the former case you can direct your answer at the individual and keep it short, while confusion is more likely to be shared and requires a fuller answer directed at the whole class.

 

When we provide provisional truths, it is quite likely that the point raised by a student’s question will be covered in a subsequent session, so we can always point the student to the relevant part of their notes (I give students the entire set of course materials in advance of the course, rather than release it as we go along – I want them to read ahead). However, “we will come to that” is slightly dismissive, so I always try to say something positive in response. Handling questions that go beyond the material that you are currently trying to teach is an issue in all forms of teaching, it is just that in this style of course, the problem arises more often.

 

Over the years I’ve found that when I teach provisional truth to adults it pays to be honest, that is, to tell the students in advance that you are going to present a simplified version and to explain why you are doing it. Then, after you have covered the point, you can tell them how you plan to develop their understanding in future sessions. Generally, I find that the more context that you give, the better the students will understand. In my introduction to the course, I make a point of telling the students about the two approaches to teaching R and I explain that we will be jumping into the tidyverse without any mention of base R. Explaining your approach aids understanding and it helps you avoid those time consuming questions from students who have previous experience of being taught base R.

 

When starting with the tidyverse and the pipe applied to a data frame, there is a natural order for the ideas that we need to get across in the first session.

  1. R is based on functions and data objects
  2. A data frame is a rectangular data object similar to a spreadsheet
  3. Some functions are provided by R itself, but most are obtained from downloadable packages
  4. The functions that we are going to use take a data frame, modify it and return the modified data frame
  5. Functions can be chained together in pipes
  6. If you want to save the object returned at the end of a pipe, you assign it a name.

Being explicit about this structure is very important to beginners because it gives them a mental picture of what we are trying to achieve. I try to get these ideas across in Session 1 while introducing a very limited range of functions. For my short course I chose,

  • read_csv() from the readr package
  • select(), filter(), rename(), group_by(), summarise() from dplyr
  • qplot() from ggplot2
  • rmarkdown for report writing

You can see that together these functions enable you to import some data, calculate summary statistics, plot the data and report the findings. Although these functions will not provide a very sophisticated analysis, the students will have covered the entire process of a data analysis. In subsequent sessions, I can extend their knowledge, for example by replacing qplot() with ggplot(), but when I teach an extension, the student is able to see how that topic fits into the overall structure.

 

Starting with a small set of unrelated tidyverse functions could appear to the student as being rather scatter-gun. Two-minutes on read_csv(), three minutes on filter() and so on. Students need structure if they are to learn and since we are not providing the structure that comes from the logic of the language, it is vital that we replace it with some other structure. In my course, I hope this comes from the workflow that underpins a good data analysis and an early appreciation of the way in which R works with functions that operate on data objects. I will return to these issues in future posts.

 

I suggested in my introduction to this blog that “starting with the data” is more fundamental than “starting with the tidyverse” (https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/). In my next post I’ll develop this idea and discuss the implications of letting the data drive the syllabus.

Share this page:

Share this page:

John

About John

Professor of Genetic Epidemiology

View more posts by John

Subscribe to John's posts

Leave a Reply

Network-wide options by YD - Freelance Wordpress Developer