This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found on my github page at https://github.com/thompson575/RCourse.
This is a new blog in which I will discuss my experiences of teaching R. I anticipate that there will be about twenty posts after which the blog will close.
Over the years I have tried to teach R in many ways, but I was inspired to start this blog by a tidyverse-based short course that I first ran in the autumn of 2018. As you are reading a blog on teaching R, you will know that there are two popular ways to introduce R. The traditional approach is to treat R as a computer language and to start by discussing data objects and the many base R functions that form the building blocks from which you eventually create a data analysis. The more fashionable alternative is to start with the tidyverse packages of functions; these enable the students to complete useful data analysis tasks from the very first session. In the past, I have tried different combinations of the two approaches, but in the course that motivated this blog, I switched entirely to the tidyverse. There is something satisfying about being up to date, but it is not always true that popular is best.
The students who enrolled on my course, did so because they wanted to use R as a tool and for them the ability to do something useful from the very beginning was definitely motivating, but the danger with task-orientated teaching is that the students learn to do things without really understanding why they work. When this happens, the students become very good at repeating what they have already been shown, but they have difficulty when they need to extend the ideas to related problems with slightly different features. The students lack a mental picture of what R is doing and so they have little feel for how they should modify their R scripts to tackle the new problem. The challenge in any task-orientated teaching is to ensure that the students develop an understanding of the foundations of the subject.
In planning this tidyverse-based course, I was very conscious of the need to build up the students’ basic understanding of R, but this is not easy to do without discussing the structure of the language as one would when teaching base R and consequently there were times when I got it wrong. In fact, I have run the course a couple of times and made a number of changes, which on the whole have helped. In this blog, I’ll point out the things that did not work first time and describe the changes that I made.
My introductory short course consisted of four half-day sessions and was advertised as being for researchers in the life sciences who had no prior knowledge of R. Of course, most, by not all, of the people who signed up for the course had used R before, some had even tried to teach themselves or been on other R courses without making much progress. This created two problems that will be familiar to anyone who teaches courses to non-specialists; the students start the course with widely varying levels of knowledge and even more critically, they start with very different expectations. Perhaps one of the most challenging groups consists of the students who have previous experiences that have convinced them that the topic is difficult and they will never understand. As the tutor, you have to build the confidence of those who find the subject demanding, while not making the more knowledgeable students feel that they are wasting their time. It can be quite a tough balance to achieve.
A second aspect of this short course that distinguished it from any of my previous R courses was that I decided to make it entirely problem-based. Each of the four half-day sessions centred on the analysis of a single real study taken from a published research paper and, within each session, I made a point of covering all aspects of the analysis from importing the data through to preparing a report on the findings. To enable me to do this, I divided each session into three parts. In the first part, I gave a short lecture covering the tidyverse skills needed for that particular data analysis. In the second part, I presented my analysis of the data in the form of a demonstration based on a set of prepared R scripts, then in the final part I set an exercise that invited the students to create a slightly different analysis of the same dataset.
Obviously, if you want to cover a wide range of skills in one session then you cannot go into much detail about any one of them. This needs careful judgement. The ordering of material and the gauging of the right level of detail are critical to the success of a problem-based course.
I found the problem-based style of teaching to be much better for motivation than my usual approach of using small datasets chosen to illustrate the use of specific R functions, but it too has its dangers. Problem-based teaching requires you to find real data that the students can understand and which they can analyse using the limited skill set that they currently possess. There are obvious dangers if you choose a study that can only be understood by people with specialist scientific knowledge or if you choose a study with hidden complexities that require you to either, introduce advanced R skills before the students are ready for them, or to omit obviously important aspects of the analysis. Once again, I found that I made some good choices and some poor choices; this will provide yet more material for the blog.
It is interesting to speculate on whether using the tidyverse to create a task-orientated approach to teaching R was more or less important to the success of the course than the problem-based approach of basing each session on real data from a single study. My feeling is that choosing to make the course problem-based was the key, because that dictated the structure of the sessions and the way that R was used. However, it is true that the problem-based approach would have been very difficult, if not impossible, without the tidyverse, because I would have needed to teach so much base R before I could undertake even the simplest stage of the analysis. Perhaps we should stop trying to persuade teachers of R to “start with the tidyverse” and instead ask them to “start with the data”; that way they would soon find that it is impractical to limit oneself to base R.
There is one further aspect of my teaching that raises interesting issues for discussion. I am a biostatistician and not a data scientist. My short course was aimed at scientists who were more interested in the t-test than in neural networks. Now, I have become a strong advocate of the idea that statisticians should learn from data scientists and this means much more than just being aware of the latest machine learning algorithms. Most statistical analyses would be greatly improved if they incorporated ideas such as workflow, reproducibility and literate programming that are mainstream in data science.
Whilst acknowledging the lessons that statisticians can learn from data science, there are perhaps even more lessons that data scientists could learn from statisticians. I know that it is a little harsh to say it, but when I read articles and blogs on data analysis in R that are written by data scientists, they frequently give me the impression that the author’s understanding of statistics is both rudimentary and out of date. This is unfortunate for those data scientists, but it has broader implications because data scientists are currently driving the development of R. In particular, the tidyverse has been created by people, most of whom call themselves data scientists. There is no doubt in my mind that the tidyverse would have looked quite different had it been designed by a group of statisticians.
In recent years, data scientists have rather taken over the R community and consequently the needs of traditional statistical analysis have been pushed into the background. There is a strange contradiction is this development because R was originally designed for statistical analysis and it is not ideally suited to the needs of data science. There is a good argument that if you want to do data science, then you should use python or some similar language. Yet we are where we are and R is currently dominated by data scientists and the needs of statistical analysis are being neglected. It is time for statisticians to step forward and redress the balance.
In the last few years the tidymodels project has sought to extend the tidyverse into the area of modelling and as such tidymodels ought to have addressed many of the problems associated with using the tidyverse for statistical analysis. In my opinion, tidymodels has taken a giant step in the wrong direction and as a consequence I do not use it in my teaching. I’ll explain why I don’t use tidymodels and I’ll outline what I think are better ways of handling statistical models within the framework of the tidyverse.
In summary then, I’ve identified three themes relating to teaching R that I want to explore in this blog
- The pros and cons of teaching the tidyverse without first teaching base R
- The pros and cons of problem-based teaching
- How to teach the tidyverse to people who are more interested in biostatistics than in data science
I will discuss these issues and at the same time make available my course material and comment on what worked well and what did not. The blog is primarily aimed at teachers of R, so I will not be explaining how particular functions work; I’ll assume that either you know already, or you know where to look it up.