This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found on my github page. If you have not already done so then you might start by reading my introduction to this blog.
In this post I discuss the ways in which reproducibility and coding style are introduced on the course.
In the introduction to my short course, I highlight three ideas that are basic to my approach to data analysis, namely workflow, reproducibility and coding style. I discussed workflow in my last post and now I want to comment on the other two topics.
Let me emphasise that I will be discussing reproducibility in the context of a short course on data analysis for early stage scientific researchers. My objective is to make the students aware of the need for reproducibility and to give them a few simple tools that will start them working in the right way. An introductory course is not the place to try to create the perfect framework for reproducible data analysis.
Most of my students are junior researchers in the life sciences and it is not uncommon for their supervisors to start them in a new area of research by giving them a published article and asking them to reproduce the experiment described in the paper. The students are immediately faced with the issue of published scientific findings that nobody else can replicate. The journal Nature covered this topic in a special collection of articles under the title “Challenges in irreproducible research”. The collection included the report of a survey that found that 70% of researchers had tried and failed to reproduce another scientist’s experiment.
Having students who are alive to the problem of scientific reproducibility would appear to be a good starting point for discussing reproducibility in data analysis, but unfortunately researchers often conflates two very different problems. One is the ability to reproduce the experimental method and the other is the ability to reproduce the experimental results. In the physical sciences these two things may be very close to one another, but in the biological sciences they most certainly are not. No matter how closely you reproduce the method, the variability inherent in living organisms means that you will not get exactly the same results. The problem is magnified if you are foolish enough to judge reproducibility in terms of statistical significance.
To avoid confusion, I talk of reproducing the method and replicating the results, but not everyone makes the distinction and when introducing the topic it is important to be aware that many of the students will blur the two ideas. So, instead of linking reproducibility in data science to reproducibility in scientific research, I opt for a different approach.
All scientists have a secret fear that one day they will publish their research in a high profile journal, only for someone to point out an error in the work. The thought of the public humiliation that would result is one of the major obstacles to openness in science. I play on this universal fear by inviting the students to consider the situation in which they perform an elaborate analysis of their data and publish a finding, say p=0.034. Then, when the analysis is questioned, they return to their data but, try as they might, they cannot reproduce the same p-value. The problem might lie in the data, or in the way that it was processed, or it could just have been a transcription error, there is no way to be sure. I sum this up by asking, “if you cannot reproduce your own result, how do you know if it is correct?”
The argument is very powerful and I have never had it questioned. The example is deliberately worded to emphasise personal reproduction of an analysis rather than reproduction of your analysis by another researcher. Reproducibility in the broader sense raises more issues than one can cover in an introductory course and personal reproduction of an analysis is more relevant to my students.
So the students are open to the importance of being able to reproduce their analyses, but the consequences are far reaching and come as a surprise to most of them. Most shocking is that I discourage use of their favourite app, Excel. Reproducibility requires that you keep an exact record of everything that was done. This is very difficult to do if you work interactively. So one should not interactively edit the data, or interactively re-organisation the data or interactively analyse the data. Hence, no Excel.
I must admit that sometimes I do exploratory data analysis in the console in order to familiarise myself with the data before I start the formal analysis. However, there is no real reason why I should not do the same thing in an rmarkdown notebook. Unrecorded interactive data exploration is an old habit that I am trying break. It would be sad to plot the data and find an interesting pattern that you want to develop in the main analysis, only to discover that you cannot quite remember how you produced the original visualisation. I have, on occasions, admitted this weakness to the students and explained that I am trying to change. It is good that the students appreciate that other people also do things that they should not and that it is OK to be open about it. I do not make much reference to openness in this introductory course, but it is an under-explored issue in statistics that is worth wider consideration.
The consequence of not working interactively is that all stages of the analysis must be performed with scripts and not in the console and this includes everything from changes to the data, such as correcting wrongly entered values, through to the preparation of the final report or presentation. Consequently, I introduce R through a script and at no stage of the introductory course do I run R in the console. In my workflow, the original data are stored and never edited; a script is used to read the raw data and a clean copy is saved as an rds file(s). That way, if there is ever a problem you can always return to the beginning and re-run your scripts.
I make a particular point of mentioning cut and paste because that is the bread and butter of the way that my students work. They cut and paste data from one spreadsheet to another and they cut and paste plots and other results from the screen to their final reports. You can convey the dangers very easily by telling stories. Someone cuts and pastes the data to a new spreadsheet in preparation for some analysis, but inadvertently they miss the last line. The analysis will be wrong, it will not be reproducible and the error will be difficult to locate. Then, consider the case of someone who cuts and pastes a graph into a report. Later they decide to change the font size of the labels, but when they plot the new version, they find that the points do not match the original. How could they possibly tell what went wrong? Did they just copy and paste the wrong graph, or was it the right graph drawn with the wrong data?
Once I’ve got the students to abandon interactive working, my next step for improving reproducibility is better record keeping. Scientists all keep a lab-book in which they note information about their experiments. Each day they record details of their work in the lab, they might note the temperature, the time when they started etc. That way, if they find an inconsistency in the measurements that they made, they can look back at the lab-book to see if there was anything unusual about the work that they did that day. My message is that what is good for the lab is also good for data analysis.
Every time I start a new data analysis project I create a text file, that I call diary.txt. I open the file in the RStudio editor and I leave it there, so that it opens automatically whenever I work on that project. In this text file, I enter dated information about,
- the background to the data, what variables mean etc.
- who supplied the data, their contact details etc.
- the analyses that I run,
- my ideas for future analyses,
- queries about the data,
- sources of any R code that I download and adapt
- … and so on
At first, filling in the diary is a bit tedious but I cannot tell you how useful it has proved in my own work, especially when I leave a project for a few weeks and then try to return to it. Unfortunately, in a short course the students do not really see the benefit of the diary and so they have little incentive to use it. It would certainly be more appreciated in a course that ran over a whole semester. Despite this, I think that a diary is so useful that I still recommend it. Maybe, it will survive as a seed in the backs of their minds.
An important aspect of reproducibility is the ability to return to the state of the analysis as it was on a particular day. The dated entries in the diary help with this, but it is also important to use archiving and version control. One has to do quite a lot of data analysis before the need for version control becomes apparent and so I do not use it in on an introductory course. However, my students are all working towards presentations and publications, so they can readily appreciate the need to archive frozen copies of their data and scripts at important milestones in their work. I mention this, but it does not really affect the way that they work on the course exercises.
There are two other aspects of reproducibility that are important, but which I decided not to include. The first is the importance of specifying and recording the seed in any analysis that involves randomness, sampling or simulation. I guess that a data science course would, from the very beginning, split the data between a training set and a testing set and so such a course might need to emphasise the importance of the seed of the random number generator. On a statistics course, you are unlikely to need a random number generator until you meet methods such as the bootstrap, so recording the seed is not relevant on an introductory course.
The other aspect of reproducibility that I do not mention at all is the order in which scripts are run. In an introductory course, it is a big step forward to get students to use two scripts, one for importing and tidying the data and a second for the the analysis. It is way beyond their experience that a project might require so many scripts that the order in which they are run becomes an issue and so I choose not to mention it.
When planning an R course, you have to ask yourself whether you will insist on a specific style of coding, or you intend explaining the principles of good coding and allowing the students to develop their own style. For me as a statistician, coding is just a tool, so perhaps it is not surprising that I do not feel strongly about the style that the students use.
Error avoidance motivates my approach to coding style; you are more likely to spot errors in a script that is coded in a literate style. I stress to the students the benefits of consistency, but I am not prescriptive about the style that they use. In practice, of course, most students start by copying my approach, but they do so knowing that they are free to modify it.
I give the students links to different R style guides including the google R style guide and Hadley Wickham’s style guide and I invite them to read them and to take from them any ideas that seem helpful. Recommended styles vary and it is difficult to argue that any one is better than the others.
My own coding style has changed over time and has elements that do not appear in any of the style guides. Since I do not recommend a specific style, there is no need for me to adapt my own style when teaching, so the code that I use in the demonstrations looks very much like the code that I use in my everyday work, oddities and all.
An important aspect of coding style is the naming of objects. My own preference is for lowerCaseCamel for the names of data objects and snake_text for the names of files and functions. I encourage the students to use long descriptive names and poke fun at people who call everything x or y; then I laugh at myself when I do it. I have one naming convention that I use in my own work that is very non-standard. In order to distinguish between a data frame and the columns contained within the data frame, I end all data frame names with the capital letters DF, as in patientDF$dateBirth. It might look odd, but I find it helpful.
Another personal preference is for assignment with the right pointing arrow. It seems odd to me that people code with the pipe so that it can be read, left to right, as a series of ordered operations and then at the end of the pipe the reader is meant to jump back to line one to see where the end result is assigned and to read it right to left. It is illogical, so I make my assignments at the end of a pipe using a right pointing arrow. However, I have been using R since it was first released, so I started with base R and left pointing arrows. Out of habit, I still use left pointing arrows for assignment in lines that do not involve the pipe. This might seem inconsistent but it is my style, so the students just have to live with it.
Adding comments to scripts is a pain because there is no immediate return. You only see the benefit when you try working on a uncommented script that you started six months earlier. I suppose that I could present students with a piece of comment-free code and ask them to work out what it does, but in practice, I rely on teaching by example. I make sure that all of the scripts that I use in the demonstrations are well-commented. I include a standard header at the start of every script and I break the code into sections with brief descriptive section headers. I do not place comments at the ends of individual lines of code, since I’ve never found them to be very helpful and they look ugly.
For many years I have used three line section headers such as
# --------------------------------------------------- # Section title: what it does #
However, RStudio’s Editor pane has a “show document outline” button in the top right and a “jump to” button below, both of which enable you to jump to a particular section. RStudio recognises the line of dashes as a section marker, but because that line contains no text, Rstudio just calls it an “untitled” section, which is of no use if you want to move around your code. Consequently, I am in the process of changing my style so that section headers take the form
# ---Section title---------------------------------- # what it does #
With the new style, the title appears in RStudio’s outline and facilitates navigation. The benefits are so great that I am willing to live with the slightly less clear layout.
If you want to see examples of my style, my R scripts are included in the notes for the individual sessions that are available on my github page