Starting with the data

This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. If you have not already done so then you might start by reading my introduction to the blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.

In this post I’ll discuss the merits of problem-based teaching in which each session is based around the analysis of a single published study.

Everyone uses data for illustration when they teach students to use R, but what I want to suggest in this post goes beyond that. I want to advocate the problem-based teaching style in which every session is based around the analysis of a single study. Having identified the study that we will use, we cover those aspects of R and statistics that are needed for that particular analysis and nothing else. So the data drive the syllabus and the ordering of the material.

You will not be surprised to hear that I am convinced that the problem-based approach using real data is perfect for teaching both statistical analysis and R. With the possible exception of theoretical courses for computer scientists, R should not be taught in isolation from its application and the task-orientated design of the tidyverse makes problem-based teaching much more straightforward.

There are several websites that provide teaching datasets usually taken from text books. I know that there are useful examples to be found on these sites, but I want to suggest that you should not use this type of resource. A sure sign that a website is to be avoided is when you are able to search by statistical technique, for instance to find data suitable for teaching linear regression. These datasets are usually sanitised and rarely have sufficient backstory to enable a proper analysis.

The short course that motivated this blog was aimed at research students and post docs, so using data from published studies appealed to them because they could identify with the problems of the authors. What is more, students like finding out about the background to a study and they can immediately see that the R skills that they are learning have practical use. The formulation of the research questions, the selection of appropriate statistical methods and the interpretation of the results are all much easier and more meaningful when done in the context of a real study.

If, as I do, you see R as a tool for producing better data analyses, then you should not isolate the teaching of say, ggplot2 and the choice of model for the analysis; the two things inter-relate. High-quality visualizations play an important role in model selection, so you cannot perform model selection without skills in ggplot, and you cannot create a good plot in R unless you understand how it will be used.

Real data enable you to discuss important issues that do not arise when artificial or sanitised data are used. Issues such as what to do about missing values. It is all very well to discuss the theoretical merits of complete-case or a particular form of imputation, but in practice the choice of approach will be context specific. Handled properly you can get students to debate the options and appreciate that the choice is subjective and that it may involve a compromise between what is desirable and what is practical; where practical might mean, available in an R package. Hopefully, this helps the students to appreciate that statistical analysis is not a matter of right and wrong and that some decisions about the analysis are critical, while many others do not materially affect the result.

For many years, I have used real data to illustrate my teaching, sometimes I have used data from studies that I’ve worked on, but more often I’ve taken data from published papers that the authors have made available via a journal or an internet data repository. In the case of the R course that inspired this blog, I decided to use published data because the scope of my own work is not broad enough to cover the varied interests of my students, who come from the full range of the life sciences.

It could be argued that to teach the t-test, it helps to have data that are clearly independent and normally distributed; that way, one is free to concentrate on the statistical method. When I was a student on a mathematics course, we derived the formula for the sampling distribution of the t-statistic and then used a few hypothetical numbers to illustrate the calculation. That lecture concentrated on the topic at hand but made no attempt to be practical and it omitted all of the things that you actually need to know, such as when a t-test is appropriate, what question it addresses, how robust the result is, what the result means in the context of a real study, etc.

Even if your aim is to teach the use of R’s t.test() function to students who already know something about the t-test, it is wrong to do so in isolation from the way that the student will use that function when they come to analyse their own data. It is fine to tell them that there is a var.equal argument that can be set to TRUE or FALSE, but that information means very little if the student does not know how to use R to calculate the variances in the two groups, or cannot use R to plot histograms of the two groups and display them on the same scale. When the data drive the course, you teach these diverse skills as a set in the same session, so you have the chance to talk about how equal the variances need to be, or whether it is a better policy to always assume unequal variances, and at the same time, you can introduce the use of facets in ggplot.

A particular issue with problem-based teaching is that it is harder to stick to a pre-defined syllabus and one can be led into discussing topics in what might seem to be a very unnatural order. A session intended to introduce the chisq.test() function with real data might lead students to suggest unforeseen research questions that are best answered using a sparse contingency table. As a result you could end up with an unplanned discussion of simulated p-values. Not only is it more difficult to stick to a lesson plan when using a problem-based approach but nobody remembers all of the tidyverse functions, so there is a good chance that you will be led to a point where you do not know what function to use. This is just the situation that the students are going to find themselves in when they analyse their own data, so it is helpful for them to see that their tutor can get by without knowing everything.

If you do decide to use published data then you will need to hunt for an appropriate study in a data repository. There are two types of data repository available via the internet. One type is for work in progress and these repositories make it easy for team members to share data during an analysis. The other type is for archiving data once a project has been completed and a report or paper has been published. Naturally enough some repositories combine both functions and allow sharing within the team during the analysis after which some or all of the data can be made public. Repositories that provide public access to data may be specialised in the sense of being for particular types of study or they may be general.

The total number of data repositories is quite large and the best list of science repositories that I know of is kept by Nature at https://www.nature.com/sdata/policies/repositories. Increasingly, scientific journals ask their authors to deposit the raw data on which the article is based in one of these data repositories. It is all part of the drive for reproducible and open research. So it is now much easier to find data from published articles around which we can base particular lessons, lectures or practicals. My personal favourite repository is Dryad, which can be found at https://datadryad.org/; it is large, varied, yet very easy to search.

The journal PLoS One has a data sharing policy that explicitly mentions Dryad so it is not surprising that many authors submitting to the journal choose to use that repository. The journal’s website at https://journals.plos.org/plosone/s/data-availability says

“Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods.”

and later it says,

“PLOS partners with repositories to support data sharing and compliance with the PLOS data policy. Our submission system is integrated with partner repositories to ensure that the article and its underlying data are paired, published together and linked. Current partners include Dryad and FlowRepository.”

I have spent many a happy hour searching Dryad for suitable studies based on titles that suggest a topic that is not too specialised and study sizes that are neither too large nor too small. The articles themselves provide the necessary background on study design and data collection and, at least in principle, the repository provides the data that you need.

It is important to appreciate that my aim is not to reproduce the analysis in the published paper. Often I use the data to answer a very specific question that might have been given very little weight in the article, or perhaps not have been mentioned at all.

I have had lots of good experiences using Dryad but difficulties do occur. The most common problem arises in when authors chose to withhold part of the data. Quite often they withhold information in such a way as to make it impossible to recreate exactly the analysis published in the article. Presumably this is deliberate, as it stops anyone from challenging the article, thereby defeating the main object of open research. Much less common are instances where I have started to prepare an analysis for an up-coming course only to find an error in the data. Often one cannot be 100% sure that the data are wrong, but I still feel uncomfortable about using questionable data for teaching. I guess that I ought to report these findings to the authors, but I must admit that I never do. It’s easier just to look for a new example.

More common than errors in the data are papers that use statistical methods that I would not have selected myself. It is rare to be able to say that a statistical analysis is wrong, but researchers do sometimes use methods that make assumptions that seem hard to justify given their own study design. Once again, I choose not to get into a debate over these issues; they usually end as a matter of opinion and anyway, as I have already said, my objective is not to reproduce the results reported in the paper.

My impression is that the majority of the moderately sized data sets in Dryad are provided in Excel files but you will also see, csv, Word documents and all manner of formats linked to statistical programs. The Excel files in particular can be very challenging. Researchers will often mix their data with text, graphs and summary statistics, all on the same spreadsheet and they pay little respect to the ideas of tidy data (https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf). Reading such data and getting them into a form suitable for analysis can be quite challenging and it is sometimes tempting to re-format the data oneself before providing them to the students. This a temptation that I have always resisted. Handling messy data is a vital skill for any data analyst and it is difficult for students to appreciate how they should format their own data, if they never see how difficult it is to handle other people’s poorly-formatted spreadsheets.

In a more specialised R course it might be better to use one of the repositories that is dedicated to a particular type of study. In my own field I have found the GEO repository (https://www.ncbi.nlm.nih.gov/geo/ ) to be particularly useful. This repository specialises in gene expression data from micro-arrays or, more recently, from sequencing. I have had a class use R to view the digital images of a micro-array and then to follow the analysis through pre-processing and normalization to statistical testing. The students learn to appreciate the impact of errors in the data and to understand how changes to the normalization affect the statistical analysis. These issues would be almost impossible to get across without real data. The students also like the fact that their results can be linked to annotation databases that tell them about the genes that are differentially expressed and in the process they learn a lot of useful R skills.

Having chosen a study that is suitable for the skills and interests of my students, I next prepare my own analysis of those data. This analysis will eventually form the demonstration that lies at the heart of each session. My analysis will dictate the new R and statistical skills that I will need to introduce in the lecture and so it defines the syllabus. Of course, this is not a one way process, if I judge that the lecture would prove too challenging then I modify my analysis. When I discuss the course materials you will see the type of study that I like to use, I’ll make a point of explaining why I chose each example and I’ll describe how the choice of data dictated the structure of the course.

Starting with the data

“Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods.”

About John

Recent Comments

Meta