This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. If you have not already done so then you might start by reading my introduction to the blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.
In this post I discuss the workflow that I use in the demonstrations.
When I was taught statistics, nobody would have dared use the word “workflow” for fear of being laughed at. The organisation of one’s work was not discussed, because it was thought to be common-sense and a matter of personal style. Fortunately, those days are over. Under pressure from data scientists, it is now much more acceptable for statisticians to talk about workflows. Over the years, my own work has become increasingly structured and I start my short course in data analysis with R by presenting a simplified version of my workflow and explaining how it will be used in the demonstrations and exercises.
For the students, I motivate the importance of structure through the idea of error control. The argument is that data analysis is complex and so is coding in R, as a result, errors are inevitable. I stress that what is important is not whether you make mistakes, but rather how quickly you notice them. A structured approach to data analysis helps minimise the errors and it makes them easier to spot.
Although I do not say this to the students, my feeling is that structured data analysis has a secondary advantage in that it helps the students to learn R. People pick up complex ideas much more quickly when those ideas are placed within a context or a framework. Once the students have the overall picture of a data analysis, they find it easier to flip between seemingly unrelated ideas. So you can go from readr functions to rmarkdown functions in the same lecture and know that the students will see how each fits within the workflow.
When Hadley Wickham describes the logic behind the tidyverse, he often uses a diagram to explain how a data analysis can be divided into a series of tasks each with its own tidyverse packages (e.g. https://r4ds.had.co.nz/explore-intro.html). Here is a version of Hadley’s workflow
The terminology, as so often with the tidyverse, is slightly unusual but the structure is still useful. We import the data, reorganise it into a form suitable for analysis (tidy), then we iterate around a loop in which we pre-process the data to select important variables or to create derived variables (transform), plot the data (visualise) and build models. When we are happy with the analysis, we prepare a report of our findings (communicate). The diagram explains the structure of the tidyverse and it serves as a good starting point for thinking about data analysis in general.
My own work as a biostatistician requires a slightly different workflow, as shown in the next diagram.
The key points of difference are that
Perhaps there should be an extra box in my diagram for “clarifying the research questions”. When a statistician acts as a statistical consultant, it is not at all unusual to spend the first session talking through the aims of the research and trying to turn those aims into specific questions that can be answered with the data that are, or will be, available. It is not always the case, but this discussion ought to take place prior to data collection, so I think of “clarifying the research questions” as part of the design.
Design
Statisticians have an important role in study design and both experimental design and survey design have their own specialist statistical literature. Statisticians help with questions of sample size, power, randomization, measurement error and so on.
In contrast, data scientists rarely have the luxury of being involved in the design because they usually analyse large datasets collected for other purposes. So it could be argued that design is less important in data science, but I think that this would be a mistake for two reasons; data scientists may not be able to change the design, but they still need to understand it and in large studies, they have the option to sample rather than analyse the whole dataset.
As an example, imagine a study that analyses hourly air pollution measurements collected over the last year by monitors placed at numerous points across a city.
Tempting as it is, I must be careful not to caricature data scientists, but it is not too much of an exaggeration to suggest that many would start by tidying the data and preparing some exploratory visualisations. I’d argue that a more appropriate first step would be to ask a series of questions about the design
Unless the data analyst asks these questions, they run a real risk of making poor decisions about the form of the analysis.
With very large data sets, precision is usually much less of a problem than bias. So having understood the way that the data were collected, the analyst should next consider whether to analyse all of the data or to concentrate on a sample. Here are three examples of situations in which sampling might be appropriate.
Another, quite different, aspect of design that is important to statisticians is the design of the analysis. In a clinical trial it is usual for the statisticians to prepare a statistical analysis plan (SAP) in advance of the data collection. The idea is to avoid the situation in which the researchers try many different analyses and then report the one that is most favourable to their preferred theory. A SAP helps focus the research questions and limits the tendency to engage in the type of research that says, we have these data, let’s see what we can find.
In my opinion, data scientists tend to underestimate the importance of subjectivity in data analysis.
Tidying and Pre-processing
I have kept the data tidying step in my workflow diagram and freely acknowledge that the organisation of the data prior to analysis is a vital skill that is neglected in many statistics courses. On my short course, I make a lot of use of the ideas in Hadley Wickham’s paper on tidy data (https://www.jstatsoft.org/article/view/v059i10 ). In essence, this approach stores the data in tables much as one would in a relational database. Since the paper is very clear, I’ll leave it there and move on to next stage in the workflow, the pre-processing.
Pre-processing is represented in my diagram by a single box but in reality it usually expands into a cycle of three stages, as shown below.
During the pre-processing, we check the data for errors, duplications, inconsistencies, missing values etc. Then we clean the data, by which I mean correct the obvious errors. In the transform stage, we might do anything from simple recoding or categorisation through to complex feature selection, data reduction or multiple imputation. After this, the data will have changed, so we need to check them once again for errors and inconsistencies, cycling until we are happy with the data quality.
A key point is that all of this pre-processing must be completed before the analyses are started. If not, there is a real danger that two analyses, or even two parts of the same analysis, will be based on subtly different versions of the same data.
Analysis
Just like the pre-processing, the analysis step can be expanded into a cycle with three stages, as shown in the next diagram.
The first stage involves summarising to produce a basic description of the processed data. The summaries together with the research questions will probably suggest models, so we move on to model selection. Typically, this will require several models to be fitted with some criterion, or performance metric, used to choose between them. Next comes model checking to ensure that the assumptions of the chosen model are satisfied and that the model is sufficient to answer whatever questions we started with. We might continue around the cycle by summarising the results of the model fit and perhaps this will suggest new models that we can fit and check.
Within this cycle, data scientists usually avoid over-fitting by dividing the data into training, validation and test sets, or training and tests sets if they plan to use cross-validation. Statisticians rarely reserve data for testing, so it is interesting to ask why.
Until recently, statisticians have tended to work on small or moderate sized data sets, so that the loss of precision that would result from data splitting would have had a major impact on the analyses. Secondly, statisticians tend to use simple models, perhaps assuming linearity and lack of interaction, and as a result, over-fitting is not a major problem and typically, these simple models are compared using formal model selection methods, such as the AIC, that penalise model complexity.
In biostatistics, it is certainly the case that replication in a second, independent dataset is considered important. Here the second sample is used differently and acts as an indicator of whether or not the findings will generalise. When a single data set is randomly split, the many small biases present in the training data are also present in the test data, therefore performance in the test data gives no information about generalisability.
You will probably have noticed that visualisation does not feature in my workflow. For me, visualisation is just a tool that I might use at any stage of the analysis. For example, visualisation can help with data checking, or with the data summary, or as part of model checking. It is not an end in itself.
Simplifying the workflow
My workflow diagram helps me to think about the process of data analysis but for the purpose of teaching, a simpler picture is likely to be more helpful. My diagram represents the place that I would like the students to reach, but it does not follow that it should be presented to them at the very beginning. The examples in my introductory course can be tackled with a much simpler workflow and students who have a limited experience of data analysis can be baffled by the inclusion of seemingly redundant stages. Here is a simpler picture that works for one of the early examples in the course.
This workflow relates to a study of serum calcium levels in African buffalo and in this diagram I show the stages in the analysis in aquamarine, the R scripts in ivory and the data files in cyan. This is what I present to the students.
When I first ran this short course, I did not include this type of diagram. I talked in general terms about workflow, but thought that a diagram was unnecessary. I was wrong. Many of my students had never before seen a data analysis broken down into stages. For them, an analysis was all one thing made up of reading data, recoding, model fitting or whatever. So when I had separate scripts for data import and data analysis, or when I saved the tidy data before analysing it, a sizeable minority of them got confused. The second time that I taught the course, I included the diagram and I had no such problems.
There are many other aspects of structured working, such as how we organise the project folders, how we name files, how we document, how we archive and so on. I will cover these topics in future posts.
This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. If you have not already done so then you might start by reading my introduction to the blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.
In this post I discuss the principles that underpin the course and how those principles are drawn from both data science and statistics.
Many years ago, when I started teaching statistics, I spent a lot of time on facts. I provided students with formulae and depending on the students’ background I gave more or less detail about their derivation. Today the situation is completely transformed. No set of notes that I could provide on, say, the gamma distribution, would come close to what Wikipedia makes available to the students via their phones (https://en.wikipedia.org/wiki/Gamma_distribution ).
The availability of factual information on the internet extends to R coding; there are fantastic sites where you can look up the syntax of any given R function, or find examples of code, or read answers to “how to” questions of coding. It would be foolish to try to duplicate what is available on those sites or to design a course that pretended that they do not exist.
So, if we are no longer the primary source for the facts underpinning data analysis or the facts about R, what is our purpose a teachers? Clearly, we must teach students to use the R help system and the mass of internet sites that supply the factual information that they need, but for a generation brought up with smart phones these skills come very naturally. As a result, we are left with more time to teach the less tangible but arguably more important topics, such as, the logic behind the way that R works, the culture of statistical analysis and the best way to organise your work.
The apprenticeship model of teaching, in which students learn by copying the way that a tutor tackles a problem, is the basis for teaching research students in the UK. Now that we are liberated from the responsibility of being the primary source of facts, it is possible to use this apprenticeship model in lower level courses. In my R short course, I do this through the demonstrations in which I present my analysis of that session’s dataset.
As I’ve explained in previous posts, I divide all half-day sessions into a lecture, a demonstration and an exercise. If I had to rank the three in order of importance, I would certainly place the demonstration at the top; it is the heart of the session. The lecture introduces the essential skills, but the demonstration shows how to put those skills to practical use.
My impression is that students learn a lot from the demonstrations but they do so by a process of subconscious absorption. Many of the things that I hope that they will pick up are difficult to cover in a lecture, they include,
How to structure your work
How to formulate the research question
How to choose the best methods for the analysis
How to guarantee reproducibility
How to document your work
How to spot and correct errors
How to problem solve when coding
How to work collaboratively
How to relate the results back to the original problem
How to communicate your conclusions
Answers to these “how to” questions are the heart of the culture that underpins data analysis. In the demonstrations, I try to hand on my enthusiasm and my curiosity as well as my answers to these “how to” questions, but in doing so I hope that I make it clear that the demonstrations illustrate my approach and not the approach.
These considerations have led me to think about the principles that underpin my teaching and the way that those principles have been influenced by both traditional statistics and data science. However, do not be under any illusions, I did not plan this course with a list of principles pinned to my wall; rather I designed the course instinctively and in a rush and I only thought about the underlying principles when I came to prepare this blog. What you are reading is mostly post hoc rationalisation.
Famously, Robert Tibshirani, a very well-respected statistician, is said to have referred to machine learning as “glorified statistics” and judging by the many blogs that have argued against this view, the idea is not very popular with the machine learning community. Dismissing machine learning in this way is not exactly tactful, but it does have an element of truth about it.
The discipline of statistics has always been an uneasy mix of probability-based methods and algorithmic methods. The former lead to distribution theory, inference and measures of the strength of the evidence, while the latter are a mixed bag of techniques including, principal components, cluster analysis, cross-validation, the lasso and so on. Obviously, statisticians have tried to link the two strands, creating hybrids such as the Bayesian lasso, but essentially the two approaches are based on quite different philosophies. Much more recently, the machine learning community has adopted the algorithmic approach to statistics and scaled it to work with big data by making the algorithms more efficient and more automatic, so reducing the need for intervention by the researcher.
One could argue that a statistical course on R for data analysis has very little to learn from data science and from a methodological viewpoint this is probably true. The improved algorithms created for machine learning could all be taught within the general framework of a statistics course. Indeed, many of most popular machine learning algorithms were first proposed by statisticians and in general, it is statisticians who develop the theory that explains why the algorithms work.
Even though statisticians can lay prior claim to most of the methods used in machine learning, data science is more than just algorithms. Data scientists have a wide range of backgrounds, usually with a strong element of computing and they have brought with them the best practices of their previous disciplines, including ideas of workflow, reproducibility, version control, literate programming, documentation, openness and scalability. These ideas have been sadly neglected by the statistics community, but should underpin the teaching of any good data analysis course.
I start my short course on data analysis with R with a 15-minute introductory lecture in which I set out my guiding principles. In this lecture I emphasise three key points; workflow, reproducibility and literate programming. I do not try for depth in this discussion, instead I flag up ideas that I expand on in the demonstrations.
Workflow fits naturally with the tidyverse, which is designed to breakdown the process of data analysis into stages ranging from data import to report writing. Thinking in terms of workflow encourages a standard way of working, for example, by having the same standard directory structure for every data analysis, by structuring all data in what Hadley Wickham calls tidy format and by using the project management system in RStudio. I think that workflow is so important that I will dedicate my next post to that topic.
Reproducibility is as much an attitude of mind as anything else. Most scientists are familiar with the use of reproducible methods in the lab and they all have a lab-book in which they keep detailed records of their experiments. For some reason, they do not carry this across to the data analysis, so I encourage them to do so, even to the point of keeping a data analysis diary in which they record what they do, note ideas for future analyses etc. I suspect that the problem is that their scientific supervisors were bought up in a different era and they have not kept up to date. For instance, the message that it is wrong to work interactively in Excel, because of the lack of reproducibility, is one that most students find logical but surprising.
The importance of literate programming is easy to get across by showing examples of very concise base R code and contrasting them with code written with the tidyverse and the pipe. When I analyse the data in my demonstrations I present the students with lots of examples of R scripts written in what I hope is a fairly literate style. I do not try to enforce any particular rules of style. Instead I stress how important it is that they develop a consistent style of their own; although, of course, most students start by copying my approach.
If these ideas come from data science, what are the principles that come from statistics? Here is a list of a few topics that seem to me to get more attention is statistics than in machine learning.
The importance of study design
Being explicit about the analysis method and its assumptions
Understanding the properties of the methods that you use
The dangers of black-box methods
Quantifying the uncertainty of the findings
Considering whether the findings will generalise
Being explicit about the use of knowledge that is external to the data
Remembering that simplicity is a virtue
These topics are all second nature to me after working as a statistician for so long, but they are not easy to teach in a formal way. My hope is that they are implicit in the demonstrations. These are all examples of the types of issue that make face-to-face learning so much better than reading a book.
In an attempt to capture the cultural difference between a statistician and a data scientist, let’s consider the dangers of a black-box method such as a neural network. Why am I cautious about using neural networks, when there is plenty of evidence to suggest that they make very good predictions? Let’s list a few reasons.
Black-box methods,
do not provide understanding of how the prediction was arrived at
do not make one’s assumptions explicit
may or may not generalise, it is rarely possible to say
tend to be complex and often only offer marginal gains over simpler methods
may not be robust to small changes in the data
These criticisms of black-box methods reveal a deeper set of prejudices that reflect my cultural background as a statistician. For me
understanding the structure of the data is an important aim of analysis
generalisation is important
minor gains in a specific measure of performance are rarely important
understanding the assumptions behind an analysis is critically important
you are less likely to make mistakes if you keep the analysis simple
Many of these values come from experience and therefore reflect the type of work that I have done in the past. As a young statistician I would let my enthusiasm run away with me and I’d launch into unnecessarily complex analyses. With experience, I now know that it is often better to calculate a table of averages.
Unfortunately you cannot teach experience and you have to accept that your students must be free to make their own mistakes and to learn from them in their turn. So it would be futile to tell students that they must not use black-box methods, but this does not mean that it is wrong to raise questions such as, are simple methods preferable? A discussion of this topic will help students to realise that it is an issue about which people disagree, and of course, students usually give a lot of weight to their tutor’s opinion.
I’ll return to these ideas in future posts when I’ll try to explain how my views on the principles that underpin a good data analysis have influenced the course.
]]>
This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. If you have not already done so then you might start by reading my introduction to the blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.
In this post I’ll discuss the merits of problem-based teaching in which each session is based around the analysis of a single published study.
Everyone uses data for illustration when they teach students to use R, but what I want to suggest in this post goes beyond that. I want to advocate the problem-based teaching style in which every session is based around the analysis of a single study. Having identified the study that we will use, we cover those aspects of R and statistics that are needed for that particular analysis and nothing else. So the data drive the syllabus and the ordering of the material.
You will not be surprised to hear that I am convinced that the problem-based approach using real data is perfect for teaching both statistical analysis and R. With the possible exception of theoretical courses for computer scientists, R should not be taught in isolation from its application and the task-orientated design of the tidyverse makes problem-based teaching much more straightforward.
There are several websites that provide teaching datasets usually taken from text books. I know that there are useful examples to be found on these sites, but I want to suggest that you should not use this type of resource. A sure sign that a website is to be avoided is when you are able to search by statistical technique, for instance to find data suitable for teaching linear regression. These datasets are usually sanitised and rarely have sufficient backstory to enable a proper analysis.
The short course that motivated this blog was aimed at research students and post docs, so using data from published studies appealed to them because they could identify with the problems of the authors. What is more, students like finding out about the background to a study and they can immediately see that the R skills that they are learning have practical use. The formulation of the research questions, the selection of appropriate statistical methods and the interpretation of the results are all much easier and more meaningful when done in the context of a real study.
If, as I do, you see R as a tool for producing better data analyses, then you should not isolate the teaching of say, ggplot2 and the choice of model for the analysis; the two things inter-relate. High-quality visualizations play an important role in model selection, so you cannot perform model selection without skills in ggplot, and you cannot create a good plot in R unless you understand how it will be used.
Real data enable you to discuss important issues that do not arise when artificial or sanitised data are used. Issues such as what to do about missing values. It is all very well to discuss the theoretical merits of complete-case or a particular form of imputation, but in practice the choice of approach will be context specific. Handled properly you can get students to debate the options and appreciate that the choice is subjective and that it may involve a compromise between what is desirable and what is practical; where practical might mean, available in an R package. Hopefully, this helps the students to appreciate that statistical analysis is not a matter of right and wrong and that some decisions about the analysis are critical, while many others do not materially affect the result.
For many years, I have used real data to illustrate my teaching, sometimes I have used data from studies that I’ve worked on, but more often I’ve taken data from published papers that the authors have made available via a journal or an internet data repository. In the case of the R course that inspired this blog, I decided to use published data because the scope of my own work is not broad enough to cover the varied interests of my students, who come from the full range of the life sciences.
It could be argued that to teach the t-test, it helps to have data that are clearly independent and normally distributed; that way, one is free to concentrate on the statistical method. When I was a student on a mathematics course, we derived the formula for the sampling distribution of the t-statistic and then used a few hypothetical numbers to illustrate the calculation. That lecture concentrated on the topic at hand but made no attempt to be practical and it omitted all of the things that you actually need to know, such as when a t-test is appropriate, what question it addresses, how robust the result is, what the result means in the context of a real study, etc.
Even if your aim is to teach the use of R’s t.test() function to students who already know something about the t-test, it is wrong to do so in isolation from the way that the student will use that function when they come to analyse their own data. It is fine to tell them that there is a var.equal argument that can be set to TRUE or FALSE, but that information means very little if the student does not know how to use R to calculate the variances in the two groups, or cannot use R to plot histograms of the two groups and display them on the same scale. When the data drive the course, you teach these diverse skills as a set in the same session, so you have the chance to talk about how equal the variances need to be, or whether it is a better policy to always assume unequal variances, and at the same time, you can introduce the use of facets in ggplot.
A particular issue with problem-based teaching is that it is harder to stick to a pre-defined syllabus and one can be led into discussing topics in what might seem to be a very unnatural order. A session intended to introduce the chisq.test() function with real data might lead students to suggest unforeseen research questions that are best answered using a sparse contingency table. As a result you could end up with an unplanned discussion of simulated p-values. Not only is it more difficult to stick to a lesson plan when using a problem-based approach but nobody remembers all of the tidyverse functions, so there is a good chance that you will be led to a point where you do not know what function to use. This is just the situation that the students are going to find themselves in when they analyse their own data, so it is helpful for them to see that their tutor can get by without knowing everything.
If you do decide to use published data then you will need to hunt for an appropriate study in a data repository. There are two types of data repository available via the internet. One type is for work in progress and these repositories make it easy for team members to share data during an analysis. The other type is for archiving data once a project has been completed and a report or paper has been published. Naturally enough some repositories combine both functions and allow sharing within the team during the analysis after which some or all of the data can be made public. Repositories that provide public access to data may be specialised in the sense of being for particular types of study or they may be general.
The total number of data repositories is quite large and the best list of science repositories that I know of is kept by Nature at https://www.nature.com/sdata/policies/repositories. Increasingly, scientific journals ask their authors to deposit the raw data on which the article is based in one of these data repositories. It is all part of the drive for reproducible and open research. So it is now much easier to find data from published articles around which we can base particular lessons, lectures or practicals. My personal favourite repository is Dryad, which can be found at https://datadryad.org/; it is large, varied, yet very easy to search.
The journal PLoS One has a data sharing policy that explicitly mentions Dryad so it is not surprising that many authors submitting to the journal choose to use that repository. The journal’s website at https://journals.plos.org/plosone/s/data-availability says
and later it says,
I have spent many a happy hour searching Dryad for suitable studies based on titles that suggest a topic that is not too specialised and study sizes that are neither too large nor too small. The articles themselves provide the necessary background on study design and data collection and, at least in principle, the repository provides the data that you need.
It is important to appreciate that my aim is not to reproduce the analysis in the published paper. Often I use the data to answer a very specific question that might have been given very little weight in the article, or perhaps not have been mentioned at all.
I have had lots of good experiences using Dryad but difficulties do occur. The most common problem arises in when authors chose to withhold part of the data. Quite often they withhold information in such a way as to make it impossible to recreate exactly the analysis published in the article. Presumably this is deliberate, as it stops anyone from challenging the article, thereby defeating the main object of open research. Much less common are instances where I have started to prepare an analysis for an up-coming course only to find an error in the data. Often one cannot be 100% sure that the data are wrong, but I still feel uncomfortable about using questionable data for teaching. I guess that I ought to report these findings to the authors, but I must admit that I never do. It’s easier just to look for a new example.
More common than errors in the data are papers that use statistical methods that I would not have selected myself. It is rare to be able to say that a statistical analysis is wrong, but researchers do sometimes use methods that make assumptions that seem hard to justify given their own study design. Once again, I choose not to get into a debate over these issues; they usually end as a matter of opinion and anyway, as I have already said, my objective is not to reproduce the results reported in the paper.
My impression is that the majority of the moderately sized data sets in Dryad are provided in Excel files but you will also see, csv, Word documents and all manner of formats linked to statistical programs. The Excel files in particular can be very challenging. Researchers will often mix their data with text, graphs and summary statistics, all on the same spreadsheet and they pay little respect to the ideas of tidy data (https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf). Reading such data and getting them into a form suitable for analysis can be quite challenging and it is sometimes tempting to re-format the data oneself before providing them to the students. This a temptation that I have always resisted. Handling messy data is a vital skill for any data analyst and it is difficult for students to appreciate how they should format their own data, if they never see how difficult it is to handle other people’s poorly-formatted spreadsheets.
In a more specialised R course it might be better to use one of the repositories that is dedicated to a particular type of study. In my own field I have found the GEO repository (https://www.ncbi.nlm.nih.gov/geo/ ) to be particularly useful. This repository specialises in gene expression data from micro-arrays or, more recently, from sequencing. I have had a class use R to view the digital images of a micro-array and then to follow the analysis through pre-processing and normalization to statistical testing. The students learn to appreciate the impact of errors in the data and to understand how changes to the normalization affect the statistical analysis. These issues would be almost impossible to get across without real data. The students also like the fact that their results can be linked to annotation databases that tell them about the genes that are differentially expressed and in the process they learn a lot of useful R skills.
Having chosen a study that is suitable for the skills and interests of my students, I next prepare my own analysis of those data. This analysis will eventually form the demonstration that lies at the heart of each session. My analysis will dictate the new R and statistical skills that I will need to introduce in the lecture and so it defines the syllabus. Of course, this is not a one way process, if I judge that the lecture would prove too challenging then I modify my analysis. When I discuss the course materials you will see the type of study that I like to use, I’ll make a point of explaining why I chose each example and I’ll describe how the choice of data dictated the structure of the course.
]]>
This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. You can find the introduction to this blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.
In this post I’ll discuss teaching the tidyverse without first introducing base R.
There is an excellent blog by David Robinson called VarianceExplained that has a post which discusses the advantages of teaching the tidyverse to beginners (http://varianceexplained.org/r/teach-tidyverse/). David has also given talks on starting R with the tidyverse that you can watch on youTube (https://www.youtube.com/watch?v=dT5A0sAWc2I ). In discussing the merits and problems associated with introducing R via the tidyverse, I’ll try not to duplicate what has already been said, but a degree of overlap is inevitable, not least because those sources played a role in persuading me to adopt this style of teaching.
The tidyverse, as I am sure that you know, is a collection of R packages inspired by the work of Hadley Wickham; indeed in the early days the packages were often referred to as the Hadleyverse. Hadley Wickham was motivated by a desire to create a set of high-level R functions that together cover all of the stages of a data analysis, so there are packages of functions for importing data, functions for manipulating data, functions for graphics and so on. Of course, these functions do nothing that you could not already do with base R, but they are well-written and they are in a fairly consistent style, which makes them efficient and relatively easy to use.
When designing a course, it is not enough to say that you are “starting with the tidyverse”; after all, you could teach the tidyverse by beginning with a thorough description of the readr package before moving on to a comprehensive look at dplyr and then tidyr. Instead, when I talk of starting with the tidyverse, I mean introducing a few carefully chosen tidyverse functions that work together to enable you to do something practical, and in addition, I mean presenting those functions without any prior mention of base R.
At least as important as the range of tasks carried out by the tidyverse is the fact that its functions are designed to be combined using the pipe. Personally, I doubt whether the tidyverse would have had nearly as much impact were it not for the pipe. The pipe enables a style of coding in which functions are linked by the %>% symbol, so that the output from one function becomes the first argument in the next function. This allows you to chain together a series of functions in the order in which they are to be executed. It creates very neat and easy to read R code. ( https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html) .
Any type of data object could be passed on in a pipe, but for beginners, I find that it helps to start with the idea of “data frame in, data frame out”. That is, to envisage the functions in the pipe as completing a series of operations on a single data frame. This simplification helps beginners to build up a mental picture of what their code does and it fits neatly with their experience of working with data in a spreadsheet.
For better or for worse, most of my students start R with previous experience of having used Excel and an early ambition of my R course is to convince the students to stop using Excel; it is limited in its capability and the way that it is commonly used violates most of the principles of reproducibility. Despite this, picturing functions as operating on a rectangular data structure creates a comfortable starting point for most students.
When I teach base R, I introduce the idea of a list before I mention the data frame, in that way I can explain that a data frame is a special type of list made up of vectors of the same length, so operations on data frames are just special cases of operations on lists. When starting with the tidyverse, the approach has to be very different. I am forced to introduce the data frame by analogy with a rectangular spreadsheet with rows corresponding to subjects and columns corresponding to variables. This is a kind of ‘provisional truth’ and is typical of the approach that has to be adopted when you teach practical skills without the theoretical underpinning. We deliberately present a simplified version of the truth and develop the underlying concepts as we progress.
Slowly developing understanding is the way that young children learn; we do not define the system of integers before we teach children to count. However, with adults we tend to move away from this style of teaching towards a more logically ordered method. In my experience, most adults are happy to revert to the simplified form of learning, especially when their motivation is practical. However, you need to be prepared for awkward questions, especially from students who have been taught base R before, or who have read a book on R. How will you reply if during the very first session a student interrupts and says, “but I thought that a data frame was a type of list”?
Handling questions from students who have some knowledge beyond what you are currently trying to teach is tricky and we all develop our own way of coping. Obviously we want to encourage discussion and questioning so we must take care not to be dismissive of a seemingly unnecessary question, as that could stop others from asking really important questions. Perhaps the first thing to do when presented with such a question is to judge whether it is being asked so that the student can show-off their knowledge, or because they really are confused. In the former case you can direct your answer at the individual and keep it short, while confusion is more likely to be shared and requires a fuller answer directed at the whole class.
When we provide provisional truths, it is quite likely that the point raised by a student’s question will be covered in a subsequent session, so we can always point the student to the relevant part of their notes (I give students the entire set of course materials in advance of the course, rather than release it as we go along – I want them to read ahead). However, “we will come to that” is slightly dismissive, so I always try to say something positive in response. Handling questions that go beyond the material that you are currently trying to teach is an issue in all forms of teaching, it is just that in this style of course, the problem arises more often.
Over the years I’ve found that when I teach provisional truth to adults it pays to be honest, that is, to tell the students in advance that you are going to present a simplified version and to explain why you are doing it. Then, after you have covered the point, you can tell them how you plan to develop their understanding in future sessions. Generally, I find that the more context that you give, the better the students will understand. In my introduction to the course, I make a point of telling the students about the two approaches to teaching R and I explain that we will be jumping into the tidyverse without any mention of base R. Explaining your approach aids understanding and it helps you avoid those time consuming questions from students who have previous experience of being taught base R.
When starting with the tidyverse and the pipe applied to a data frame, there is a natural order for the ideas that we need to get across in the first session.
Being explicit about this structure is very important to beginners because it gives them a mental picture of what we are trying to achieve. I try to get these ideas across in Session 1 while introducing a very limited range of functions. For my short course I chose,
You can see that together these functions enable you to import some data, calculate summary statistics, plot the data and report the findings. Although these functions will not provide a very sophisticated analysis, the students will have covered the entire process of a data analysis. In subsequent sessions, I can extend their knowledge, for example by replacing qplot() with ggplot(), but when I teach an extension, the student is able to see how that topic fits into the overall structure.
Starting with a small set of unrelated tidyverse functions could appear to the student as being rather scatter-gun. Two-minutes on read_csv(), three minutes on filter() and so on. Students need structure if they are to learn and since we are not providing the structure that comes from the logic of the language, it is vital that we replace it with some other structure. In my course, I hope this comes from the workflow that underpins a good data analysis and an early appreciation of the way in which R works with functions that operate on data objects. I will return to these issues in future posts.
I suggested in my introduction to this blog that “starting with the data” is more fundamental than “starting with the tidyverse” (https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/). In my next post I’ll develop this idea and discuss the implications of letting the data drive the syllabus.
]]>This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found on my github page at https://github.com/thompson575/RCourse.
This is a new blog in which I will discuss my experiences of teaching R. I anticipate that there will be about twenty posts after which the blog will close.
Over the years I have tried to teach R in many ways, but I was inspired to start this blog by a tidyverse-based short course that I first ran in the autumn of 2018. As you are reading a blog on teaching R, you will know that there are two popular ways to introduce R. The traditional approach is to treat R as a computer language and to start by discussing data objects and the many base R functions that form the building blocks from which you eventually create a data analysis. The more fashionable alternative is to start with the tidyverse packages of functions; these enable the students to complete useful data analysis tasks from the very first session. In the past, I have tried different combinations of the two approaches, but in the course that motivated this blog, I switched entirely to the tidyverse. There is something satisfying about being up to date, but it is not always true that popular is best.
The students who enrolled on my course, did so because they wanted to use R as a tool and for them the ability to do something useful from the very beginning was definitely motivating, but the danger with task-orientated teaching is that the students learn to do things without really understanding why they work. When this happens, the students become very good at repeating what they have already been shown, but they have difficulty when they need to extend the ideas to related problems with slightly different features. The students lack a mental picture of what R is doing and so they have little feel for how they should modify their R scripts to tackle the new problem. The challenge in any task-orientated teaching is to ensure that the students develop an understanding of the foundations of the subject.
In planning this tidyverse-based course, I was very conscious of the need to build up the students’ basic understanding of R, but this is not easy to do without discussing the structure of the language as one would when teaching base R and consequently there were times when I got it wrong. In fact, I have run the course a couple of times and made a number of changes, which on the whole have helped. In this blog, I’ll point out the things that did not work first time and describe the changes that I made.
My introductory short course consisted of four half-day sessions and was advertised as being for researchers in the life sciences who had no prior knowledge of R. Of course, most, by not all, of the people who signed up for the course had used R before, some had even tried to teach themselves or been on other R courses without making much progress. This created two problems that will be familiar to anyone who teaches courses to non-specialists; the students start the course with widely varying levels of knowledge and even more critically, they start with very different expectations. Perhaps one of the most challenging groups consists of the students who have previous experiences that have convinced them that the topic is difficult and they will never understand. As the tutor, you have to build the confidence of those who find the subject demanding, while not making the more knowledgeable students feel that they are wasting their time. It can be quite a tough balance to achieve.
A second aspect of this short course that distinguished it from any of my previous R courses was that I decided to make it entirely problem-based. Each of the four half-day sessions centred on the analysis of a single real study taken from a published research paper and, within each session, I made a point of covering all aspects of the analysis from importing the data through to preparing a report on the findings. To enable me to do this, I divided each session into three parts. In the first part, I gave a short lecture covering the tidyverse skills needed for that particular data analysis. In the second part, I presented my analysis of the data in the form of a demonstration based on a set of prepared R scripts, then in the final part I set an exercise that invited the students to create a slightly different analysis of the same dataset.
Obviously, if you want to cover a wide range of skills in one session then you cannot go into much detail about any one of them. This needs careful judgement. The ordering of material and the gauging of the right level of detail are critical to the success of a problem-based course.
I found the problem-based style of teaching to be much better for motivation than my usual approach of using small datasets chosen to illustrate the use of specific R functions, but it too has its dangers. Problem-based teaching requires you to find real data that the students can understand and which they can analyse using the limited skill set that they currently possess. There are obvious dangers if you choose a study that can only be understood by people with specialist scientific knowledge or if you choose a study with hidden complexities that require you to either, introduce advanced R skills before the students are ready for them, or to omit obviously important aspects of the analysis. Once again, I found that I made some good choices and some poor choices; this will provide yet more material for the blog.
It is interesting to speculate on whether using the tidyverse to create a task-orientated approach to teaching R was more or less important to the success of the course than the problem-based approach of basing each session on real data from a single study. My feeling is that choosing to make the course problem-based was the key, because that dictated the structure of the sessions and the way that R was used. However, it is true that the problem-based approach would have been very difficult, if not impossible, without the tidyverse, because I would have needed to teach so much base R before I could undertake even the simplest stage of the analysis. Perhaps we should stop trying to persuade teachers of R to “start with the tidyverse” and instead ask them to “start with the data”; that way they would soon find that it is impractical to limit oneself to base R.
There is one further aspect of my teaching that raises interesting issues for discussion. I am a biostatistician and not a data scientist. My short course was aimed at scientists who were more interested in the t-test than in neural networks. Now, I have become a strong advocate of the idea that statisticians should learn from data scientists and this means much more than just being aware of the latest machine learning algorithms. Most statistical analyses would be greatly improved if they incorporated ideas such as workflow, reproducibility and literate programming that are mainstream in data science.
Whilst acknowledging the lessons that statisticians can learn from data science, there are perhaps even more lessons that data scientists could learn from statisticians. I know that it is a little harsh to say it, but when I read articles and blogs on data analysis in R that are written by data scientists, they frequently give me the impression that the author’s understanding of statistics is both rudimentary and out of date. This is unfortunate for those data scientists, but it has broader implications because data scientists are currently driving the development of R. In particular, the tidyverse has been created by people, most of whom call themselves data scientists. There is no doubt in my mind that the tidyverse would have looked quite different had it been designed by a group of statisticians.
In recent years, data scientists have rather taken over the R community and consequently the needs of traditional statistical analysis have been pushed into the background. There is a strange contradiction is this development because R was originally designed for statistical analysis and it is not ideally suited to the needs of data science. There is a good argument that if you want to do data science, then you should use python or some similar language. Yet we are where we are and R is currently dominated by data scientists and the needs of statistical analysis are being neglected. It is time for statisticians to step forward and redress the balance.
In the last few years the tidymodels project has sought to extend the tidyverse into the area of modelling and as such tidymodels ought to have addressed many of the problems associated with using the tidyverse for statistical analysis. In my opinion, tidymodels has taken a giant step in the wrong direction and as a consequence I do not use it in my teaching. I’ll explain why I don’t use tidymodels and I’ll outline what I think are better ways of handling statistical models within the framework of the tidyverse.
In summary then, I’ve identified three themes relating to teaching R that I want to explore in this blog
I will discuss these issues and at the same time make available my course material and comment on what worked well and what did not. The blog is primarily aimed at teachers of R, so I will not be explaining how particular functions work; I’ll assume that either you know already, or you know where to look it up.
]]>