This blog discusses the teaching of data analysis with R. It was inspired by a short course that I first ran in the autumn of 2018. The notes for that course can be found at https://github.com/thompson575/RCourse. If you have not already done so then you might start by reading my introduction to the blog at https://staffblogs.le.ac.uk/teachingr/2020/08/24/teaching-r/.
In this post I discuss the principles that underpin the course and how those principles are drawn from both data science and statistics.
Many years ago, when I started teaching statistics, I spent a lot of time on facts. I provided students with formulae and depending on the students’ background I gave more or less detail about their derivation. Today the situation is completely transformed. No set of notes that I could provide on, say, the gamma distribution, would come close to what Wikipedia makes available to the students via their phones (https://en.wikipedia.org/wiki/Gamma_distribution ).
The availability of factual information on the internet extends to R coding; there are fantastic sites where you can look up the syntax of any given R function, or find examples of code, or read answers to “how to” questions of coding. It would be foolish to try to duplicate what is available on those sites or to design a course that pretended that they do not exist.
So, if we are no longer the primary source for the facts underpinning data analysis or the facts about R, what is our purpose a teachers? Clearly, we must teach students to use the R help system and the mass of internet sites that supply the factual information that they need, but for a generation brought up with smart phones these skills come very naturally. As a result, we are left with more time to teach the less tangible but arguably more important topics, such as, the logic behind the way that R works, the culture of statistical analysis and the best way to organise your work.
The apprenticeship model of teaching, in which students learn by copying the way that a tutor tackles a problem, is the basis for teaching research students in the UK. Now that we are liberated from the responsibility of being the primary source of facts, it is possible to use this apprenticeship model in lower level courses. In my R short course, I do this through the demonstrations in which I present my analysis of that session’s dataset.
As I’ve explained in previous posts, I divide all half-day sessions into a lecture, a demonstration and an exercise. If I had to rank the three in order of importance, I would certainly place the demonstration at the top; it is the heart of the session. The lecture introduces the essential skills, but the demonstration shows how to put those skills to practical use.
My impression is that students learn a lot from the demonstrations but they do so by a process of subconscious absorption. Many of the things that I hope that they will pick up are difficult to cover in a lecture, they include,
-
How to structure your work
-
How to formulate the research question
-
How to choose the best methods for the analysis
-
How to guarantee reproducibility
-
How to document your work
-
How to spot and correct errors
-
How to problem solve when coding
-
How to work collaboratively
-
How to relate the results back to the original problem
-
How to communicate your conclusions
Answers to these “how to” questions are the heart of the culture that underpins data analysis. In the demonstrations, I try to hand on my enthusiasm and my curiosity as well as my answers to these “how to” questions, but in doing so I hope that I make it clear that the demonstrations illustrate my approach and not the approach.
These considerations have led me to think about the principles that underpin my teaching and the way that those principles have been influenced by both traditional statistics and data science. However, do not be under any illusions, I did not plan this course with a list of principles pinned to my wall; rather I designed the course instinctively and in a rush and I only thought about the underlying principles when I came to prepare this blog. What you are reading is mostly post hoc rationalisation.
Famously, Robert Tibshirani, a very well-respected statistician, is said to have referred to machine learning as “glorified statistics” and judging by the many blogs that have argued against this view, the idea is not very popular with the machine learning community. Dismissing machine learning in this way is not exactly tactful, but it does have an element of truth about it.
The discipline of statistics has always been an uneasy mix of probability-based methods and algorithmic methods. The former lead to distribution theory, inference and measures of the strength of the evidence, while the latter are a mixed bag of techniques including, principal components, cluster analysis, cross-validation, the lasso and so on. Obviously, statisticians have tried to link the two strands, creating hybrids such as the Bayesian lasso, but essentially the two approaches are based on quite different philosophies. Much more recently, the machine learning community has adopted the algorithmic approach to statistics and scaled it to work with big data by making the algorithms more efficient and more automatic, so reducing the need for intervention by the researcher.
One could argue that a statistical course on R for data analysis has very little to learn from data science and from a methodological viewpoint this is probably true. The improved algorithms created for machine learning could all be taught within the general framework of a statistics course. Indeed, many of most popular machine learning algorithms were first proposed by statisticians and in general, it is statisticians who develop the theory that explains why the algorithms work.
Even though statisticians can lay prior claim to most of the methods used in machine learning, data science is more than just algorithms. Data scientists have a wide range of backgrounds, usually with a strong element of computing and they have brought with them the best practices of their previous disciplines, including ideas of workflow, reproducibility, version control, literate programming, documentation, openness and scalability. These ideas have been sadly neglected by the statistics community, but should underpin the teaching of any good data analysis course.
I start my short course on data analysis with R with a 15-minute introductory lecture in which I set out my guiding principles. In this lecture I emphasise three key points; workflow, reproducibility and literate programming. I do not try for depth in this discussion, instead I flag up ideas that I expand on in the demonstrations.
Workflow fits naturally with the tidyverse, which is designed to breakdown the process of data analysis into stages ranging from data import to report writing. Thinking in terms of workflow encourages a standard way of working, for example, by having the same standard directory structure for every data analysis, by structuring all data in what Hadley Wickham calls tidy format and by using the project management system in RStudio. I think that workflow is so important that I will dedicate my next post to that topic.
Reproducibility is as much an attitude of mind as anything else. Most scientists are familiar with the use of reproducible methods in the lab and they all have a lab-book in which they keep detailed records of their experiments. For some reason, they do not carry this across to the data analysis, so I encourage them to do so, even to the point of keeping a data analysis diary in which they record what they do, note ideas for future analyses etc. I suspect that the problem is that their scientific supervisors were bought up in a different era and they have not kept up to date. For instance, the message that it is wrong to work interactively in Excel, because of the lack of reproducibility, is one that most students find logical but surprising.
The importance of literate programming is easy to get across by showing examples of very concise base R code and contrasting them with code written with the tidyverse and the pipe. When I analyse the data in my demonstrations I present the students with lots of examples of R scripts written in what I hope is a fairly literate style. I do not try to enforce any particular rules of style. Instead I stress how important it is that they develop a consistent style of their own; although, of course, most students start by copying my approach.
If these ideas come from data science, what are the principles that come from statistics? Here is a list of a few topics that seem to me to get more attention is statistics than in machine learning.
-
The importance of study design
-
Being explicit about the analysis method and its assumptions
-
Understanding the properties of the methods that you use
-
The dangers of black-box methods
-
Quantifying the uncertainty of the findings
-
Considering whether the findings will generalise
- Causality
-
Being explicit about the use of knowledge that is external to the data
-
Remembering that simplicity is a virtue
These topics are all second nature to me after working as a statistician for so long, but they are not easy to teach in a formal way. My hope is that they are implicit in the demonstrations. These are all examples of the types of issue that make face-to-face learning so much better than reading a book.
In an attempt to capture the cultural difference between a statistician and a data scientist, let’s consider the dangers of a black-box method such as a neural network. Why am I cautious about using neural networks, when there is plenty of evidence to suggest that they make very good predictions? Let’s list a few reasons.
Black-box methods,
-
do not provide understanding of how the prediction was arrived at
-
do not make one’s assumptions explicit
-
may or may not generalise, it is rarely possible to say
-
tend to be complex and often only offer marginal gains over simpler methods
-
may not be robust to small changes in the data
These criticisms of black-box methods reveal a deeper set of prejudices that reflect my cultural background as a statistician. For me
-
understanding the structure of the data is an important aim of analysis
-
generalisation is important
-
minor gains in a specific measure of performance are rarely important
-
understanding the assumptions behind an analysis is critically important
-
you are less likely to make mistakes if you keep the analysis simple
Many of these values come from experience and therefore reflect the type of work that I have done in the past. As a young statistician I would let my enthusiasm run away with me and I’d launch into unnecessarily complex analyses. With experience, I now know that it is often better to calculate a table of averages.
Unfortunately you cannot teach experience and you have to accept that your students must be free to make their own mistakes and to learn from them in their turn. So it would be futile to tell students that they must not use black-box methods, but this does not mean that it is wrong to raise questions such as, are simple methods preferable? A discussion of this topic will help students to realise that it is an issue about which people disagree, and of course, students usually give a lot of weight to their tutor’s opinion.
I’ll return to these ideas in future posts when I’ll try to explain how my views on the principles that underpin a good data analysis have influenced the course.
Wonderful, what a weblog it is! This website
provides useful facts to us, keep it up.
Thank you. Your comment is much appreciated
I know this web page presents quality depending articles and extra data, is there any other
web site which presents these kinds of data in quality?
I am not sure that I completely understand what you are asking. One of the reasons that I started this blog is that I did not know of anything similar. I don’t see the point in just reproducing what can already be found elsewhere. There are two other blogs that I read when I have time, one by David Robinson and the other by Andrew Gelman. You can find these easily using a search engine.