I’ve been getting a bit concerned that the EEF’s evaluations of educational methods, which were meant to help provide a more solid evidence base for teaching, are actually leading to the same sort of unreliable research and hype that we have seen all too often in educational research. The following guest post is by Matthew Inglis (@mjinglis) who kindly offered to comment on a big problem with the recent, widely-reported study showing the effectiveness of Philosophy for Children (P4C).
On Friday the Independent newspaper tweeted that the “best way to boost children’s maths scores” is to “teach them philosophy”. A highly implausible claim one might think: surely teaching them mathematics would be better? The study which gave rise to this remarkable headline was conducted by Stephen Gorard, Nadia Siddiqui and Beng Huat See of Durham University. Funded by the Education Endowment Foundation (EEF), they conducted a year-long investigation of the ‘Philosophy for Children’ (P4C) teaching programme. The children who participated in P4C engaged in group dialogues on important philosophical issues – the nature of truth, fairness, friendship and so on.
I have a lot of respect for philosophy and philosophers. Although it is not my main area of interest, I regularly attend philosophy conferences, I have active collaborations with a number of philosophers, and I’ve published papers in philosophy journals and edited volumes. Encouraging children to engage in philosophical conversations sounds like a good idea to me. But could it really improve their reading, writing and mathematics achievement? Let alone be the best way of doing this? Let’s look at the evidence Gorard and colleagues presented.
Gorard and his team recruited 48 schools to participate in their study. About half were randomly allocated to the intervention: they received the P4C programme. The others formed the control group. The primary outcome measures were Key Stage 1 and 2 results for reading, writing and mathematics. Because different tests were used at KS1 and KS2, the researchers standardised the scores from each test so that they had a mean of 0 and a standard deviation of 1.
The researchers reported that the intervention had yielded greater gains for the treatment group than the control group, with effect sizes of g = +0.12, +0.03 and +0.10 for reading, writing and mathematics respectively. In other words the rate of improvement was around a tenth of a standard deviation greater in the treatment group than in the control group. These effect sizes are trivially small, but the sample was extremely large (N = 1529) , so perhaps they are meaningful. But before we start to worry about issues of statistical significance*, we need to take a look at the data. I’ve plotted the means of the groups here.
Any researcher who sees these graphs should immediately spot a rather large problem: there were substantial group differences at pre-test. In other words the process of allocating students to groups, by randomising at the school level, did not result in equivalent groups.
Why is this a problem? Because of a well known statistical phenomenon called regression to the mean. If a variable is more extreme on its first measurement, then it will tend to be closer to the mean on its second measurement. This is a general phenomenon that will occur any time two successive measurements of the same variable are taken.
Here’s an example from one of my own research studies (Hodds, Alcock & Inglis, 2014, Experiment 3). We took two achievement measurements after an educational intervention (the details don’t really matter), one immediately and one two weeks later. Here I’ve split the group of participants into two – a high-achieving group and a low-achieving group – based on their scores on the immediate post test.
As you can see, the high achievers in the immediate post test performed worse in the delayed post test, and the low achievers performed better. Both groups regressed towards the mean. In this case we can be absolutely sure that the low achieving group’s ‘improvement’ wasn’t due to an intervention because there wasn’t one: the intervention took place before the first measurement.
Regression to the mean is a threat to validity whenever two groups differ on a pre-test. And, unfortunately for Gorard and colleagues, their treatment group performed quite a bit worse than their control group at pre-test. So the treatment group was always going to regress upwards, and the control group was always going to regress downwards. It was inevitable that there would be a between-groups difference in gain scores, simply because there was a between-groups difference on the pre-test.
So what can we conclude from this study? Very little. Given the pre-test scores, if the P4C intervention had no effect whatsoever on reading, writing or mathematics, then this pattern of data is exactly what we would expect to see.
What is most curious about this incident is that this obvious account of the data was not presented as a possible (let alone a highly probable) explanation in the final report, or in any of the EEF press releases about the study. Instead, the Director of the EEF was quoted as saying “It’s absolutely brilliant that today’s results give us evidence of [P4C]’s positive impact on primary pupils’ maths and reading results”, and Stephen Gorard remarked that “these philosophy sessions can have a positive impact on pupils’ maths, reading and perhaps their writing skills.” Neither of these claims is justified.
That such weak evidence can result in a national newspaper reporting that the “best way to boost children’s maths scores” is to “teach them philosophy” should be of concern to everyone who cares about education research and its use in schools. The EEF ought to pause and reflect on the effectiveness of their peer review system and on whether they include sufficient caveats in their press releases.
*The comment about “statistical significance” reflects additional concerns others had expressed about the methodology, for instance: here.