The EEF were even more wrong about ability grouping than I realised

April 2, 2018

For quite some time now, if I mentioned my support for setting, people would refer me to the EEF toolkit. This supposedly neutral source looked at the meta-analyses and found that setting or streaming had a negative effect, supported by evidence of “moderate strength”. In fact, the EEF found a negative effect size of -0.09. Leaving aside all issues related to whether this is a good way to evaluate the issue, this was a surprising result given that John Hattie, in his book Visible Learning, had also looked at the meta-analyses, and found a positive effect of 0.12.

I explored this in this blogpost and found that the toolkit referenced the following meta-analyses.

Meta-analyses Effect size
Gutierrez, R., & Slavin, R. E., (1992)
-0.34 (mixed age attainment vs non-graded classes)
Kulik C-L.C & Kulik J.A. , (1982)
0.10 (on secondary pupils)
Kulik C-L.C & Kulik J.A. , (1984)
0.10 (on elementary/primary pupils)
Lou, Y., Abrami, P. C., Spence, J. C., Poulsen, C., Chambers, B., & d’Apollonia, S. , (1996)
-0.12 (on low attainers)
Puzio, K., & Colby, G. , (2010)
Slavin, R. E. , (1990)
-0.06 (on low attainers)
Indicative effect size (on low attainers) -0.09

It was noticeable that this seemed to hinge on the three meta-analyses that found a negative result. There were issues with all 3, but the biggest issue was that the most negative of the effect sizes, Gutierrez, R., & Slavin, R. E.’s (1992) finding of -0.34, was wrong. They actually found a positive effect size of 0.34. Now one would assume that this would make a huge difference to the -0.09 figure. I pointed this out.

I was a little surprised at the response.

In the news section of their website, 6 days after my post they published an item entitled EEF Blog: Setting and streaming in schools – what does the evidence say? in which the following was claimed:

However, setting and streaming is a unique Toolkit topic in having meta-analytical evidence showing a split positive/negative impact depending on pupils’ level of attainment. Presenting an overall average, which is the usual Toolkit approach when there is evidence of varying effects, would mask the fact that the impact was actually negative for lower attaining pupils, who are disproportionately from disadvantaged backgrounds.

Indeed, in the first version developed by Durham University for the Sutton Trust as the Pupil Premium Toolkit (2011), the impact estimate was presented as “+1 / -1”, in order to communicate this variability of impact according to pupil attainment. This approach had merits, but risked being confusing for teachers and senior leaders.

We therefore present the headline estimate for low-attainers, and then use the Toolkit text to explain the variation (and clearly state the source of the estimate). This is something we continue to review as we develop the Toolkit.

In the interests of transparency, by the way, we do want to highlight that we have corrected a ‘typo’ in the online version of the Toolkit – one of the meta-analyses referenced in the technical appendix has been incorrectly shown as -0.34 when it should have read +0.34. Our thanks to Andrew Old, whose blog highlighted the mistake. We’re happy to reassure users of our Toolkit that this was a transposition error only and so does not in any way affect the impact figures we report on setting and streaming.

It took me a while to work out what they were claiming, as it seemed so unlikely and the implications so ridiculous. It appears to be that:

  1. The -0.09 figure is only based on the figures for low attainers.
  2. This was clearly stated.

This would mean that the mistake with the -0.34 figure would not affect the result.

It would also mean the -0.09 figure was even more ridiculous than I had thought.

Considering the first point, if the figure for setting and streaming was based only on the figures for low attainers, then we are talking about evidence from only 2 meta-analyses. One of these, the one with the larger negative effect size, (Lou et al, 1996), was not about setting or streaming; it was about grouping children within classes, a form of mixed ability teaching. Which leaves us with one negative effect size of -0.06 (described by the author of that study as “close to zero”) and an effect size of -0.12 that isn’t actually for setting or streaming, being combined to get -0.09 for setting or streaming. We have a completely bogus figure which is rated as evidence of “moderate strength” against setting and streaming, while ignoring all the evidence that found positive effect sizes. And this has been achieved by cherry-picking the data to ignore studies which did not specify “low attainers”.

Now we should consider the claim that the EEF “clearly state the source of the estimate”. This seems to be utterly false. There is no mention that the figure was based only on studies of low attainers on the page summarising results from the toolkit. The Technical Appendix,  says, when discussing the strength of the evidence:

There are six meta-analyses suggesting that setting or streaming appears to benefit higher attaining pupils and be detrimental to the learning of mid-range and lower attaining learners. [my emphasis]

Within two weeks of that blogpost, the EEF released a (really pretty terrible) report on maths teaching, which included a section on ability grouping which repeated the -0.09 figure and claimed:

The EEF ‘Setting or Streaming’ toolkit draws on six meta-analyses (in addition to a range of single studies and reviews).

I did look at the original version of the Toolkit from 2011 mentioned in the blogpost. This cleared up one of the things that had most confused me, namely, how some unusually sensible academics from the CEM at Durham could have been involved in such flawed figures. The answer is that, while they were in my view too negative about ability grouping, they did, as described above, give two figures, a positive one for the average effect of setting (one that actually agreed with Hattie) and a negative one for low attainers. They also referred to “ability grouping” not “setting and streaming”, making the inclusion of Lou et al (1996) less of an obvious mistake.

So to summarise the origins of the negative effect size for setting and streaming in the EEF toolkit, we appear to have the following events:

  1.  Researchers in 2011 find a positive effect size for “ability grouping” but a negative effect size for low attainers from the 2 meta-analyses which specified low attainers. They report both figures.They include within class ability grouping in their studies, rather than just studies of setting and streaming.
  2. This result is subsequently attributed to “setting and streaming” not “ability grouping” on the EEF website despite the figure for low attainers being dependent on a study which didn’t look at setting.
  3. Also subsequently to point 1, the positive effect size from the 6 studies is removed and only the result for low attainers reported as a headline figure, without making it clear that only 2 meta-analyses (only one of which was relevant) were used for this figure.
  4. The EEF website continues to claim incorrectly that the figure is based on 6 meta-analyses.
  5. A “typo” which adds an extra larger negative effect size to the 6 studies, serves to conceal the fact that the figure does not reflect the 6 studies it is attributed to.
  6. The EEF blog inaccurately claims that they “clearly state the source of the estimate”. Within a fortnight of this they publish a report claiming again to have used all 6 studies.

Now, to be honest, every one of the errors listed above could be an honest mistake. But every single one of them either misrepresented the research on setting and streaming, or obscured the source of the claims made. And that’s quite a few mistakes in the same general direction to have been made without some form of bias over the issue of setting, as well as a lot of carelessness.

I think the time has come for the EEF to admit they have really screwed up here, and misled a lot of people. I have heard their incorrect figure quoted in schools. I have seen it quoted in blogs. I have seen it quoted on Twitter. It is probably the most widely publicised result they have. And it’s wrong. And they still include it on their website. It needs to be withdrawn and replaced with the positive effect size of 0.12 for ability grouping that their researchers actually found.








  1. It’s even worse than this. Until a few month’s ago, they had another typo: they had recorded the Slavin result as -0.6 instead of -0.06. At the same time as they corrected this typo, they ‘miscorrected’ +0.34 to -0.34 which mysteriously yet conveniently obscures the issues you raise: that this is the one and only EEF toolkit area which reports results for a restricted range of pupils’ ability. That’s a huge coincidence we have to swallow if we are to maintain a belief in the EEF toolkit authors’ fairness.

    In addition, by renaming ‘ability grouping’ to ‘setting and streaming’, they have also failed to pick up that many of the underlying studies are within-class ability grouping (therefore not appropriate to be classified as ‘setting and streaming’).

    I agree that the only honest response is to replace the summary effect size with one calculated in the same way as the other areas.

    All of this, of course, should be read in the context that the whole effect size enterprise is fundamentally flawed: it simply does not measure the impact of an intervention and areas which appear higher on the toolkit are not areas with better educational impact, but areas where it is easier to conduct more precise research. See http://bit.ly/2F3204F

    So the really honest thing would be to take the toolkit offline completely and stop misinforming teachers.

  2. […] for a minute pause to reflect on the implications of the information dug out by Andrew Old in his blog. I’m not a mathematician, but I was able to understand the […]

  3. […] was thus with great interest and a certain awe that I perused Andrew Old’s thorough, meticulous dissection of a misleading report from England’s Education Endowment Foundation, a body which appears […]

  4. […] Teaching in British schools « The EEF were even more wrong about ability grouping than I realised […]

  5. A very interesting commentary.
    I like the EEF for what they’re attempting to do, buy your investigation has highlighted a real issue that I never really considered – I’ve been guilty of taking a lot of what the site says at face value.
    Also, a lot of the research into steaming is old – data from the early nineties isn’t particularly relevant today.

    All the best,
    Mr H.

  6. […] Andrew, a UK blogger, recently wrote a post about the UK’s Education Endowment Foundation (EEF) and the evidence it presents on setting and streaming (between-class ability grouping). It was a […]

  7. Thanks Andrew for an excellent analysis. I’m trying to do a similar analysis of Hattie’s work and finding the same mistakes – effect sizes represented as negative but were positive in the research, combing studies that are not really measuring the influence in question, lots of misrepresentation, etc.. I’m looking for others to contribute here – https://visablelearning.blogspot.com.au/

  8. […] use it to inform our practice and make us better teachers.  But, as Andrew Old postulates here, what if the research is […]

  9. Andrew, I’m surprised you haven’t devoted a blog post yet to this little gem from the EEF (unless of course I’ve missed it?):
    “EEF publishes new review of evidence on Maths teaching”
    (see https://educationendowmentfoundation.org.uk/news/eef-publishes-new-review-of-evidence-on-maths-teaching/ ).

    It includes a lovely quote from Josh Hillman, Director of Education at the Nuffield Foundation:

    “This research is valuable because it synthesises a huge range of international evidence on what works and what doesn’t when it comes to teaching maths. For instance, it tells us that collaborative learning has a positive effect on attainment, but that setting or streaming students by ability generally does not.”

    I think the issue here is not just what this report says, but the mixed messages that it seems to send as well. Take these two quotes:

    “It’s often said that calculators can harm students’ arithmetic skills. What this review finds is that they can actually boost pupils’ fluency and understanding of maths – but that to do so, teachers should ensure they are used in a considered and thoughtful way, particularly with younger students.”

    And …

    “Today’s report also finds that teachers should help pupils to use a range of mental and other methods and be able to recall number facts efficiently and quickly. The evidence suggests that those who are unable to do this may have difficulty with harder maths later in school.”

    The latter of these quotes implicitly supports the use of times table tests to aid the learning of higher level maths concepts (a point I made on my blog) while the former comes with so many caveats as to be virtually worthless. Yet guess which message gets highlighted by the BBC, The Guardian and the EEF itself?

    “Calculators ‘a plus’ for young mathematicians after all” ( http://www.bbc.co.uk/news/education-43500274 ).

    “I’ve never known my times tables. Frankly, who needs them?” by Peter Bradshaw ( https://www.theguardian.com/commentisfree/2018/feb/16/times-tables-multiplication-learning-by-rote ).

    “Using calculators in maths lessons can boost pupils’ calculation and problem-solving skills, but they need to be used in a thoughtful and considered way, according to a review of the evidence published by the Education Endowment Foundation (EEF) today.” (First line of EEF web page.)

    Personally I think you’re putting too much faith in the EEF here.

    • I did mention that report in the above post. We know it is wrong about setting and streaming. I’m not sure it’s worth investigating what else it’s wrong about.

  10. […] through the lens of differing perspectives and levels of reliability and validity. The excellent recent post by Andrew shows that the research bandwaggon often has political and other hidden agendas. In […]

  11. […] The second issue was to be edited by Jonathon Sharples of the EEF (the organisation I blogged about here). Among the topics they requested articles on was “Metacognition, self-regulation” […]

  12. The process the EEF undertake for these reviews is flawed – even in their own terms. They keep on relying on ‘effect size’ to tell them what is and what is not a better intervention. It’s just not a measure of this – they’re simply misinterpreting a statistical term and they’re risking huge harm by making recommendations on that basis. There’s an easy to listen to explanation of all this in a recent podcast from the Education Research Reading Room (https://bit.ly/2w8eyXx)

  13. […] Many thanks to Mura Nava from EFL Notes for the comment below (after my initial posting) pointing me towards this other recent post by Andrew Old on much more worrying mis-reporting, and mis-interpretation of the research on the complex issue of ability grouping by EEF:  https://teachingbattleground.wordpress.com/2018/04/02/the-eef-were-even-more-wrong-about-ability-gro… […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: