Archive for August, 2020

h1

The tragedy of grades based on predictions

August 16, 2020

When I wrote about an exam announcement last week it was out of date before I’d finished typing. This post too may now be out of date if the appeals system allows major changes, but I have seen so much false information that I thought I’d better get this out there.

Exams were not sat this year. The decision was made instead to predict what grades would have been given. This is probably the decision that should have been debated. Instead the debate has centred on how grades were predicted with much talk of an evil algorithm crushing children’s hopes. Some wished to predict grades deliberately inaccurately in order to allow grade inflation to hide the problems. Because opportunities such as university places and employment are finite, grade inflation actually doesn’t solve any problem. What it does is make sure that when people lose out on opportunities, it would not be clear that this year’s grades were the problem. I argued against the idea that grade inflation solves problems here and will not be going into it again now, but it is worth noting that most disagreement with any opinions I express in this post will be from advocates of using grade inflation to solve problems, rather than anything else. In particular, it needs to be acknowledged that the use of teacher assessment would have on average led to more grade inflation.

However, because people seemed to think inaccuracy in grades would justify grade inflation, and because people objected to specific grades when they arrived, there has now been huge debate about how grades were given. Much of this has been ill-informed. 

I intend to explain the following:

  1. How grades are predicted.
  2. Why predicted grades are inaccurate.
  3. What claims about the process are false or unproven.

Normally, I’d split this into 3 posts, but things are moving so fast I assumed people would want all this at once in one long post.

How grades are predicted.

Ofqual produced a statistical model that would predict the likeliest grades for each centre (usually a school or college). This used all the available data (past performance and past grades of the current cohort) to predict what this year’s performance would have been. This was done in accordance with what previous data showed would predict grades accurately. A lot of comment has assumed that if people are now unhappy with these predictions or individual results, then there must have been a mistake in this statistical model. However, this is not something where one can simply point at things one doesn’t like and say “fix it”. You can test statistical models using old data, e.g. predict 2019 grades from the years before 2019. If you have a model that predicts better than Ofqual’s then you win, you are right. If you don’t, and you don’t know why the Ofqual model predicts how it does, then you are probably wrong. In the end, proportions of grades were calculated from grades given in recent years, then adjusted in light of GCSE information about current students, then the number of expected A-levels in each subject at each grade was calculated for each centre. Centres were given information about what happened in this process in their case.

Although the model came up with the grades at centre level, which students got which grades was decided by the centres. Centres ranked their students in each subject and grades were given in rank order. Some commentary has overlooked this, talking as if the statistical model decided every student’s grade. It did not. It determined what grades were available to be given (with an exception to be discussed in the next paragraph), not which student should get which grade. As a result the majority of grades were not changed and where they were, it would often have been a result of the ranking as well as the statistical model.

Finally, there was an exception because of the problem of “small cohorts” taking exams i.e. where centres had very few students taking a particular exam (or very few had taken it in the past). This is because where there was less data, it would be harder to predict what grades were likely to be given. Centres had also been asked to predict grades (Centre Assessed Grades or CAGs) for each student and for the smallest cohorts these were accepted. Slightly larger cohorts were given a compromise between the CAGs and the statistical model, and for cohorts that were larger still, the statistical model alone was used.

It is important to understand this process if you think a particular grade is wrong. Without knowing whether the cohort was small; why the statistical model would have predicted what it did; how the distribution was calculated for a centre, and where a student was in the ranking, you do not know how a grade came to be given. For some reason, people have jumped to declare the evils of an “algorithm”. Didn’t get your result? It’s the result of an algorithm.

As a maths teacher, I quite like algorithms. Algorithms are the rules and processes used to solve a problem, perhaps best seen as the recipe for getting an answer. Every year algorithms are used after exams to decide grade boundaries and give grades. A mark scheme is also an algorithm. The alternative to algorithms deciding things is making arbitrary judgements that don’t follow rules. This year is different in that CAGs; a statistical model (also a type of algorithm), and centre rankings have replaced exams. The first thing that people need to do to discuss this sensibly is to stop talking about an algorithm that decided everything. If you mean the statistical model then say “the statistical model”. There are other algorithms involved in the process, but they are more like the algorithms used every year: rules that turn messy information into grades. Nobody should be arguing that the process of giving grades should not happen according to rules. Nobody in an exam board should be making it up as they go along.

Why predicted grades are inaccurate.

Predicted grades, whether from teachers or from a statistical model, are not likely to be accurate. That’s why exams are taken every year. The grades given will not have been the same as those that would have been given had exams been sat. Exam results are always influenced by what seem like random factors that nobody can predict (I will discuss this further in the next section). We can reasonably argue over what is the most accurate way to predict grades, but we cannot claim that there is a very accurate method. There are also situations where exam results are very hard to predict. Here is why I think this year’s results will be depressingly inaccurate.

Some students are exceptional. Some will get an A* in a school that’s never had an A*. Some will get a U in a school that’s never had a U. Predicting who these students are is incredibly difficult and remains difficult even where historic A-level results are adjusted to account for the GCSE data of current students. Students will have often unfairly missed out (or unfairly gained) wherever very high or low grades were on the table (i.e. if students were at the top and the bottom of rankings). This is the most heartbreaking aspect of what’s happened. The exceptional is unpredictable. The statistical model will not pick up on these students. If a school normally gets some Us (or it gets Es but this cohort is weaker than usual) the model will predict Us. If a school doesn’t normally get A*s (or it does but this years cohort is weaker than usual) the model will not predict A*s. This will be very inaccurate in practice. You might then think that CAGs should be used to identify these students. However, just as a statistical model won’t pick up an A* or U student where normally there are none, a teacher who has never taught an A* or U student will not be able to be sure they have taught one this time. In the case of U it might be more obvious, but why even enter a student for the exam if it was completely obvious they’d get U? The inaccuracy in the CAGs for extreme grades was remarkable. In 2019, 7.7% of grades were A*; in 2020, 13.9% of CAGs were A*. In 2019, 2.5% of grades were Us; in 2020, 0.3% of CAGs were Us. Both the CAGs and the statistical models were likely to be wrong. There’s no easy way to sort this out, it’s a choice between two bad options.

As well as exceptional students, there are exceptional schools. There are schools that do things differently now, and their results will be different. Like exceptional students, these are hard to predict. Ofqual found that looking at the recent trajectory of schools did not tell them which were going to improve and so the statistical model didn’t use that information. Some of us (myself included) are very convinced we work in schools that are on the right track and likely to do better. However, no school is going to claim otherwise and few schools will admit grades are going to get worse, so again, CAGs are not a solution. Because exceptional schools and exceptional students are by their very nature unpredictable, this is where we can expect to find the biggest injustices in predicted grades.

Perhaps the biggest source of poor predictions is the one that people seem to be reluctant to mention. The rankings rely on the ability of centres to compare students. There is little evidence that schools are good at this, and I can guarantee that some schools I’ve worked at would do a terrible job. However, if we removed this part of the process, grades given in line with the statistical model would be ignoring everything that happened during the course. Few people would argue that this should happen, so this hasn’t been debated anywhere near as much as other sources of error. But for individual students convinced their grades are wrong, this is likely to be incredibly important. Despite what I said about the problems with A*s and Us, a lot of students who missed out on their CAG of A* will have done so because they were not highly ranked, and a lot of students who have got Us will have done so because they were ranked bottom and any “error” could be attributable to their school rather than an algorithm. 

Finally, we have the small cohorts problem. There’s no real way round this, although obviously plenty of technical debate is possible about how it should be dealt with. If the cohort was so small that the statistical model would not work, something else needs to be done. The decision was to use CAGs fully or partially, despite the fact that these are likely to have been inflated. Inflated grades are probably better than random ones or ones based on GCSE results. But this is also a source of inaccuracy. It also favours centres with small cohorts in a subject and, therefore, it will allow systematic inaccuracy that will affect some institutions very differently to others. It is the likely reason that CAGs have not been adjusted downwards equally in all types of school. Popular subjects in large sixth forms are likely to have ended up with grades further below CAGs than obscure subjects in small sixth forms.

Which claims about the process are false or unproven

Much of what I have observed of the debate about how grades were given has consisted of calls for grade inflation disguised as complaints about inaccuracy, or emotive tales of students’ thwarted ambitions that assume that this was unfair or unusual without addressing the cause of the specific disappointment. As mentioned above, much debate has blamed everything on an “algorithm” rather than identifying what choices were made and why. Having accepted the problems with predicting grades and acknowledged the suffering caused by inaccuracies, it’s still worth trying to dispense with mistaken, misleading or inaccurate claims that I have seen on social media and heard on the news. Here are the biggest myths about what’s happened.

Myth 1: Exams grades are normally very accurate. A lot of attempts to emphasise the inaccuracies in the statistical model have assumed that there is more precision in exam grades than there actually are. In reality, the difference between a B grade student and a C grade student can be far less than the difference between two B grade students. Some types of exam marking (not maths, obviously) is quite subjective and there is a significant margin of error, making luck a huge factor in what grades are given. Add to that the amount of luck involved in revising the right topics, having a good day or a bad day in the exam, and it’s no wonder grades are hard to predict with accuracy. It’s not comforting to think that a student may miss out on a university offer because of bad luck, but that is not unique to this year; it is normal. The point of exam grades is not to distinguish between a B grade and a C grade, but between a B grade and a D grade or even an E grade. It’s not that every A* grade reflects the top 7.7% of ability, it’s more a way of ensuring that anyone in the top 1%, say, should get an A*. All grades are a matter of probability, not a definitive judgement. That does not make them useless or mean that there are better alternatives to exams, but it does mean everyone should interpret grades carefully every year. 

Myth 2: CAGs would have been more accurate.

As mentioned above, CAGs were higher than they should have been based on the reasonable assumption that a year group with an interrupted year 13 is unlikely to end up far more able than all previous year groups. There’s been a tendency for people to claim that aggregate errors don’t tell us anything about inaccuracies at the level of individual students. This is getting things backwards. It is possible to have inaccuracies for individual students that cancel each other out and aren’t visible at the aggregate level. So you could have half of grades being too high, and half too low, and on average the distribution of grades seems fair. You could even argue that this happens every year. But this does not work the other way. If, on average, grades were too high it does tell us something about individual grades. It tell us that they are more likely to be too high than too low. This is reason enough to adjust downwards if you want to make the most accurate predictions.

Myth 3: Individual students we don’t know getting unpredicted Us and not getting predicted A*s are examples of how the statistical model was inaccurate.

As argued above, the statistical model is likely to have been inaccurate with respect to the extremes. However, because we know CAGs are also inaccurate, and that bad rankings can also explain anomalies, we cannot blindly accept every story about this from kids we don’t know. I mention this because so much commentary and news coverage has been anecdotal in this way. If there were no disappointed school leavers that would merely tell us that the results this year were way out compared to what they should have been, because disappointed school leavers are normal when exam grades are given out. Obviously, the better you know a student, the more likely you are to know a grade is wrong, but even then you need to know their ranking and the justification for the grade distribution to know the statistical model is the problem.

Myth 4: The system was particularly unfair on poor bright children.

This myth seems to have come from two sources, so I’ll deal with each in turn.

Firstly, is has been assumed that as schools which normally get no A*s would not be predicted A*s (not quite true) this means poor bright kids in badly performing schools would have lost out. This misses out the fact that even with little history of getting A*s previously, they might still be predicted if the cohort has better GCSE results than usual, so the error is less likely if the poor bright kid had good GCSEs. It also assumes that it is normal for poor kids to go to do A-levels in institutions that get no A*s which is unlikely for big institutions. Additionally, schools are not uniform in their intake. The bright kid at a school full of poor kids who misses out is not necessarily poor, in fact because disadvantaged kids are likely to get worse results, they often won’t be. Finally, it’s not just low achieving schools whose A* students are hard to predict. While a school that usually gets no A*s in a subject, but who would have got one this year makes for a more dramatic story, the situation of that child is no different to the lowest ranked child in a school that normally gets 20 A*s in a subject and this year would have got 21. 

The second cause of this myth, is from statistics about downgrading from CAGs like these.

Although really this shows there’s not a huge difference between children with a different socioeconomic status (SES) it has been used to claim that poorer students were harder hit by downgrading and, therefore, it is poor bright kids that will have been hit worse than wealthier bright kids. (Other arguments have looked at type of school, but I’ll deal with that next). Whether this figure is a result of the problem of small cohorts, or from the fact that it is harder to overestimate higher achieving students, I don’t know. However, we do know the claim these figures reflect what happened to the highest achieving kids is incorrect. If we look at the top two grades, the proportion of kids who had a high CAG and had them downgraded is smaller for lower SESs (although because fewer students received those grades overall the chance of being downgraded given that you had a high CAG would show the opposite pattern).

 

Myth 5: The system was deliberately rigged to downgrade the CAGs of some types of students more than others

I suppose it’s probably worth saying that it’s impossible to prove beyond all doubt that this is a myth, but I can note the evidence is against it. The statistical model should not have discriminated at all. The problem of small cohorts and the fact it is easier to over-estimate low-achieving students and harder to over-estimate high achieving students seem to provide a plausible explanation of what we can observe about discrepancies in downgrading. Also, if we compare results over time, we would expect those types of institutions who on average had a fall in results last time to have a rise this year. Take those three factors into account and nobody should be surprised to see the following or to think it sinister (although it would be useful to know to what extent each type of school was affected by downgrading and by small cohort size).

If you see anyone using only one of the above two sets of data, ignoring the change from 2018 to 2019, or deciding to pick and choose which types of centre matter (like comparing independent schools with FE colleges) suspect they are being misleading. Also, recall that these are averages and individual subjects and centres will differ a lot. You cannot pick a single school like, say, Eton and claim it will have done well in avoiding downgrading in all subjects this year.

Now for some general myth-busting.

The evidence shows students were affected by rounding errors. False. Suggestions like this, often used to explain unexpected Us, seem entirely speculative and not necessary to explain why students have got Us.

Some students got higher results in further maths than maths. True. Still a tiny minority, but much higher than normal.

No students at Eton were downgraded. Almost certainly false. This claim that was all over Twitter is extremely unlikely; denied anecdotally and there is no evidence for it. We would expect large independent schools to have been downgraded in popular subjects.

Something went wrong on results day. False. Things seem to have gone according to plan. If what happened was wrong it was because it was the wrong plan. Nothing surprising happened at the system level.

Students were denied the grades they needed by what happened. True for some students, but on average there is no reason to think it would have been more common to miss out on an offer than if exams had taken place, and some institutions might become more generous, if they can, due to the reduced reliability of the grades.

Results were given according to a normal distribution. False.

Rankings were changed by the statistical model. False. Or at least if it did happen, it wasn’t supposed to and an error has been made.

The stressful events of this year where exams were cancelled show that we shouldn’t have exams. False. Your logic does not resemble our earth logic.

And one final point. So many of the problems above come down to small cohort size, that next week’s GCSE results should be far more accurate. Fingers crossed. And good luck.

h1

Grade inflation is not the way to resolve an exam kerfuffle

August 13, 2020

This year, it was decided that exams would be cancelled due to COVID-19, and grades for years 11 and 13 in England (and, as I now know from the news, for higher students in Scotland) would be decided by a mixture of centre assessed grades (CAGs) and a statistical model based on rankings provided by centres. Both elements of this have their limitations, and that is why a combination is necessary. It remains to be seen how effectively this will be done. In England, I suspect it will work well for GCSEs, but I’m not sure about A Levels. In Scotland, the Scottish government gave in to pressure and accepted CAGs as grades, despite them being much higher and results this year now being massively different from previous years. There is a widespread misconception that in normal years, exams represent an objective standard and luck does not play a role in allocating grades. For people who believe this, this year’s system is completely broken no matter how accurately it might predict what students would have got. Moreover, there is also a belief that when an exam system has a problem, grade inflation is a solution.

I would argue that inaccurate grades create their own problems, and that honesty, by which I mean maximising accuracy in predictions, is the best policy. I am aware that there are unavoidable difficulties. Schools and individuals whose success (or failure) this year is unprecedented will not get the grades they would have got. I’ve also worked in schools where assessment was poor, and I hate to think how their rankings will be compiled. But for large cohorts, CAGs will not be more accurate than a model that corrects the tendency towards over-estimation. It flies in the face of mathematics to deny that if grades are inflated, they are less likely to be accurate, although there appear to be many involved in education who claim a large systematic bias in a single direction is not a source of inaccuracy. It’s been reported that A-level grades at A-A* would have gone up from 27% to 38% if CAGs had been used. Nobody can argue that such grades would have been accurate.

Grade inflation is not a victimless crime. It does have real, negative effects. Firstly, devalued grades take opportunities away from those who have received them in the past, as their grades start to be interpreted according to lower standards. Secondly, inflated grades create inconvenience for employers and educational institutions who will find them harder to interpret. Thirdly, some of those who receive grades they never would have achieved without grade inflation will find themselves in courses and jobs for which they are unsuitable. Fourthly, if the rate of grade inflation is not uniform across the system, some will lose out relative to their peers. This is particularly noticeable in Scotland, where there is evidence that grades were inflated more for some socio-economic groups than others. Finally, students in the following year will lose out if the higher pass rates are not maintained, particularly if students can defer for a year before going to university. I would expect there to be pressure in Scotland to keep the much higher pass rates from this year for next year – although a cynic might wonder whether such pressure is easier to resist further away from an election.

There is also a bigger picture here. This might seem like a one-off event, but this is not the first exam kerfuffle for which some have advocated massive grade inflation as a solution. When a new modular English GCSE exam resulted in grade boundaries moving drastically in 2012, there were those who advocated a huge shift in C grade pass rates. When grades are revalued, the direction is almost always the same: more passes without any underlying improvement in achievement or ability. Recent stability in pass rates is the exception, not the norm. It has only being achieved through a deliberate policy effort to hold the line after years of constant grade inflation. If we discard this policy this year, it will be easier to abandon it in other years too.

Whether or not grading goes well today and next Thursday (and I know some will inevitably lose out compared with exams), we would be fools to give up on maintaining the value of grades.

An additional couple of notes.

Firstly, good luck to all students (and their teachers) getting results today and next week. Secondly, the grade allocation might go completely wrong, but remember, anomalies will be reported from schools even if it goes really well. Don’t jump to conclusions when the first angry school leaders appear on the news or on social media. We won’t know if there’s a problem for certain until somebody checks the maths for those schools, which is easier said than done.

h1

Mock results are not a good prediction of final exam grades

August 12, 2020

The government has announced last minute plans to let students use their mock exam result as a grade this year following the cancellation of exams. Although, I have just heard Nick Gibb say mocks could be used for an appeal. so maybe the proposal is not what we thought. Just in case I’ll explain now why it would be insane to allow mocks to count for the following reasons.

  1. There is no consistent system of doing and recording mock exam results with schools doing drastically different things. Schools would definitely have done them differently if this had been on the cards.
  2. Mock exams don’t have boundaries. Schools just make up the boundaries.
  3. Some schools deliberately play down mock results; some even play them up. It’s completely unfair for such arbitrary decision to have any effect on students.
  4. Some students with private tutors “accidentally” see the paper the before sitting the mock exam. Schools then have to sort out how a child surprisingly got almost everything right sometimes on topics that have they’ve never studied.
  5. This new system creates a precedent. Schools will want to have dodgy over-inflated mock results on the system in future.
  6. Schools do mocks at completely different times of the year so they are not comparable between schools.
  7. Nobody wanted this. I’d bet Ofqual don’t want this.
  8. Some subjects, like A level English literature and language, have very long exams which might not be practical to do rigorously as mocks. (And let’s not even mention art A-level)
  9. Schools have already done teacher assessed grades while these are unlikely to be reliable, there is no reason they should less accurate than mock exams.
  10. Making last minute decisions like this makes the job harder for everyone.

Update: It does appear to be the case that mocks will only be used for appeals. Looks like last night’s announcement was incorrect, thank goodness.

%d bloggers like this: