Formal assessment is not a subject I have ever studied in any detail and I have only had reason to think about it in detail because of the bizarre arguments over English GCSE. These have made me realise how differently some people see exams. So, to begin with, I think it’s worth going over these fundamental issues, but be warned, this is entirely based on personal experience of the debates and no actual research on the theory or practice of formal assessment.
As far as I am concerned, exams are to test ability. I accept that it is both inevitable and perfectly acceptable that they will reflect effort as well and when I talk about ability in this blog, you can assume that I have accepted the point that this actually means ability mediated by effort. This, however, is all that they should be looking for. If an exam allows comparison of ability then it functions. If outcomes become sufficiently detached from ability then the system is breaking down.
Now an exam system requires standardisation, a way of comparing ability levels even when people have taken different exams. If no way is found to do this from one year to the next then we have grade inflation (in theory we could also have grade deflation, but that does not actually seem to happen). The key priority, therefore, has to be that the difficulty involved in getting a particular grade is consistent between exams. This, perhaps obvious, point is often missed in debate over exams. People talk as though they can identify what deserves a C without identifying how difficult it might be to pull off at a particular time or in particular circumstances or what level of genuine ability it will require. Debate becomes easier once we accept that it is difficulty that must remain constant. And this is not just about results from year to year; within our system we also require standardisation between exam boards and tiers of entry. There are several different approaches to these issues.
The classic method is norm-referencing. This is where roughly the same proportions get each grade each year. This does not really help us standardise between tiers or boards, but it is objective and does help address standards over time. The problem is that is does not work if genuine changes in cohort ability occur. People often use this as an excuse to dismiss it out of hand. Countless people have asked questions along the line of: “well what if everyone simultaneously became as smart as last year’s A* students?” Yes, this would make the effects of norm-referencing unfair, but it is a bizarre way to approach anything. You might as well ask an economist if their measures of inflation would work if everyone spent all their money on radishes or complain that a thermometer wouldn’t work if you threw it into the heart of the sun. It is no argument against anything that it wouldn’t work in an impossible situation. The fact is that ability levels don’t change much from year to year. There are, nevertheless, difficulties that may occur. Firstly, in a complicated exam system there may be changes in the cohort entered for particular exams. This is best dealt with by looking at other data, say KS2 results, so as to adjust results. This is the basis of the “comparable outcomes” approach used in our exams this year. Secondly it does not give a huge amount of guidance for comparing between different exam boards and tiers. Again, other data like KS2 scores can be used. Thirdly, it is likely to be less effective over the long term. While there may be little change in cohort ability from one year to the next, there are likely to be changes from one decade to the next. Even IQ scores change on that basis. Despite this, if other data is used to check for changes in ability, it is the most objective method of maintaining the level of difficulty, and the easiest method to use in the short-term.
There are a couple of alternative approaches which seem to make sense but actually do not maintain consistency in the level of difficulty and are behind a lot of the problems with exams. Firstly, we have the idea that there is a particular performance that simply has to be replicated in order to get a grade. This seems to make intuitive sense and its advocates will usually use a series of analogies. The 100m world record will always be for running 100m. There will not be days when it is okay to run a couple of people over on the driving test. These are simply tests of performance, not difficulty. However, there are key differences. Both athletic world records and driving tests set a simple standard which you only need to beat once. If circumstances make them difficult to achieve on one occasion, then you can do them again. The standard is one to be met on your best day and so it doesn’t matter if, say, the weather makes it harder to achieve on one particular day. Exams are not like that. Expense and the numbers involved mean that there is a huge problem with having an exam where everybody is disadvantaged, whereas there is no problem in a 100 metres race where all the competitors are performing well below their personal best because of conditions. Even more importantly, it is in the nature of intellectual accomplishments that they involve more than repetition. Watching Usain Bolt run will not make you a world record breaker. Watching somebody else pass a driving test will be of only marginal advantage to helping you pass yours. Watching somebody else answer an exam question tells you how to answer it (providing you understand what they wrote). This is why questions change each year. There can never be a simple, repeatable grade C performance, if there was then it would become easier to achieve the more it was repeated and the more information about it was released and this would erode standards over time. Despite this, I have repeatedly heard people defend grade inflation on the grounds that, as teachers have been able to better prepare students for exams (i.e. teaching to the test) then grades should go up to reflect this better performance. This is to miss the importance of measuring ability.
The second problematic alternative to norm-referencing is to have a list of criteria which have to be met for a grade. This is very common and is embedded in the National Curriculum which has some pretty arbitrary attempts to match grades with criteria. This box-ticking approach is unworkable for two reasons. Firstly, the same criteria can be met in difficult or easy ways meaning they do not preserve difficulty. There are difficult and easy books to read and understand. There are difficult and easy long multiplication questions. There are difficult and easy analyses of historical events. Criteria that were capable of distinguishing between all levels of difficulty within a topic would be unmanageable. Secondly, if the initial criteria are botched then they tend to be ignored. So for instance, the maths National Curriculum says “mental recall of multiplication facts up to 10×10 and quick derivation of corresponding division facts” is Level 4, i.e. the level of the average 11 year old, which I believe is equivalent to grade F. However, if you ask a few 16 year olds for the factors of 56, you will quickly realise that it is not actually achieved by even the average 16 year old, and that includes many with grade C in maths. Criteria are convenient lies used mainly by those who don’t believe in grades and levels and would sooner have endless lists of what students can or cannot do used for assessment. Nevertheless, the myths that exams measure the genuine meeting of criteria, or the repetition of a particular performance, have been trotted out again and again by the apologists for grade inflation making sensible debate about what is going on in the exam system impossible
Finally, there is another method that works. This is getting experienced and academically able teachers to look over the exam and judge how difficult it is. I suspect this, rather than norm-referencing, is actually the key to those exams in the world that have maintained their standards even those that claim to be criteria-based. It is the only thing that is going to make a difference over time. The subjective judgement of the genuine expert is probably the gold standard when it comes to maintaining the level of difficulty in exams. The problem is that, while some exams taken by small numbers of students might be managed by experienced experts, the mass of the exam system, particularly the GCSEs like maths and English which everybody takes, are not managed in this way. This is because we have an education system where authority does not depend either experience or academic ability. There are people running schools, quangos and consultancies or working in education departments at universities, who will swear on their mothers’ lives that an easy question is hard and a hard question is easy. In this respect the inability to maintain standards in exams is a symptom of the inability to maintain professional standards in a bureaucracy. This may be the root cause of our current difficulties and until it is resolved then a variation on norm-referencing, such as “comparable outcomes” is the best we can hope for.