A Note on Exams

September 22, 2012


Formal assessment is not a subject I have ever studied in any detail and I have only had reason to think about it in detail because of the bizarre arguments over English GCSE. These have made me realise how differently some people see exams. So, to begin with, I think it’s worth going over these fundamental issues, but be warned, this is entirely based on personal experience of the debates and no actual research on the theory or practice of formal assessment.

As far as I am concerned, exams are to test ability. I accept that it is both inevitable and perfectly acceptable that they will reflect effort as well and when I talk about ability in this blog, you can assume that I have accepted the point that this actually means ability mediated by effort. This, however, is all that they should be looking for. If an exam allows comparison of ability then it functions. If outcomes become sufficiently detached from ability then the system is breaking down.

Now an exam system requires standardisation, a way of comparing ability levels even when people have taken different exams. If no way is found to do this from one year to the next then we have grade inflation (in theory we could also have grade deflation, but that does not actually seem to happen). The key priority, therefore, has to be that the difficulty involved in getting a particular grade is consistent between exams. This, perhaps obvious, point is often missed in debate over exams. People talk as though they can identify what deserves a C without identifying how difficult it might be to pull off at a particular time or in particular circumstances or what level of genuine ability it will require. Debate becomes easier once we accept that it is difficulty that must remain constant. And this is not just about results from year to year; within our system we also require standardisation between exam boards and tiers of entry. There are several different approaches to these issues.

The classic method is norm-referencing. This is where roughly the same proportions get each grade each year. This does not really help us standardise between tiers or boards, but it is objective and does help address standards over time. The problem is that is does not work if genuine changes in cohort ability occur. People often use this as an excuse to dismiss it out of hand. Countless people have asked questions along the line of: “well what if everyone simultaneously became as smart as last year’s A* students?” Yes, this would make the effects of norm-referencing unfair, but it is a bizarre way to approach anything. You might as well ask an economist if their measures of inflation would work if everyone spent all their money on radishes or complain that a thermometer wouldn’t work if you threw it into the heart of the sun. It is no argument against anything that it wouldn’t work in an impossible situation. The fact is that ability levels don’t change much from year to year. There are, nevertheless, difficulties that may occur. Firstly, in a complicated exam system there may be changes in the cohort entered for particular exams. This is best dealt with by looking at other data, say KS2 results, so as to adjust results. This is the basis of the “comparable outcomes” approach used in our exams this year. Secondly it does not give a huge amount of guidance for comparing between different exam boards and tiers. Again, other data like KS2 scores can be used. Thirdly, it is likely to be less effective over the long term. While there may be little change in cohort ability from one year to the next, there are likely to be changes from one decade to the next. Even IQ scores change on that basis. Despite this, if other data is used to check for changes in ability, it is the most objective method of maintaining the level of difficulty, and the easiest method to use in the short-term.

There are a couple of alternative approaches which seem to make sense but actually do not maintain consistency in the level of difficulty and are behind a lot of the problems with exams. Firstly, we have the idea that there is a particular performance that simply has to be replicated in order to get a grade. This seems to make intuitive sense and its advocates will usually use a series of analogies. The 100m world record will always be for running 100m. There will not be days when it is okay to run a couple of people over on the driving test. These are simply tests of performance, not difficulty. However, there are key differences. Both athletic world records and driving tests set a simple standard which you only need to beat once. If circumstances make them difficult to achieve on one occasion, then you can do them again. The standard is one to be met on your best day and so it doesn’t matter if, say, the weather makes it harder to achieve on one particular day. Exams are not like that. Expense and the numbers involved mean that there is a huge problem with having an exam where everybody is disadvantaged, whereas there is no problem in a 100 metres race where all the competitors are performing well below their personal best because of conditions. Even more importantly, it is in the nature of intellectual accomplishments that they involve more than repetition. Watching Usain Bolt run will not make you a world record breaker. Watching somebody else pass a driving test will be of only marginal advantage to helping you pass yours. Watching somebody else answer an exam question tells you how to answer it (providing you understand what they wrote). This is why questions change each year. There can never be a simple, repeatable grade C performance, if there was then it would become easier to achieve the more it was repeated and the more information about it was released and this would erode standards over time. Despite this, I have repeatedly heard people defend grade inflation on the grounds that, as teachers have been able to better prepare students for exams (i.e. teaching to the test) then grades should go up to reflect this better performance. This is to miss the importance of measuring ability.

The second problematic alternative to norm-referencing is to have a list of criteria which have to be met for a grade. This is very common and is embedded in the National Curriculum which has some pretty arbitrary attempts to match grades with criteria. This box-ticking approach is unworkable for two reasons. Firstly, the same criteria can be met in difficult or easy ways meaning they do not preserve difficulty. There are difficult and easy books to read and understand. There are difficult and easy long multiplication questions. There are difficult and easy analyses of historical events. Criteria that were capable of distinguishing between all levels of difficulty within a topic would be unmanageable. Secondly, if the initial criteria are botched then they tend to be ignored. So for instance, the maths National Curriculum says “mental recall of multiplication facts up to 10×10 and quick derivation of corresponding division facts” is Level  4, i.e. the level of the average 11 year old, which I believe is equivalent to grade F. However, if you ask a few 16 year olds for the factors of 56, you will quickly realise that it is not actually achieved by even the average 16 year old, and that includes many with grade C in maths. Criteria are convenient lies used mainly by those who don’t believe in grades and levels and would sooner have endless lists of what students can or cannot do used for assessment. Nevertheless, the myths that exams measure the genuine meeting of criteria, or the repetition of a particular performance, have been trotted out again and again by the apologists for grade inflation making sensible debate about what is going on in the exam system impossible

Finally, there is another method that works. This is getting experienced and academically able teachers to look over the exam and judge how difficult it is. I suspect this, rather than norm-referencing, is actually the key to those exams in the world that have maintained their standards even those that claim to be criteria-based. It is the only thing that is going to make a difference over time. The subjective judgement of the genuine expert is probably the gold standard when it comes to maintaining the level of difficulty in exams. The problem is that, while some exams taken by small numbers of students might be managed by experienced experts, the mass of the exam system, particularly the GCSEs like maths and English which everybody takes, are not managed in this way. This is because we have an education system where authority does not depend either experience or academic ability. There are people running schools, quangos and consultancies or working in education departments at universities, who will swear on their mothers’ lives that an easy question is hard and a hard question is easy. In this respect the inability to maintain standards in exams is a symptom of the inability to maintain professional standards in a bureaucracy. This may be the root cause of our current difficulties and until it is resolved then a variation on norm-referencing, such as “comparable outcomes” is the best we can hope for.


  1. Stop worrying about this. Just give the top 5% of candidates an A*, the next 10% an A, etc.

  2. Yes OA, a good explanation.

    My preference would be similar to Jonathan’s but I would allow experienced experts to have the authority to alter the % groups by +/- 5%.

    This would retain difficulty but hopefully overcome the decade to decade intelligence variations you mentioned.

    And yes, lets ditch criteria based assessment once and for all! as its time consuming, misleading nonsense at its worst.

  3. Now I have to disagree with a return to norm referencing. That is not the way it works in France, Germany or Denmark for example. There is an expectation that as many children as possible will achieve a certain standard, albeit at 18 rather than 16. There should be an expected standard in English for example that children can spell, punctuate and write a coherent sentence.

    • Just to give some idea of what you are up against. About the time the current slide began I was still teaching. I recall very well sitting through a presentation by the English Adviser for a major UK local authority who showed a series of sample “essays” which were illiterate and illegible on any level. Step by step, she pointed out what the child was “trying” to say and instructed us how they should be rewarded accordingly. This politically motivated determination to find/imagine anything to reward, instead of setting appropriate goals and helping children reach them, has become endemic in UK education. Until the profession acknowledges it has completely lost its way: that teachers are not social engineers, the mess will continue.

  4. I do think Gove will need to say how he will measure rises in standards while active stopping grade inflation. I’m happy for him to stop grade inflation but im not sure how he plans to measure the ever hoped for improvement in real standards especially in Maths and English.

  5. Heather & Chestnut,

    You both make good points I think.

    I believe our GCSE boundaries should result in:

    10% A, 20% B, 40%C, 20%D, 10% E-G

    but that we make the C grade sufficiently demanding so its somewhere around the old o-level grade C.

    Exemplar Indicators/criteria to be decided by experienced experts;

    Students, under exam conditions, should be able to:

    In English: Write a fluent fictional essay with reasonable standards of grammar and spelling. To be able to write a flawless job application. To be able to answer questions about a classical novel or poem with reasonable accuracy. To be able to complete comprehension exercises with reasonable skill.

    In maths: Be able to do long hand addition, subtraction, multiplication and division of large numbers and decimals. To be able to use a scientific calculator. To be able to be do simple algebra involving 5 variables, brackets and factorisation. To be alive to do simple calculation involving graphs, matrices, %, statistics. To be able to be use basic geometry and sin/cos/tan. To be able to apply maths to wide variety of real world examples.

    Science: well I shan’t go on, you get the picture.

    Its my belief that at least 70% of uk children can innately reach these standards at least.

    The fact that they do not means something has gone wrong.

    So they should make the exam rigorous and make a C a true C- denoting independent competence in a subject. and if that makes the national pass rate drop by 30% then so be it.

    Then Gove can measure success by seeing how many kids can be pushed to ‘new C’ standards under the new system.

    Eventually, if successful, we will have a system that will perpetuate the approximate cohort divisions representing normal distribution of talents and also have a rigorous system of assessment.

  6. I agree with you. We need exams to genuinely differentiate the ability of those who take them. Steady grade inflation has has made this impossible with GCSEs. A drastic change is needed, and norm referencing is a good starting point. But the elephant in the room is what you do with the tail end of the bell curve. These are real students. GCSEs meant that there was a chance that they could come out of education with something to show for all of that double maths agony.

  7. “norm referencing. That is not the way it works in France,,,”
    No, but there are selection criteria that would make norm referencing blush.
    First, at collège (14/15) exit level their is an examination (Brevet des collège) .The next stage would be to proceed to either a vocational education institution or a lyceée. It is at the lycée at 18 that the students will sit their baccalauréat of which I think there are currently five versions. Around 80% pass the bac with which comes the right to go on the university. However, after the first year there is a clear out. A friend who marked the first year exams at a fashionable institution told me that 256 students failed their first year exams – not sure of the percentage but is was very high- of which only four passed their resit in September.
    University is only one part of HE in France. Their are also the Grandes écoles, entrance to which is normally via a “prépa”. Only top scoring students will attend these highly competitive schools. A son of a friend who was ranked 7th in the county obtained a place at one of these schools and is finding it tough going. He will have to score well to continue on to a grande école.
    And teachers? They are chosen by concourse (competitive exam), the capes for lower education and the agrégation for higher, not many pass this, around 10% I believe.

    • Oh dear, sorry about the grammar.

  8. My take on all of this is that we are in the business of teaching kids (and/or adults) and assessment simply needs to indicate to us what kids know about a particular concept/topic, whether they understand a particular concept/topic/procedure and whether they are able to apply what they know/understand to problems. Maybe it is useful to compare one kid with another to see who knows more, who understands more or who is better at solving problems but maybe not.

    Assessment is assessment, always has been and always will be. Whether the kid knows/can do is a fact(ish) with or without norm or criterion referencing. I sort of agree with much of what OA has siad, although my understanding of the research on reliance on the views of acknowledged experts when looking at both examination papers and marking scripts does not provide much support for this approach.

    I dont much like norm referencing as a way to describe how much a kid knows/can do and given the choice would come down on the side of criterion referencing. I would have thought that in 2012 we could come up with some sort of computer managed system for describing in precise detail what a kid knows and can do.

    Finding ways to hold teachers accountable and enable administrators to manage the process in minute detail in a way that describes the performance of a kid, teacher, school, county, country as a single number in a way that is meaningful and informative is however I feel a different matter. It is in the grey area where these two issues intertwine, where the politics meets the learning transaction that the whole thing founders.

    I fear that since Plato it has always been thus and that in the future it will always be so. Unless you can afford a decent private education for your kids.

    Another fascinating post from OA about the things that really matter in education which has made me think, and at my age and in my current teaching role that is quite an achievement.

  9. Would there be any merit in making part of the exam the same as last year and part of it new? The bit that was the same could be used to normalise the results of the bit that was different.
    There would be no advantage in schooling the students in last years questions since that would only serve to raise the bar for the new part of the test.
    Just a thought.

  10. This is essential background for the claim that GCSE is criteria based


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: