Statistical Data and the Education Debate Part 2: Why we can reach conclusions from limited data.June 13, 2013
I have brought this post forward as I have just seen a number of people react to this OFSTED report by making some of the errors described here.
As I said last time, people often think probability can be left out of evidence-based decision making entirely. The most common version of this is when we dismiss people’s descriptions of their experiences as unrepresentative or (perversely) anecdotal. Probability is at the heart of how we reason from evidence (particularly limited evidence) to more general observations. If we see something happen, then (unless we are mistaken) it is impossible that it actually never happens, it is less probable that it is rare, and it is more probable that it is common. The more often we see that something happen, and the more people we know who also see it happen, then the more unlikely it is to be rare and the more likely it is to be common.
The use of probability to go from a limited set of data to a more general claim is part of how opinion polling works. Although opinion polls don’t ask everyone in the appropriate population for their opinions, if they ask enough people and there is no reason to think those people were unrepresentative of the wider popular, then the opinions they express to the pollsters are likely to be close to the opinions of the wider population. Opinion polls usually give a margin of error indicating just how close to the opinions of the entire population their numbers are likely to be, which is calculated from the number of people in the pollsters’ sample.
Now, our reasoning is affected if our tendency to see things happen, or the pollsters’ way of finding people to poll, is not random. The results are likely to be less accurate. But while some sort of bias in how the sample of people asked might affect the probabilities involved, it still remains a matter of probability, and it can only change the probability, it doesn’t mean we know nothing at all. Biased sampling makes polling less reliable, but not necessarily so unreliable that it tells us nothing. The same goes for small samples. While asking less than a thousand people might make it far less likely that an opinion poll represents the opinions of the whole population to the nearest 3%, it might still tell us within 10% or 20%, and if it is claimed that “nobody” or “hardly anybody” or “only a minority” of people think something then that might be enough to settle the matter.
Once the role of probability in interpreting data is understood, we need to be careful about how easily opinions or experience are dismissed. There are those who reject any survey evidence outright for being only a tiny proportion of those who could have been asked. This is a big mistake. Because polling is based on probability, then, given random sampling, the size of the total population is not usually a major factor in the accuracy of a poll. 3 thousand people is a very good sample of 5 million people or 2 billion people.
A more common error is to assume that a small number of data points must tell us nothing which brings us back to how easily the genuine experiences of real people are dismissed as “anecdotal” or “unrepresentative” because they are not based on information gathered from a large random sample. Apparently seeing something frequently (like poor behaviour or bad management in schools) is no reason to think it is commonplace.
I think the best example I can give as to the usefulness of even a small sample is to imagine testing a coin to see if it is biased towards landing on a particular side when thrown. Now the population of possible coin throws is probably infinite. If the coin was to be kept for the purpose of coin throwing then the actual number of throws could be enormous. Yet if you were testing it for bias and it landed on heads every single time, how many times would you take to be convinced it was biased? A million times would not be enough to prove it for certain. There is a tiny probability that an unbiased coin could land on heads a million times. But you could be sure beyond reasonable doubt long before that. You wouldn’t even need the scale of an opinion poll sample, say 1000 throws. The chance of getting heads every time when throwing an unbiased coin 10 times is 1 in 1024. The chance of getting heads every time when throwing an unbiased coin 5 times is 1 in 32, which statistically speaking, makes throwing 5 heads out of 5 throws a reliable indicator that a coin is biased. Now all this hinges on the strength of the result. It would need a lot more throws to determine whether the coin was biased if there were a minority of tails among the heads. But the more consistent a result is, the less likely that it occurs by chance. On the other hand if the claim to be disproved was not that the coin was unbiased, but that it was biased towards tails to some stated degree, it would take even less throws to show this to be implausible. This also hinges on there being nothing biased about the throws which are recorded, if there is a chance that the person writing down the throws is more likely to see when the coins land on heads, rather than when it lands on tails, then it might take more throws to get reliable evidence. However, if we know the probability of missing a tails then that can be factored into the calculations too. That sort of bias, if understood, does not ruin the experiment.
Now let’s imagine a teacher chooses to teach at 5 different schools in their career and sees that behaviour is really bad in all 5. We can be reasonably confident, by the same maths as above, that (assuming no bias we haven’t accounted for creeps into the calculation) that behaviour is really bad in at least half of the schools that this teacher could have chosen to work at. Depending on the way the schools were selected, and the opportunities the teacher had, this could also tell us about a lot more schools, possibly a whole sector or all schools. On this basis it is simply not unreasonable for teachers to conclude things about the whole system from just a handful of experiences, if those experiences are likely to be representative. Slightly different results (say one good school) or the possibility of non-random choices of school, might make the result less reliable, but going to more schools or listening to other teachers, will increase the reliability again. And if the claim is that schools with really bad behaviour are rare (rather than just 50% or less) then the reliability of that teachers’s experience as evidence against the claim goes up (or equivalently, the number of schools needed to indicate the claim is unlikely can go down).
Now the reason I focussed in on this, is because one of the most common responses from the various forms of denialists who infest the education debate is to dismiss personal experiences. Now if the claim was that personal experiences told us about all schools, perhaps even most schools, then there is a problem. However, if I merely claim that the sort of thing I have seen is common then, if I am not deluded, I can feel I am wholly justified speaking from my own experiences, in claiming that the sort of things I describe in my blog are common in our secondary schools.