h1

Statistical Data and the Education Debate Part 2: Why we can reach conclusions from limited data.

June 13, 2013

I have brought this post forward as I have just seen a number of people react to this OFSTED report by making some of the errors described here.

As I said last time, people often think probability can be left out of evidence-based decision making entirely. The most common version of this is when we dismiss people’s descriptions of their experiences as unrepresentative or (perversely) anecdotal. Probability is at the heart of how we reason from evidence (particularly limited evidence) to more general observations. If we see something happen, then (unless we are mistaken) it is impossible that it actually never happens, it is less probable that it is rare, and it is more probable that it is common. The more often we see that something happen, and the more people we know who also see it happen, then the more unlikely it is to be rare and the more likely it is to be common.

The use of probability to go from a limited set of data to a more general claim is part of how opinion polling works. Although opinion polls don’t ask everyone in the appropriate population for their opinions, if they ask enough people and there is no reason to think those people were unrepresentative of the wider popular, then the opinions they express to the pollsters are likely to be close to the opinions of the wider population. Opinion polls usually give a margin of error indicating just how close to the opinions of the entire population their numbers are likely to be, which is calculated from the number of people in the pollsters’ sample.

Now, our reasoning is affected if our tendency to see things happen, or the pollsters’ way of finding people to poll, is not random. The results are likely to be less accurate. But while some sort of bias in how the sample of people asked might affect the probabilities involved, it still remains a matter of probability, and it can only change the probability, it doesn’t mean we know nothing at all. Biased sampling makes polling less reliable, but not necessarily so unreliable that it tells us nothing. The same goes for small samples. While asking less than a thousand people might make it far less likely that an opinion poll represents the opinions of the whole population to the nearest 3%, it might still tell us within 10% or 20%, and if it is claimed that “nobody” or “hardly anybody” or “only a minority” of people think something then that might be enough to settle the matter.

Once the role of probability in interpreting data is understood, we need to be careful about how easily opinions or experience are dismissed. There are those who reject any survey evidence outright for being only a tiny proportion of those who could have been asked. This is a big mistake. Because polling is based on probability, then, given random sampling, the size of the total population is not usually a major factor in the accuracy of a poll. 3 thousand people is a very good sample of 5 million people or 2 billion people.

A more common error is to assume that a small number of data points must tell us nothing which brings us back to how easily the genuine experiences of real people are dismissed as “anecdotal” or “unrepresentative” because they are not based on information gathered from a large random sample. Apparently seeing something frequently (like poor behaviour or bad management in schools) is no reason to think it is commonplace.

I think the best example I can give as to the usefulness of even a small sample is to imagine testing a coin to see if it is biased towards landing on a particular side when thrown. Now the population of possible coin throws is probably infinite. If the coin was to be kept for the purpose of coin throwing then the actual number of throws could be enormous. Yet if you were testing it for bias and it landed on heads every single time, how many times would you take to be convinced it was biased? A million times would not be enough to prove it for certain. There is a tiny probability that an unbiased coin could land on heads a million times. But you could be sure beyond reasonable doubt long before that. You wouldn’t even need the scale of an opinion poll sample, say 1000 throws. The chance of getting heads every time when throwing an unbiased coin 10 times is 1 in 1024. The chance of getting heads every time when throwing an unbiased coin 5 times is 1 in 32, which statistically speaking, makes throwing 5 heads out of 5 throws a reliable indicator that a coin is biased. Now all this hinges on the strength of the result. It would need a lot more throws to determine whether the coin was biased if there were a minority of tails among the heads. But the more consistent a result is, the less likely that it occurs by chance. On the other hand if the claim to be disproved was not that the coin was unbiased, but that it was biased towards tails to some stated degree, it would take even less throws to show this to be implausible. This also hinges on there being nothing biased about the throws which are recorded, if there is a chance that the person writing down the throws is more likely to see when the coins land on heads, rather than when it lands on tails, then it might take more throws to get reliable evidence. However, if we know the probability of missing a tails then that can be factored into the calculations too. That sort of bias, if understood, does not ruin the experiment.

Now let’s imagine a teacher chooses to teach at 5 different schools in their career and sees that behaviour is really bad in all 5. We can be reasonably confident, by the same maths as above, that (assuming no bias we haven’t accounted for creeps into the calculation)  that behaviour is really bad in at least half of the schools that this teacher could have chosen to work at. Depending on the way the schools were selected, and the opportunities the teacher had, this could also tell us about a lot more schools, possibly a whole sector or all schools. On this basis it is simply not unreasonable for teachers to conclude things about the whole system from just a handful of experiences, if those experiences are likely to be representative. Slightly different results (say one good school) or the possibility of non-random choices of school, might make the result less reliable, but going to more schools or listening to other teachers, will increase the reliability again. And if the claim is that schools with really bad behaviour are rare (rather than just 50% or less) then the reliability of that teachers’s experience as evidence against the claim goes up (or equivalently, the number of schools needed to indicate the claim is unlikely can go down).

Now the reason I focussed in on this, is because one of the most common responses from the various forms of denialists who infest the education debate is to dismiss personal experiences. Now if the claim was that personal experiences told us about all schools, perhaps even most schools, then there is a problem. However, if I merely claim that the sort of thing I have seen is common then, if I am not deluded, I can feel I am wholly justified  speaking from my own experiences, in claiming that the sort of things I describe in my blog are common in our secondary schools.

31 comments

  1. Reblogged this on The Echo Chamber.


  2. I’m hoping that in part 3 you’ll talk about cognitive errors and biases in observers. That’s because I was with you up to the words ‘bad behaviour’.

    Because behaviour is usually a response to an environment, it is entirely possible that a teacher could be observing behaviour that was ‘bad’ because of his/her assumptions about what constitutes ‘good’ behaviour, his/her teaching, the organisation of the school or the structure of the education system, rather than because the behaviour was ‘bad’ per se.

    If observers are using categories, the categories need to be clearly defined and when making comparisons between observers, we need to ensure they’re using the same definitions.

    Anecdotal evidence is certainly not worthless, but the failure to operationalise the metrics observers use makes it unreliable.


    • “I’m hoping that in part 3 you’ll talk about cognitive errors and biases in observers. That’s because I was with you up to the words ‘bad behaviour’.”

      I’m not going to talk about cognitive biases because, while they are worth looking for in one’s self to avoid errors, when used as a criticism of somebody else’s argument they are as invalid as any other circumstantial ad hominem.

      “Because behaviour is usually a response to an environment,”

      Serious ambiguity there. All behaviour? Or bad behaviour? Do you merely mean our decisions that account of the environment? Or do you mean our behaviour is determined by the environment?

      “it is entirely possible that a teacher could be observing behaviour that was ‘bad’ because of his/her assumptions about what constitutes ‘good’ behaviour, his/her teaching, the organisation of the school or the structure of the education system, rather than because the behaviour was ‘bad’ per se.”

      Well, yes, but so what?

      All I’m arguing is that it is statistically likely that what I call bad behaviour is common in the sort of school I can get to work in.

      That’s all. Obviously, what I call bad behaviour might be acceptable to somebody else, but that’s another discussion (and another excuse).


      • “I’m not going to talk about cognitive biases because, while they are worth looking for in one’s self to avoid errors, when used as a criticism of somebody else’s argument they are as invalid as any other circumstantial ad hominem.”

        I think we’re talking about two distinct but related issues – one is anecdotal evidence, and the other is how reliable, valid, general conclusions can be drawn from anecdotal evidence. The blogpost appears to be about the latter; your comments about cognitive bias about the former.

        We are all biased in one way or another, and I agree that questioning the validity of somebody else’s argument on the grounds that they might be biased is the equivalent of an ad hominem criticism.

        However, when it comes to drawing general conclusions from anecdotal evidence, we know that people are susceptible to several well-documented biases in their perception of events, so it’s important to take those into account. For example, there’s attribution error – people tend to attribute successful outcomes to themselves, and unsuccessful outcomes to circumstances. And confirmation bias – we overemphasise the importance of evidence that confirms our pre-existing perceptions. We are also subject to primacy, recency and salience biases, to name but three others. That’s why human memory is so fallible.

        Those biases don’t invalidate anecdotal evidence, but they do need to be taken into account when extrapolating from a small sample to the population level.

        ‘ “Because behaviour is usually a response to an environment,”
        Serious ambiguity there. All behaviour? Or bad behaviour? Do you merely mean our decisions that account of the environment? Or do you mean our behaviour is determined by the environment?’

        Ok, behaviour is always a response to an environment but it isn’t only a response to an environment. Is that less ambiguous?

        ‘ “it is entirely possible that a teacher could be observing behaviour that was ‘bad’ because of his/her assumptions about what constitutes ‘good’ behaviour, his/her teaching, the organisation of the school or the structure of the education system, rather than because the behaviour was ‘bad’ per se.”
        Well, yes, but so what?’

        So… what would be the point of knowing only that n teachers thought there was ‘bad’ behaviour in x schools? I can’t see much point if we don’t look at what’s considered to be ‘bad’ behaviour, what causes it, and what eradicates it.

        “All I’m arguing is that it is statistically likely that what I call bad behaviour is common in the sort of school I can get to work in.”

        So why was your blog post about sampling, statistical analysis and Ofsted’s general conclusions? Why not just say that from what you’ve seen, you’re making an informed guess that you’d see the same in a lot of other schools?

        “That’s all. Obviously, what I call bad behaviour might be acceptable to somebody else, but that’s another discussion (and another excuse).”

        What’s acceptable behaviour and what isn’t, is a key issue in drawing general conclusions about ‘bad’ behaviour. Isn’t it?


        • “So why was your blog post about sampling, statistical analysis and Ofsted’s general conclusions? Why not just say that from what you’ve seen, you’re making an informed guess that you’d see the same in a lot of other schools?”

          Because I’m arguing against the same fallacy in both cases, that a small sample tells you nothing. I’m actually making fairly limited claims about what it does tell you, but I can say that claims cannot be dismissed entirely due to small sample size alone.


  3. When my sister told me that at the school she taught in she was told to f***off most days and shot at with an air pistol, its important that she realises that she cannot call this ‘bad behaviour’. After all she MIGHT not be using the same definition of bad behaviour as others. Perhaps her assumptions of what constitutes good behaviour were not shared by all. And when I make assumptions that bad behaviour is common because I hear so much about schools where the teachers are regularly sworn at, I must realise I can only make assumptions of bad behaviour if I have used clearly defined categories and therefore my observations are unreliable…


  4. @ Heather F. I’m not saying that people can’t call behaviour ‘bad’; but the blog post was about statistical sampling. If everyone is using different definitions of ‘bad’ it’s difficult to draw reliable conclusions from stats.


    • Are you serious? You seem to be implying that we can’t draw conclusions about anything that isn’t clearly defined. If you feel so strongly about this, at least reassure me that you dismiss every judgement that every OFSTED inspector ever made or will make as unreliable. How, for instance, would you nail down “engaged” or “over-reliant”…or better still “disruptive”?

      Next time I’m observed I might just give them all a piece of paper with the simple instruction “make outstanding progress”. Now, I’m sure at some point they’d discuss what they were meant to be up to which would tick a few group discussion/ peer assessment boxes. I’d be demonstrating high expectations (explicitly through the three word instruction) and implicitly by trusting them to ‘independently engage’ in a complex problem solving activity ie. just wtf they were meant to be doing. But most importantly, I’d be veering safely on the side of avoiding talking to the class, which as we all know…apparently…is a huge barrier to progress.

      As for the ensuing chaos, for which I’d hopefully be taken to task, I’d be ready with my ‘obviously your judgement is flawed since you can’t actually provide me with a cast-iron, non-subjective definition of “bad behaviour”‘.

      Incidentally, do you ever go to the doctor…or would his/her inability to accurately define the concept ‘healthy’ render the trip superfluous? After all, such an explanation might involve such ideas of ‘quality of life’ and ‘normal’ and you can imagine the minefield you’d be walking into.

      By all means respond, but before you do, remember that if you want to include ‘sarcastic’, ‘ironic’, ‘absurdity’, ‘extreme’ or suchlike I shall of course require objective definitions before I consider your post reliable.


      • The various doctors I have visited over the course of my life for myself, my children and wider family have never ‘reached conclusions on limited data’ statistical or otherwise. They have also often struggled to accurately diagnose conditions. I don’t go to the doctor to be told weather I am ‘healthy’ or not as this is a very broad concept which is again open to interpretation. I also take responsibility to a certain degree to my own state of healthiness based upon my life decisions.

        Of course doctors use anecdotal evidence too – the stories and histories of their patients are vital – but these need to be put in wider contexts and weighed up with all the data available. And even then one will often not get an absolute diagnosis but will still be based on probabilities. The blog post is about ‘reach[ing] conclusions from limited data’ couched in the language of research. One should neither deny the significance of anecdotal evidence nor base one’s conclusions or generalisations upon them.


        • Oh dear ‘whether’ not ‘weather.’ I blame the contextual conditions while writing. I was gazing out of the window looking at the clouds!


        • Are you taking issue with my post? I can’t actually tell.


          • To clarify; you use the example of visiting the doctor as a rebuff against logicalincrementalisms’ argument. This is a weak rebuff for the reasons I give in my post.


          • Not sure it is really. I’ve seen this many times before; what one might term the ‘semantic approach’ to behaviour management, namely, getting rid of bad behaviour by calling it something else, (generally in order to place the fault with teachers…and, naturally, the less senior the teacher, the greater the degree of blame.)

            logicalincrementalism’s argument seems to me to rest upon the somewhat quotidian and blindingly obvious observation that words don’t have rigid and universally accepted meanings. That fact doesn’t actually prevent communication, however, nor does it stop us tackling a problem such as bad behaviour simply because ‘bad’ is a subjective concept. My post tried, in vain it seems, to depict the absurdity of such a world-view…by drawing an absurd analogy.


      • I’m not saying we *can’t* draw conclusions about anything that isn’t clearly defined: what I’m saying is that the conclusions we can draw will depend on how clearly defined it is. Obviously, we’ve each got our own unique definition of concepts like ‘engaged’, ‘disruptive’, ‘healthy’ or ‘normal’, but if we are pooling the views of lots of different people using those concepts without defining them clearly, there’s a limit to the conclusions we can draw. If Ofsted are looking at ‘engagement’ or ‘disruption’ and it turns out they are using a different definition to the one used by a teacher, the outcomes would be unhelpful, to say the least.

        Doctors are quite familiar with the minefield that results from not defining things clearly. Whether or not someone gets the best treatment for a medical condition often depends on precise definitions of a body part or a precise chemical compound at a specific dosage. As Keith Turvey implies, the degree of certainty doctors exercise depends on the data available.


        • Well, fair enough, but that’s the way the world is; not just education. We just have to get on with it…we can reach consensus. I can confidently assert that Einstein was a genius or the Third Reich was abhorrent in the full knowledge that ‘genius’ or ‘abhorrent’ don’t mean exactly the same thing to different people.
          I am NO LESS confident in asserting that behaviour in this country’s state schools is a significant problem and in my (not inconsidtable) is THE major obstacle to learning (with poor management and OFSTED) giving it a run for its money.

          Incidentally, I know the three factors above are intertwined and I’m aware that it may not be every school. Extrapolating from the half dozen I’ve worked in however means it’s highly probable that it’s the majority…( as I’m sure you know, the probability of 6 successive heads is 1 in 64)


          • Not inconsiderable experience…it should have said. I’ve hurt my hand.


          • I’ve really got to get off here but I can’t believe what I just omitted from my list of learning barriers: academisation. I’m not even a wholehearted dissenter; there are, after all, some truly incompetent local authorities. That said, the speed of the process and, in many cases, the dubious motives of some of those involved have already produced some disasters and we’re only in the early stages.

            I think once TESCO step in we can safely assume the narrative of liberal education in the UK is effectively over and, at the denouement, it will become clear we’d been reading a Gothic horror. Mind you, they’ll have a ready supply of low-skilled otherwise unemployable labour they can force into workfare or minimum wage drudgery. In fact, for those kids, ‘learning for life’ will at last be a reality. They’ll never have to leave school at all.


          • It’s interesting to note how we ‘just got on with it’ and ‘reached consensus’ before the development of the scientific method. Admittedly we haven’t been unremittingly successful since, but beforehand…


  5. My understanding was that OA was talking about whether it is reasonable for an individual (not a researcher doing formal research) to accept anecdotal evidence. In this context the debate isn’t so much about whether that behaviour was actually bad (although, yes, it is mind boggling what some people would argue isn’t bad behaviour) as whether someone’s anecdotal evidence can be dismissed out if hand, as often happens on twitter and similar forums.


  6. Yes I think there is value in personal experience but like the poster above agree it is important to define categories. Bad behaviour identified by one observer may be classed high spirits by another. Obviously I’m not talking about extreme indicators such as telling a teacher to F**** off. Defining categories and checking they are applied consistently is known as inter-rater reliability and there are various statistical tests.

    I remember teachers telling pupils to f*** off in my secondary school. Mostly part of the lively banter exchanged between pupils and teachers but no indicator of a lack of respect. Although I do also remember incidents of break downs of behaviour too with the odd kid throwing a desk at a teacher. I wouldn’t try to generalise too widely from this experience about comps in inner city Birmingham in the 70s.


  7. Allow me to set up some definitions then…

    Ahem..

    Bad behaviour is any act counter to the school rules.

    There… fixed.


  8. @rob: that would be fine if all schools had the same rules.


    • well then it is fine then, as most schools have near identical rules.

      the difference is that some schools insist on adherence and some do not.


  9. What I have seen on twitter is people challenging assertions as anecdotal and expecting research. They don’t seem to be in conflict over the definition but rather over whether one person’s experience is of any worth. I agree that if what constitues bad behaviour (for example) can’t be agreed by those in debate then this is a problem but it is not anything like as common as the assertions that someone’s experience ‘is just anecdotal’ and therefore irrelevant.


  10. Absolutely logicalincrementalism. Heatherf I don’t think of the issue of different interpretations as a problem. It is the reality of the social world that perception and interpretation cannot be eliminated from the thing you are studying, researching or simply wanting to observe and say something about. As for ‘anecdotes’ being dismissed as irrelevant well that depends on the philosophical position of the person doing the dismissing; something else that one can’t really eliminate but you can put a counter argument as is going on here.


  11. I appreciate the principles on display here, that small samples can be representative of a larger population. But if I might suggest an improvement, I think you could pick some different examples to make the point stronger. A teacher choosing to work at 5 schools might have huge biases in where they choose to work: for example, they might work at 5 schools in the same town or conurbation. Additionally, schools with bad behaviour are likely to have a higher staff turnover, so job postings are disproportionately more likely to be in troubled schools. Perhaps a better example might be teacher’s experience of OFSTED inspections; if a teacher sees five bad judgements of other teachers, they can start to develop a reliable idea that OFSTED often make bad judgements.

    On similar lines, a teacher might see bad behaviour frequently, but if they only observe their own school, they are really only a sample size of one (one school). I suspect this, in particular, is why a lot of observations by teachers tend to be dismissed as anecdotal — that, and the partial observer/bias issue. Example issues where a single teacher’s observations can be more trusted are those that are not as affected by the variable of school, e.g. consistency of marking by exam boards across years, or success of different style of personal statements in UCAS applications.


  12. The point is not that seeing the same things in five schools is indicative that significant numbers of schools are like that.

    It’s about fantasists dismissing personal testimony simply because it is personal testimony.

    If you teach, you attend meetings with teachers from other schools a fair bit, and talk to them. If they are all telling you the same story it bumps the numbers up even more. I think he moderates all the posts here which is quite often teachers saying “yep, me too”. He also (sometimes) participates in the TES forums, which has teachers discussing the same sort of stuff. Even if you write 75% of it off as hyperbole it’s still supporting evidence.

    Old Andrew clearly isn’t stupid. He’s not so naive that he doesn’t look at other classes, talk to other teachers, and so on. If he was a lousy teacher who couldn’t control a paper bag he wouldn’t write about behaviour at all (if he worked in schools, those who don’t are ever willing to pronounce the magic)

    The denialists try to adopt the OFSTED mentality that if behaviour it is bad then it is the teachers’ fault, nothing to do with the system. The fact that most of the teachers in most of the schools encountered have similar problems is wished away.

    IME most of the denialists work either in schools in leafy areas with 90% 5A-C passes or more commonly don’t work in schools at all but are paid (god alone knows why) to tell teachers what to do.

    You can see this most obviously in the discussions about phones in schools (see earlier posts). Those in tougher schools talk about distractions etc. Those in nice easy jobs write those teachers off as luddites or whatever because their classrooms aren’t full of iPads.

    It is difficult to produce research because schools lie about numbers. I suspect the most dishonest document in any school is the H&S Injury log.


    • “The point is not that seeing the same things in five schools is indicative that significant numbers of schools are like that.

      It’s about fantasists dismissing personal testimony simply because it is personal testimony. ”

      So why is the post entitled “Statistical Data and the Education Debate Part 2: Why we can reach conclusions from limited data”
      and why is it about probability and sampling? If you’re right, Paul, the whole post seems a bit misleading.


  13. With regard to the definition of “bad behaviour”, I’m not sure why it matters to the point being made here. I’m not actually suggesting that any study of behaviour just uses the term “bad behaviour” and leaves it at that, I’m merely pointing out that seeing what you consider to be bad behaviour in several places does give you reason (with appropriate caveats) to think it common; something which is often denied even before anyone gets onto the issue of whether that behaviour is bad or not.


  14. It’s quite likely that one teacher working in 5 schools all of which have a problem with ‘bad behaviour’ will conclude that ‘bad behaviour’ is common. However, all that those data points tell you is that that particular teacher considered behaviour to be bad in five schools. The teacher’s conclusion is not unreasonable, but your post claims to be about drawing statistical conclusions from limited data, and you can’t be sure conclusions based on one teacher’s opinions about 5 schools are reliable or valid.


  15. Disclaimer: Stumbled across this post while trying to find a good online introduction to all the issues surrounding sampling (I may actually have to write one myself despite only knowing enough to know how little I know :( ).

    You’re forgetting one important thing in your example:
    How the teacher was selected and brought to your attention.
    If she was mentioned in a news article about the terrible education system she may have specifically been selected from a potential sample of thousands of teachers BECAUSE of her experience, in which case the fact all five schools she went to were bad really DOES, for all intents and purposes, indicate nothing (because if you get a sample of 1000 people, each with 1d6 binary properties, it would frankly be surprising not to have one with their five binary properties being positive).

    While it wouldn’t be unreasonable (in the absence of more, larger scale, data) for the teacher in question to conclude a prevalence of bad teaching, it wouldn’t be reasonable for someone reading the news article mentioning her to do so (probability judgements are, in fact, subjective, but they’re also incredibly useful, even vital). The laws of probability mean that following them will generally give you good conclusions, but there’s always that one in a million person who is exposed to only outliers.

    There’s also distortion by the observer to consider, accessing a human memory is essentially (based on a moderately large number of interesting experiments on memory, famously including implanting false memories of traumatic childhood experiences) reconstructing some /distinctly/ non-randomly sampled events based on a few loose associations.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: