Common statistical terms


  • Definitions of Average: Mean, Median and Mode

    These definitions refer to this set of numbers: 5, 5, 5, 8, 12, 14, 21, 33, 38.

    Mean (arithmetic mean)

    The mean is the most commonly used type of average. It is total of all the numbers divided by the how many numbers there are. In this case there are 9 numbers with a total of 5+5+5+8+12+14+21+33+38 = 141. The mean is therefore 141 divided by 9, or 15.6444 (rounded to 15.6).

    Median

    For an odd number of numbers, the median is the middle number when they are arranged in order. The above set has 9 numbers, and they are in order: the middle number, i.e. the median, is 12. (For an even number of numbers the median is mid-way between the two middle numbers. So if we had just 8 numbers, 5, 5, 5, 8, 12, 14, 21, 33, the median would be mid-way between 8 and 12, i.e. 10.)

    Mode

    The mode is the number that occurs most frequently in a data set. For this set, the number 5 occurs most often, so the mode is 5. (If two or more numbers occur jointly most often the data set is bi-modal or multi-modal. In this case the mode may be of limited usefulness.)

  • Statistical Significance (including hypothesis testing)

    In every day speech, an event is significant if it is important or meaningful in some way. So, the election of Barack Obama was significant because he was the first black US president.

    In statistics, however, significance is a technical concept – and one that is quite commonly misunderstood.

    According to a current Gallup opinion poll, 45% of American adults think Barack Obama is doing a good job – his ‘approval rating’ is 45%. Suppose that in a week’s time Gallup carry out a new poll and it gives Obama an approval rating of 46%. A statistician might do a calculation or two and say that the rise from 45% to 46% is not significant. But what does that mean?

    If you take a series of opinion polls the figures you get will vary even if people as a whole have not changed their views. Opinion polls use random samples, and in random samples the numbers vary. So what we want to know is whether a change (45% to 46% in this case) is the sort of thing you would expect to see from one sample to another, or indicates a genuine shift in public opinion. And that is what a statistician can calculate.

    Without going into the calculation, the shift from 45% to 46% is well within the sort of variation you would expect from one sample to another – so we say it is not significant. It is the sort of change that could easily happen in a sample without there being any real increase in Obama’s approval rating in the population as a whole.

    A change from 45% to 50%, however, is bigger than would be expected in a typical opinion poll. It could happen by chance, but the probability of it doing so is small. So a change like that would be described as significant, and if it happened we would suspect that there had been an increase in the approval of Obama in the American population.

    If you see a result described as significant at the 5% level, it means there is only a 5% chance of getting such a result purely from random variation. You might suspect, therefore, that there is some real underlying change responsible for the result. If the significance level is given as 1%, the result is even less likely to have arisen by chance, so you will have a correspondingly stronger suspicion that there has been some real underlying change.

    Finally, there is a potential source of confusion in the terminology. A smaller significance level indicates greater significance! But that does actually make sense: a result at the 1% level is more likely to indicate a real change than a result at the 5% level. The 1% level gives stronger evidence than the 5% level; that is, it is more significant.

    More on significance: hypothesis testing

    In statistical tests we usually work with two hypotheses, the null and the alternative. The null hypothesis is something like the status quo; it is the assumption we would make unless there was sufficient evidence to suggest otherwise. The alternative hypothesis represents a new state of affairs that we suspect (or perhaps hope) might be true.

    For example, suppose we are testing a new medical treatment to see if it performs better than the existing standard treatment. The null hypothesis would be that the new treatment is no better (or worse) than the old; the alternative would be that it performs better.

    A significance test assesses the evidence in relation to the two competing hypotheses. A significant result is one which favours the alternative rather than the null hypothesis. A highly significant result strongly favours the alternative.

    The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical. If an outcome O is said to have a P value of 0.05, for example, this means that O falls within the 5% of possible outcomes that represent the strongest evidence in favour of the alternative hypothesis rather than the null. If O has a P value of 0.01 then it falls within the 1% of possible cases giving the strongest evidence for the alternative. So the smaller the P value the stronger the evidence.

    Of course an outcome may not have a small significance level (or P value) at all. Suppose the outcome is not significant at the 5% level. This is sometimes – and quite wrongly – interpreted to mean that there is strong evidence in favour of the null hypothesis. The proper interpretation is much more cautious: we simply don’t have strong enough evidence against the null. The alternative may still be correct, but we don’t have the data to justify that conclusion.

    There are interesting parallels here with criminal cases in a court of law. The null hypothesis in court is that I am not guilty; this is the assumption we start with, the assumption we hold to unless there is sufficient evidence otherwise. The alternative is that I am guilty, and the court accepts that conclusion only if my guilt is shown ‘beyond reasonable doubt’. But if the prosecution fails to obtain a guilty verdict this does not show that I am innocent. Perhaps I am innocent; or perhaps I am guilty but the evidence is not strong enough to show that beyond reasonable doubt. In the latter case, additional evidence may emerge later and I may face a re-trial.

    Likewise, if the evidence in favour of the new medical treatment is strong enough, we will want to adopt it. But if the evidence is weak we will stick with the standard treatment, at least until additional experimental evidence emerges to suggest that the new treatment may be better.

  • Standard Deviation

    The standard deviation (SD) is a measure of how spread out a dataset is. It tells us how far items in the dataset are, on average, from the mean. So the larger the SD, the more spread out the data are.

    Though the SD is an average, it is an average calculated in a somewhat unusual way – and if you don’t want the technical detail you should skip to the next paragraph now. For the dataset {2, 3, 5, 8, 12} the mean is 6. So the deviations from the mean are {–4, –3, –1, 2, 6}. We square these deviations, {16, 9, 1, 4, 36}, and take their average, 13.2. Finally we take the square root of 13.2 to give us the SD, 3.63.

     


     
    One of the best ways to think about the standard deviation is in terms of some ‘rules of thumb’. Suppose we have a large dataset, or a whole population, and suppose that the distribution is the common bell-shape or Normal curve as shown in the graph above. Then the following statements are generally pretty accurate.
     

     

    As an example, consider intelligence quotient or IQ (which remains popular despite the doubts of many psychologists and educationalists). IQ is usually measured on a scale with mean 100 and standard deviation 15, and the distribution of IQs in a population is, to a good approximation, Normal. We can therefore say that about 2/3 of people will have an IQ within 1 SD, that is 15 units, of the mean; so 2/3 of IQs will be between 85 and 115. About 95% of people will have IQs within 2 SDs of the mean, that is between 70 and 130. And it will be very rare to have an IQ more than 3 SDs from the mean, that is below 55 or above 145 (it’s about 1/8 of 1% in each case). The requirement to join Mensa is an IQ in the top 2% of the distribution; that amounts to an IQ just a little more than 2 SDs above the mean – about 131.

  • Regression to the mean

    Children of tall parents are, on average, shorter than their parents. This is a simple example of a common phenomenon: regression to the mean or the tendency for extremes to be pulled back towards the average. And for the same reason children of short parents are, on average, taller than their parents.

    In the case of heights, regression to the mean is just a fact and nobody worries too much about it. However, there are many other cases in which regression to the mean is seriously misunderstood.

    One famous example is the Sports Illustrated ‘jinx’.

    Appearing on the cover of Sports Illustrated is said to be a ‘kiss of death’: it is frequently followed by a loss of form, by failure of some sort. But of course the jinx is just regression to the mean. You only get to be on the cover if you are doing exceptionally well. And if you are doing exceptionally well now, then when regression to the mean kicks in you will be doing worse. All good runs come to an end. If Sports Illustrated started to feature those who were doing exceptionally badly on the front cover then people would be queuing up to appear, because if you are having a bad run then you will improve when regression to the mean occurs. (Of course good and bad runs don’t come to an end in a predictable way and some runs last longer than others. As ever in statistics we are talking about averages and randomness here.)

    The Sports Illustrated ‘jinx’ is a case of mistaking correlation for causation. Appearing on the cover may be correlated with a loss of form but it is not the cause of a loss of form. This misinterpretation is much more worrying when it affects serious areas such as health, where the fallacious reasoning can appear to defend quackery.

    'I was seriously ill and tried all sorts of things to no avail. But then I tried psychic surgery and I recovered. Therefore psychic surgery works.' No, many people who are seriously ill get better eventually. It’s regression to the mean which is responsible.