In every day speech, an event is significant if it is important or meaningful in some way. So, the election of Barack Obama was significant because he was the first black US president.
In statistics, however, significance is a technical concept – and one that is quite commonly misunderstood.
According to a current Gallup opinion poll, 45% of American adults think Barack Obama is doing a good job – his ‘approval rating’ is 45%. Suppose that in a week’s time Gallup carry out a new poll and it gives Obama an approval rating of 46%. A statistician might do a calculation or two and say that the rise from 45% to 46% is not significant. But what does that mean?
If you take a series of opinion polls the figures you get will vary even if people as a whole have not changed their views. Opinion polls use random samples, and in random samples the numbers vary. So what we want to know is whether a change (45% to 46% in this case) is the sort of thing you would expect to see from one sample to another, or indicates a genuine shift in public opinion. And that is what a statistician can calculate.
Without going into the calculation, the shift from 45% to 46% is well within the sort of variation you would expect from one sample to another – so we say it is not significant. It is the sort of change that could easily happen in a sample without there being any real increase in Obama’s approval rating in the population as a whole.
A change from 45% to 50%, however, is bigger than would be expected in a typical opinion poll. It could happen by chance, but the probability of it doing so is small. So a change like that would be described as significant, and if it happened we would suspect that there had been an increase in the approval of Obama in the American population.
If you see a result described as significant at the 5% level, it means there is only a 5% chance of getting such a result purely from random variation. You might suspect, therefore, that there is some real underlying change responsible for the result. If the significance level is given as 1%, the result is even less likely to have arisen by chance, so you will have a correspondingly stronger suspicion that there has been some real underlying change.
Finally, there is a potential source of confusion in the terminology. A smaller significance level indicates greater significance! But that does actually make sense: a result at the 1% level is more likely to indicate a real change than a result at the 5% level. The 1% level gives stronger evidence than the 5% level; that is, it is more significant.
More on significance: hypothesis testing
In statistical tests we usually work with two hypotheses, the null and the alternative. The null hypothesis is something like the status quo; it is the assumption we would make unless there was sufficient evidence to suggest otherwise. The alternative hypothesis represents a new state of affairs that we suspect (or perhaps hope) might be true.
For example, suppose we are testing a new medical treatment to see if it performs better than the existing standard treatment. The null hypothesis would be that the new treatment is no better (or worse) than the old; the alternative would be that it performs better.
A significance test assesses the evidence in relation to the two competing hypotheses. A significant result is one which favours the alternative rather than the null hypothesis. A highly significant result strongly favours the alternative.
The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical. If an outcome O is said to have a P value of 0.05, for example, this means that O falls within the 5% of possible outcomes that represent the strongest evidence in favour of the alternative hypothesis rather than the null. If O has a P value of 0.01 then it falls within the 1% of possible cases giving the strongest evidence for the alternative. So the smaller the P value the stronger the evidence.
Of course an outcome may not have a small significance level (or P value) at all. Suppose the outcome is not significant at the 5% level. This is sometimes – and quite wrongly – interpreted to mean that there is strong evidence in favour of the null hypothesis. The proper interpretation is much more cautious: we simply don’t have strong enough evidence against the null. The alternative may still be correct, but we don’t have the data to justify that conclusion.
There are interesting parallels here with criminal cases in a court of law. The null hypothesis in court is that I am not guilty; this is the assumption we start with, the assumption we hold to unless there is sufficient evidence otherwise. The alternative is that I am guilty, and the court accepts that conclusion only if my guilt is shown ‘beyond reasonable doubt’. But if the prosecution fails to obtain a guilty verdict this does not show that I am innocent. Perhaps I am innocent; or perhaps I am guilty but the evidence is not strong enough to show that beyond reasonable doubt. In the latter case, additional evidence may emerge later and I may face a re-trial.
Likewise, if the evidence in favour of the new medical treatment is strong enough, we will want to adopt it. But if the evidence is weak we will stick with the standard treatment, at least until additional experimental evidence emerges to suggest that the new treatment may be better.