The so-called ‘replication [or reproducibility] crisis’ in science has, in part, been blamed on the use of significance tests, or p-values, to derive whether or not a scientific discovery is 'significant'.
RSS Conference programme lead, Daniel Farewell, introduced the keynote session which offered a variety of perspectives on significance tests from leading experts on the subject. Deborah Mayo, a statistical scientist and philosopher from the Department of Philosophy at Virginia Tech is author of the book, Statistical inference as Severe Testing. Fellow Virginia Tech professor, Aris Spanos, who has authored many papers on the subject including ‘Severe testing as a basic concept in a Neyman–Pearson philosophy of induction’ with Deborah Mayo back in 2006, brought his perspective from the field of econometrics. Richard Morey from the School of Psychology at Cardiff University, and who has just published a paper ‘Beyond statistics: accepting the null hypothesis in mature sciences’, talked from his background in Bayesian statistics and experimental psychology. Last, but certainly not least, David Cox brought his unique perspectives as one of the most influential thinkers in modern statistics.
Speaking first, Deborah Mayo gave some background context. The replication crisis has cast doubt on the use of p-value null-hypothesis significance testing (NHST). However, Deborah pointed out, the problem is when data is cherry picked. Quoting Wagenmakers, ‘p-values can only be used once the sampling plan is known’, Deborah emphasised that data on sample sizes and manipulations must be disclosed along with the p-value. Don’t oust the method, Deborah warned – but be prepared to respond to challenges.
Richard Morey addressed a question asked by many: Should statistical significance be redefined? In his talk, he set out why he did not think that it should. He described NHST as a ‘scientific hammer’ which only focuses on a single effect, where statistical rejection equals scientific support. The idea that you can base scientific discovery on a p-value is unrealistic, Richard said. Using Bayes factor methods can produce more realistic results.
Talking from his econometrics background, Aris Spanos talked about the importance of statistical model designs, and of testing inside and outside of the model. He pointed out that lot of bad evidence is easily reproducable if you use the same bad methodology; you need to test all of your assumptions.
In David Cox’s talk, titled ‘In Gentle Praise of Significance Tests’ he identified a variety of contexts where significance tests are helpful, including the discovery of the Higgs boson which used the extreme five-sigma test to prove its existence. The p-value is a simple idea, David said; it carries clear information. But how we use it is a multi-faceted thing, and confusion stems from the different objectives it is used for.
A lively panel discussion followed, discussing questions on, among other things, the quality of reporting and the distinction between model refinement and ‘p-hacking’.
This RSS 2018 keynote conference session was titled 'Significance Tests: Rethinking the Controversy'. A video of the session will be available on our YouTube channel in due course.
The above photo shows (l-r) David Cox, Aris Spanos, Deborah Mayo and Richard Morey taking part in this session's panel discussion. Photos courtesy of Rich Gray.