Missing data is a very common challenge when working with real-world datasets and something everyone working with data should be mindful of. As Emeritus Professor David J Hand from Imperial College, London, reminded us, we live in a golden age for statistics and associated disciplines. Statisticians are able to perform extraordinary feats in analysing data to extract understanding and make predictions. But we must remember that data are but a shadow of the real world, and to the extent that the shadow is distorted in unknown ways any conclusions we draw can be compromised. In this talk, Professor Hand examined different ways in which data can be inadequate, failing to properly represent the object of study. He described the taxonomy of missing data as developed in his latest book - Dark Data - illustrating some of the consequences for our inferences and understanding with real examples.
This was followed by Dr Robin Mitra, from Cardiff University and the Data Science Campus of the ONS, who presented how to consider missing values when constructing optimal designs ahead of conducting experiments. The issue there is that it is not known which values are missing at the design stage. When data are missing at random it is possible to incorporate this information into the optimality criterion that is used to find designs.
However, when data are not missing at random this framework can lead to inefficient designs. Dr Mitra also described some of the specific challenges to finding optimal designs for linear regression models with ‘not missing at random’ values present, when one’s aim is to estimate the model parameters with high precision.
In the final talk, Dr Levente Klein from IBM's presentation explained how physics constraints can help reconstruct missing data in satellite-based observations of the CH4, SO2 and NO2 global distribution, when images are acquired sporadically and have patchy spatial coverage. By using the laws of physics to inform the deep learning models, the results obtained outperformed standard purely data driven models while requiring less data for training.
The session, with all the speakers and attending from different locations virtually, was simultaneously live streamed to and from the main auditorium in Manchester Central. It was well attended, with interesting questions from both the in-person and online audience. The discussion centred around the impossibility with missing data to ever know the grand truth, no matter how large the amount of data we have accessed.
Author
Camille Szmaragd is a committee member of the RSS Computational Statistics and Machine Learning section.