In this ‘big data’ era, understanding how to use a variety of data sources in a given analysis has taken on increasing importance.
For example, high-quality probability surveys have struggled with increasing costs and nonresponse rates. Integrating multiple data sources to assist in social research has thus become a research priority. Government agencies can link surveys with administrative records and geospatial measures. Large non-probability surveys can be combined with probability surveys to provide information about comparatively uncommon but important segments of the population. Web scraping, Twitter feeds, and retail sales can be used to better forecast and understand economic trends and forces. In medical settings, pulling together data from electronic health records, ecological momentary assessments, environmental measures, and more traditional sources such as controlled clinical trials have become a major focus of interest.
While computer science plays a large role in the development of the computational aspects of data integration, statistics plays a key role in taking this work into the scientific realm.
This special issue of the Journal of the Royal Statistical Society, Series A, is dedicated to finding solutions to these challenges, through innovative methodological developments and applications, bringing together data science and statistics. As usual with Series A, the focus is on the development and/or evaluation of innovative methodology that is directly motivated by, and substantially increases our understanding of, real world data problems in social and medical settings. These might include:
- Methodological issues surrounding match rates across data sources using fuzzy matching techniques
- The production and use of weighting schema to account for data issues around representativeness and attrition
- The development and diffusion of new or modified statistical techniques required to exploit new data, for example, in relation to the representation of spatial data, dealing with very big data that might require relying on random subsample analysis, or replication studies.
Motivating examples might include:
- Enhancing statistical power of analyzing data collected from controlled clinical trials on drug development by integrating summary information from historical studies, for which the information may be easily accessible via publications in the existing relevant literature
- Improving model fitting when predicting the risk of developing certain cancers where individual-level data from different hospitals cannot be shared and thus repeated communications between hospitals may be necessary to coordinate sharing aggregate data.
We particularly encourage interdisciplinary submissions that involve collaboration between statisticians and other scientists.
Prospective authors are invited to email their proposals to the Guest Editors, Peisong Han (peisong@umich.edu), Yajuan Si (yajuan@umich.edu). Please note that, in line with the remit of Series A, contributions of a principally technical nature will not be acceptable. The deadline for manuscript submissions is midnight on 31 January 2024. Submissions— which should clearly indicate JRSS-A Frontiers in Data Integration Special Issue in the cover letter—should be made in the usual way, online at ScholarOne Manuscripts (manuscriptcentral.com), where further guidance about the structure, length and format of manuscripts may be found. All manuscripts will be peer reviewed in line with the journal’s standard policy. However, in order to produce the special issue in a timely manner, authors will be asked to complete revisions within eight weeks of receiving referee reports.