Diagnosing Disease with Shopping Data

Retailers loyalty card data is a currently underused and under-explored dataset in health research despite containing large-scale, and longitudinal behavioural information on the populations diet, product use and self-medication.

GDPR now gives individuals the right to access a copy of the personal data commercial companies hold on them. As more studies evidence how what we consume is affecting our health, the opportunity should not be missed to link data-sets on our purchasing habits to health outcomes.

  • Funder


  • Duration

    Sept 2019 – Sept 2023

  • Investigators

    Elizabeth Dolan, James Goulding, Anya Skatova,

  • Partners

    ALSPAC, Boots, ONS, JBC

Project Description

Personal commercial transactional data is the information stored when an exchange occurs between an individual and a business, including customer shopping data.  This research will connect loyalty card data (customer shopping information held by a retailer), to covid-19 incidents and to information from women with ovarian cancer. Connecting these datasets will be used to investigate whether shopping data can be used to get women with ovarian cancer diagnosed earlier, and/or if it can help in informing public health decisions in a pandemic.

The aim of this project is to create a framework for using shopping data in medical research and asks the question:

How can personal transactional data be collected and analysed for the purposes of health research in a way that is acceptable to society, and works for infectious and chronic disease.

The project is connected to a wider project by partners ALSPAC at Bristol University and the Alan Turing Institute: “donating personal transactional data for research: investigating the public acceptability of using commercial transactional data in public health research”.


A collection of studies will be done to iteratively create machine learning models whose predictions could help in the earlier diagnosis of ovarian cancer and/or the understanding of ILI (Influenza Like Illnesses) outbreaks.

The methodology to be used is mixed methods collecting and analysing both qualitative data, and quantitative data for integrated interpretation.  The studies will be used to inform the models schema creation, feature engineering, to understand, and validate its outputs and any interpretations made from these.  The iterative design will allow for adjustments to the model for successful implementation in a clinical setting.


Watch this space.

Associated Publications

Psychology of personal data donation 2019

Advances in digital technology have led to large amounts of personal data being recorded and retained by industry, constituting an invaluable asset to private organizations. The implementation of the General Data Protection Regulation in the EU … enables the general public to access data collected about them by organisations, opening up the possibility of this data being used for research that benefits the public themselves; for example, to uncover lifestyle causes of poor health outcomes…. [more]


Value of Commercial Product Sales Data in Healthcare Prediction

Technical report and code for above project conducted with the NHS can be viewed at


Media, Blogs and News Stories


Applying a novel variable importance technique, MCR (model class reliance), to machine learning models in order to assess the Value of Commercial Product Sales Data in Healthcare Prediction