Bundle entropy as an optimised measure of consumers’ systematic product choice combinations in mass transactional data

In this work, a novel measure is developed based on entropy to directly measure the predictability of basket composition: bundle entropy, where zero denotes a bundle’s total predictability and one – the total unpredictability.

  • Funder


  • Duration

    Oct 2020 – Oct 2023

  • Investigators

    Roberto Mansilla, Gavin Smith, Andrew Smith, and James Goulding

  • Partners


Project Description

Understanding and measuring the predictability of consumer purchasing (basket) behaviour is of significant value. While predictability measures such as entropy have been well studied and leveraged in other sectors, their development and application to very large multi-dimensional data sets present in the retailing sector are less common. While a small number of methods exist, we demonstrate they fail to accord with intuition, leading to the potential for misunderstandings between those who conduct the analysis and those who act on the insights. We delineate the requirements for such a measure in this domain to demonstrate these issues in context. A novel measure is then developed based on entropy to directly measure the predictability of basket composition. The measure is designated as bundle entropy (zero denotes a bundle’s total predictability, one the total unpredictability). We empirically compare the proposed bundle entropy against existing measures using two large-scale real-world transactional data sets, each including more than 2,000 households (frequent shoppers) over two years. First, we demonstrate how the proposed measure is the only measure that behaves according to the desired properties. Second, we show empirically that bundle entropy differs noticeably from the other measures. Finally, we consider some use case analyses and discuss the utility of the proposed measure in practice.


This research paper outlines the necessary conditions for developing a measurement tool that can accurately capture consumers’ systematic purchasing behaviors. The study introduces a novel metric called “bundle entropy,” which employs the concept of entropy to measure the degree of predictability in the composition of a consumer’s shopping baskets.  Through empirical analysis using extensive real-world transactional data involving over 2,000 frequent shopper households across two years, the proposed bundle entropy is compared against established measures.

The evaluations are based on two different, real-world, mass transactional data sets. The first is Dunnhumby – The complete Journey – a freely available data set. The data set includes grocery purchases at a household level over two years from 2,500 frequent shoppers, providing a cohort for tracking systematic choices over time. The data set contains over 2.5 million records at the household level. The second data set is a large transactional data set from 2,181 loyalty card holders over 20 months (between 2014 and 2016) from a large UK grocery retailer. 


The empirical analysis reveals that the proposed metric, Bundle Entropy, is the sole measure aligning with the desired properties for such a measure, and it significantly differs from other existing measures. Additionally, practical use case scenarios are considered, discussing the potential utility of the proposed measure in real-world applications.

Associated Publications

A refined limit on the predictability of human mobility
It has been recently claimed that human movement is highly predictable. While an upper bound of 93% predictability was shown, this was based upon human movement trajectories of very high spatiotemporal granularity. Recent studies reduced this spatiotemporal granularity down to the level of GPS data, and under a similar methodology results once again suggested a high predictability upper bound (i.e. 90% when movement was quantized down to a spatial resolution approximately the size of a large building). In this work we reconsider the derivation of the upper bound to movement predictability. By considering… [more]

Media, Blogs and News Stories