Automatic Lifestate Identification and Clustering

While numerous methods to segment and/or summarise time series exist, the properties often do not align with the needs of consumers of the summaries or require the unrealistic setting of parameters. Addressing this we define a set of broad properties that lead to high utility in a broad class of domains and propose a model with complexity controlled by a holdout procedure automatically realising a summarisation meeting the aforementioned properties based on an information theoretic notion of optimality. This work defines the concept of a lifestate and introduces a 3-stage pipeline that includes: 1. Feature selection, 2. Feature clustering, 3. Automatic Lifestate Identification.

Funder
EPSRC
Duration
September 2021 – December 2025
Investigators
Samuel Smith, Gavin Smith, John Harvey
Partners
RSSB

Project Description

Short Summary

This project aims to provide a method to answer the following question: how can we identify states that are common across multiple time series, without prior knowledge of the states themselves?

Summarising high-dimensional time series data across multiple entities is an increasingly prevalent problem. Mass data collection is routine in most domains. For example, regular survey collection, consumer purchasing history from transactional data (where the number of possible items to choose from is high), or other repeatedly sampled data. Summarisation in such a context is both with regard to a reduction of the high-dimensional observations and large number of temporal points. While numerous methods to segment and/or summarise time series exist, the properties often do not align with the needs of consumers of the summaries or require the unrealistic setting of parameters. Addressing this we define a set of broad properties that lead to high utility in a broad class of domains. Intuitively these properties reflect the summarization of such data into life-states where (1) the number of states is limited and shared across entities to allow interpretation and comparison and (2) the number of state-transitions is jointly controlled to provide a parameterless, optimal summarisation of both the high sample and temporal dimensionality. Specifically the aim is the realisation of a segmentation that optimally trades off the number of states and segments that humans must then interpret while still capturing salient state changes. This work addresses the following problems: feature selection (removing features that do not contain meaningful temporal behaviour), clustering features (into groups of features with similar temporal behaviour), and Automatic Lifestate Identification (finding the optimal segmentation for each cluster of features).

Partner info - RSSB

Funding info - EPSRC

Method

Current methods of summarising high-dimensional time series employ Minimum Description Length (MDL) and information criteria principles for complexity control. This project follows a similar approach but has additional constraints on how datasets can be segmented. These constraints significantly change the problem and mean new methodologies must be created.

Empirical analysis on both synthetic and real-world datasets are used to test the validity of the developed methods. Synthetic datasets are used to provide some semblance of ‘ground truth’, and real-world datasets are used to test the efficacy of the method where the ‘ground truth’ is less controlled.

Results

Results on synthetic data show improvements over current methods and established baselines. Furthermore, application to real-world data demonstrates accurate identification of ‘parenthood’ lifestates, COVID effects seen in rail customer satisfaction, and work/social lifestates identified in mobile application usage.

Associated Publications

Automatic Lifestate Identification and Clustering.
Summarising high-dimensional time series data across multiple entities is an increasingly prevalent problem because mass data collection has become routine in most domains. We propose a method of automatically summarising high-dimensional data… [more]

International Journal of Population Data Science, 2023Smith, S., Smith, G. & Harvey, J.

Automatic Lifestate Identification for High-Dimensional Time Series Data.
Time series summarisation methods that account for temporal structure in high-dimensional data are important for analysis in a wide variety of domains, yet current statistical and machine-learning tools offer limited support for this task. We introduce ALI (Automatic Lifestate Identification), a parameter-free algorithm that… [more]

IEEE International Conference on Big Data, 2025Smith, S., Smith, G. & Harvey, J.

Automatic Lifestate Identification and Clustering