Subset Modeling

A Domain Partitioning Strategy for Data-efficient Machine Learning

Raphael Saldanha

Inria

Motivation

  • Complex data present internal diversity

  • ML systems may present a good overall performance

  • But it is not uniformly equal on all parts of the input

Objective

  • Propose a ML framework that accounts for shared characteristics and regional variations across a domain dataset

Method

  1. Given a dataset \(D\), train a global ML model \(G\)

  2. Identify a number of subsets \(S_k\) on \(D\)

  3. Train ML models on each \(S_k\)

  4. For inference, assign the incoming sample to the corresponding \(S_k\)

  5. Compare the performance observed on \(G\) and \(S\) models for each unit

Method

Workflow

flowchart LR
A[Domain data] --> G[Global Model]
A --> C(Clustering)
C --> K1[Subset model 1]
C --> K2[Subset model 2]
C --> Kn[Subset model k]
G --> P1[Inference]
K1 --> P2[Inference]
K2 --> P2
Kn --> P2
P1 --> AC[Performance comparison]
P2 --> AC

Subsets identification

  • The subset may have a priori definitions

  • Identified with data-driven methods, like clustering techniques

Preliminary results

  • Dengue dataset. A 10 year weekly time series of dengue cases incidence on Brazilian municipalities and related predictors, as temperature and rainfall.
    • \(k=5\) subsets were identified (multivariate DTW)
    • The subsets models rendered a better performance on 116 out of 333 municipalities in comparison with the global model. An improvement of 34.83%.

For SIGMOD2025

  • Increase the number of experiments, and include more municipalities
  • We would like to test the approach on another datasets to help generalize our findings.

GeoLifeCLEF 2023 competition

  • Large-scale training dataset of 5M plant occurrences
  • Validation set of 5K plots
  • Test set with 20K plots
  • Baseline model Spatial Random Forest (PA), trained with Presence-Absence data and longitude/latitude as covariates