Subset Modeling

A Domain Partitioning Strategy for Data-efficient Machine Learning

Raphael Saldanha

Inria

Motivation

Complex data present internal diversity
ML systems may present a good overall performance
But it is not uniformly equal on all parts of the input

Objective

Propose a ML framework that accounts for shared characteristics and regional variations across a domain dataset

Method

Given a dataset \(D\), train a global ML model \(G\)
Identify a number of subsets \(S_k\) on \(D\)
Train ML models on each \(S_k\)
For inference, assign the incoming sample to the corresponding \(S_k\)
Compare the performance observed on \(G\) and \(S\) models for each unit

Method

Workflow

flowchart LR
A[Domain data] --> G[Global Model]
A --> C(Clustering)
C --> K1[Subset model 1]
C --> K2[Subset model 2]
C --> Kn[Subset model k]
G --> P1[Inference]
K1 --> P2[Inference]
K2 --> P2
Kn --> P2
P1 --> AC[Performance comparison]
P2 --> AC

Subsets identification

The subset may have a priori definitions
Identified with data-driven methods, like clustering techniques

Preliminary results

Dengue dataset. A 10 year weekly time series of dengue cases incidence on Brazilian municipalities and related predictors, as temperature and rainfall.
- \(k=5\) subsets were identified (multivariate DTW)
- The subsets models rendered a better performance on 116 out of 333 municipalities in comparison with the global model. An improvement of 34.83%.

For SIGMOD2025

Increase the number of experiments, and include more municipalities

We would like to test the approach on another datasets to help generalize our findings.

GeoLifeCLEF 2023 competition

Large-scale training dataset of 5M plant occurrences
Validation set of 5K plots
Test set with 20K plots
Baseline model Spatial Random Forest (PA), trained with Presence-Absence data and longitude/latitude as covariates