Advances in climate features engineering and subsets modeling

for Dengue forecasting

Raphael Saldanha

HPDaSc Workshop on Data Driven Science - Inria

2024-05-31

Context of Climate-Sensitive diseases

  • Direct relationship: floods, droughts, heat waves…

  • Indirect relationship

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '30px'
    }
  }
}%%
flowchart LR

climate(Climate) --> vector(Disease vectors) --> health(Human health)
climate --> health
climate --> social(Social & economic \n determinants) --> health

A time-lagged relationship

  • Vector life cycle from a time perspective
  • Climate conditions from the past leads to the disease incidence of tomorrow

ERA5-Land reanalysis

  • Copernicus, ECMWF
  • Global coverage
  • Hourly data
  • 1950 to the present (one week delay)
  • Spatial resolution ~9km
  • Several climate indicators

Challenge on data structures

  • Climate indicators: grid data
  • Disease incidence: tabular, individual cases aggregated by spatial regions and time spans

Zonal statistics

Resulting products

  • ERA5-Land daily datasets
    • 7,105 files, 658.7 GB
    • 24,242 downloads on Zenodo
  • Daily zonal statistics of climate indicators
    • 8 selected indicators, 5,570 municipalities
    • 6,085,749,761 records covering 1950-2023

Precipitation

Rio de Janeiro municipalities. January 1, 2010.

Angra dos Reis

Publications & Products

  • Paper on Environmental Data Science journal (Saldanha et al. 2024)
  • Datasets on Zenodo: more than 34,000 downloads
  • brclimr package to retrieve climate data of Brazilian municipalities, Almost 4,000 downloads on CRAN.

Subset models for multivariate time series forecast

  • Data may present intrinsic diversity of samples, affecting model’s performance on different parts of the input
  • Global models: use all available time series
  • Local models: use only time series pertaining to each sample
  • Data subsets models: our proposal
  • Paper on ICDE2024, Multivariate Time Series Analytics workshop

Case example

  • Dengue disease is transmitted by mosquitoes and is a Public Health concern. Record number cases on 2024 in Brazil, tendency to increase with global warming
  • A typical forecasting model is targeted to predict number of cases based on climate indicators (rain and temperature)
  • A global model would use data from all municipalities, facing difficulties related to distinct temporal and spatial disease transmission patterns

Experimental setup

  • Identify data subsets considering dengue cases and covariates patterns across municipalities with DTW distance.
    • Select the optimum number of subsets (\(k\)) considering silhouette score
  • Train random forest Global Model with and without the subset id feature information
  • Train random forest Subsets Models
  • Evaluate forecasting model’s performance on test data

Clustering results

  • \(k = 5\) returned the highest silhouette score
  • Partition sizes: \(g_1 = 69\), \(g_2 = 62\), \(g_3 = 82\), \(g_4 = 102\), \(g_5 = 18\)

Model results

Conclusions and next steps

  • Subsets models rendered better performance than global models on 116 municipalities from 333 (34.83%)
  • Subsets models overall performance is related to the partition’s size. Bigger partitions (more municipalities) have more training data.
  • We are working on different clustering strategies (constraints in size and featured-based approaches) and apply different learners on model training