Advances in climate features engineering and subsets modeling

for Dengue forecasting

brclim
Talk at the HPDaSc Workshop on Data Driven Science.
Author
Affiliation

Raphael Saldanha

HPDaSc Workshop on Data Driven Science - Inria

Published

May 31, 2024

Context of Climate-Sensitive diseases

  • Direct relationship: floods, droughts, heat waves…

  • Indirect relationship

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '30px'
    }
  }
}%%
flowchart LR

climate(Climate) --> vector(Disease vectors) --> health(Human health)
climate --> health
climate --> social(Social & economic \n determinants) --> health

A time-lagged relationship

  • Vector life cycle from a time perspective
  • Climate conditions from the past leads to the disease incidence of tomorrow

ERA5-Land reanalysis

  • Copernicus, ECMWF
  • Global coverage
  • Hourly data
  • 1950 to the present (one week delay)
  • Spatial resolution ~9km
  • Several climate indicators

Challenge on data structures

  • Climate indicators: grid data
  • Disease incidence: tabular, individual cases aggregated by spatial regions and time spans

Zonal statistics

Resulting products

  • ERA5-Land daily datasets
    • 7,105 files, 658.7 GB
    • 24,242 downloads on Zenodo
  • Daily zonal statistics of climate indicators
    • 8 selected indicators, 5,570 municipalities
    • 6,085,749,761 records covering 1950-2023

Precipitation

Rio de Janeiro municipalities. January 1, 2010.

Angra dos Reis

Publications & Products

  • Paper on Environmental Data Science journal (Saldanha et al. 2024)
  • Datasets on Zenodo: more than 34,000 downloads
  • brclimr package to retrieve climate data of Brazilian municipalities, Almost 4,000 downloads on CRAN.

Subset models for multivariate time series forecast

  • Data may present intrinsic diversity of samples, affecting model’s performance on different parts of the input
  • Global models: use all available time series
  • Local models: use only time series pertaining to each sample
  • Data subsets models: our proposal
  • Paper on ICDE2024, Multivariate Time Series Analytics workshop

Case example

  • Dengue disease is transmitted by mosquitoes and is a Public Health concern. Record number cases on 2024 in Brazil, tendency to increase with global warming
  • A typical forecasting model is targeted to predict number of cases based on climate indicators (rain and temperature)
  • A global model would use data from all municipalities, facing difficulties related to distinct temporal and spatial disease transmission patterns

Experimental setup

  • Identify data subsets considering dengue cases and covariates patterns across municipalities with DTW distance.
    • Select the optimum number of subsets (\(k\)) considering silhouette score
  • Train random forest Global Model with and without the subset id feature information
  • Train random forest Subsets Models
  • Evaluate forecasting model’s performance on test data

Clustering results

  • \(k = 5\) returned the highest silhouette score
  • Partition sizes: \(g_1 = 69\), \(g_2 = 62\), \(g_3 = 82\), \(g_4 = 102\), \(g_5 = 18\)

Model results

Conclusions and next steps

  • Subsets models rendered better performance than global models on 116 municipalities from 333 (34.83%)
  • Subsets models overall performance is related to the partition’s size. Bigger partitions (more municipalities) have more training data.
  • We are working on different clustering strategies (constraints in size and featured-based approaches) and apply different learners on model training
Back to top