Advances in climate features engineering and subsets modeling
for Dengue forecasting
brclim
Talk at the HPDaSc Workshop on Data Driven Science.
Context of Climate-Sensitive diseases
Direct relationship: floods, droughts, heat waves…
Indirect relationship
A time-lagged relationship
- Vector life cycle from a time perspective
- Climate conditions from the past leads to the disease incidence of tomorrow
ERA5-Land reanalysis
- Copernicus, ECMWF
- Global coverage
- Hourly data
- 1950 to the present (one week delay)
- Spatial resolution ~9km
- Several climate indicators
Challenge on data structures
- Climate indicators: grid data
- Disease incidence: tabular, individual cases aggregated by spatial regions and time spans
Zonal statistics
Resulting products
- ERA5-Land daily datasets
- 7,105 files, 658.7 GB
- 24,242 downloads on Zenodo
- Daily zonal statistics of climate indicators
- 8 selected indicators, 5,570 municipalities
- 6,085,749,761 records covering 1950-2023
Precipitation
Rio de Janeiro municipalities. January 1, 2010.
Angra dos Reis
Publications & Products
- Paper on Environmental Data Science journal (Saldanha et al. 2024)
- Datasets on Zenodo: more than 34,000 downloads
- brclimr package to retrieve climate data of Brazilian municipalities, Almost 4,000 downloads on CRAN.
Subset models for multivariate time series forecast
- Data may present intrinsic diversity of samples, affecting model’s performance on different parts of the input
- Global models: use all available time series
- Local models: use only time series pertaining to each sample
- Data subsets models: our proposal
- Paper on ICDE2024, Multivariate Time Series Analytics workshop
Case example
- Dengue disease is transmitted by mosquitoes and is a Public Health concern. Record number cases on 2024 in Brazil, tendency to increase with global warming
- A typical forecasting model is targeted to predict number of cases based on climate indicators (rain and temperature)
- A global model would use data from all municipalities, facing difficulties related to distinct temporal and spatial disease transmission patterns
Experimental setup
- Identify data subsets considering dengue cases and covariates patterns across municipalities with DTW distance.
- Select the optimum number of subsets (\(k\)) considering silhouette score
- Train random forest Global Model with and without the subset id feature information
- Train random forest Subsets Models
- Evaluate forecasting model’s performance on test data
Clustering results
- \(k = 5\) returned the highest silhouette score
- Partition sizes: \(g_1 = 69\), \(g_2 = 62\), \(g_3 = 82\), \(g_4 = 102\), \(g_5 = 18\)
Model results
Conclusions and next steps
- Subsets models rendered better performance than global models on 116 municipalities from 333 (34.83%)
- Subsets models overall performance is related to the partition’s size. Bigger partitions (more municipalities) have more training data.
- We are working on different clustering strategies (constraints in size and featured-based approaches) and apply different learners on model training