Disease and climate data fusion for modeling

An application case for Brazil

Raphael Saldanha

Inria

2024-03-21

Introduction

Postdoctoral researcher at Inria, Zenith team
Master’s at Public Health, Doctorate on Health Information
Data Science applied to Public Health

Climate sensitive diseases

Direct relationship: floods, droughts, heat waves…
Indirect relationship

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '30px'
    }
  }
}%%
flowchart LR

climate(Climate) --> vector(Disease vectors) --> health(Human health)
climate --> health
climate --> social(Social & economic \n determinants) --> health

A time-lagged relationship

Vector life cycle from a time perspective
Climate conditions from the past leads to the disease incidence of tomorrow

Climate data

Data sources
- In situ: Weather stations, rain gauges
- Remote: Satellites, drones
Data products
- Statistical surface interpolations
- Model reanalysis

ERA5-Land reanalysis

Copernicus, ECMWF
Global coverage
Hourly data
1950 to the present (one week delay)
Spatial resolution ~9km
Several climate indicators

Data structures

Climate indicators: grid data
Disease incidence: tabular, individual cases aggregated by spatial regions and time spans

Fusioning data

Case example

Zonal Statistics of Climate Indicators from ERA5-Land for Brazilian Municipalities

8 climate indicators: maximum, minimum and average temperature, total precipitation, surface pressure, dewpoint, \(u\) and \(v\) components of wind
6 zonal statistics computation for the 5,570 Brazilian municipalities
- Minimum, maximum, average, sum, standard deviation, cell count
Time coverage: 1950-2022, daily data

Workflow

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '30px'
    }
  }
}%%
flowchart TD

era5(ERA5-Land \n indicators) --> hdata(Hourly data)
bb(Latin America \n bounding box) --> hdata

hdata --> agg(Aggregation to \n daily data)

agg --> mun(Municipal boundaries)
mun --> zs(Zonal statistics)

ERA5-Land Daily datasets

Open data, available at Zenodo
7,105 files, 658.7 GB
Reproducible R scripts
Plans to continuously update this dataset and add more indicators

https://rfsaldanha.github.io/data-projects/era5land-daily-latin-america.html

Zonal statistics

Challenges to handle the amount of data and computational tasks
Strategy
- Group the tasks into chunks and compute in parallel
- DAG (Directed-Acyclic Graph) approach to orchestrate computation, with the {targets} package
- Save results into columnar-oriented databases for fast data retrieval (duckdb and parquet)

6,085,749,761 records

Temperature

Precipitation

Rio de Janeiro municipalities. January 1, 2010.

Angra dos Reis

Resolution and spatial variability

Brazilian municipalities size variation
- Altamira (PA): 159,533 km2
- Santa Cruz de Minas (MG): 3 km2
ERA5-Land cell: ~ 100 km2

Published paper

Environmental Data Science journal
Published on February 8, 2024
Journal’s most read paper of the month
More than 9,000 datasets downloads on Zenodo so far
Inria has an agreement for APC fees with the Cambridge University Press
Swift submission process

brclimr R package

Package to retrieve climate data of Brazilian municipalities
Query remote parquet files stored on a S3 system
Avoid huge dataset downloads when user wants only a subset of the data
Available on CRAN. More than 3,000 downloads

brclimr::fetch_data(
    code_muni = 3304557,
    product = "brdwgd",
    indicator = "rh",
    statistics = "mean",
    date_start = as.Date("2010-10-15"),
    date_end = as.Date("2010-10-20")
  )

Usage

Training multivariate machine learning models to forecast dengue incidence in Brazil with a subsets strategy

\[ {\small \begin{aligned} D_t = \mu + & \theta_1 D_{t-1} + \cdots + \theta_p D_{t-p} + \\ & \lambda_1 C_{t-1} + \cdots + \lambda_p C_{t-p} + \\ & \varepsilon_1 e_{t-1} + \cdots + \varepsilon_p e_{t-p} \end{aligned}} \]

Cluster municipalities based on dengue spread and climate regimes
Train different models for each partition

Accepted paper on ICDE2024, Multivariate Time Series Analytics workshop

Next steps…

Continuous update
Human settlements, population-weighted zonal statistics
Compute climate time-series features: heat waves, persistent rains, etc.
Adopt climate products with finer resolutions when possible (CHIRPS)
Expand results to other countries

Thanks

Backup slides

Climate reanalysis

ERA5-Land hourly to daily aggregation

Usage of {KrigR} package to access the Copernicus Climate Data Store API, crop data at server side, download and perform the time aggregation.

download_ERA(Variable = "2m_temperature", DataSet = "era5-land", 
             DateStart = "2022-12-01", DateStop = "2022-12-31",
             TResolution = "day", TStep = 1,
             FUN = "max",
             Extent = extent(c(-118.47,-34.1,-56.65, 33.28)), 
             Dir = "dir_name", FileName = "file_name.nc", 
             API_User = "api_user", API_Key = "api_key")

Took ~15 days to download and process the data from 8 climate indicators covering the Latin America region

Computation

Zonal statistics weighted by the fraction of the cell that is covered, with the {exactextractr} package

exact_extract(
  x = rst,
  y = pol,
  fun = "mean"
)

Results

ERA5-Land indicators	Daily time-aggregating functions	Spatial zonal statistics
Temperature (2m)	mean, max, min	max, min, stdev, count
Dewpoint temp. (2m)	mean	max, min, stdev, count
\(u\) component of wind	mean	max, min, stdev, count
\(v\) component of wind	mean	max, min, stdev, count
Surface pressure	mean	max, min, stdev, count
Total precipitation	sum	max, min, stdev, count, sum