Disease and climate data fusion for modeling

An application case for Brazil

Raphael Saldanha

Inria

2024-03-21

Introduction

  • Postdoctoral researcher at Inria, Zenith team
  • Master’s at Public Health, Doctorate on Health Information
  • Data Science applied to Public Health

Climate sensitive diseases

  • Direct relationship: floods, droughts, heat waves…

  • Indirect relationship

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '30px'
    }
  }
}%%
flowchart LR

climate(Climate) --> vector(Disease vectors) --> health(Human health)
climate --> health
climate --> social(Social & economic \n determinants) --> health

A time-lagged relationship

  • Vector life cycle from a time perspective
  • Climate conditions from the past leads to the disease incidence of tomorrow

Climate data

  • Data sources
    • In situ: Weather stations, rain gauges
    • Remote: Satellites, drones
  • Data products
    • Statistical surface interpolations
    • Model reanalysis










ERA5-Land reanalysis

  • Copernicus, ECMWF
  • Global coverage
  • Hourly data
  • 1950 to the present (one week delay)
  • Spatial resolution ~9km
  • Several climate indicators

Data structures

  • Climate indicators: grid data
  • Disease incidence: tabular, individual cases aggregated by spatial regions and time spans

Fusioning data

Case example

Zonal Statistics of Climate Indicators from ERA5-Land for Brazilian Municipalities

  • 8 climate indicators: maximum, minimum and average temperature, total precipitation, surface pressure, dewpoint, \(u\) and \(v\) components of wind
  • 6 zonal statistics computation for the 5,570 Brazilian municipalities
    • Minimum, maximum, average, sum, standard deviation, cell count
  • Time coverage: 1950-2022, daily data

Workflow

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '30px'
    }
  }
}%%
flowchart TD

era5(ERA5-Land \n indicators) --> hdata(Hourly data)
bb(Latin America \n bounding box) --> hdata

hdata --> agg(Aggregation to \n daily data)

agg --> mun(Municipal boundaries)
mun --> zs(Zonal statistics)

ERA5-Land Daily datasets

  • Open data, available at Zenodo
  • 7,105 files, 658.7 GB
  • Reproducible R scripts
  • Plans to continuously update this dataset and add more indicators

https://rfsaldanha.github.io/data-projects/era5land-daily-latin-america.html

Zonal statistics

  • Challenges to handle the amount of data and computational tasks
  • Strategy
    • Group the tasks into chunks and compute in parallel
    • DAG (Directed-Acyclic Graph) approach to orchestrate computation, with the {targets} package
    • Save results into columnar-oriented databases for fast data retrieval (duckdb and parquet)

6,085,749,761 records

Temperature

Precipitation

Rio de Janeiro municipalities. January 1, 2010.

Angra dos Reis

Resolution and spatial variability

  • Brazilian municipalities size variation
    • Altamira (PA): 159,533 km2
    • Santa Cruz de Minas (MG): 3 km2
  • ERA5-Land cell: ~ 100 km2

Published paper

  • Environmental Data Science journal
  • Published on February 8, 2024
  • Journal’s most read paper of the month
  • More than 9,000 datasets downloads on Zenodo so far
  • Inria has an agreement for APC fees with the Cambridge University Press
  • Swift submission process

brclimr R package

  • Package to retrieve climate data of Brazilian municipalities
  • Query remote parquet files stored on a S3 system
  • Avoid huge dataset downloads when user wants only a subset of the data
  • Available on CRAN. More than 3,000 downloads
brclimr::fetch_data(
    code_muni = 3304557,
    product = "brdwgd",
    indicator = "rh",
    statistics = "mean",
    date_start = as.Date("2010-10-15"),
    date_end = as.Date("2010-10-20")
  )

Usage

  • Training multivariate machine learning models to forecast dengue incidence in Brazil with a subsets strategy

\[ {\small \begin{aligned} D_t = \mu + & \theta_1 D_{t-1} + \cdots + \theta_p D_{t-p} + \\ & \lambda_1 C_{t-1} + \cdots + \lambda_p C_{t-p} + \\ & \varepsilon_1 e_{t-1} + \cdots + \varepsilon_p e_{t-p} \end{aligned}} \]

  • Cluster municipalities based on dengue spread and climate regimes
  • Train different models for each partition
  • Accepted paper on ICDE2024, Multivariate Time Series Analytics workshop

Next steps…

  • Continuous update
  • Human settlements, population-weighted zonal statistics
  • Compute climate time-series features: heat waves, persistent rains, etc.
  • Adopt climate products with finer resolutions when possible (CHIRPS)
  • Expand results to other countries

Thanks

Backup slides

Climate reanalysis

ERA5-Land hourly to daily aggregation

download_ERA(Variable = "2m_temperature", DataSet = "era5-land", 
             DateStart = "2022-12-01", DateStop = "2022-12-31",
             TResolution = "day", TStep = 1,
             FUN = "max",
             Extent = extent(c(-118.47,-34.1,-56.65, 33.28)), 
             Dir = "dir_name", FileName = "file_name.nc", 
             API_User = "api_user", API_Key = "api_key")
  • Took ~15 days to download and process the data from 8 climate indicators covering the Latin America region

Computation

  • Zonal statistics weighted by the fraction of the cell that is covered, with the {exactextractr} package
exact_extract(
  x = rst,
  y = pol,
  fun = "mean"
)

Results

ERA5-Land indicators Daily time-aggregating functions Spatial zonal statistics
Temperature (2m) mean, max, min max, min, stdev, count
Dewpoint temp. (2m) mean max, min, stdev, count
\(u\) component of wind mean max, min, stdev, count
\(v\) component of wind mean max, min, stdev, count
Surface pressure mean max, min, stdev, count
Total precipitation sum max, min, stdev, count, sum