%%{
init: {
'theme': 'base',
'themeVariables': {
'fontSize': '30px'
}
}
}%%
flowchart LR
climate(Climate) --> vector(Disease vectors) --> health(Human health)
climate --> health
climate --> social(Social & economic \n determinants) --> health
Disease and climate data fusion for modeling
An application case for Brazil
brclim
Talk at the Inria Zenith’s Team Seminar.
Introduction
- Postdoctoral researcher at Inria, Zenith team
- Master’s at Public Health, Doctorate on Health Information
- Data Science applied to Public Health
Climate sensitive diseases
Direct relationship: floods, droughts, heat waves…
Indirect relationship
A time-lagged relationship
- Vector life cycle from a time perspective
- Climate conditions from the past leads to the disease incidence of tomorrow

Climate data
- Data sources
- In situ: Weather stations, rain gauges
- Remote: Satellites, drones
- Data products
- Statistical surface interpolations
- Model reanalysis
ERA5-Land reanalysis
- Copernicus, ECMWF
- Global coverage
- Hourly data
- 1950 to the present (one week delay)
- Spatial resolution ~9km
- Several climate indicators

Data structures
- Climate indicators: grid data
- Disease incidence: tabular, individual cases aggregated by spatial regions and time spans


Fusioning data

Case example
Zonal Statistics of Climate Indicators from ERA5-Land for Brazilian Municipalities
- 8 climate indicators: maximum, minimum and average temperature, total precipitation, surface pressure, dewpoint, \(u\) and \(v\) components of wind
-
6 zonal statistics computation for the 5,570 Brazilian municipalities
- Minimum, maximum, average, sum, standard deviation, cell count
- Time coverage: 1950-2022, daily data
Workflow
%%{
init: {
'theme': 'base',
'themeVariables': {
'fontSize': '30px'
}
}
}%%
flowchart TD
era5(ERA5-Land \n indicators) --> hdata(Hourly data)
bb(Latin America \n bounding box) --> hdata
hdata --> agg(Aggregation to \n daily data)
agg --> mun(Municipal boundaries)
mun --> zs(Zonal statistics)
ERA5-Land Daily datasets
- Open data, available at Zenodo
- 7,105 files, 658.7 GB
- Reproducible R scripts
- Plans to continuously update this dataset and add more indicators

https://rfsaldanha.github.io/data-projects/era5land-daily-latin-america.html
Zonal statistics
- Challenges to handle the amount of data and computational tasks
- Strategy
- Group the tasks into chunks and compute in parallel
- DAG (Directed-Acyclic Graph) approach to orchestrate computation, with the {targets} package
- Save results into columnar-oriented databases for fast data retrieval (duckdb and parquet)

6,085,749,761 records
Temperature


Precipitation


Rio de Janeiro municipalities. January 1, 2010.
Angra dos Reis

Resolution and spatial variability
- Brazilian municipalities size variation
- Altamira (PA): 159,533 km2
- Santa Cruz de Minas (MG): 3 km2
- ERA5-Land cell: ~ 100 km2


Published paper
- Environmental Data Science journal
- Published on February 8, 2024
- Journal’s most read paper of the month
- More than 9,000 datasets downloads on Zenodo so far
- Inria has an agreement for APC fees with the Cambridge University Press
- Swift submission process
brclimr R package
- Package to retrieve climate data of Brazilian municipalities
- Query remote parquet files stored on a S3 system
- Avoid huge dataset downloads when user wants only a subset of the data
- Available on CRAN. More than 3,000 downloads
brclimr::fetch_data(
code_muni = 3304557,
product = "brdwgd",
indicator = "rh",
statistics = "mean",
date_start = as.Date("2010-10-15"),
date_end = as.Date("2010-10-20")
)
Usage
- Training multivariate machine learning models to forecast dengue incidence in Brazil with a subsets strategy
\[ {\small \begin{aligned} D_t = \mu + & \theta_1 D_{t-1} + \cdots + \theta_p D_{t-p} + \\ & \lambda_1 C_{t-1} + \cdots + \lambda_p C_{t-p} + \\ & \varepsilon_1 e_{t-1} + \cdots + \varepsilon_p e_{t-p} \end{aligned}} \]
- Cluster municipalities based on dengue spread and climate regimes
- Train different models for each partition
- Accepted paper on ICDE2024, Multivariate Time Series Analytics workshop
Next steps…
- Continuous update
- Human settlements, population-weighted zonal statistics
- Compute climate time-series features: heat waves, persistent rains, etc.
- Adopt climate products with finer resolutions when possible (CHIRPS)
- Expand results to other countries

Thanks
Backup slides
Climate reanalysis
ERA5-Land hourly to daily aggregation
- Usage of {KrigR} package to access the Copernicus Climate Data Store API, crop data at server side, download and perform the time aggregation.
download_ERA(Variable = "2m_temperature", DataSet = "era5-land",
DateStart = "2022-12-01", DateStop = "2022-12-31",
TResolution = "day", TStep = 1,
FUN = "max",
Extent = extent(c(-118.47,-34.1,-56.65, 33.28)),
Dir = "dir_name", FileName = "file_name.nc",
API_User = "api_user", API_Key = "api_key")- Took ~15 days to download and process the data from 8 climate indicators covering the Latin America region
Computation
- Zonal statistics weighted by the fraction of the cell that is covered, with the {exactextractr} package
exact_extract(
x = rst,
y = pol,
fun = "mean"
)Results
| ERA5-Land indicators | Daily time-aggregating functions | Spatial zonal statistics |
|---|---|---|
| Temperature (2m) | mean, max, min | max, min, stdev, count |
| Dewpoint temp. (2m) | mean | max, min, stdev, count |
| \(u\) component of wind | mean | max, min, stdev, count |
| \(v\) component of wind | mean | max, min, stdev, count |
| Surface pressure | mean | max, min, stdev, count |
| Total precipitation | sum | max, min, stdev, count, sum |
