Skip to contents

This is a general use tutorial of the package. The showProgress argument is set as FALSE in the examples to remove the clutter.

Package instalation and loading

For the package installation, use the remotes package:

remotes::install_github("rfsaldanha/opendenguedata")

After installation, load the package. We also will use functions from the dplyr package.

library(opendenguedata)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Reading data

The OpenDengueData project presents dengue data on three different extracts: national, spatial and temporal. Use the read_data function to download and read data from one of these extracts.

res <- read_data(extract = "temporal", as_data_frame = TRUE, showProgress = FALSE)

dplyr::glimpse(res)
#> Rows: 2,484,356
#> Columns: 16
#> $ adm_0_name                   <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ adm_1_name                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ adm_2_name                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ full_name                    <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ ISO_A0                       <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ FAO_GAUL_code                <int> 1011446, 1011446, 1011446, 1011446, 10114…
#> $ RNE_iso_code                 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ IBGE_code                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ calendar_start_date          <date> 2021-01-03, 2021-01-10, 2021-01-17, 2021…
#> $ calendar_end_date            <date> 2021-01-09, 2021-01-16, 2021-01-23, 2021…
#> $ Year                         <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021,…
#> $ dengue_total                 <dbl> 101, 151, 201, 202, 100, 251, 101, 61, 99…
#> $ case_definition_standardised <chr> "Suspected", "Suspected", "Suspected", "S…
#> $ S_res                        <chr> "Admin0", "Admin0", "Admin0", "Admin0", "…
#> $ T_res                        <chr> "Week", "Week", "Week", "Week", "Week", "…
#> $ UUID                         <chr> "WHOEMRO-ALL-2021-Y01-05", "WHOEMRO-ALL-2…

The as_data_frame argument

The read_data function as_data_frame arguments defaults to FALSE. This means that the package will return an Arrow Table object. The advantage of this format is that the data is not directly loaded to the computer memory after download. This is particular useful when the dataset is too big to fit on memory.

It is recommendated to set this argument to FALSE to filter rows and use other dplyr verbs before to load the data to memory.

res <- read_data(extract = "temporal", as_data_frame = FALSE, showProgress = FALSE) %>%
  filter(Year == 2021) %>%
  collect()

glimpse(res)
#> Rows: 214,136
#> Columns: 16
#> $ adm_0_name                   <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ adm_1_name                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ adm_2_name                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ full_name                    <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ ISO_A0                       <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ FAO_GAUL_code                <int> 1011446, 1011446, 1011446, 1011446, 10114…
#> $ RNE_iso_code                 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ IBGE_code                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ calendar_start_date          <date> 2021-01-03, 2021-01-10, 2021-01-17, 2021…
#> $ calendar_end_date            <date> 2021-01-09, 2021-01-16, 2021-01-23, 2021…
#> $ Year                         <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021,…
#> $ dengue_total                 <dbl> 101, 151, 201, 202, 100, 251, 101, 61, 99…
#> $ case_definition_standardised <chr> "Suspected", "Suspected", "Suspected", "S…
#> $ S_res                        <chr> "Admin0", "Admin0", "Admin0", "Admin0", "…
#> $ T_res                        <chr> "Week", "Week", "Week", "Week", "Week", "…
#> $ UUID                         <chr> "WHOEMRO-ALL-2021-Y01-05", "WHOEMRO-ALL-2…

The collect function is internally responsible to prepare and send the query to the Arrow Table object and collects its results, returning a tibble object.

The columns argument

A vector of columns names may presented to subset the dataset.

res <- opendenguedata::read_data(extract = "spatial", as_data_frame = TRUE, 
                                 showProgress = FALSE, 
                                 columns = c("full_name", "calendar_start_date", 
                             "calendar_end_date", "dengue_total"))
                 

dplyr::glimpse(res)
#> Rows: 2,476,894
#> Columns: 4
#> $ full_name           <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGH…
#> $ calendar_start_date <date> 2021-01-03, 2021-01-10, 2021-01-17, 2021-01-24, 2…
#> $ calendar_end_date   <date> 2021-01-09, 2021-01-16, 2021-01-23, 2021-01-30, 2…
#> $ dengue_total        <dbl> 101, 151, 201, 202, 100, 251, 101, 61, 99, 112, 70…

The cache argument

The read_data function download data from the OpenDengue project repository. To avoid repeatedly downloads, the cache argument creates a temporary folder inside the computer with the download data. After the first download, the file is cached on this folder and referenced on next runs of the function.

To force a new download and refresh the package cache, set this argument to FALSE.

res <- opendenguedata::read_data(extract = "spatial", cache = FALSE)