Package general use tutorial
general_use_tutorial.Rmd
This is a general use tutorial of the package. The
showProgress
argument is set as FALSE
in the
examples to remove the clutter.
Package instalation and loading
For the package installation, use the remotes package:
remotes::install_github("rfsaldanha/opendenguedata")
After installation, load the package. We also will use functions from the dplyr package.
library(opendenguedata)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Reading data
The OpenDengueData project presents dengue data on three different
extracts: national, spatial and temporal. Use the read_data
function to download and read data from one of these extracts.
res <- read_data(extract = "temporal", as_data_frame = TRUE, showProgress = FALSE)
dplyr::glimpse(res)
#> Rows: 2,484,356
#> Columns: 16
#> $ adm_0_name <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ adm_1_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ adm_2_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ full_name <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ ISO_A0 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ FAO_GAUL_code <int> 1011446, 1011446, 1011446, 1011446, 10114…
#> $ RNE_iso_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ IBGE_code <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ calendar_start_date <date> 2021-01-03, 2021-01-10, 2021-01-17, 2021…
#> $ calendar_end_date <date> 2021-01-09, 2021-01-16, 2021-01-23, 2021…
#> $ Year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021,…
#> $ dengue_total <dbl> 101, 151, 201, 202, 100, 251, 101, 61, 99…
#> $ case_definition_standardised <chr> "Suspected", "Suspected", "Suspected", "S…
#> $ S_res <chr> "Admin0", "Admin0", "Admin0", "Admin0", "…
#> $ T_res <chr> "Week", "Week", "Week", "Week", "Week", "…
#> $ UUID <chr> "WHOEMRO-ALL-2021-Y01-05", "WHOEMRO-ALL-2…
The as_data_frame
argument
The read_data
function as_data_frame
arguments defaults to FALSE
. This means that the package
will return an Arrow Table object. The advantage of this format is that
the data is not directly loaded to the computer memory after download.
This is particular useful when the dataset is too big to fit on
memory.
It is recommendated to set this argument to FALSE
to
filter rows and use other dplyr verbs before to load the
data to memory.
res <- read_data(extract = "temporal", as_data_frame = FALSE, showProgress = FALSE) %>%
filter(Year == 2021) %>%
collect()
glimpse(res)
#> Rows: 214,136
#> Columns: 16
#> $ adm_0_name <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ adm_1_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ adm_2_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ full_name <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTA…
#> $ ISO_A0 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ FAO_GAUL_code <int> 1011446, 1011446, 1011446, 1011446, 10114…
#> $ RNE_iso_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
#> $ IBGE_code <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ calendar_start_date <date> 2021-01-03, 2021-01-10, 2021-01-17, 2021…
#> $ calendar_end_date <date> 2021-01-09, 2021-01-16, 2021-01-23, 2021…
#> $ Year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021,…
#> $ dengue_total <dbl> 101, 151, 201, 202, 100, 251, 101, 61, 99…
#> $ case_definition_standardised <chr> "Suspected", "Suspected", "Suspected", "S…
#> $ S_res <chr> "Admin0", "Admin0", "Admin0", "Admin0", "…
#> $ T_res <chr> "Week", "Week", "Week", "Week", "Week", "…
#> $ UUID <chr> "WHOEMRO-ALL-2021-Y01-05", "WHOEMRO-ALL-2…
The collect
function is internally responsible to
prepare and send the query to the Arrow Table object and collects its
results, returning a tibble
object.
The columns
argument
A vector of columns names may presented to subset the dataset.
res <- opendenguedata::read_data(extract = "spatial", as_data_frame = TRUE,
showProgress = FALSE,
columns = c("full_name", "calendar_start_date",
"calendar_end_date", "dengue_total"))
dplyr::glimpse(res)
#> Rows: 2,476,894
#> Columns: 4
#> $ full_name <chr> "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGH…
#> $ calendar_start_date <date> 2021-01-03, 2021-01-10, 2021-01-17, 2021-01-24, 2…
#> $ calendar_end_date <date> 2021-01-09, 2021-01-16, 2021-01-23, 2021-01-30, 2…
#> $ dengue_total <dbl> 101, 151, 201, 202, 100, 251, 101, 61, 99, 112, 70…
The cache
argument
The read_data
function download data from the OpenDengue
project repository. To avoid repeatedly downloads, the
cache
argument creates a temporary folder inside the
computer with the download data. After the first download, the file is
cached on this folder and referenced on next runs of the function.
To force a new download and refresh the package cache, set this
argument to FALSE
.
res <- opendenguedata::read_data(extract = "spatial", cache = FALSE)