- 1
-
The
{tidyverse}
load several packages like{dplyr}
that will help us to handle the data. - 2
-
The
{arrow}
package present efficient ways to connect to parquet files.
After releasing parquet files with zonal statistics of climate indicators for Brazilian municipalities, I received some inquiries about how to query the files in an efficient way, avoiding loading all data to the memory. This blog post shows some examples about how to do it.
After downloading the desired data, load the following packages on R.
Packages
Access the data
Small datasets
The most straightforward way to read a parquet file is with the read_parquet()
command. But this will load all parquet data to the memory. Use this command only with small parquet files (and remember that parquet files are usually compressed).
Large datasets
For large datasets, we can use the open_dataset()
command. This will create a virtual link to the parquet file but will not load it to the memory.
<- open_dataset(sources = "~/Downloads/2m_temperature_mean.parquet")
temp_mean
print(temp_mean)
FileSystemDataset with 1 Parquet file
code_muni: int64
date: date32[day]
name: string
value: float
With the sources
argument, you can point to a specific parquet file or to a folder with several files. Check the help to learn more about this command ?open_dataset
.
When we print the temp_mean
object, the console shows how many parquet files are linked and the available variables.
Dimension
To check the dimension (number of lines and columns), you can use the dim()
command.
dim(temp_mean)
[1] 742164605 4
That is a lot of rows!
Print first rows
To take a look at the file, we can print the first rows.
- 1
- The object that points to the parquet file.
- 2
-
R base command to access the first rows of a data frame. You can use
head(10)
to access more lines if you want. - 3
- Collect the data from the parquet file to the memory.
- 4
- Present the result in a nice table.
code_muni | date | name | value |
---|---|---|---|
1100015 | 1950-01-01 | 2m_temperature_mean_mean | 298.4606 |
1100015 | 1950-01-02 | 2m_temperature_mean_mean | 298.1530 |
1100015 | 1950-01-03 | 2m_temperature_mean_mean | 298.6412 |
1100015 | 1950-01-04 | 2m_temperature_mean_mean | 297.9100 |
1100015 | 1950-01-05 | 2m_temperature_mean_mean | 298.3538 |
1100015 | 1950-01-06 | 2m_temperature_mean_mean | 297.4383 |
Query
Now we are ready to query the data. We can use some dplyr
verbs to filter the data we want.
<- temp_mean |>
res1 1filter(code_muni %in% c(3303401, 3136702, 3304557)) |>
2filter(year(date) >= 2000) |>
3filter(name == "2m_temperature_mean_mean") |>
4collect()
- 1
- Filter some municipalities (Nova Friburgo, Juiz de Fora and Rio de Janeiro IBGE codes).
- 2
- Filter dates where the year is 2000 or greater.
- 3
- Filter the indicator.
- 4
- Collect the data from the parquet to memory.
Let’s inspect the results.
dim(res1)
[1] 25203 4
|> head() |> gt::gt() res1
code_muni | date | name | value |
---|---|---|---|
3136702 | 2000-01-01 | 2m_temperature_mean_mean | 294.2239 |
3136702 | 2000-01-02 | 2m_temperature_mean_mean | 293.3698 |
3136702 | 2000-01-03 | 2m_temperature_mean_mean | 293.7604 |
3136702 | 2000-01-04 | 2m_temperature_mean_mean | 295.7519 |
3136702 | 2000-01-05 | 2m_temperature_mean_mean | 296.5222 |
3136702 | 2000-01-06 | 2m_temperature_mean_mean | 296.7838 |
As you can see, the command collect()
does all the magic. Filters and other verbs are translated and executed with the file on disk and only the results are returned.
We can also make some aggregations before loading the data into memory. By example, let’s convert this data from daily to weekly.
<- temp_mean |>
res2 filter(code_muni %in% c(3303401, 3136702, 3304557)) |>
filter(year(date) >= 2000) |>
filter(name == "2m_temperature_mean_mean") |>
1mutate(date_w = ceiling_date(date, unit = "week")) |>
2group_by(code_muni, date_w, name) |>
3summarise(value = mean(value, na.rm = TRUE)) |>
4ungroup() |>
5collect()
- 1
-
First, we create a date variable with the first date of the week. Take a look on
?ceiling_date
to learn more about this function. - 2
- Now we will group or data by municipality code, week start date and indicator name.
- 3
- Summarize value by the mean of values.
- 4
- Ungroup the data.
- 5
- Collect the data from the parquet file to memory.
dim(res2)
[1] 3603 4
|> head() |> gt::gt() res2
code_muni | date_w | name | value |
---|---|---|---|
3136702 | 2000-01-02 | 2m_temperature_mean_mean | 294.2239 |
3136702 | 2000-01-09 | 2m_temperature_mean_mean | 295.3651 |
3136702 | 2000-01-16 | 2m_temperature_mean_mean | 296.8024 |
3136702 | 2000-01-23 | 2m_temperature_mean_mean | 295.6669 |
3136702 | 2000-01-30 | 2m_temperature_mean_mean | 293.7877 |
3136702 | 2000-02-06 | 2m_temperature_mean_mean | 295.9360 |
And a simple plot.
ggplot(data = res2, aes(x = date_w, y = value, group = code_muni, color = as.character(code_muni))) +
geom_line(stat = "identity", alpha = .5) +
theme_bw() +
theme(legend.position = "bottom", legend.direction = "horizontal") +
labs(title = "Average temperature", color = "Municipality",
x = "Week start", y = "Temp (k)")
Bonus: indicator meaning
What the indicator 2m_temperature_mean_mean means?
2m_temperature
is the base indicator, it is the atmospheric temperature at 2m height._mean
(the first mean): we are using the average temperature of the day. It could be the maximum (max
) or minimum (min
) for example._mean
(the second mean): we are dealing with zonal statistics. This is the zonal statistic that was applied on all spatial cells that intersects the municipality. We also have themax
,min
,sd
(standard deviation) andcount
(cells count). For volume indicators there is also thesum
.
Usually, for temperature we will use the mean
, and for precipitation the sum
.
Session info
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.3 (2024-02-29)
os Ubuntu 22.04.4 LTS
system x86_64, linux-gnu
ui X11
language en_US:en
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2024-04-04
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
arrow * 15.0.1 2024-03-12 [1] CRAN (R 4.3.3)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.1)
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.1)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.1)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.1)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.3)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.1)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.1)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1)
ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.2)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2)
gt 0.10.1 2024-01-17 [1] CRAN (R 4.3.2)
gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.1)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.3)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.2)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2)
knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.1)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.2)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.3)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.1)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.3.2)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2)
rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.3)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.3)
sass 0.4.9 2024-03-15 [1] CRAN (R 4.3.3)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1)
stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.2)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.1)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.2)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.1)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2)
xfun 0.43 2024-03-25 [1] CRAN (R 4.3.3)
xml2 1.3.6 2023-12-04 [1] CRAN (R 4.3.2)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
[1] /home/raphael/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────