Query local parquet files – Raphael Saldanha

After releasing parquet files with zonal statistics of climate indicators for Brazilian municipalities, I received some inquiries about how to query the files in an efficient way, avoiding loading all data to the memory. This blog post shows some examples about how to do it.

After downloading the desired data, load the following packages on R.

Packages

1library(tidyverse)
2library(arrow)

1: The {tidyverse} load several packages like {dplyr} that will help us to handle the data.
2: The {arrow} package present efficient ways to connect to parquet files.

Access the data

Small datasets

The most straightforward way to read a parquet file is with the read_parquet() command. But this will load all parquet data to the memory. Use this command only with small parquet files (and remember that parquet files are usually compressed).

Large datasets

For large datasets, we can use the open_dataset() command. This will create a virtual link to the parquet file but will not load it to the memory.

temp_mean <- open_dataset(sources = "~/Downloads/2m_temperature_mean.parquet")

print(temp_mean)

FileSystemDataset with 1 Parquet file
code_muni: int64
date: date32[day]
name: string
value: float

With the sources argument, you can point to a specific parquet file or to a folder with several files. Check the help to learn more about this command ?open_dataset.

When we print the temp_mean object, the console shows how many parquet files are linked and the available variables.

Dimension

To check the dimension (number of lines and columns), you can use the dim() command.

dim(temp_mean)

[1] 742164605         4

That is a lot of rows!

Print first rows

To take a look at the file, we can print the first rows.

1temp_mean |>
2  head() |>
3  collect() |>
4  gt::gt()

1: The object that points to the parquet file.
2: R base command to access the first rows of a data frame. You can use head(10) to access more lines if you want.
3: Collect the data from the parquet file to the memory.
4: Present the result in a nice table.

code_muni	date	name	value
1100015	1950-01-01	2m_temperature_mean_mean	298.4606
1100015	1950-01-02	2m_temperature_mean_mean	298.1530
1100015	1950-01-03	2m_temperature_mean_mean	298.6412
1100015	1950-01-04	2m_temperature_mean_mean	297.9100
1100015	1950-01-05	2m_temperature_mean_mean	298.3538
1100015	1950-01-06	2m_temperature_mean_mean	297.4383

Query

Now we are ready to query the data. We can use some dplyr verbs to filter the data we want.

res1 <- temp_mean |>
1  filter(code_muni %in% c(3303401, 3136702, 3304557)) |>
2  filter(year(date) >= 2000) |>
3  filter(name == "2m_temperature_mean_mean") |>
4  collect()

1: Filter some municipalities (Nova Friburgo, Juiz de Fora and Rio de Janeiro IBGE codes).
2: Filter dates where the year is 2000 or greater.
3: Filter the indicator.
4: Collect the data from the parquet to memory.

Let’s inspect the results.

dim(res1)

[1] 25203     4

res1 |> head() |> gt::gt()

code_muni	date	name	value
3136702	2000-01-01	2m_temperature_mean_mean	294.2239
3136702	2000-01-02	2m_temperature_mean_mean	293.3698
3136702	2000-01-03	2m_temperature_mean_mean	293.7604
3136702	2000-01-04	2m_temperature_mean_mean	295.7519
3136702	2000-01-05	2m_temperature_mean_mean	296.5222
3136702	2000-01-06	2m_temperature_mean_mean	296.7838

As you can see, the command collect() does all the magic. Filters and other verbs are translated and executed with the file on disk and only the results are returned.

We can also make some aggregations before loading the data into memory. By example, let’s convert this data from daily to weekly.

res2 <- temp_mean |>
  filter(code_muni %in% c(3303401, 3136702, 3304557)) |>
  filter(year(date) >= 2000) |>
  filter(name == "2m_temperature_mean_mean") |>
1  mutate(date_w = ceiling_date(date, unit = "week")) |>
2  group_by(code_muni, date_w, name) |>
3  summarise(value = mean(value, na.rm = TRUE)) |>
4  ungroup() |>
5  collect()

1: First, we create a date variable with the first date of the week. Take a look on ?ceiling_date to learn more about this function.
2: Now we will group or data by municipality code, week start date and indicator name.
3: Summarize value by the mean of values.
4: Ungroup the data.
5: Collect the data from the parquet file to memory.

dim(res2)

[1] 3603    4

res2 |> head() |> gt::gt()

code_muni	date_w	name	value
3136702	2000-01-02	2m_temperature_mean_mean	294.2239
3136702	2000-01-09	2m_temperature_mean_mean	295.3651
3136702	2000-01-16	2m_temperature_mean_mean	296.8024
3136702	2000-01-23	2m_temperature_mean_mean	295.6669
3136702	2000-01-30	2m_temperature_mean_mean	293.7877
3136702	2000-02-06	2m_temperature_mean_mean	295.9360

And a simple plot.

ggplot(data = res2, aes(x = date_w, y = value, group = code_muni, color = as.character(code_muni))) +
  geom_line(stat = "identity", alpha = .5) +
  theme_bw() +
  theme(legend.position = "bottom", legend.direction = "horizontal") +
  labs(title = "Average temperature", color = "Municipality",
       x = "Week start", y = "Temp (k)")

Bonus: indicator meaning

What the indicator 2m_temperature_mean_mean means?

2m_temperature is the base indicator, it is the atmospheric temperature at 2m height.
_mean (the first mean): we are using the average temperature of the day. It could be the maximum (max) or minimum (min) for example.
_mean (the second mean): we are dealing with zonal statistics. This is the zonal statistic that was applied on all spatial cells that intersects the municipality. We also have the max, min, sd (standard deviation) and count (cells count). For volume indicators there is also the sum.

Usually, for temperature we will use the mean, and for precipitation the sum.

Session info

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_US:en
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Paris
 date     2024-04-04
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 arrow       * 15.0.1  2024-03-12 [1] CRAN (R 4.3.3)
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.3.1)
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.1)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.1)
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.2)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.1)
 digest        0.6.35  2024-03-11 [1] CRAN (R 4.3.3)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.2)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.2)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.1)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.1)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
 ggplot2     * 3.5.0   2024-02-23 [1] CRAN (R 4.3.2)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.2)
 gt            0.10.1  2024-01-17 [1] CRAN (R 4.3.2)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.1)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.1)
 htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.3.3)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.2)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.2)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.1)
 labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.1)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
 munsell       0.5.1   2024-04-01 [1] CRAN (R 4.3.3)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
 readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.2)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.2)
 rmarkdown     2.26    2024-03-05 [1] CRAN (R 4.3.3)
 rstudioapi    0.16.0  2024-03-24 [1] CRAN (R 4.3.3)
 sass          0.4.9   2024-03-15 [1] CRAN (R 4.3.3)
 scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.2)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
 stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.2)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.2)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
 tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.3.2)
 tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.3.3)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.1)
 timechange    0.3.0   2024-01-18 [1] CRAN (R 4.3.2)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.1)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.2)
 withr         3.0.0   2024-01-16 [1] CRAN (R 4.3.2)
 xfun          0.43    2024-03-25 [1] CRAN (R 4.3.3)
 xml2          1.3.6   2023-12-04 [1] CRAN (R 4.3.2)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.2)

 [1] /home/raphael/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────