<- "https://github.com/rfsaldanha/releases/releases/download/v1/flights.parquet" parquet_url
DuckDB has a very interesting extension called httpfs that allows to query CSV and parquet files remotely, including S3 storage.
I will gave here an example about how to query a parquet file stored in a GitHub repository release.
Remote parquet file
First, I recommend to create an object with the URL of the parquet file. On the example bellow, the parquet file contains the nycflights13
flights database.
This is a real URL. If you try to open it with your browser, a download will start.
DuckDB database on memory
Next, we create a DuckDB connection. Considering that the data is stored remotely, we can create this DuckDB database on memory and not on disk.
<- DBI::dbConnect(
conn ::duckdb(),
duckdbdbdir = ":memory:"
)
httpfs extension
Now, we need to load the httpfs
extension.
::dbExecute(conn, "INSTALL httpfs;") DBI
[1] 0
::dbExecute(conn, "LOAD httpfs;") DBI
[1] 0
The last time I checked, the httpfs
extension was not working on Windows.
Query
We are ready to execute a query over the parquet file!
<- DBI::dbGetQuery(
res
conn, ::glue("SELECT carrier, flight, tailnum, year FROM '{parquet_url}' WHERE year = 2013 LIMIT 100")
glue
)
::glimpse(res) dplyr
Rows: 100
Columns: 4
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, …
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N…
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20…
This query selects some variables and filter the year, returning the first 100 rows. The query is carried out by DuckDB accessing the remote parquet file, but the file is not downloaded. That’s great!!
Queries that needs more data and return more rows takes longer to run, especially transmitting data over the Internet. Craft carefully your queries with this in mind.
If you want to use dplyr
verbs with the connection, you can create a view query.
::dbExecute(conn, glue::glue("CREATE VIEW flights AS SELECT * FROM PARQUET_SCAN('{parquet_url}')")) DBI
[1] 0
::dbListTables(conn) DBI
[1] "flights"
library(dplyr)
tbl(conn, "flights") %>%
group_by(month) %>%
summarise(freq = n()) %>%
ungroup() %>%
collect()
# A tibble: 12 × 2
month freq
<int> <dbl>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
Now we can close the connection.
::dbDisconnect(conn, shutdown = TRUE) DBI
Extra: duckdbfs
package
The duckdbfs offers a neat way to connect to remote parquet files and other connections. The same example using the package:
<- duckdbfs::open_dataset(parquet_url) ds
%>%
ds group_by(month) %>%
summarise(freq = n()) %>%
ungroup() %>%
collect()
# A tibble: 12 × 2
month freq
<int> <dbl>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
Session info
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.3
loaded via a namespace (and not attached):
[1] vctrs_0.6.4 cli_3.6.1 knitr_1.45 rlang_1.1.2
[5] xfun_0.41 DBI_1.1.3 purrr_1.0.2 duckdbfs_0.0.3
[9] generics_0.1.3 jsonlite_1.8.7 glue_1.6.2 dbplyr_2.4.0
[13] htmltools_0.5.7 fansi_1.0.5 rmarkdown_2.25 evaluate_0.23
[17] tibble_3.2.1 fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.3
[21] duckdb_0.9.1-1 compiler_4.3.2 blob_1.2.4 htmlwidgets_1.6.2
[25] pkgconfig_2.0.3 rstudioapi_0.15.0 digest_0.6.33 R6_2.5.1
[29] tidyselect_1.2.0 utf8_1.2.4 pillar_1.9.0 magrittr_2.0.3
[33] tools_4.3.2