Query remote parquet files with DuckDB

database
duckdb
parquet
Published

October 24, 2023

DuckDB has a very interesting extension called httpfs that allows to query CSV and parquet files remotely, including S3 storage.

I will gave here an example about how to query a parquet file stored in a GitHub repository release.

Remote parquet file

First, I recommend to create an object with the URL of the parquet file. On the example bellow, the parquet file contains the nycflights13 flights database.

parquet_url <- "https://github.com/rfsaldanha/releases/releases/download/v1/flights.parquet"

This is a real URL. If you try to open it with your browser, a download will start.

DuckDB database on memory

Next, we create a DuckDB connection. Considering that the data is stored remotely, we can create this DuckDB database on memory and not on disk.

conn <- DBI::dbConnect(
  duckdb::duckdb(),
  dbdir = ":memory:"
)

httpfs extension

Now, we need to load the httpfs extension.

DBI::dbExecute(conn, "INSTALL httpfs;")
[1] 0
DBI::dbExecute(conn, "LOAD httpfs;")
[1] 0
Windows users

The last time I checked, the httpfs extension was not working on Windows.

Query

We are ready to execute a query over the parquet file!

res <- DBI::dbGetQuery(
  conn, 
  glue::glue("SELECT carrier, flight, tailnum, year FROM '{parquet_url}' WHERE year = 2013 LIMIT 100")
)

dplyr::glimpse(res)
Rows: 100
Columns: 4
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B…
$ flight  <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, …
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N…
$ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20…

This query selects some variables and filter the year, returning the first 100 rows. The query is carried out by DuckDB accessing the remote parquet file, but the file is not downloaded. That’s great!!

Tip

Queries that needs more data and return more rows takes longer to run, especially transmitting data over the Internet. Craft carefully your queries with this in mind.

If you want to use dplyr verbs with the connection, you can create a view query.

DBI::dbExecute(conn, glue::glue("CREATE VIEW flights AS SELECT * FROM PARQUET_SCAN('{parquet_url}')"))
[1] 0
DBI::dbListTables(conn)
[1] "flights"
library(dplyr)

tbl(conn, "flights") %>%
  group_by(month) %>%
  summarise(freq = n()) %>%
  ungroup() %>%
  collect()
# A tibble: 12 × 2
   month  freq
   <int> <dbl>
 1     1 27004
 2     2 24951
 3     3 28834
 4     4 28330
 5     5 28796
 6     6 28243
 7     7 29425
 8     8 29327
 9     9 27574
10    10 28889
11    11 27268
12    12 28135

Now we can close the connection.

DBI::dbDisconnect(conn, shutdown = TRUE)

Extra: duckdbfs package

The duckdbfs offers a neat way to connect to remote parquet files and other connections. The same example using the package:

ds <- duckdbfs::open_dataset(parquet_url)
ds %>%
  group_by(month) %>%
  summarise(freq = n()) %>%
  ungroup() %>%
  collect()
# A tibble: 12 × 2
   month  freq
   <int> <dbl>
 1     1 27004
 2     2 24951
 3     3 28834
 4     4 28330
 5     5 28796
 6     6 28243
 7     7 29425
 8     8 29327
 9     9 27574
10    10 28889
11    11 27268
12    12 28135

Session info

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.3

loaded via a namespace (and not attached):
 [1] vctrs_0.6.4       cli_3.6.1         knitr_1.45        rlang_1.1.2      
 [5] xfun_0.41         DBI_1.1.3         purrr_1.0.2       duckdbfs_0.0.3   
 [9] generics_0.1.3    jsonlite_1.8.7    glue_1.6.2        dbplyr_2.4.0     
[13] htmltools_0.5.7   fansi_1.0.5       rmarkdown_2.25    evaluate_0.23    
[17] tibble_3.2.1      fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3  
[21] duckdb_0.9.1-1    compiler_4.3.2    blob_1.2.4        htmlwidgets_1.6.2
[25] pkgconfig_2.0.3   rstudioapi_0.15.0 digest_0.6.33     R6_2.5.1         
[29] tidyselect_1.2.0  utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3   
[33] tools_4.3.2      
Back to top