SQLite database conversion to DuckDB and Parquet files

database
sqlite
duckdb
parquet
Published

October 24, 2023

DuckDB is a relatively new database that works in a file, just like SQLite, but is very fast and designed for data science workflows.

I am writing this post to cover the following scenario: you already have a SQLite database and want to convert it to DuckDB, and also export it to a parquet file.

SQLite database

We need a SQLite database example to later convert it to DuckDB. Let’s use the mtcars dataset.

library(dplyr)
library(lubridate)

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

And write mtcars in a SQLite database.

sqlite_database_file <- tempfile()

conn_sqlite <- DBI::dbConnect(
  RSQLite::SQLite(), 
  sqlite_database_file, 
  extended_types = TRUE
)

DBI::dbWriteTable(conn_sqlite, name = "mtcars_table", value = mtcars, overwrite = TRUE)

Let’s take a look.

tbl(conn_sqlite, "mtcars_table") %>% head() %>% collect()
# A tibble: 6 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
Note

Note that this database could be written directly using DuckDB, but this is an example about database conversion.

From SQLite to DuckDB

First, we need to create our DuckDB database.

duckdb_database_file <- tempfile()

conn_duckdb <- DBI::dbConnect(
  duckdb::duckdb(), 
  duckdb_database_file
)

To import our data, we can use a DuckDB extension to read SQLite databases.

DBI::dbExecute(conn_duckdb, "INSTALL sqlite;")
[1] 0
DBI::dbExecute(conn_duckdb, "LOAD sqlite;")
[1] 0
DBI::dbExecute(conn_duckdb, glue::glue("CREATE TABLE mtcars_table AS SELECT * FROM sqlite_scan('{sqlite_database_file}', 'mtcars_table');"))
[1] 32

Great! Now we have the same database on DuckDB.

From DuckDB to Parquet

It is very simple to export a DuckDB table to a parquet file.

parquet_file <- tempfile()

DBI::dbExecute(conn_duckdb, glue::glue("COPY (SELECT * FROM 'mtcars_table') TO '{parquet_file}' (FORMAT 'PARQUET')"))
[1] 32

And that’s it! Let’s close the connections.

DBI::dbDisconnect(conn_sqlite)
DBI::dbDisconnect(conn_duckdb, shutdown = TRUE)

Session info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lubridate_1.9.3 dplyr_1.1.3    

loaded via a namespace (and not attached):
 [1] vctrs_0.6.4       cli_3.6.1         knitr_1.44        rlang_1.1.1      
 [5] xfun_0.40         DBI_1.1.3         purrr_1.0.2       generics_0.1.3   
 [9] jsonlite_1.8.7    bit_4.0.5         glue_1.6.2        dbplyr_2.3.4     
[13] htmltools_0.5.6.1 hms_1.1.3         fansi_1.0.5       rmarkdown_2.25   
[17] evaluate_0.22     tibble_3.2.1      fastmap_1.1.1     yaml_2.3.7       
[21] lifecycle_1.0.3   memoise_2.0.1     duckdb_0.9.1      compiler_4.3.1   
[25] blob_1.2.4        RSQLite_2.3.1     htmlwidgets_1.6.2 timechange_0.2.0 
[29] pkgconfig_2.0.3   rstudioapi_0.15.0 digest_0.6.33     R6_2.5.1         
[33] tidyselect_1.2.0  utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3   
[37] bit64_4.0.5       tools_4.3.1       cachem_1.0.8     
Back to top