Some tips to work with SQLite database

database
sqlite
Published

October 19, 2023

Databases are very useful for handling large-than-memory datasets, a common problem in Data Science. Several database engines work very well with R and Posit has a nice guide overview of them.

SQLite is a very popular engine due its simplicity. You do not need to install a database server on your environment because SQLite stores the database in a simple single file that you can modify, copy, store at Google Drive etc.

I have been using it for some time and collected some practical tips for some practical situations.

First, lets have some data to use at the examples.

library(dplyr)
library(lubridate)
library(nycflights13)

flights <- flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_date(year, month, day))

glimpse(flights)
Rows: 336,776
Columns: 6
$ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ hour      <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, …
$ minute    <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, …
$ departure <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01,…

I specifically created a variable called departure to have a date type on our example dataset.

Dates

Natively, SQLite databases does not handle dates and this may be difficult in the beginning. Let’s explore some options to handle dates with SQLite.

First, we need to create a database in a temporary file using two packages: DBI and RSQLite.

library(DBI)
library(RSQLite)

database_file <- tempfile()
conn <- dbConnect(RSQLite::SQLite(), database_file)

Observe that the conn object is NOT the database, is just a connection instruction to the database stored in a file.

Lets write the flights tibble to the database.

dbWriteTable(conn, name = "flights_table", value = flights)

Now, let’s take a look how the tibble was stored.

tbl(conn, "flights_table") %>%
  head() %>% 
  collect()
# A tibble: 6 × 6
   year month   day  hour minute departure
  <int> <int> <int> <dbl>  <dbl>     <dbl>
1  2013     1     1     5     15     15706
2  2013     1     1     5     29     15706
3  2013     1     1     5     40     15706
4  2013     1     1     5     45     15706
5  2013     1     1     6      0     15706
6  2013     1     1     5     58     15706

The departure variable is no more a human-readable date, it is now a integer value in Unix time. That means: “the number of seconds since 1970-01-01 00:00:00 UTC”. Now very practical…

Thus, there as two options for this: you may convert your date variable to a string variable (as.character(...)) or use an argument called extended_types with the dbConnect command.

If you store the date variable as string, you will need to reconvert it to date wherever your collect data from your database, what is not very practical. Let’s see how the extented_types option works.

First, let’s close our connection.

dbDisconnect(conn)

We will modify our connection using this argument.

conn <- dbConnect(RSQLite::SQLite(), database_file, extended_types = TRUE)

And now we will overwrite the data at the same table.

dbWriteTable(conn, name = "flights_table", value = flights, overwrite = TRUE)

Let’s see the result.

tbl(conn, "flights_table") %>% head() %>% collect()
# A tibble: 6 × 6
   year month   day  hour minute departure 
  <int> <int> <int> <dbl>  <dbl> <date>    
1  2013     1     1     5     15 2013-01-01
2  2013     1     1     5     29 2013-01-01
3  2013     1     1     5     40 2013-01-01
4  2013     1     1     5     45 2013-01-01
5  2013     1     1     6      0 2013-01-01
6  2013     1     1     5     58 2013-01-01

Voilà! Now we can now see human-readable dates with a UNIX time variable.

Parallel write

One nice thing about databases is parallel writing. Imagine a function being executed in parallel and writing the results at the same database and even at the same table.

There are some nice tricks to allow it with SQLite. Basically, those are the options that I use.

conn <- DBI::dbConnect(RSQLite::SQLite(), database_file, extended_types = TRUE, synchronous = NULL)
dbExecute(conn, "PRAGMA busy_timeout = 5000")
[1] 0
dbExecute(conn, "BEGIN IMMEDIATE TRANSACTION")
[1] 0
dbWriteTable(conn = conn, name = "flights_table", value = flights, append = TRUE)
dbExecute(conn, "COMMIT TRANSACTION")
[1] 0
dbDisconnect(conn)

These options will secure that your connection waits other connections to conclude and immediately commit the transaction to the database.

Delete table

One odd thing, after you delete a table in a database you need to vacuum it to get the free space.

Lets delete the file database and do some testing.

unlink(database_file)

We will create the database with two equal tables.

conn <- dbConnect(RSQLite::SQLite(), database_file, extended_types = TRUE)
dbWriteTable(conn = conn, name = "flights_table_1", value = flights)
dbWriteTable(conn = conn, name = "flights_table_2", value = flights)
dbDisconnect(conn)

What’s is the size of the file?

fs::file_size(database_file)
13.4M

Now, lets delete one of the tables.

conn <- dbConnect(RSQLite::SQLite(), database_file, extended_types = TRUE)
dbRemoveTable(conn = conn, name = "flights_table_1")
dbDisconnect(conn)

fs::file_size(database_file)
13.4M

Same size… so, lets vacuum it!

conn <- dbConnect(RSQLite::SQLite(), database_file, extended_types = TRUE)
dbExecute(conn, "VACUUM;")
[1] 0
dbDisconnect(conn)

fs::file_size(database_file)
6.69M

And we have a smaller size!

Session info

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSQLite_2.3.1      DBI_1.1.3          nycflights13_1.0.2 lubridate_1.9.3   
[5] dplyr_1.1.3       

loaded via a namespace (and not attached):
 [1] pillar_1.9.0      compiler_4.2.0    dbplyr_2.3.4      tools_4.2.0      
 [5] digest_0.6.33     bit_4.0.5         jsonlite_1.8.7    evaluate_0.22    
 [9] memoise_2.0.1     lifecycle_1.0.3   tibble_3.2.1      timechange_0.2.0 
[13] pkgconfig_2.0.3   rlang_1.1.1       cli_3.6.1         rstudioapi_0.15.0
[17] yaml_2.3.7        xfun_0.40         fastmap_1.1.1     withr_2.5.1      
[21] knitr_1.44        fs_1.6.3          hms_1.1.3         generics_0.1.3   
[25] vctrs_0.6.4       htmlwidgets_1.6.2 bit64_4.0.5       tidyselect_1.2.0 
[29] glue_1.6.2        R6_2.5.1          fansi_1.0.5       rmarkdown_2.25   
[33] purrr_1.0.2       blob_1.2.4        magrittr_2.0.3    htmltools_0.5.6.1
[37] utf8_1.2.3        cachem_1.0.8     
Back to top