Databases are very useful for handling large-than-memory datasets, a common problem in Data Science. Several database engines work very well with R and Posit has a nice guide overview of them.
SQLite is a very popular engine due its simplicity. You do not need to install a database server on your environment because SQLite stores the database in a simple single file that you can modify, copy, store at Google Drive etc.
I have been using it for some time and collected some practical tips for some practical situations.
First, lets have some data to use at the examples.
The departure variable is no more a human-readable date, it is now a integer value in Unix time. That means: “the number of seconds since 1970-01-01 00:00:00 UTC”. Now very practical…
Thus, there as two options for this: you may convert your date variable to a string variable (as.character(...)) or use an argument called extended_types with the dbConnect command.
If you store the date variable as string, you will need to reconvert it to date wherever your collect data from your database, what is not very practical. Let’s see how the extented_types option works.
First, let’s close our connection.
dbDisconnect(conn)
We will modify our connection using this argument.
Voilà! Now we can now see human-readable dates with a UNIX time variable.
Parallel write
One nice thing about databases is parallel writing. Imagine a function being executed in parallel and writing the results at the same database and even at the same table.
There are some nice tricks to allow it with SQLite. Basically, those are the options that I use.
dbWriteTable(conn = conn, name ="flights_table", value = flights, append =TRUE)dbExecute(conn, "COMMIT TRANSACTION")
[1] 0
dbDisconnect(conn)
These options will secure that your connection waits other connections to conclude and immediately commit the transaction to the database.
Delete table
One odd thing, after you delete a table in a database you need to vacuum it to get the free space.
Lets delete the file database and do some testing.
unlink(database_file)
We will create the database with two equal tables.
conn <-dbConnect(RSQLite::SQLite(), database_file, extended_types =TRUE)dbWriteTable(conn = conn, name ="flights_table_1", value = flights)dbWriteTable(conn = conn, name ="flights_table_2", value = flights)dbDisconnect(conn)
What’s is the size of the file?
fs::file_size(database_file)
13.4M
Now, lets delete one of the tables.
conn <-dbConnect(RSQLite::SQLite(), database_file, extended_types =TRUE)dbRemoveTable(conn = conn, name ="flights_table_1")dbDisconnect(conn)fs::file_size(database_file)