Interacting With Data Using The: Filehash Package For R
Interacting With Data Using The: Filehash Package For R
Interacting With Data Using The: Filehash Package For R
Abstract
The filehash package for R implements a simple key-value style database where character string keys
are associated with data values that are stored on the disk. A simple interface is provided for inserting,
retrieving, and deleting data from the database. Utilities are provided that allow filehash databases to be
treated much like environments and lists are already used in R. These utilities are provided to encourage
interactive and exploratory analysis on large datasets. Three different file formats for representing the
database are currently available and new formats can easily be incorporated by third parties for use in the
filehash framework.
1
While the package itself does not include a specific implementation, some examples are provided on the
package’s website.
The filehash package provides a full read-write implementation of a key-value database for R. The pack-
age does not depend on any external packages (beyond those provided in a standard R installation) or
software systems and is written entirely in R, making it readily usable on most platforms. The filehash
package can be thought of as a specific implementation of the database concept described in [Cha91], taking
a slightly different approach to the problem. Both [Tem02] and [Cha91] focus on generalizing the notion of
“attach()-ing” a database in an R/S session so that variable names can be looked up automatically via the
search list. The filehash package represents a database as an instance of an S4 class and operates directly
on the S4 object via various methods.
Key-value databases are sometimes called hash tables and indeed, the name of the package comes from
the idea of having a “file-based hash table”. With filehash the values are stored in a file on the disk rather
than in memory. When a user requests the values associated with a key, filehash finds the object on the disk,
loads the value into R and returns it to the user. The package offers two formats for storing data on the disk:
The values can be stored (1) concatenated together in a single file or (2) separately as a directory of files.
2 Related R packages
There are other packages on CRAN designed specifically to help users work with large datasets. Two packages
that come immediately to mind are the g.data package by David Brahm [Bra02] and the biglm package
by Thomas Lumley. The g.data package takes advantage of the lazy evaluation mechanism in R via the
delayedAssign function. Briefly, objects are loaded into R as promises to load the actual data associated
with an object name. The first time an object is requested, the promise is evaluated and the data are loaded.
From then on, the data reside in memory. The mechanism used in g.data is similar to the one used by the
lazy-loaded databases described in [Rip04]. The biglm package allows users to fit linear models on datasets
that are too large to fit in memory. However, the biglm package does not provide methods for dealing with
large datasets in general. The filehash package also draws inspiration from Luke Tierney’s experimental
gdbm package which implements a key-value database via the GNU dbm (GDBM) library. The use of GDBM
creates an external dependence since the GDBM C library has to be compiled on each system. In addition,
I encountered a problem where databases created on 32-bit machines could not be transferred to and read
on 64-bit machines (and vice versa). However, with the increasing use of 64-bit machines in the future, it
seems this problem will eventually go away.
The R Special Interest Group on Databases has developed a number of packages that provide an R
interface to commonly used relational database management systems (RDBMS) such as MySQL (RMySQL),
PostgreSQL (RPgSQL), and Oracle (ROracle). These packages use the S4 classes and generics defined in the
DBI package and have the advantage that they offer much better database functionality, inherited via the use
of a true database management system. However, this benefit comes with the cost of having to install and
use third-party software. While installing an RDBMS may not be an issue—many systems have them pre-
installed and the RSQLite package comes bundled with the source for the RDBMS—the need for the RDBMS
and knowledge of structured query language (SQL) nevertheless adds some overhead. This overhead may
serve as an impediment for users in need of a database for simpler applications.
[1] TRUE
2
You can also specify the type argument which controls how the database is represented on the backend.
We will discuss the different backends in further detail later. For now, we use the default backend which is
called “DB1”.
Once the database is created, it must be initialized in order to be accessed. The dbInit function returns
an S4 object inheriting from class “filehash”. Since this is a newly created database, there are no objects in
it.
[1] 0.002912563
The function dbList lists all of the keys that are available in the database, dbExists tests to see if a
given key is in the database, and dbDelete deletes a key-value pair from the database
[1] "b"
[1] FALSE
While using functions like dbInsert and dbFetch is straightforward it can often be easier on the fingers
to use standard R subset and accessor functions like $, [[, and [. Filehash databases have methods for these
functions so that objects can be accessed in a more compact manner. Similarly, replacement methods for
these functions are also available. The [ function can be used to access multiple objects from the database,
in which case a list is returned.
[1] 1.011141
> mean(db[["a"]])
[1] 1.011141
3
For all of the accessor functions, only character indices are allowed. Numeric indices are caught and an error
is given.
> e <- local({
+ err <- function(e) e
+ tryCatch(db[[1]], error = err)
+ })
> conditionMessage(e)
Finally, there is method for the with generic function which operates much like using with on lists or
environments.
The following three statements all return the same value.
> with(db, c(a = mean(a), b = mean(b)))
a b
1.011141 2.012793
When using with, the values of “a” and “b” are looked up in the database.
> sapply(db[c("a", "b")], mean)
a b
1.011141 2.012793
Here, using [ on db returns a list with the values associated with “a” and “b”. Then sapply is applied in
the usual way on the returned list.
> unlist(lapply(db, mean))
a b
1.011141 2.012793
In the last statement we call lapply directly on the “filehash” object. The filehash package defines a method
for lapply that allows the user to apply a function on all the elements of a database directly. The method
essentially loops through all the keys in the database, loads each object separately and applies the supplied
function to each object. lapply returns a named list with each element being the result of applying the
supplied function to an object in the database. There is an argument keep.names to the lapply method
which, if set to FALSE, will drop all the names from the list.
4
[1] TRUE
> db <- dbInit("testDB")
> db$x <- rnorm(100)
> db$y <- runif(100)
> db$a <- letters
> dbLoad(db)
> ls()
[1] "a" "db" "x" "y"
Notice that we appear to have some additional objects in our workspace. However, the values of these objects
are not stored in memory—they are stored in the database. When one of the objects is accessed, the value is
automatically loaded from the database.
> mean(y)
[1] 0.5118129
> sort(a)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
[15] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
If I assign a different value to one of these objects, its associated value is updated in the database via the
active binding mechanism.
> y <- rnorm(100, 2)
> mean(y)
[1] 2.010489
If I subsequently remove the database and reload it later, the updated value for “y” persists.
> rm(list = ls())
> db <- dbInit("testDB")
> dbLoad(db)
> ls()
[1] "a" "db" "x" "y"
> mean(y)
[1] 2.010489
Perhaps one disadvantage of the active binding approach taken here is that whenever an object is ac-
cessed, the data must be reloaded into R. This behavior is distinctly different from the the delayed assignment
approach taken in g.data where an object must only be loaded once and then is subsequently in memory.
However, when using delayed assignments, if one cycles through all of the objects in the database, one could
eventually exhaust the available memory.
5
7 Filehash database backends
Currently, the filehash package can represent databases in two different formats. The default format is called
“DB1” and it stores the keys and values in a single file. From experience, this format works well overall but
can be a little slow to initialize when there are many thousands of keys. Briefly, the “filehash” object in R
stores a map which associates keys with a byte location in the database file where the corresponding value
is stored. Given the byte location, we can seek to that location in the file and read the data directly. Before
reading in the data, a check is made to make sure that the map is up to date. This format depends critically
on having a working ftell at the system level and a crude check is made when trying to initialize a database
of this format.
The second format is called “RDS” and it stores objects as separate files on the disk in a directory with
the same name as the database. This format is the most straightforward and simple of the available formats.
When a request is made for a specific key, filehash finds the appropriate file in the directory and reads the
file into R. The only catch is that on operating systems that use case-insensitive file names, objects whose
names differ only in case will collide on the filesystem. To workaround this, object names with capital letters
are stored with mangled names on the disk. An advantage of this format is that most of the organizational
work is delegated to the filesystem.
8 Extending filehash
The filehash package has a mechanism for developing new backend formats, should the need arise. The
function registerFormatDB can be used to make filehash aware of a new database format that may be
implemented in a separate R package or a file. registerFormatDB takes two arguments: a name for the
new format (like “DB1” or “RDS”) and a list of functions. The list should contain two functions: one function
named “create” for creating a database, given the database name, and another function named “initialize”
for initializing the database. In addition, one needs to define methods for dbInsert, dbFetch, etc.
A list of available backend formats can be obtained via the filehashFormats function. Upon register-
ing a new backend format, the new format will be listed when filehashFormats is called.
The interface for registering new backend formats is still experimental and could change in the future.
9 Discussion
The filehash package has been designed be useful in both a programming setting and an interactive setting.
Its main purpose is to allow for simpler interaction with large datasets where simultaneous access to the
full dataset is not needed. While the package may not be optimal for all settings, one goal was to write a
simple package in pure R that users to could install with minimal overhead. In the future I hope to add
functionality for interacting with databases stored on remote computers and perhaps incorporate a “real”
database backend. Some work has already begun on developing a backend based on the RSQLite package.
References
[Bra02] David E. Brahm. Delayed data packages. R News, 2(3):11–12, December 2002.
[Cha91] John M. Chambers. Data management in S. Technical Report 99, AT&T Bell Laboratories Statistics
Research, December 1991. http://stat.bell-labs.com/doc/93.15.ps.
[Cha98] John M. Chambers. Programming with Data: A Guide to the S Language. Springer, 1998.
[Rip04] Brian D. Ripley. Lazy loading and packages in R 2.0.0. R News, 4(2):2–4, September 2004.
[Tem02] Duncan Temple Lang. RObjectTables: User-level attach()’able table support, 2002. R package version
0.3-1.