The documentation of the popular {data.table} states that data.table objects are compatible with R functions and packages that accept only data frames, but is this really true?

"> R at Work - data.table is not so compatible after all

data.table is not so compatible after all

The documentation of the popular {data.table} states that data.table objects are compatible with R functions and packages that accept only data frames, but is this really true?

Here is a quote from the {data.table} package documentation (see ?data.table in R or [[1]]):

Since a data.table is a data.frame, it is compatible with R functions and packages that accept only data.frames.

That's not right and I will show you why. Here is an example in which we select a column named a in a data frame df:

> df <- data.frame(a = 1:3, b = letters[1:3])
> df
  a b
1 1 a
2 2 b
3 3 c
> df["a"]
  a
1 1
2 2
3 3

Now here's the same example, but using a data.table object:

> library("data.table")
data.table 1.16.2 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> dt <- data.table(a = 1:3, b = letters[1:3])
> dt
       a      b
   <int> <char>
1:     1      a
2:     2      b
3:     3      c
> dt["a"]
Error in `[.data.table`(dt, "a") :
  When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

If we try to get column a using the command dt["a"], where dt is an object of class data.table, then we get an error. Why? Well, because R's data.frame is actually just a list [[2]]. The {data.table} package overwrites R's implementation of the `[` function, which effectively removes the ability to select columns in a table as we would select elements in a list.

So you solve this by indexing tables as if they are matrices using the [i, j] indexing notation. For example,

>  dt[, "a"]
       a
   <int>
1:     1
2:     2
3:     3

Alas, this approach again breaks down when our column names are stored in a variable:

> colA <- "a"
> dt[, colA]
Error in `[.data.table`(dt, , colA) :
  j (the 2nd argument inside [...]) is a single symbol but column name 'colA' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..colA]. The .. prefix conveys one-level-up similar to a file system path.
> dt[, colA, with = FALSE]
       a
   <int>
1:     1
2:     2
3:     3
> dt[, ..colA]
       a
   <int>
1:     1
2:     2
3:     3

As you can see, you have to add the option with = FALSE or the special .. prefix for data.table.[ to correctly interpret the value provided to the j argument.

The {data.table} package brings some very nice performance benefits to working with tabular data in R and there can be situations where your data is simply too large to handle using R's data frame. So if you decide to use the {data.table} package in your work: be aware that it is an all-or-nothing decision, because all your code will need to be able to handle data.table objects correctly.

— M

Footnotes

[1]Online documentation for data.table at https://rdatatable.gitlab.io/data.table/reference/data.table.html
[2]The data.frame class in R is not actualy a built-in data type, but rather a special compound object. This means that data frames are lists with the data.table class attribute and all elements in the list must have the same length. See https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Data-frame-objects.