Here is a quote from the {data.table} package documentation (see ?data.table in R or [[1]]):
Since a data.table is a data.frame, it is compatible with R functions and packages that accept only data.frames.
That's not right and I will show you why. Here is an example in which we select a column named a in a data frame df:
> df <- data.frame(a = 1:3, b = letters[1:3])
> df
a b
1 1 a
2 2 b
3 3 c
> df["a"]
a
1 1
2 2
3 3
Now here's the same example, but using a data.table object:
> library("data.table")
data.table 1.16.2 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
> dt <- data.table(a = 1:3, b = letters[1:3])
> dt
a b
<int> <char>
1: 1 a
2: 2 b
3: 3 c
> dt["a"]
Error in `[.data.table`(dt, "a") :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
If we try to get column a using the command dt["a"], where dt is an object of class data.table, then we get an error. Why? Well, because R's data.frame is actually just a list [[2]]. The {data.table} package overwrites R's implementation of the `[` function, which effectively removes the ability to select columns in a table as we would select elements in a list.
So you solve this by indexing tables as if they are matrices using the [i, j] indexing notation. For example,
> dt[, "a"]
a
<int>
1: 1
2: 2
3: 3
Alas, this approach again breaks down when our column names are stored in a variable:
> colA <- "a"
> dt[, colA]
Error in `[.data.table`(dt, , colA) :
j (the 2nd argument inside [...]) is a single symbol but column name 'colA' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..colA]. The .. prefix conveys one-level-up similar to a file system path.
> dt[, colA, with = FALSE]
a
<int>
1: 1
2: 2
3: 3
> dt[, ..colA]
a
<int>
1: 1
2: 2
3: 3
As you can see, you have to add the option with = FALSE or the special .. prefix for data.table.[ to correctly interpret the value provided to the j argument.
The {data.table} package brings some very nice performance benefits to working with tabular data in R and there can be situations where your data is simply too large to handle using R's data frame. So if you decide to use the {data.table} package in your work: be aware that it is an all-or-nothing decision, because all your code will need to be able to handle data.table objects correctly.
— M
Footnotes
[1] | Online documentation for data.table at https://rdatatable.gitlab.io/data.table/reference/data.table.html |
[2] | The data.frame class in R is not actualy a built-in data type, but rather a special compound object. This means that data frames are lists with the data.table class attribute and all elements in the list must have the same length. See https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Data-frame-objects. |