The trouble with tibbles

Hadley Wickham’s dplyr package makes complex data manipulations easy to describe. However, dplyr functions all return “tibbles” rather than data.frames. Class tbl inherits from data.frame, so you can use tbls everywhere you use data.frames. Except when you can’t.

Here’s one example that tripped me up recently.

df <- data.frame(a = 1:26,
                 b = letters)
sapply(df,class)

##         a         b 
## "integer"  "factor"

sum(df[,"b"] == 'b')

## [1] 1

sum(as.character(df[,"b"],1,1) == 'b')

## [1] 1

But now with tbl_df

library(dplyr)
my_first_tbl <- tbl_df(df)
my_first_tbl

## Source: local data frame [26 x 2]
## 
##        a      b
##    <int> <fctr>
## 1      1      a
## 2      2      b
## 3      3      c
## 4      4      d
## 5      5      e
## 6      6      f
## 7      7      g
## 8      8      h
## 9      9      i
## 10    10      j
## ..   ...    ...

So I don’t have to do sapply(df, class) to see what is going on with the contents. This is good. tbls also print out only what fits on the console, which is also nice.

But check this out:

sum(my_first_tbl[,"b"] == 'b') ## works

## [1] 1

sum(as.character(my_first_tbl[,"b"]) == 'b') ## !!

## [1] 0

This threw me for longer than I care to admit. Especially embarrassing when a student comes with this problem and I don’t know the answer!

The reason is that [.tbl_df() has different default behavior from [.data.frame when extracting a single column.

class(my_first_tbl[,"b"])

## [1] "tbl_df"     "tbl"        "data.frame"

class(df[,"b"])

## [1] "factor"

Coercing a data.frame to character gives a different outcome than coercing a tbl_df. What gives? Turns out that [.tbl_df() has drop = FALSE while [.data.frame has drop = TRUE when the result has a single column. Never heard of drop you say? Check this out:

class(df[,"b", drop=FALSE])

## [1] "data.frame"

sum(as.character(df[,"b", drop=FALSE],1,1) == 'b')

## [1] 0

There are other differences too. For example, data_frame() by default does NOT convert strings to factors:

my_second_tbl <- data_frame(a = 1:26,
                            b = letters)
my_second_tbl

## Source: local data frame [26 x 2]
## 
##        a     b
##    <int> <chr>
## 1      1     a
## 2      2     b
## 3      3     c
## 4      4     d
## 5      5     e
## 6      6     f
## 7      7     g
## 8      8     h
## 9      9     i
## 10    10     j
## ..   ...   ...

I think I’m a fan of tibbles, but even if I’m not I am in love with dplyr, so I’d better get used to tibbles.