
Summary functions
As we have discussed earlier, all aggregating functions can take any valid R functions to apply on the subsets of the data. Some of the R packages make it extremely easy for the users, while a few functions do require you to fully understand the package concept, custom syntax, and options to get the most out of the high-performance opportunities.
For such more advanced topics, please see Chapter 4, Restructuring Data, and the further readings listed in the References section at the end of the book.
Now, we will concentrate on a very simple summary
function, which is extremely common in any general data analysis project: counting the number of cases per group. This quick example will also highlight some of the differences among the referenced alternatives mentioned in this chapter.
Adding up the number of cases in subgroups
Let's focus on plyr
, dplyr
and data.table
now, as I am pretty sure that you can construct the aggregate
and tapply
versions without any serious issues. On the basis of the previous examples, the current task seems fairly easy: instead of the mean
function, we can simply call the length
function to return the number of elements in the Diverted
column:
> ddply(hflights, .(DayOfWeek), summarise, n = length(Diverted)) DayOfWeek n 1 1 34360 2 2 31649 3 3 31926 4 4 34902 5 5 34972 6 6 27629 7 7 32058
Now, we also know that a relatively low number of flights leave Houston on Saturday. However, do we really have to type so much to answer such a simple question? Further, do we really have to name a variable in which we can count the number of cases? You already know the answer:
> ddply(hflights, .(DayOfWeek), nrow) DayOfWeek V1 1 1 34360 2 2 31649 3 3 31926 4 4 34902 5 5 34972 6 6 27629 7 7 32058
In short, there is no need to choose a variable from data.frame
to determine its length, as it's a lot easier (and faster) to simply check the number of rows in the (sub)datasets.
However, we can also return the very same results in a much easier and quicker way. Probably, you have already thought of using the good old table
function for such a straightforward task:
> table(hflights$DayOfWeek) 1 2 3 4 5 6 7 34360 31649 31926 34902 34972 27629 32058
The only problem with the resulting object is that we have to transform it further, for example, to data.frame
in most cases. Well, plyr
already has a helper function to do this in one step, with a very intuitive name:
> count(hflights, 'DayOfWeek') DayOfWeek freq 1 1 34360 2 2 31649 3 3 31926 4 4 34902 5 5 34972 6 6 27629 7 7 32058
Therefore, we end up with some rather simple examples for counting data, but let us also see how to implement summary tables with dplyr
. If you simply try to modify our previous dplyr
commands, you will soon realize that passing the length
or nrow
function, as we did in plyr
, simply does not work. However, reading the manuals or some related questions on StackOverflow soon points our attention to a handy helper function called n
:
> dplyr::summarise(hflights_DayOfWeek, n()) Source: local data frame [7 x 2] DayOfWeek n() 1 1 34360 2 2 31649 3 3 31926 4 4 34902 5 5 34972 6 6 27629 7 7 32058
However, to be honest, do we really need this relatively complex approach? If you remember the structure of hflights_DayOfWeek
, you will soon realize that there is a lot easier and quicker way to find out the overall number of flights on each weekday:
> attr(hflights_DayOfWeek, 'group_sizes') [1] 34360 31649 31926 34902 34972 27629 32058
Further, just to make sure that we do not forget the custom (yet pretty) syntax of data.table
, let us compute the results with another helper function:
> hflights_dt[, .N, by = list(DayOfWeek)] DayOfWeek N 1: 1 34360 2: 2 31649 3: 3 31926 4: 4 34902 5: 5 34972 6: 6 27629 7: 7 32058