Summary functions_Mastering Data Analysis with R-QQ阅读女生青春网

上QQ阅读APP看书，第一时间看更新

Summary functions

As we have discussed earlier, all aggregating functions can take any valid R functions to apply on the subsets of the data. Some of the R packages make it extremely easy for the users, while a few functions do require you to fully understand the package concept, custom syntax, and options to get the most out of the high-performance opportunities.

For such more advanced topics, please see Chapter 4, Restructuring Data, and the further readings listed in the References section at the end of the book.

Now, we will concentrate on a very simple summary function, which is extremely common in any general data analysis project: counting the number of cases per group. This quick example will also highlight some of the differences among the referenced alternatives mentioned in this chapter.

Adding up the number of cases in subgroups

Let's focus on plyr, dplyr and data.table now, as I am pretty sure that you can construct the aggregate and tapply versions without any serious issues. On the basis of the previous examples, the current task seems fairly easy: instead of the mean function, we can simply call the length function to return the number of elements in the Diverted column:

> ddply(hflights, .(DayOfWeek), summarise, n = length(Diverted))
 DayOfWeek n
1 1 34360
2 2 31649
3 3 31926
4 4 34902
5 5 34972
6 6 27629
7 7 32058

Now, we also know that a relatively low number of flights leave Houston on Saturday. However, do we really have to type so much to answer such a simple question? Further, do we really have to name a variable in which we can count the number of cases? You already know the answer:

> ddply(hflights, .(DayOfWeek), nrow)
 DayOfWeek V1
1 1 34360
2 2 31649
3 3 31926
4 4 34902
5 5 34972
6 6 27629
7 7 32058

In short, there is no need to choose a variable from data.frame to determine its length, as it's a lot easier (and faster) to simply check the number of rows in the (sub)datasets.

However, we can also return the very same results in a much easier and quicker way. Probably, you have already thought of using the good old table function for such a straightforward task:

> table(hflights$DayOfWeek)

 1 2 3 4 5 6 7 
34360 31649 31926 34902 34972 27629 32058

The only problem with the resulting object is that we have to transform it further, for example, to data.frame in most cases. Well, plyr already has a helper function to do this in one step, with a very intuitive name:

> count(hflights, 'DayOfWeek')
 DayOfWeek freq
1 1 34360
2 2 31649
3 3 31926
4 4 34902
5 5 34972
6 6 27629
7 7 32058

Therefore, we end up with some rather simple examples for counting data, but let us also see how to implement summary tables with dplyr. If you simply try to modify our previous dplyr commands, you will soon realize that passing the length or nrow function, as we did in plyr, simply does not work. However, reading the manuals or some related questions on StackOverflow soon points our attention to a handy helper function called n:

> dplyr::summarise(hflights_DayOfWeek, n())
Source: local data frame [7 x 2]

 DayOfWeek n()
1 1 34360
2 2 31649
3 3 31926
4 4 34902
5 5 34972
6 6 27629
7 7 32058

However, to be honest, do we really need this relatively complex approach? If you remember the structure of hflights_DayOfWeek, you will soon realize that there is a lot easier and quicker way to find out the overall number of flights on each weekday:

> attr(hflights_DayOfWeek, 'group_sizes')
[1] 34360 31649 31926 34902 34972 27629 32058

Further, just to make sure that we do not forget the custom (yet pretty) syntax of data.table, let us compute the results with another helper function:

> hflights_dt[, .N, by = list(DayOfWeek)]
 DayOfWeek N
1: 1 34360
2: 2 31649
3: 3 31926
4: 4 34902
5: 5 34972
6: 6 27629
7: 7 32058