以一种可读的方式获得 dplyr 计数

我刚开始使用 dplyr, 我需要计算一个组中不同的值,下面是一个表格示例:

data=data.frame(aa=c(1,2,3,4,NA), bb=c('a', 'b', 'a', 'c', 'c'))

我知道我可以做这样的事情:

by_bb<-group_by(data, bb, add = TRUE)
summarise(by_bb, mean(aa, na.rm=TRUE), max(aa), sum(!is.na(aa)), length(aa))

但是如果我想要独特元素的数量呢?

我可以做到:

  > summarise(by_bb,length(unique(unlist(aa))))


bb length(unique(unlist(aa)))
1  a                          2
2  b                          1
3  c                          2

如果我想排除 NAs 我可以这样做:

> summarise(by_bb,length(unique(unlist(aa[!is.na(aa)]))))


bb length(unique(unlist(aa[!is.na(aa)])))
1  a                                      2
2  b                                      1
3  c                                      1

但这是一个 对我来说有点难以理解。有没有更好的方法来做这种总结?

135504 次浏览

How about this option:

data %>%                    # take the data.frame "data"
filter(!is.na(aa)) %>%    # Using "data", filter out all rows with NAs in aa
group_by(bb) %>%          # Then, with the filtered data, group it by "bb"
summarise(Unique_Elements = n_distinct(aa))   # Now summarise with unique elements per group


#Source: local data frame [3 x 2]
#
#  bb Unique_Elements
#1  a               2
#2  b               1
#3  c               1

Use filter to filter out any rows where aa has NAs, then group the data by column bb and then summarise by counting the number of unique elements of column aa by group of bb.

As you can see I'm making use of the pipe operator %>% which you can use to "pipe" or "chain" commands together when using dplyr. This helps you write easily readable code because it's more natural, e.g. you write code from left to write and top to bottom and not deeply nested from inside out (as in your example code).

Edit:

In the first part of your question, you wrote:

I know I can do things like:

by_bb<-group_by(data, bb, add = TRUE)
summarise(by_bb, mean(aa, na.rm=TRUE), max(aa), sum(!is.na(aa)), length(aa))

Here's another option to do that (applying a number of functions to the same column(s)):

data %>%
filter(!is.na(aa)) %>%
group_by(bb) %>%
summarise_each(funs(mean, max, sum, n_distinct), aa)


#Source: local data frame [3 x 5]
#
#  bb mean max sum n_distinct
#1  a    2   3   4          2
#2  b    2   2   2          1
#3  c    4   4   4          1