如何使用 dplyr 选择每个组中具有最大值的行？

小开

最佳答案

试试这个:

result <- df %>%
group_by(A, B) %>%
filter(value == max(value)) %>%
arrange(A,B,C)

似乎奏效了:

identical(
as.data.frame(result),
ddply(df, .(A, B), function(x) x[which.max(x$value),])
)
#[1] TRUE

正如在评论中指出的那样，如果严格地说每组只需要一行，那么按照 @ 皇室的回答如下，slice可能是首选的。如果有多行具有相同的最大值，则此答案将返回多行。

小开

你可以使用 top_n

df %>% group_by(A, B) %>% top_n(n=1)

这将根据最后一列(value)进行排名，并返回最上面的 n=1行。

当前，您不能更改此默认值而不导致错误(请参见 https://github.com/hadley/dplyr/issues/426)

小开

df %>% group_by(A,B) %>% slice(which.max(value))

小开

这个更详细的解决方案提供了对重复最大值情况下发生的情况的更大控制(在本例中，它将随机采用相应的一行)

library(dplyr)
df %>% group_by(A, B) %>%
mutate(the_rank  = rank(-value, ties.method = "random")) %>%
filter(the_rank == 1) %>% select(-the_rank)

小开

更一般地说，我认为您可能希望获得给定组中 解决了行的“ top”。

对于单个值为 max’d out 的情况，基本上只按一列排序。但是，按多列进行分层排序通常很有用(例如: 日期列和一天中的时间列)。

# Answering the question of getting row with max "value".
df %>%
# Within each grouping of A and B values.
group_by( A, B) %>%
# Sort rows in descending order by "value" column.
arrange( desc(value) ) %>%
# Pick the top 1 value
slice(1) %>%
# Remember to ungroup in case you want to do further work without grouping.
ungroup()


# Answering an extension of the question of
# getting row with the max value of the lowest "C".
df %>%
# Within each grouping of A and B values.
group_by( A, B) %>%
# Sort rows in ascending order by C, and then within that by
# descending order by "value" column.
arrange( C, desc(value) ) %>%
# Pick the one top row based on the sort
slice(1) %>%
# Remember to ungroup in case you want to do further work without grouping.
ungroup()

小开

对我来说，它有助于计算每组值的数量。将 count 表复制到新对象中。然后根据第一个分组特征对群体的最大值进行筛选。例如:

count_table  <- df %>%
group_by(A, B) %>%
count() %>%
arrange(A, desc(n))


count_table %>%
group_by(A) %>%
filter(n == max(n))

或者

count_table %>%
group_by(A) %>%
top_n(1, n)