按组提取对应于变量最小值的行

我希望(1)按一个变量(State)对数据进行分组,(2)在每个组中查找另一个变量(Employees)的最小值行,(3)提取整个行。

(1)和(2)是简单的俏皮话,我觉得(3)也应该是,但我不能得到它。

下面是一个样本数据集:

> data
State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42


data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A",
"B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L,
104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c("State", "Company",
"Employees"), class = "data.frame", row.names = c(NA, -8L))

按组计算 min很容易,使用 aggregate:

> aggregate(Employees ~ State, data, function(x) min(x))
State Employees
1    AK        24
2    RI        19

... 或者 data.table:

> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
State Employees
1:    AK        24
2:    RI        19

但是如何提取与这些 min值对应的整个行,也就是说,在结果中也包括 Company

105433 次浏览

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]


State Company Employees
1:    AK       D        24
2:    RI       E        19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset rows corresponding to max value by group using data.table.

The base function by is often useful for working with block data in data.frames. For example

by(data, data$State, function(x) x[which.min(x$Employees), ] )

It does return the data in a list, but you can collapse that with

do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))

Here a dplyr solution ( Note that I am not a regular user ):

library(dplyr)
data %>%
group_by(State) %>%
slice(which.min(Employees))

As jazzurro notes in the comments, as of dplyr version 1.0.0, there is also now a built-in function slice_min:

data %>%
group_by(State) %>%
slice_min(order_by = Employees)

As this is Google's top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees and then just take the uniques per State

Either using data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Alternatively, we could also first order and then subset .SD. Both of those operations were optimized in the resent data.table versions and order is seemingly triggers data.table:::forderv, while .SD[1L] triggers Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Or dplyr

library(dplyr)
data %>%
arrange(Employees) %>%
distinct(State, .keep_all = TRUE)
#   State Company Employees
# 1    RI       E        19
# 2    AK       D        24

Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = "first" in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin function (which skips the evaluation overhead) and the binary join feature

tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = "first"]
#    State Company Employees
# 1:    AK       D        24
# 2:    RI       E        19

Some benchmarks

library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)


set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)


microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
"(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
"(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
"(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
"(data.table) self join (on):" = {
tmp <- DT[, .(Employees = min(Employees)), by = State]
DT[tmp, on = .(State, Employees), mult = "first"]},
"(data.table) self join (setkey):" = {
tmp <- DT2[, .(Employees = min(Employees)), by = State]
setkey(tmp, State, Employees)
setkey(DT2, State, Employees)
DT2[tmp, mult = "first"]},
"(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
"(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
"(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
"(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
"(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))




# Unit: milliseconds
#                             expr        min         lq       mean     median         uq       max neval      cld
#    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d
#     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a
#      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b
#     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b
#     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b
# (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a
#       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c
#       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e
# (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f
#         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g
#                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h

Corrected plyr solution:

ddply(df, .(State), function(x) x[which.min(x$Employees),])
#   State Company Employees
# 1    AK       D        24
# 2    RI       E        19

thanks to @joel.wilson

In base you can use ave to get min per group and compare this with Employees and get a logical vector to subset the data.frame.

data[data$Employees == ave(data$Employees, data$State, FUN=min),]
#  State Company Employees
#4    AK       D        24
#5    RI       E        19

Or compare it already in the function.

data[as.logical(ave(data$Employees, data$State, FUN=function(x) x==min(x))),]
#data[ave(data$Employees, data$State, FUN=function(x) x==min(x))==1,] #Variant
#  State Company Employees
#4    AK       D        24
#5    RI       E        19

Using collapse

library(collapse)
library(magrittr)
data %>%
fgroup_by(State) %>%
fsummarise(Employees = fmin(Employees))

This is an old question, but previous solutions give the results that may not be desired if the dataset contains a tie in min value. For example:

> data
State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42
9    RI       H        19

In case someone wants to keep all the rows corresponding to the min:

  State Company Employees
<fct> <fct>       <int>
1 AK    D              24
2 RI    E              19
3 RI    H              19

a dplyr or data.table solution is as followed:

Using dplyr:

data %>%
group_by(State) %>%
slice_min(Employees)

Or possibly slower but more traditional:

data %>%
group_by(State) %>%
filter(Employees == min(Employees))

Using data.table: just replace which.min with which(Employees == min(Employees)):

data[, .SD[which(Employees == min(Employees))], by = State]

Or probably faster by using .I

data[data[, .I[which(Employees == min(Employees))], by = State]$V1]

Another data.table solution:

DT[, E_min := min(Employees), by = State][Employees == E_min]

Pretty straightforward and among the fastest. Below I rerun David Arenburg's benchmarking with this and the other fastest data.table solutions.

library(data.table)
library(microbenchmark)


set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
Employees = sample(N * 10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)


microbenchmark(
"(data.table) min column: " = DT3[, E_min := min(Employees), by = State][Employees == E_min],
"(data.table) .I[which.min]: " = DT[DT[, .I[which.min(Employees)], by = State]$V1],
"(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
"(data.table) self join (setkey):" = {
tmp <- DT2[, .(Employees = min(Employees)), by = State]
setkey(tmp, State, Employees)
setkey(DT2, State, Employees)
DT2[tmp, mult = "first"]
}
)


expr      min       lq      mean   median        uq      max neval
(data.table) min column:  44.30078 52.17932  68.31826 58.65887  76.89786 184.0207   100
(data.table) .I[which.min]:  20.34116 26.31244  39.36874 34.01958  42.65439 124.9204   100
(data.table) order/unique:  70.07820 80.20577 109.71235 95.25586 114.87695 514.4456   100
(data.table) self join (setkey): 13.48105 16.06614  22.58310 17.35083  22.31206 161.9103   100

This solution is most useful if you want just a few columns returned among many, e.g. [Employees == E_min, ..columns_to_keep], in which case it can be even faster.