根据字符串 match-dplyr: : Select 选择列

我有一个数据框架(“ data”) ,其中有很多很多列。有些列包含某个字符串(“ search _ string”)。

如何使用 dplyr::select()给出一个子集,其中只包含包含字符串的列?

我试过:

# columns as boolean vector
select(data, grepl("search_string",colnames(data)))


# columns as vector of column names names
select(data, colnames(data)[grepl("search_string",colnames(data))])

两个都不管用。

我知道 select()接受数值向量作为列的替代品,例如:

select(data,5,7,9:20)

但是我不知道如何从 grepl()表达式中获得列 ID的数值向量。

163493 次浏览

No need to use select just use [ instead

data[,grepl("search_string", colnames(data))]

Let's try with iris dataset

>iris[,grepl("Sepal", colnames(iris))]
Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

Within the dplyr world, try:

select(iris,contains("Sepal"))

See the Selection section in ?select for numerous other helpers like starts_with, ends_with, etc.

You can try:

select(data, matches("search_string"))

It is more general than contains - you can use regex (e.g. "one_string|or_the_other").

For more examples, see: http://rpackages.ianhowson.com/cran/dplyr/man/select.html.

Based on Piotr Migdals response I want to give an alternate solution enabling the possibility for a vector of strings:

myVectorOfStrings <- c("foo", "bar")
matchExpression <- paste(myVectorOfStrings, collapse = "|")
# [1] "foo|bar"
df %>% select(matches(matchExpression))

Making use of the regex OR operator (|)

ATTENTION: If you really have a plain vector of column names (and do not need the power of RegExpression), please see the comment below this answer (since it's the cleaner solution).

Alternatively using a DataFrame of 22 columns:

library(plyr) # for baseball dataset.
library(dplyr)


baseball %>% colnames() %>% length()
[1] 22


baseball %>% colnames()
[1] "id"    "year"  "stint" "team"  "lg"    "g"     "ab"    "r"     "h"     "X2b"   "X3b"   "hr"    "rbi"
[14] "sb"    "cs"    "bb"    "so"    "ibb"   "hbp"   "sh"    "sf"    "gidp"

You can use starts_with("s") and ends_with("b"):

> baseball %>% select(starts_with("s")) %>% head(5)
stint sb so sh sf
4       1  6  1 NA NA # players.columns.str.startswith('p')
44      1  8  0 NA NA
68      1  2  0 NA NA
99      1  4  0 NA NA
102     1  3  0 NA NA


> baseball %>% select(ends_with("b")) %>% head(5)
ab X2b X3b sb bb ibb
4   120  11   3  6  2  NA
44  162   9   4  8  4  NA
68   89   3   1  2  2  NA
99  161   5   1  4  3  NA
102 128   3   7  3  1  NA


# contains("g") matches names that contain “g”.
> baseball %>% select(contains("g")) %>% head(5)
lg  g gidp
4      25   NA
44     32   NA
68     19   NA
99     33   NA
102    29   NA