使用具有多个模式的字符向量的 grep

小开

我建议编写一个小脚本，并使用 Grep 进行多次搜索。我从来没有找到一种方法来搜索多种模式，相信我，我已经找过了！

如下所示，您的 shell 文件包含一个嵌入字符串:

 #!/bin/bash
grep *A6* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
grep *A7* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
grep *A8* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";

然后输入 myshell.sh 运行。

如果希望能够在命令行上传递字符串，可以这样做，使用 shell 参数—— btw 是 bash 符号:

 #!/bin/bash
$stingtomatch = "${1}";
grep *A6* "${stingtomatch}";
grep *A7* "${stingtomatch}";
grep *A8* "${stingtomatch}";

等等。

如果有很多模式需要匹配，可以将其放入 for 循环中。

小开

除了@Marek 关于不包含 fixed==TRUE的注释之外，还需要在正则表达式中不使用空格。应该是 "A1|A9|A6"。

你还提到有很多模式，假设它们在一个矢量中

toMatch <- c("A1", "A9", "A6")

然后可以直接使用 paste和 collapse = "|"创建正则表达式。

matches <- unique (grep(paste(toMatch,collapse="|"),
myfile$Letter, value=TRUE))

小开

你试过 match()或 charmatch()的功能吗？

示例使用:

match(c("A1", "A9", "A6"), myfile$Letter)

小开

根据 Brian Digg 的文章，这里有两个过滤列表的有用功能:

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
return(setdiff(theList,include(theList,toMatch)))
}


#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
matches <- unique (grep(paste(toMatch,collapse="|"),
theList, value=TRUE))
return(matches)
}

小开

布莱恩 · 迪格斯的回答。

另一种使用 grepl 的方法是返回一个包含所有值的数据框架。

toMatch <- myfile$Letter


matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]


matches


Letter Firstname
1     A1      Alex
2     A6      Alex
4     A1       Bob
5     A9     Chris
6     A6     Chris

也许更干净一点... 也许？

小开

不确定这个答案是否已经出现..。

对于问题中的特定模式，只需要一个 grep()调用,

grep("A[169]", myfile$Letter)

小开

不错的答案，但是不要忘记 dplyr 中的 filter():

patterns <- c("A1", "A9", "A6")
>your_df
FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6


result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))


>result
FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6

小开

使用 sapply

 patterns <- c("A1", "A9", "A6")
df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))






name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9




df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9

小开

把空格拿走，也这样做:

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))

小开

这应该会奏效:

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

或者更简单:

library(data.table)
myfile$Letter %like% 'A1|A9|A6'

小开

另一种选择是使用类似 '\\b(A1|A9|A6)\\b'的语法作为模式。这是用于正则表达式的单词边界，例如，如果 Bob 有字母“ A7，A1”，当使用该语法时，仍然可以提取行。下面是两种选择的一个可重复的例子:

df <- read.table(text="FirstName Letter
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex     A7
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df, df[grep('\\b(A1|A9|A6)\\b', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6


df2 <- read.table(text="FirstName Letter
Alex      A1
Alex      A6
Alex      A7,A1
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df2
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df2, df2[grep('A1|A9|A6', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

^{由 Reprex 软件包于2022.07-16年度创作(v2.0.1)}

请注意: 如果使用的是 R v4.1 + ，则可以使用 \\b，否则使用 \b。