R 中具有多个捕获组的正则表达式组捕获

在 R 中,是否可以从正则表达式匹配中提取组捕获?据我所知,没有一个 grepgreplregexprgregexprsub,或者 gsub返回组捕获。

我需要从这样编码的字符串中提取键值对:

\((.*?) :: (0\.[0-9]+)\)

我总是可以做多个完全匹配的 grep,或者做一些外部(非 R)处理,但是我希望我可以在 R 中完成所有这些操作,是否有一个函数或者一个包提供这样的函数?

62630 次浏览

gsub does this, from your example:

gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"

you need to double escape the \s in the quotes then they work for the regex.

Hope this helps.

This is how I ended up working around this problem. I used two separate regexes to match the first and second capture groups and run two gregexpr calls, then pull out the matched substrings:

regex.string <- "(?<=\\().*?(?= :: )"
regex.number <- "(?<= :: )\\d\\.\\d+"


match.string <- gregexpr(regex.string, str, perl=T)[[1]]
match.number <- gregexpr(regex.number, str, perl=T)[[1]]


strings <- mapply(function (start, len) substr(str, start, start+len-1),
match.string,
attr(match.string, "match.length"))
numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)),
match.number,
attr(match.number, "match.length"))

gsub() can do this and return only the capture group:

However, in order for this to work, you must explicitly select elements outside your capture group as mentioned in the gsub() help.

(...) elements of character vectors 'x' which are not substituted will be returned unchanged.

So if your text to be selected lies in the middle of some string, adding .* before and after the capture group should allow you to only return it.

gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"

str_match(), from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):

> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
> str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")
[,1]                         [,2]       [,3]
[1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
[2,] "(moretext :: 0.111222)"     "moretext" "0.111222"

Try regmatches() and regexec():

regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)"))
[[1]]
[1] "(sometext :: 0.1231313213)" "sometext"                   "0.1231313213"

I like perl compatible regular expressions. Probably someone else does too...

Here is a function that does perl compatible regular expressions and matches the functionality of functions in other languages that I am used to:

regexpr_perl <- function(expr, str) {
match <- regexpr(expr, str, perl=T)
matches <- character(0)
if (attr(match, 'match.length') >= 0) {
capture_start <- attr(match, 'capture.start')
capture_length <- attr(match, 'capture.length')
total_matches <- 1 + length(capture_start)
matches <- character(total_matches)
matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1)
if (length(capture_start) > 1) {
for (i in 1:length(capture_start)) {
matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1)
}
}
}
matches
}

As suggested in the stringr package, this can be achieved using either str_match() or str_extract().

Adapted from the manual:

library(stringr)


strings <- c(" 219 733 8965", "329-293-8753 ", "banana",
"239 923 8115 and 842 566 4692",
"Work: 579-499-7527", "$1000",
"Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

Extracting and combining our groups:

str_extract_all(strings, phone, simplify=T)
#      [,1]           [,2]
# [1,] "219 733 8965" ""
# [2,] "329-293-8753" ""
# [3,] ""             ""
# [4,] "239 923 8115" "842 566 4692"
# [5,] "579-499-7527" ""
# [6,] ""             ""
# [7,] "543.355.3679" ""

Indicating groups with an output matrix (we're interested in columns 2+):

str_match_all(strings, phone)
# [[1]]
#      [,1]           [,2]  [,3]  [,4]
# [1,] "219 733 8965" "219" "733" "8965"
#
# [[2]]
#      [,1]           [,2]  [,3]  [,4]
# [1,] "329-293-8753" "329" "293" "8753"
#
# [[3]]
#      [,1] [,2] [,3] [,4]
#
# [[4]]
#      [,1]           [,2]  [,3]  [,4]
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
#
# [[5]]
#      [,1]           [,2]  [,3]  [,4]
# [1,] "579-499-7527" "579" "499" "7527"
#
# [[6]]
#      [,1] [,2] [,3] [,4]
#
# [[7]]
#      [,1]           [,2]  [,3]  [,4]
# [1,] "543.355.3679" "543" "355" "3679"

Solution with strcapture from the utils:

x <- c("key1 :: 0.01",
"key2 :: 0.02")
strcapture(pattern = "(.*) :: (0\\.[0-9]+)",
x = x,
proto = list(key = character(), value = double()))
#>    key value
#> 1 key1  0.01
#> 2 key2  0.02

This can be done using the package unglue, taking the example from the selected answer:

# install.packages("unglue")
library(unglue)


s <- c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
unglue_data(s, "({x} :: {y})")
#>          x            y
#> 1 sometext 0.1231313213
#> 2 moretext     0.111222

Or starting from a data frame

df <- data.frame(col = s)
unglue_unnest(df, col, "({x} :: {y})",remove = FALSE)
#>                          col        x            y
#> 1 (sometext :: 0.1231313213) sometext 0.1231313213
#> 2     (moretext :: 0.111222) moretext     0.111222

you can get the raw regex from the unglue pattern, optionally with named capture :

unglue_regex("({x} :: {y})")
#>             ({x} :: {y})
#> "^\\((.*?) :: (.*?)\\)$"


unglue_regex("({x} :: {y})",named_capture = TRUE)
#>                     ({x} :: {y})
#> "^\\((?<x>.*?) :: (?<y>.*?)\\)$"

More info : https://github.com/moodymudskipper/unglue/blob/master/README.md