如何在 Go 正则表达式中获取组功能

我正在将一个库从 Ruby 移植到 Go,刚刚发现 Ruby 中的正则表达式与 Go (google RE2)不兼容。我注意到 Ruby 和 Java (加上其他语言使用 PCRE 正则表达式(兼容 perl,支持捕获组)) ,所以我需要重写我的表达式,以便它们在 Go 中编译。

例如,我有以下正则表达式:

`(?<Year>\d{4})-(?<Month>\d{2})-(?<Day>\d{2})`

这应该接受以下输入:

2001-01-20

捕获组允许将年、月和日捕获到变量中。要获取每个组的值,非常简单; 只需对返回的带组名的匹配数据进行索引,就可以得到返回的值。所以,例如,得到年份,像这样的伪代码:

m=expression.Match("2001-01-20")
year = m["Year"]

这是我在表达中经常使用的一种模式,所以我有很多重写的工作要做。

那么,有没有一种方法可以在 Go regexp 中获得这种功能; 我应该如何重写这些表达式呢?

131492 次浏览

how should I re-write these expressions?

Add some Ps, as defined here:

(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})

Cross reference capture group names with re.SubexpNames().

And use as follows:

package main


import (
"fmt"
"regexp"
)


func main() {
r := regexp.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
fmt.Printf("%#v\n", r.FindStringSubmatch(`2015-05-27`))
fmt.Printf("%#v\n", r.SubexpNames())
}

If you need to replace based on a function while capturing groups you can use this:

import "regexp"


func ReplaceAllGroupFunc(re *regexp.Regexp, str string, repl func([]string) string) string {
result := ""
lastIndex := 0


for _, v := range re.FindAllSubmatchIndex([]byte(str), -1) {
groups := []string{}
for i := 0; i < len(v); i += 2 {
groups = append(groups, str[v[i]:v[i+1]])
}


result += str[lastIndex:v[0]] + repl(groups)
lastIndex = v[1]
}


return result + str[lastIndex:]
}

Example:

str := "abc foo:bar def baz:qux ghi"
re := regexp.MustCompile("([a-z]+):([a-z]+)")
result := ReplaceAllGroupFunc(re, str, func(groups []string) string {
return groups[1] + "." + groups[2]
})
fmt.Printf("'%s'\n", result)

https://gist.github.com/elliotchance/d419395aa776d632d897

I had created a function for handling url expressions but it suits your needs too. You can check this snippet but it simply works like this:

/**
* Parses url with the given regular expression and returns the
* group values defined in the expression.
*
*/
func getParams(regEx, url string) (paramsMap map[string]string) {


var compRegEx = regexp.MustCompile(regEx)
match := compRegEx.FindStringSubmatch(url)


paramsMap = make(map[string]string)
for i, name := range compRegEx.SubexpNames() {
if i > 0 && i <= len(match) {
paramsMap[name] = match[i]
}
}
return paramsMap
}

You can use this function like:

params := getParams(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`, `2015-05-27`)
fmt.Println(params)

and the output will be:

map[Year:2015 Month:05 Day:27]

To improve RAM and CPU usage without calling anonymous functions inside loop and without copying arrays in memory inside loop with "append" function see the next example:

You can store more than one subgroup with multiline text, without appending string with '+' and without using for loop inside for loop (like other examples posted here).

txt := `2001-01-20
2009-03-22
2018-02-25
2018-06-07`


regex := *regexp.MustCompile(`(?s)(\d{4})-(\d{2})-(\d{2})`)
res := regex.FindAllStringSubmatch(txt, -1)
for i := range res {
//like Java: match.group(1), match.gropu(2), etc
fmt.Printf("year: %s, month: %s, day: %s\n", res[i][1], res[i][2], res[i][3])
}

Output:

year: 2001, month: 01, day: 20
year: 2009, month: 03, day: 22
year: 2018, month: 02, day: 25
year: 2018, month: 06, day: 07

Note: res[i][0] =~ match.group(0) Java

If you want to store this information use a struct type:

type date struct {
y,m,d int
}
...
func main() {
...
dates := make([]date, 0, len(res))
for ... {
dates[index] = date{y: res[index][1], m: res[index][2], d: res[index][3]}
}
}

It's better to use anonymous groups (performance improvement)

Using "ReplaceAllGroupFunc" posted on Github is bad idea because:

  1. is using loop inside loop
  2. is using anonymous function call inside loop
  3. has a lot of code
  4. is using the "append" function inside loop and that's bad. Every time a call is made to "append" function, is copying the array to new memory position

Simple way to determine group names based on @VasileM answer.

Disclaimer: it's not about memory/cpu/time optimization

package main


import (
"fmt"
"regexp"
)


func main() {
r := regexp.MustCompile(`^(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})$`)


res := r.FindStringSubmatch(`2015-05-27`)
names := r.SubexpNames()
for i, _ := range res {
if i != 0 {
fmt.Println(names[i], res[i])
}
}
}

https://play.golang.org/p/Y9cIVhMa2pU

You can use regroup library for that https://github.com/oriser/regroup

Example:

package main


import (
"fmt"
"github.com/oriser/regroup"
)


func main() {
r := regroup.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
mathces, err := r.Groups("2015-05-27")
if err != nil {
panic(err)
}
fmt.Printf("%+v\n", mathces)
}

Will print: map[Year:2015 Month:05 Day:27]

Alternatively, you can use it like this:

package main


import (
"fmt"
"github.com/oriser/regroup"
)


type Date struct {
Year   int `regroup:"Year"`
Month  int `regroup:"Month"`
Day    int `regroup:"Day"`
}


func main() {
date := &Date{}
r := regroup.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
if err := r.MatchToTarget("2015-05-27", date); err != nil {
panic(err)
}
fmt.Printf("%+v\n", date)
}

Will print: &{Year:2015 Month:5 Day:27}

As of GO 1.15, you can simplify the process by using Regexp.SubexpIndex. You can check the release notes at https://golang.org/doc/go1.15#regexp.

Based in your example, you'd have something like the following:

re := regexp.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
matches := re.FindStringSubmatch("Some random date: 2001-01-20")
yearIndex := re.SubexpIndex("Year")
fmt.Println(matches[yearIndex])

You can check and execute this example at https://play.golang.org/p/ImJ7i_ZQ3Hu.

Function for get regexp parameters wit nil pointer checking. Returns map[] if error ocured

// GetRxParams - Get all regexp params from string with provided regular expression
func GetRxParams(rx *regexp.Regexp, str string) (pm map[string]string) {
if !rx.MatchString(str) {
return nil
}
p := rx.FindStringSubmatch(str)
n := rx.SubexpNames()
pm = map[string]string{}
for i := range n {
if i == 0 {
continue
}


if n[i] != "" && p[i] != "" {
pm[n[i]] = p[i]
}
}
return
}