如何在正则表达式中编号嵌套捕获组?

对于正则表达式应该如何处理嵌套括号的捕获行为,是否有定义的行为?更具体地说,是否可以合理地期望不同的引擎在第一个位置捕获外括号,在后续位置捕获嵌套括号?

考虑下面的 PHP 代码(使用 PCRE 正则表达式)

<?php
$test_string = 'I want to test sub patterns';
preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
print_r($matches);
?>


Array
(
[0] => I want to test sub patterns  //entire pattern
[1] => I want to test           //entire outer parenthesis
[2] => want             //first inner
[3] => to               //second inner
[4] => patterns             //next parentheses set
)

首先捕获整个括号化的表达式(我想测试一下) ,然后捕获内部括号化的模式(“ want”和“ to”)。这在逻辑上是有意义的,但是我可以看到一个同样合乎逻辑的例子: 首先捕获子括号,然后捕获整个模式。

那么,这种“首先捕获整个事物”的行为是在正则表达式引擎中定义的,还是取决于模式的上下文和/或引擎的行为(PCRE 不同于 C # ,Java 不同于等等) ?

63279 次浏览

The order of capturing in the order of the left paren is standard across all the platforms I've worked in. (perl, php, ruby, egrep)

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Caveat: Excluding non-capture group opening parenthesis (?=)

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern

the ((red|white) (king|queen))

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.

Where it gets interesting is with named groups. In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:

Regex.Replace(@"one two three four",
@"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
@"$1 $2 $3 $4")


// result: "two four one three"

In effect, the number is an alias for the name; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this thread for matching floating-point numbers from different locales:

^[+-]?[0-9]{1,3}
(?:
(?:(?<thousand>\,)[0-9]{3})*
(?:(?<decimal>\.)[0-9]{2})?
|
(?:(?<thousand>\.)[0-9]{3})*
(?:(?<decimal>\,)[0-9]{2})?
|
[0-9]*
(?:(?<decimal>[\.\,])[0-9]{2})?
)$

If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.

And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D

Yeah, this is all pretty much well defined for all the languages you're interested in:

  • Java - http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
    "Capturing groups are numbered by counting their opening parentheses from left to right. ... Group zero always stands for the entire expression."
  • .Net - http://msdn.microsoft.com/en-us/library/bs2twtah(VS.71).aspx
    "Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.")
  • PHP (PCRE functions) - http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
    "\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern." (It was also true of the deprecated POSIX functions)
  • PCRE - http://www.pcre.org/pcre.txt
    To add to what Alan M said, search for "How pcre_exec() returns captured substrings" and read the fifth paragraph that follows:

    The  first  pair  of  integers, ovector[0] and ovector[1], identify the
    portion of the subject string matched by the entire pattern.  The next
    pair  is  used for the first capturing subpattern, and so on. The value
    returned by pcre_exec() is one more than the highest numbered pair that
    has  been  set.  For example, if two substrings have been captured, the
    returned value is 3. If there are no capturing subpatterns, the  return
    value from a successful match is 1, indicating that just the first pair
    of offsets has been set.
    
  • Perl's different - http://perldoc.perl.org/perlre.html#Capture-buffers
    $1, $2 etc. match capturing groups as you'd expect (i.e. by occurrence of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.

You'll more than likely find similar results for other languages (Python, Ruby, and others).

You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.

Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).