. Net regex: 单词字符 w 是什么?

简单的问题:
C # ,. net 中单词字符 \w的模式是什么?

我的第一个想法是,它匹配 [A-Za-z0-9_]文件告诉我:

Character class    Description          Pattern     Matches
\w                 Matches any          \w          "I", "D", "A", "1", "3"
word character.                  in "ID A1.3"

这没什么帮助。
而且 \w似乎也和 äöü相匹配。还有什么? 有没有更好的(精确的)定义?

87517 次浏览

Basically it matches everything that can be considered the intuitive definition of letter in various scripts – plus the underscore and a few other oddballs.

You can find a complete list (at least for the BMP) with the following tiny PowerShell snippet:

0..65535 | ?{([char]$_) -match '\w'} | %{ "$_`: " + [char]$_ }

From the documentation:

Word Character: \w

\w matches any word character. A word character is a member of any of the Unicode categories listed in the following table.

  • Ll (Letter, Lowercase)
  • Lu (Letter, Uppercase)
  • Lt (Letter, Titlecase)
  • Lo (Letter, Other)
  • Lm (Letter, Modifier)
  • Nd (Number, Decimal Digit)
  • Pc (Punctuation, Connector)
    • This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.

If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].

See also

So after some research using '\w' in .NET is equivalent to:

public static class Extensions {
/// <summary>
/// The word categories.
/// </summary>
[NotNull]
private static readonly HashSet<UnicodeCategory> _wordCategories = new HashCollection<UnicodeCategory>(
new[]
{
UnicodeCategory.DecimalDigitNumber,
UnicodeCategory.UppercaseLetter,
UnicodeCategory.ConnectorPunctuation,
UnicodeCategory.LowercaseLetter,
UnicodeCategory.OtherLetter,
UnicodeCategory.TitlecaseLetter,
UnicodeCategory.ModifierLetter,
UnicodeCategory.NonSpacingMark,
});


/// <summary>
/// Determines whether the specified character is a word character (equivalent to '\w').
/// </summary>
/// <param name="c">The c.</param>
public static bool IsWord(this char c) => _wordCategories.Contains(char.GetUnicodeCategory(c));
}

I've written this as an extension method to be easy to use on any character c just invoke c.IsWord() which will return true if the character is a word character. This should be significantly quicker than using a Regex.

Interestingly, this doesn't appear to match the .NET specification, in fact '\w' match 938 'NonSpacingMark' characters, which are not mentioned.

In total this matches 49,760 of the 65,535 characters, so the simple regex's often shown on the web are incomplete.