空格匹配正则表达式

用于 正则表达式的 JavaAPI 声明 \s将匹配空格。

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

这样做的目的是用一个空格替换两个连续空格的所有实例。然而,这实际上并不奏效。

Am I having a grave misunderstanding of regexes or the term "whitespace"?

472743 次浏览

对我来说挺有用的:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

将刊登:

" a  b   c"

我觉得你是想用这个代替你的代码:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
result = matcher.replaceAll(" ");
}


System.out.println(result);

是的,你需要拿到 matcher.replaceAll()的结果:

String result = matcher.replaceAll(" ");
System.out.println(result);

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode 定义了26个代码点为 \p{White_Space}: 其中20个是各种类型的 \pZ 一般类别 = 分隔符,其余6个是 \p{Cc} GeneralCategory = Control

空白空间是一个相当稳定的属性,而且这些空白空间实际上一直存在。即便如此,Java 也没有符合 Unicode 标准的属性,因此您必须使用以下代码:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
+ "\\u0009" // CHARACTER TABULATION
+ "\\u000A" // LINE FEED (LF)
+ "\\u000B" // LINE TABULATION
+ "\\u000C" // FORM FEED (FF)
+ "\\u000D" // CARRIAGE RETURN (CR)
+ "\\u0020" // SPACE
+ "\\u0085" // NEXT LINE (NEL)
+ "\\u00A0" // NO-BREAK SPACE
+ "\\u1680" // OGHAM SPACE MARK
+ "\\u180E" // MONGOLIAN VOWEL SEPARATOR
+ "\\u2000" // EN QUAD
+ "\\u2001" // EM QUAD
+ "\\u2002" // EN SPACE
+ "\\u2003" // EM SPACE
+ "\\u2004" // THREE-PER-EM SPACE
+ "\\u2005" // FOUR-PER-EM SPACE
+ "\\u2006" // SIX-PER-EM SPACE
+ "\\u2007" // FIGURE SPACE
+ "\\u2008" // PUNCTUATION SPACE
+ "\\u2009" // THIN SPACE
+ "\\u200A" // HAIR SPACE
+ "\\u2028" // LINE SEPARATOR
+ "\\u2029" // PARAGRAPH SEPARATOR
+ "\\u202F" // NARROW NO-BREAK SPACE
+ "\\u205F" // MEDIUM MATHEMATICAL SPACE
+ "\\u3000" // IDEOGRAPHIC SPACE
;
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

现在您可以使用 whitespace_charclass + "+"作为 replaceAll中的模式。


对此我很抱歉。 Java’s regexes just don’t work very well on its own native character set,和所以你真的必须跳过异国情调的环使他们工作。

如果你认为留白是不好的,你应该看看你必须做什么,让 \w\b最终行为正常!

是的,这是可能的,是的,这是一个令人麻木的混乱。甚至可以说是慈善。获得符合 Java 标准的正则表达式库的最简单方法是 JNI 到 ICU 的东西。这就是谷歌为 Android 所做的,因为 OraSun 不符合要求。

如果你不想这样做,但仍然想坚持 Java,我有一个前端 regex 重写库,我写的“修复”Java 的模式,至少让他们符合 RL1.2a in UTS#18, Unicode Regular Expressions的要求。

在 RE 中使用空格是一种痛苦,但我相信它们是有效的。OP 的问题也可以使用 StringTokenizer 或 split ()方法来解决。然而,要使用 RE (取消 println ()的注释,以查看匹配器是如何分解字符串的) ,这里有一个示例代码:

import java.util.regex.*;


public class Two21WS {
private String  str = "";
private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces


public Two21WS (String s) {
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher (s);
int startNext = 0;
while (matcher.find (startNext)) {
if (startNext == 0)
sb.append (s.substring (0, matcher.start()));
else
sb.append (s.substring (startNext, matcher.start()));
sb.append (" ");
startNext = matcher.end();
//System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
//                      ", sb: \"" + sb.toString() + "\"");
}
sb.append (s.substring (startNext));
str = sb.toString();
}


public String toString () {
return str;
}


public static void main (String[] args) {
String tester = " a    b      cdef     gh  ij   kl";
System.out.println ("Initial: \"" + tester + "\"");
System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

它生成以下内容(使用 javac 编译并在命令提示符下运行) :

% java Two21WS 首字母: “ a b cdef gh ij kl” 两个21WS: “ a b cdef gh ij kl”

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);


boolean flag = true;
while(flag)
{
//Update your original search text with the result of the replace
modLine = matcher.replaceAll(" ");
//reset matcher to look at this "new" text
matcher = whitespace.matcher(modLine);
//search again ... and if no match , set flag to false to exit, else run again
if(!matcher.find())
{
flag = false;
}
}

对于 Java (不是 php,不是 javascript,不是其他) :

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

当我发送一个问题到一个正则表达式(正则表达式开发人员应用程序)论坛,我得到了更准确的回答我的 Java 问题:

”留言作者: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... s 将匹配两个空格,如果输入仅为 ASCII。真正的问题在于 OP 的代码,正如问题中公认的答案所指出的那样。”

您可以使用下面的代码片段:

import org.apache.commons.lang3.StringUtils;


StringUtils.normalizeSpace(string);

这将规范化间距为单个,并将剥离开始和尾随的空白以及。

String sampleString = "Hello    world!";
sampleString.replaceAll("\\s{2}", " "); // replaces exactly two consecutive spaces
sampleString.replaceAll("\\s{2,}", " "); // replaces two or more consecutive white spaces

Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs} group.

因此,如果你想用一个简单的空间代替一个或多个奇异的空间,你可以这样做:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

同样值得了解的是,如果您已经使用了 trim()字符串函数,那么应该查看字符串上的(相对较新的) strip()stripLeading()stripTrailing()函数。它们可以帮助你修剪掉各种奇怪的空白字符。有关包含哪些空间的更多信息,请参见 Java 的 Character.isWhitespace()函数。

为了匹配任何空白字符,你可以使用

Pattern whitespace = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS);

The Pattern.UNICODE_CHARACTER_CLASS option "启用 Unicode 版本的预定义字符类和 POSIX 字符类" that are then "符合 Unicode 技术标准 # 18: Unicode 正则表达式 附件 C: 兼容性属性".

也可以使用 (?U)嵌入式标志表达式启用相同的行为。例如,如果希望使用正则表达式替换/删除 Java 中的所有 Unicode 空格,可以使用

String result = text.replaceAll("(?U)\\s+", ""); // removes all whitespaces
String result = text.replaceAll("(?U)\\s", "-"); // replaces each single whitespace with -
String result = text.replaceAll("(?U)\\s+", "-"); // replaces chunks of one or more consecutive whitespaces with a single -
String result = text.replaceAll("(?U)\\G\\s", "-"); // replaces each single whitespace at the start of string with -

参见 Java 演示在线:

String text = "\u00A0 \u00A0\tStart reading\u00A0here..."; // \u00A0 - non-breaking space
System.out.println("Text: '" + text + "'"); // => Text: '       Start reading here...'
System.out.println(text.replaceAll("(?U)\\s+", "")); // => Startreadinghere...
System.out.println(text.replaceAll("(?U)\\s", "-")); // => ----Start-reading-here...
System.out.println(text.replaceAll("(?U)\\s+", "-")); // => -Start-reading-here...
System.out.println(text.replaceAll("(?U)\\G\\s", "-")); // => ----Start reading here...

你可以使用更简单的方法:

String out = in.replaceAll(" {2}", " ");