Why in Java 8 split sometimes removes empty strings at start of result array?

Before Java 8 when we split on empty string like

String[] tokens = "abc".split("");

split mechanism would split in places marked with |

|a|b|c|

because empty space "" exists before and after each character. So as result it would generate at first this array

["", "a", "b", "c", ""]

and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return

["", "a", "b", "c"]

In Java 8 split mechanism seems to have changed. Now when we use

"abc".split("")

we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"].

My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.

But this theory fails, since

"abc".split("a")

returns ["", "bc"], so leading empty string was not removed.

Can someone explain what is going on here? How rules of split have changed in Java 8?

39398 次浏览

从 Java7到 Java8,split()的文档略有变化。具体来说,添加了以下声明:

如果在此字符串的开头有一个正宽匹配,则在结果数组的开头包含一个空的前导子字符串。A zero-width match at the beginning however never produces such empty leading substring.

(强调我的)

空字符串拆分在开始时生成零宽度匹配,因此根据上面指定的内容,在结果数组的开始时不包含空字符串。相比之下,在 "a"上分割的第二个示例在字符串的开始处生成一个 确定宽度匹配,因此实际上在结果数组的开始处包含一个空字符串。

This has been specified in the documentation of split(String regex, limit).

When there is a positive-width match at the beginning of this string 属性的开头包含一个空的前导子字符串 resulting array. A zero-width match at the beginning however never 产生这样的空前导子串。

"abc".split("")中,在开始时得到了一个零宽度匹配,因此结果数组中不包含前导空子字符串。

但是在您的第二个片段中,当您在 "a"上拆分时,您得到了一个正的宽度匹配(在本例中为1) ,因此空的前导子字符串按预期包含在内。

(删除不相关的源代码)

String.split(调用 Pattern.split)的行为在 Java7和 Java8之间发生变化。

文件

比较 Java 7爪哇8Pattern.split的文档,我们发现增加了以下条款:

如果在输入序列的开始处有正宽匹配,则在结果数组的开始处包含一个空的前导子字符串。然而,开头的零宽度匹配永远不会产生这样的空前导子串。

爪哇7相比,爪哇8中的 String.split也添加了相同的子句。

参考实现

让我们比较 Java7和 Java8中引用实现的 Pattern.split代码。代码从 grepcode 中检索,用于版本7u40-b43和8-b132。

爪哇7

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);


// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}


// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};


// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());


// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

爪哇8

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);


// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}


// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};


// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());


// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

Java8中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上面的行为。

            if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}

保持兼容性

遵循 Java8及以上版本的行为

为了使 split的行为在不同版本之间保持一致,并与 Java8中的行为兼容:

  1. If your regex 可以 match zero-length string, just add (?!\A) at 结束 of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. 如果正则表达式 can't匹配零长度字符串,则不需要执行任何操作。
  3. 如果不知道正则表达式是否可以匹配长度为零的字符串,请执行步骤1中的两个操作。

(?!\A)检查字符串是否在字符串的开始处结束,这意味着匹配在字符串的开始处为空匹配。

遵循 Java7和之前的行为

没有通用的解决方案可以使 split向后兼容 Java7和之前的版本,除非替换 split的所有实例以指向您自己的定制实现。