从 JavaString 中去除所有不可打印字符的最快方法

在 Java 中,从 String中去除所有不可打印字符的最快方法是什么?

到目前为止,我尝试并测量了138字节、131字符的 String:

  • String 的 replaceAll()-< strong > Slow 方法
    • 517009结果/秒
  • 预编译一个模式,然后使用 Matcher 的 replaceAll()
    • 637836结果/秒
  • 使用 StringBuffer,使用 codepointAt()逐个获取代码点,并附加到 StringBuffer
    • 711946结果/秒
  • 使用 StringBuffer,使用 charAt()逐个获取字符,并附加到 StringBuffer
    • 1052964结果/秒
  • 预分配一个 char[]缓冲区,使用 charAt()逐个获取字符并填充该缓冲区,然后转换回 String
    • 2022653结果/秒
  • 预分配2个新旧 char[]缓冲区,使用 getChars()一次性获取现有 String 的所有字符,逐个迭代旧缓冲区并填充新缓冲区,然后将新缓冲区转换为 String-< strong > 我自己的最快版本
    • 2502502结果/秒
  • 同样的东西与2个缓冲区-只使用 byte[]getBytes()和指定编码为“ utf-8”
    • 857485结果/秒
  • 同样的东西与2个 byte[]缓冲区,但指定编码为一个常量 Charset.forName("utf-8")
    • 791076结果/秒
  • 使用2个 byte[]缓冲区的情况相同,但是将编码指定为1字节的本地编码(这几乎不是一件明智的事情)
    • 370164结果/秒

我最好的尝试如下:

    char[] oldChars = new char[s.length()];
s.getChars(0, s.length(), oldChars, 0);
char[] newChars = new char[s.length()];
int newLen = 0;
for (int j = 0; j < s.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ') {
newChars[newLen] = ch;
newLen++;
}
}
s = new String(newChars, 0, newLen);

有什么办法能让它更快吗?

回答一个非常奇怪的问题: 为什么直接使用“ utf-8”字符集名称比使用预分配的静态常量 Charset.forName("utf-8")产生更好的性能?

更新

  • 建议从 棘轮怪产生令人印象深刻的3105590结果/秒的性能,一个 + 24% 的提高!
  • Ed Staub的建议产生了另一个改进-3471017结果/秒,a + 12% 以上以前最好的。

更新2

我已经尽力收集了所有提出的解决方案及其交叉突变,并将其作为 Github 的小型基准测试框架发表。目前它运行着17种算法。其中之一是“特殊”-哇哦算法(由 SO 用户 Voo 提供)采用复杂的反射技巧,从而实现恒星速度,但它混乱了 JVM 字符串的状态,因此它是单独基准测试。

欢迎您检查它,并运行它,以确定您的盒子的结果。这是我得到的结果总结。是说明书:

  • Debian Sid
  • Linux 2.6.39-2-amd64(x86 _ 64)
  • Java 是从一个包 sun-java6-jdk-6.24-1安装的,JVM 将自己标识为
    • Java (商标)系统工程执行期函式库(build 1.6.0 _ 24-b07)
    • Java HotSpot (TM)64位服务器 VM (build 19.1-b02,混合模式)

不同的算法给出不同的输入数据集,最终显示不同的结果:

同一根绳子

这种模式作为常量在 StringSource类提供的同一个字符串上工作:

Ops / s  │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
994 727 │ ArrayOfByteUTF8String
918 611 │ ArrayOfByteUTF8Const
756 086 │ MatcherReplace
598 945 │ StringReplaceAll
460 045 │ ArrayOfByteWindows1251

以图表形式: Same single string chart
(来源: Greycat.ru)

多个字符串,100% 的字符串包含控制字符

源字符串提供程序使用(0。.127)字符集——因此几乎所有字符串都至少包含一个控制字符。算法以循环的方式从这个预生成的数组中接收字符串。

Ops / s  │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
893 176 │ ArrayOfByteUTF8String
817 127 │ ArrayOfByteUTF8Const
778 089 │ StringBuilderChar
734 754 │ StringBuilderCodePoint
377 829 │ ArrayOfByteWindows1251
224 140 │ MatcherReplace
211 104 │ StringReplaceAll

以图表形式: Multiple strings, 100% concentration
(来源: Greycat.ru)

多个字符串,1% 的字符串包含控制字符

与前面相同,但是只有1% 的字符串是使用控制字符生成的——其他99% 是使用[32]生成的。. 127]字符集,所以它们根本不能包含控制字符。这个合成负载最接近这个算法在我那里的实际应用。

Ops / s  │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
939 055 │ StringBuilderChar
907 194 │ ArrayOfByteUTF8Const
841 963 │ StringBuilderCodePoint
606 465 │ MatcherReplace
501 555 │ StringReplaceAll
381 185 │ ArrayOfByteWindows1251

以图表形式: Multiple strings, 1% concentration
(来源: Greycat.ru)

It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

参考文献

24740 次浏览

using 1 char array could work a bit better

int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;
newLen++;
}
}
s = new String(oldChars, 0, newLen);

and I avoided repeated calls to s.length();

another micro-optimization that might work is

int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
// if there are none it ends on the null char I appended
for (int  j = newLen; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
newLen++;
}
}
s = new String(oldChars, 0, newLen);

why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

If you mean String#getBytes("utf-8") etc.: This shouldn't be faster - except for some better caching - since Charset.forName("utf-8") is used internally, if the charset is not cached.

One thing might be that you're using different charsets (or maybe some of your code does transparently) but the charset cached in StringCoding doesn't change.

IANA low-level java performance junkie, but have you tried unrolling your main loop? It appears that it could allow some CPU's to perform checks in parallel.

Also, this has some fun ideas for optimizations.

You could split the task into a several parallel subtasks, depending of processor's quantity.

I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.

Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];


String stripControlChars(String s)
{
final int inputLen = s.length();
if ( oldChars.length < inputLen )
{
oldChars = new char[inputLen];
}
s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.

As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.

Anyways here we go:

    // Has to be done only once - so cache those! Prohibitively expensive otherwise
private Field value;
private Field offset;
private Field count;
private Field hash;
{
try {
value = String.class.getDeclaredField("value");
value.setAccessible(true);
offset = String.class.getDeclaredField("offset");
offset.setAccessible(true);
count = String.class.getDeclaredField("count");
count.setAccessible(true);
hash = String.class.getDeclaredField("hash");
hash.setAccessible(true);
}
catch (NoSuchFieldException e) {
throw new RuntimeException();
}


}


@Override
public String strip(final String old) {
final int length = old.length();
char[] chars = null;
int off = 0;
try {
chars = (char[]) value.get(old);
off = offset.getInt(old);
}
catch(IllegalArgumentException e) {
throw new RuntimeException(e);
}
catch(IllegalAccessException e) {
throw new RuntimeException(e);
}
int newLen = off;
for(int j = off; j < off + length; j++) {
final char ch = chars[j];
if (ch >= ' ') {
chars[newLen] = ch;
newLen++;
}
}
if (newLen - off != length) {
// We changed the internal state of the string, so at least
// be friendly enough to correct it.
try {
count.setInt(old, newLen - off);
// Have to recompute hash later on
hash.setInt(old, 0);
}
catch(IllegalArgumentException e) {
e.printStackTrace();
}
catch(IllegalAccessException e) {
e.printStackTrace();
}
}
// Well we have to return something
return old;
}

For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.

It can go even faster. Much faster*. How? By leveraging System.arraycopy which is native method. So to recap:

  • Return the same String if it's "clean".

  • Avoid allocating a new char[] on every iteration

  • Use System.arraycopy for moving the elements x positions back

      public class SteliosAdamantidis implements StripAlgorithm {
    
    
    private char[] copy = new char[128];
    
    
    @Override
    public String strip(String s) throws Exception {
    int length = s.length();
    if (length > copy.length) {
    int newLength = copy.length * 2;
    while (length > newLength) newLength *= 2;
    copy = new char[newLength];
    }
    
    
    s.getChars(0, length, copy, 0);
    
    
    int start = 0;  //where to start copying from
    int offset = 0; //number of non printable characters or how far
    //behind the characters should be copied to
    
    
    int index = 0;
    //fast forward to the first non printable character
    for (; index < length; ++index) {
    if (copy[index] < ' ') {
    start = index;
    break;
    }
    }
    
    
    //string is already clean
    if (index == length) return s;
    
    
    for (; index < length; ++index) {
    if (copy[index] < ' ') {
    if (start != index) {
    System.arraycopy(copy, start, copy, start - offset, index - start);
    }
    ++offset;
    start = index + 1; //handling subsequent non printable characters
    }
    }
    
    
    if (length != start) {
    //copy the residue -if any
    System.arraycopy(copy, start, copy, start - offset, length - start);
    }
    return new String(copy, 0, length - offset);
    }
    }
    

This class is not thread safe but I guess that if one wants to handle a gazillion of strings on separate threads then they can afford 4-8 instances of the StripAlgorithm implementation inside a ThreadLocal<>

Trivia

  1. I used as reference the RatchetFreak2EdStaub1GreyCat2 solution. I was surprised that this wasn't performing any good on my machine. Then I wrongfully thought that the "bailout" mechanism didn't work and I moved it at the end. It skyrocketed performance. Then I though "wait a minute" and I realized that the condition works always it's just better at the end. I don't know why.

     ...
    6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93   3.54x   +3.9%
    ...
    2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06   6.12x   +13.9%
    
  2. The test is not 100% accurate. At first I was an egoist and I've put my test second on the array of algorithms. It had some lousy results on the first run and then I moved it at the end (let the others warm up the JVM for me :) ) and then it came first.

Results

Oh and of course the results. Windows 7, jdk1.8.0_111 on a relatively old machine, so expect different results on newer hardware and or OS.

    Rankings: (1.000.000 strings)
17. StringReplaceAll                        990099.01   1.00x   +0.0%
16. ArrayOfByteWindows1251                  1642036.12  1.66x   +65.8%
15. StringBuilderCodePoint                  1724137.93  1.74x   +5.0%
14. ArrayOfByteUTF8Const                    2487562.19  2.51x   +44.3%
13. StringBuilderChar                       2531645.57  2.56x   +1.8%
12. ArrayOfByteUTF8String                   2551020.41  2.58x   +0.8%
11. ArrayOfCharFromArrayOfChar              2824858.76  2.85x   +10.7%
10. RatchetFreak2                           2923976.61  2.95x   +3.5%
9. RatchetFreak1                           3076923.08  3.11x   +5.2%
8. ArrayOfCharFromStringCharAt             3322259.14  3.36x   +8.0%
7. EdStaub1                                3378378.38  3.41x   +1.7%
6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93  3.54x   +3.9%
5. EdStaub1GreyCat1                        3787878.79  3.83x   +8.0%
4. MatcherReplace                          4716981.13  4.76x   +24.5%
3. RatchetFreak2EdStaub1GreyCat1           5319148.94  5.37x   +12.8%
2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06  6.12x   +13.9%
1. SteliosAdamantidis                      9615384.62  9.71x   +58.7%


Rankings: (10.000.000 strings)
17. ArrayOfByteWindows1251                  1647175.09  1.00x   +0.0%
16. StringBuilderCodePoint                  1728907.33  1.05x   +5.0%
15. StringBuilderChar                       2480158.73  1.51x   +43.5%
14. ArrayOfByteUTF8Const                    2498126.41  1.52x   +0.7%
13. ArrayOfByteUTF8String                   2591344.91  1.57x   +3.7%
12. StringReplaceAll                        2626740.22  1.59x   +1.4%
11. ArrayOfCharFromArrayOfChar              2810567.73  1.71x   +7.0%
10. RatchetFreak2                           2948113.21  1.79x   +4.9%
9. RatchetFreak1                           3120124.80  1.89x   +5.8%
8. ArrayOfCharFromStringCharAt             3306878.31  2.01x   +6.0%
7. EdStaub1                                3399048.27  2.06x   +2.8%
6. RatchetFreak2EdStaub1GreyCatEarlyBail   3494060.10  2.12x   +2.8%
5. EdStaub1GreyCat1                        3818251.24  2.32x   +9.3%
4. MatcherReplace                          4899559.04  2.97x   +28.3%
3. RatchetFreak2EdStaub1GreyCat1           5302226.94  3.22x   +8.2%
2. RatchetFreak2EdStaub1GreyCatLateBail    5924170.62  3.60x   +11.7%
1. SteliosAdamantidis                      9680542.11  5.88x   +63.4%

* Reflection -Voo's answer

I've put an asterisk on the Much faster statement. I don't think that anything can go faster than reflection in that case. It mutates the String's internal state and avoids new String allocations. I don't think one can beat that.

I tried to uncomment and run Voo's algorithm and I got an error that offset field doesn't exit. IntelliJ complains that it can't resolve count either. Also (if I'm not mistaken) the security manager might cut reflection access to private fields and thus this solution won't work. That's why this algorithm doesn't appear in my test run. Otherwise I was curious to see myself although I believe that a non reflective solution can't be faster.