从 JavaString 中去除所有不可打印字符的最快方法

小开

using 1 char array could work a bit better

int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;
newLen++;
}
}
s = new String(oldChars, 0, newLen);

and I avoided repeated calls to s.length();

another micro-optimization that might work is

int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
// if there are none it ends on the null char I appended
for (int  j = newLen; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
newLen++;
}
}
s = new String(oldChars, 0, newLen);

小开

why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

If you mean String#getBytes("utf-8") etc.: This shouldn't be faster - except for some better caching - since Charset.forName("utf-8") is used internally, if the charset is not cached.

One thing might be that you're using different charsets (or maybe some of your code does transparently) but the charset cached in StringCoding doesn't change.

小开

IANA low-level java performance junkie, but have you tried unrolling your main loop? It appears that it could allow some CPU's to perform checks in parallel.

Also, this has some fun ideas for optimizations.

小开

You could split the task into a several parallel subtasks, depending of processor's quantity.

小开

I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.

Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).

小开

最佳答案

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];


String stripControlChars(String s)
{
final int inputLen = s.length();
if ( oldChars.length < inputLen )
{
oldChars = new char[inputLen];
}
s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

小开

Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.

As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.

Anyways here we go:

    // Has to be done only once - so cache those! Prohibitively expensive otherwise
private Field value;
private Field offset;
private Field count;
private Field hash;
{
try {
value = String.class.getDeclaredField("value");
value.setAccessible(true);
offset = String.class.getDeclaredField("offset");
offset.setAccessible(true);
count = String.class.getDeclaredField("count");
count.setAccessible(true);
hash = String.class.getDeclaredField("hash");
hash.setAccessible(true);
}
catch (NoSuchFieldException e) {
throw new RuntimeException();
}


}


@Override
public String strip(final String old) {
final int length = old.length();
char[] chars = null;
int off = 0;
try {
chars = (char[]) value.get(old);
off = offset.getInt(old);
}
catch(IllegalArgumentException e) {
throw new RuntimeException(e);
}
catch(IllegalAccessException e) {
throw new RuntimeException(e);
}
int newLen = off;
for(int j = off; j < off + length; j++) {
final char ch = chars[j];
if (ch >= ' ') {
chars[newLen] = ch;
newLen++;
}
}
if (newLen - off != length) {
// We changed the internal state of the string, so at least
// be friendly enough to correct it.
try {
count.setInt(old, newLen - off);
// Have to recompute hash later on
hash.setInt(old, 0);
}
catch(IllegalArgumentException e) {
e.printStackTrace();
}
catch(IllegalAccessException e) {
e.printStackTrace();
}
}
// Well we have to return something
return old;
}

For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.

小开

It can go even faster. Much faster^*. How? By leveraging System.arraycopy which is native method. So to recap:

Return the same String if it's "clean".
Avoid allocating a new char[] on every iteration

Use System.arraycopy for moving the elements x positions back

  public class SteliosAdamantidis implements StripAlgorithm {


private char[] copy = new char[128];


@Override
public String strip(String s) throws Exception {
int length = s.length();
if (length > copy.length) {
int newLength = copy.length * 2;
while (length > newLength) newLength *= 2;
copy = new char[newLength];
}


s.getChars(0, length, copy, 0);


int start = 0;  //where to start copying from
int offset = 0; //number of non printable characters or how far
//behind the characters should be copied to


int index = 0;
//fast forward to the first non printable character
for (; index < length; ++index) {
if (copy[index] < ' ') {
start = index;
break;
}
}


//string is already clean
if (index == length) return s;


for (; index < length; ++index) {
if (copy[index] < ' ') {
if (start != index) {
System.arraycopy(copy, start, copy, start - offset, index - start);
}
++offset;
start = index + 1; //handling subsequent non printable characters
}
}


if (length != start) {
//copy the residue -if any
System.arraycopy(copy, start, copy, start - offset, length - start);
}
return new String(copy, 0, length - offset);
}
}

This class is not thread safe but I guess that if one wants to handle a gazillion of strings on separate threads then they can afford 4-8 instances of the StripAlgorithm implementation inside a ThreadLocal<>

Trivia

I used as reference the RatchetFreak2EdStaub1GreyCat2 solution. I was surprised that this wasn't performing any good on my machine. Then I wrongfully thought that the "bailout" mechanism didn't work and I moved it at the end. It skyrocketed performance. Then I though "wait a minute" and I realized that the condition works always it's just better at the end. I don't know why.
```
 ...
6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93   3.54x   +3.9%
...
2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06   6.12x   +13.9%
```
The test is not 100% accurate. At first I was an egoist and I've put my test second on the array of algorithms. It had some lousy results on the first run and then I moved it at the end (let the others warm up the JVM for me :) ) and then it came first.

Results

Oh and of course the results. Windows 7, jdk1.8.0_111 on a relatively old machine, so expect different results on newer hardware and or OS.

    Rankings: (1.000.000 strings)
17. StringReplaceAll                        990099.01   1.00x   +0.0%
16. ArrayOfByteWindows1251                  1642036.12  1.66x   +65.8%
15. StringBuilderCodePoint                  1724137.93  1.74x   +5.0%
14. ArrayOfByteUTF8Const                    2487562.19  2.51x   +44.3%
13. StringBuilderChar                       2531645.57  2.56x   +1.8%
12. ArrayOfByteUTF8String                   2551020.41  2.58x   +0.8%
11. ArrayOfCharFromArrayOfChar              2824858.76  2.85x   +10.7%
10. RatchetFreak2                           2923976.61  2.95x   +3.5%
9. RatchetFreak1                           3076923.08  3.11x   +5.2%
8. ArrayOfCharFromStringCharAt             3322259.14  3.36x   +8.0%
7. EdStaub1                                3378378.38  3.41x   +1.7%
6. RatchetFreak2EdStaub1GreyCatEarlyBail   3508771.93  3.54x   +3.9%
5. EdStaub1GreyCat1                        3787878.79  3.83x   +8.0%
4. MatcherReplace                          4716981.13  4.76x   +24.5%
3. RatchetFreak2EdStaub1GreyCat1           5319148.94  5.37x   +12.8%
2. RatchetFreak2EdStaub1GreyCatLateBail    6060606.06  6.12x   +13.9%
1. SteliosAdamantidis                      9615384.62  9.71x   +58.7%


Rankings: (10.000.000 strings)
17. ArrayOfByteWindows1251                  1647175.09  1.00x   +0.0%
16. StringBuilderCodePoint                  1728907.33  1.05x   +5.0%
15. StringBuilderChar                       2480158.73  1.51x   +43.5%
14. ArrayOfByteUTF8Const                    2498126.41  1.52x   +0.7%
13. ArrayOfByteUTF8String                   2591344.91  1.57x   +3.7%
12. StringReplaceAll                        2626740.22  1.59x   +1.4%
11. ArrayOfCharFromArrayOfChar              2810567.73  1.71x   +7.0%
10. RatchetFreak2                           2948113.21  1.79x   +4.9%
9. RatchetFreak1                           3120124.80  1.89x   +5.8%
8. ArrayOfCharFromStringCharAt             3306878.31  2.01x   +6.0%
7. EdStaub1                                3399048.27  2.06x   +2.8%
6. RatchetFreak2EdStaub1GreyCatEarlyBail   3494060.10  2.12x   +2.8%
5. EdStaub1GreyCat1                        3818251.24  2.32x   +9.3%
4. MatcherReplace                          4899559.04  2.97x   +28.3%
3. RatchetFreak2EdStaub1GreyCat1           5302226.94  3.22x   +8.2%
2. RatchetFreak2EdStaub1GreyCatLateBail    5924170.62  3.60x   +11.7%
1. SteliosAdamantidis                      9680542.11  5.88x   +63.4%

* Reflection -Voo's answer

I've put an asterisk on the Much faster statement. I don't think that anything can go faster than reflection in that case. It mutates the String's internal state and avoids new String allocations. I don't think one can beat that.

I tried to uncomment and run Voo's algorithm and I got an error that offset field doesn't exit. IntelliJ complains that it can't resolve count either. Also (if I'm not mistaken) the security manager might cut reflection access to private fields and thus this solution won't work. That's why this algorithm doesn't appear in my test run. Otherwise I was curious to see myself although I believe that a non reflective solution can't be faster.

从 JavaString 中去除所有不可打印字符的最快方法

更新

更新2

同一根绳子

多个字符串，100% 的字符串包含控制字符

多个字符串，1% 的字符串包含控制字符

参考文献

Trivia

Results

* Reflection -Voo's answer