为什么 StringBuilder # append (int)在 Java7中比在 Java8中快？

在研究使用 "" + n和 Integer.toString(int)将整数原语转换为字符串时，我编写了这个 JMH微基准:

@Fork(1)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class IntStr {
protected int counter;




@GenerateMicroBenchmark
public String integerToString() {
return Integer.toString(this.counter++);
}


@GenerateMicroBenchmark
public String stringBuilder0() {
return new StringBuilder().append(this.counter++).toString();
}


@GenerateMicroBenchmark
public String stringBuilder1() {
return new StringBuilder().append("").append(this.counter++).toString();
}


@GenerateMicroBenchmark
public String stringBuilder2() {
return new StringBuilder().append("").append(Integer.toString(this.counter++)).toString();
}


@GenerateMicroBenchmark
public String stringFormat() {
return String.format("%d", this.counter++);
}


@Setup(Level.Iteration)
public void prepareIteration() {
this.counter = 0;
}
}

我使用默认的 JMH 选项运行它，这两个 Java VM 都存在于我的 Linux 机器上(最新的 Mageia 464位，Intel i7-3770 CPU，32 GB RAM)。第一个 JVM 是由 Oracle JDK 提供的 8u564位:

java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

通过这个 JVM，我得到了我所期望的结果:

Benchmark                    Mode   Samples         Mean   Mean error    Units
b.IntStr.integerToString    thrpt        20    32317.048      698.703   ops/ms
b.IntStr.stringBuilder0     thrpt        20    28129.499      421.520   ops/ms
b.IntStr.stringBuilder1     thrpt        20    28106.692     1117.958   ops/ms
b.IntStr.stringBuilder2     thrpt        20    20066.939     1052.937   ops/ms
b.IntStr.stringFormat       thrpt        20     2346.452       37.422   ops/ms

例如，使用 StringBuilder类比较慢，因为创建 StringBuilder对象和附加空字符串会带来额外的开销。使用 ABc2的速度更慢，大约慢了一个数量级。

另一方面，发行版提供的编译器基于 OpenJDK 1.7:

java version "1.7.0_55"
OpenJDK Runtime Environment (mageia-2.4.7.1.mga4-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

这里的结果是 有意思:

Benchmark                    Mode   Samples         Mean   Mean error    Units
b.IntStr.integerToString    thrpt        20    31249.306      881.125   ops/ms
b.IntStr.stringBuilder0     thrpt        20    39486.857      663.766   ops/ms
b.IntStr.stringBuilder1     thrpt        20    41072.058      484.353   ops/ms
b.IntStr.stringBuilder2     thrpt        20    20513.913      466.130   ops/ms
b.IntStr.stringFormat       thrpt        20     2068.471       44.964   ops/ms

为什么使用这个 JVM 时 StringBuilder.append(int)出现的速度要快得多？查看 StringBuilder类的源代码并没有发现什么特别有趣的东西——所讨论的方法几乎与 Integer#toString(int)完全相同。有趣的是，附加 Integer.toString(int)(stringBuilder2微基准测试)的结果似乎并不快。

这种性能差异是测试装具的问题吗？还是我的 OpenJDKJVM 包含会影响这个特定代码(反)模式的优化？

编辑:

为了进行更直接的比较，我安装了 Oracle JDK 1.7 u55:

java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

结果类似于 OpenJDK:

Benchmark                    Mode   Samples         Mean   Mean error    Units
b.IntStr.integerToString    thrpt        20    32502.493      501.928   ops/ms
b.IntStr.stringBuilder0     thrpt        20    39592.174      428.967   ops/ms
b.IntStr.stringBuilder1     thrpt        20    40978.633      544.236   ops/ms

这似乎是一个更普遍的 Java7与 Java8问题。也许 Java7有更积极的字符串优化？

编辑2 :

为了完整起见，下面是这两个 JVM 的字符串相关 VM 选项:

对于 Oracle JDK 8u5:

$ /usr/java/default/bin/java -XX:+PrintFlagsFinal 2>/dev/null | grep String bool OptimizeStringConcat = true {C2 product} intx PerfMaxStringConstLength = 1024 {product} bool PrintStringTableStatistics = false {product} uintx StringTableSize = 60013 {product}

对于 OpenJDK 1.7:

$ java -XX:+PrintFlagsFinal 2>/dev/null | grep String bool OptimizeStringConcat = true {C2 product} intx PerfMaxStringConstLength = 1024 {product} bool PrintStringTableStatistics = false {product} uintx StringTableSize = 60013 {product} bool UseStringCache = false {product}

在 Java8中删除了 UseStringCache选项，没有替换，所以我怀疑这有什么不同。其余的选项似乎具有相同的设置。

编辑3:

从 src.zip文件的 AbstractStringBuilder、 StringBuilder和 Integer类的源代码并排比较显示没有什么值得注意的。除了大量的修饰和文档更改之外，Integer现在还支持无符号整数，而且 StringBuilder已经进行了轻微的重构，以便与 StringBuffer共享更多代码。这些变化似乎都没有影响到 StringBuilder#append(int)使用的代码路径，尽管我可能遗漏了一些东西。

比较为 IntStr#integerToString()和 IntStr#stringBuilder0()生成的汇编代码要有趣得多。为 IntStr#integerToString()生成的代码的基本布局对于两个 JVM 都是相似的，尽管 Oracle JDK 8u5似乎更积极地在 Integer#toString(int)代码内嵌一些调用。这与 Java 源代码有着明显的对应关系，即使对于那些只有很少的汇编经验的人来说也是如此。

然而，IntStr#stringBuilder0()的汇编代码却截然不同。Oracle JDK 8u5生成的代码再一次与 Java 源代码直接相关——我可以很容易地识别出相同的布局。相反，由 OpenJDK7生成的代码几乎无法被未经训练的眼睛识别(就像我的眼睛一样)。new StringBuilder()调用似乎被删除了，在 StringBuilder构造函数中创建的数组也被删除了。此外，反汇编程序插件不能像 JDK 8中那样提供对源代码的引用。

我假设这要么是 OpenJDK7中更积极的优化过程的结果，要么更可能是为某些 StringBuilder操作插入手写的低级代码的结果。我不确定为什么在我的 JVM 8实现中没有发生这种优化，或者为什么在 JVM 7中没有对 Integer#toString(int)实现相同的优化。我想熟悉 JRE 源代码相关部分的人必须回答这些问题..。

小开

最佳答案

TL;DR: Side effects in append apparently break StringConcat optimizations.

Very good analysis in the original question and updates!

For completeness, below are a few missing steps:

See through the -XX:+PrintInlining for both 7u55 and 8u5. In 7u55, you will see something like this:

 @ 16   org.sample.IntStr::inlineSideEffect (25 bytes)   force inline by CompilerOracle
@ 4   java.lang.StringBuilder::<init> (7 bytes)   inline (hot)
@ 18   java.lang.StringBuilder::append (8 bytes)   already compiled into a big method
@ 21   java.lang.StringBuilder::toString (17 bytes)   inline (hot)

...and in 8u5:

 @ 16   org.sample.IntStr::inlineSideEffect (25 bytes)   force inline by CompilerOracle
@ 4   java.lang.StringBuilder::<init> (7 bytes)   inline (hot)
@ 3   java.lang.AbstractStringBuilder::<init> (12 bytes)   inline (hot)
@ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
@ 18   java.lang.StringBuilder::append (8 bytes)   inline (hot)
@ 2   java.lang.AbstractStringBuilder::append (62 bytes)   already compiled into a big method
@ 21   java.lang.StringBuilder::toString (17 bytes)   inline (hot)
@ 13   java.lang.String::<init> (62 bytes)   inline (hot)
@ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
@ 55   java.util.Arrays::copyOfRange (63 bytes)   inline (hot)
@ 54   java.lang.Math::min (11 bytes)   (intrinsic)
@ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)

You might notice that 7u55 version is shallower, and it looks like nothing is called after StringBuilder methods -- this is a good indication the string optimizations are in effect. Indeed, if you run 7u55 with -XX:-OptimizeStringConcat, the subcalls will reappear, and performance drops to 8u5 levels.

OK, so we need to figure out why 8u5 does not do the same optimization. Grep http://hg.openjdk.java.net/jdk9/jdk9/hotspot for "StringBuilder" to figure out where VM handles the StringConcat optimization; this will get you into src/share/vm/opto/stringopts.cpp

hg log src/share/vm/opto/stringopts.cpp to figure out the latest changes there. One of the candidates would be:

changeset:   5493:90abdd727e64
user:        iveresov
date:        Wed Oct 16 11:13:15 2013 -0700
summary:     8009303: Tiered: incorrect results in VM tests stringconcat...

Look for the review threads on OpenJDK mailing lists (easy enough to google for changeset summary): http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2013-October/012084.html
Spot "String concat optimization optimization collapses the pattern [...] into a single allocation of a string and forming the result directly. All possible deopts that may happen in the optimized code restart this pattern from the beginning (starting from the StringBuffer allocation). That means that the whole pattern must me side-effect free." Eureka?

Write out the contrasting benchmark:

@Fork(5)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class IntStr {
private int counter;


@GenerateMicroBenchmark
public String inlineSideEffect() {
return new StringBuilder().append(counter++).toString();
}


@GenerateMicroBenchmark
public String spliceSideEffect() {
int cnt = counter++;
return new StringBuilder().append(cnt).toString();
}
}

Measure it on JDK 7u55, seeing the same performance for inlined/spliced side effects:

Benchmark                       Mode   Samples         Mean   Mean error    Units
o.s.IntStr.inlineSideEffect     avgt        25       65.460        1.747    ns/op
o.s.IntStr.spliceSideEffect     avgt        25       64.414        1.323    ns/op

Measure it on JDK 8u5, seeing the performance degradation with the inlined effect:

Benchmark                       Mode   Samples         Mean   Mean error    Units
o.s.IntStr.inlineSideEffect     avgt        25       84.953        2.274    ns/op
o.s.IntStr.spliceSideEffect     avgt        25       65.386        1.194    ns/op

Submit the bug report (https://bugs.openjdk.java.net/browse/JDK-8043677) to discuss this behavior with VM guys. The rationale for original fix is rock solid, it is interesting however if we can/should get back this optimization in some trivial cases like these.
???
PROFIT.

And yeah, I should post the results for the benchmark which moves the increment from the StringBuilder chain, doing it before the entire chain. Also, switched to average time, and ns/op. This is JDK 7u55:

Benchmark                      Mode   Samples         Mean   Mean error    Units
o.s.IntStr.integerToString     avgt        25      153.805        1.093    ns/op
o.s.IntStr.stringBuilder0      avgt        25      128.284        6.797    ns/op
o.s.IntStr.stringBuilder1      avgt        25      131.524        3.116    ns/op
o.s.IntStr.stringBuilder2      avgt        25      254.384        9.204    ns/op
o.s.IntStr.stringFormat        avgt        25     2302.501      103.032    ns/op

And this is 8u5:

Benchmark                      Mode   Samples         Mean   Mean error    Units
o.s.IntStr.integerToString     avgt        25      153.032        3.295    ns/op
o.s.IntStr.stringBuilder0      avgt        25      127.796        1.158    ns/op
o.s.IntStr.stringBuilder1      avgt        25      131.585        1.137    ns/op
o.s.IntStr.stringBuilder2      avgt        25      250.980        2.773    ns/op
o.s.IntStr.stringFormat        avgt        25     2123.706       25.105    ns/op

stringFormat is actually a bit faster in 8u5, and all other tests are the same. This solidifies the hypothesis the side-effect breakage in SB chains in the major culprit in the original question.