为什么 long 在 x64Java 中比 int 慢?

我在 Surface Pro 2平板电脑上运行 Windows 8.1 x64,Java 7更新为45 x64(没有安装32位 Java)。

下面的代码需要1688ms 当 i 的类型是一个长的和109ms 当 i 是一个整型。为什么 long (64位类型)在64位 JVM 平台上的数量级比 int 慢?

我唯一的猜测是,CPU 添加一个64位整数比添加一个32位整数花费的时间更长,但这似乎不太可能。我怀疑哈斯维尔没有使用涟漪携带毒蛇。

顺便说一下,我在 Eclipse Kepler SR1中运行这个。

public class Main {


private static long i = Integer.MAX_VALUE;


public static void main(String[] args) {
System.out.println("Starting the loop");
long startTime = System.currentTimeMillis();
while(!decrementAndCheck()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
}


private static boolean decrementAndCheck() {
return --i < 0;
}


}

编辑: 下面是 VS2013(下面)编译的相同 C + + 代码的结果,同一个系统。这些结果处于调试32位模式。

在64位释放模式: 长: 875毫米长: 906ms int: 1047ms

这表明,我观察到的结果是 JVM 优化的怪异之处,而不是 CPU 的限制。

#include "stdafx.h"
#include "iostream"
#include "windows.h"
#include "limits.h"


long long i = INT_MAX;


using namespace std;




boolean decrementAndCheck() {
return --i < 0;
}




int _tmain(int argc, _TCHAR* argv[])
{




cout << "Starting the loop" << endl;


unsigned long startTime = GetTickCount64();
while (!decrementAndCheck()){
}
unsigned long endTime = GetTickCount64();


cout << "Finished the loop in " << (endTime - startTime) << "ms" << endl;






}

编辑: 刚刚在 Java8RTM 中再次尝试,没有显著变化。

13296 次浏览

Basic unit of data in a Java Virtual Machine is word. Choosing the right word size is left upon the implementation of the JVM. A JVM implementation should choose a minimum word size of 32 bits. It can choose a higher word size to gain efficiency. Neither there is any restriction that a 64 bit JVM should choose 64 bit word only.

The underlying architecture doesn't rules that the word size should also be the same. JVM reads/writes data word by word. This is the reason why it might be taking longer for a long than an int.

Here you can find more on the same topic.

The JVM stack is defined in terms of words, whose size is an implementation detail but must be at least 32 bits wide. The JVM implementer may use 64-bit words, but the bytecode can't rely on this, and so operations with long or double values have to be handled with extra care. In particular, the JVM integer branch instructions are defined on exactly the type int.

In the case of your code, disassembly is instructive. Here's the bytecode for the int version as compiled by the Oracle JDK 7:

private static boolean decrementAndCheck();
Code:
0: getstatic     #14  // Field i:I
3: iconst_1
4: isub
5: dup
6: putstatic     #14  // Field i:I
9: ifge          16
12: iconst_1
13: goto          17
16: iconst_0
17: ireturn

Note that the JVM will load the value of your static i (0), subtract one (3-4), duplicate the value on the stack (5), and push it back into the variable (6). It then does a compare-with-zero branch and returns.

The version with the long is a bit more complicated:

private static boolean decrementAndCheck();
Code:
0: getstatic     #14  // Field i:J
3: lconst_1
4: lsub
5: dup2
6: putstatic     #14  // Field i:J
9: lconst_0
10: lcmp
11: ifge          18
14: iconst_1
15: goto          19
18: iconst_0
19: ireturn

First, when the JVM duplicates the new value on the stack (5), it has to duplicate two stack words. In your case, it's quite possible that this is no more expensive than duplicating one, since the JVM is free to use a 64-bit word if convenient. However, you'll notice that the branch logic is longer here. The JVM doesn't have an instruction to compare a long with zero, so it has to push a constant 0L onto the stack (9), do a general long comparison (10), and then branch on the value of that calculation.

Here are two plausible scenarios:

  • The JVM is following the bytecode path exactly. In this case, it's doing more work in the long version, pushing and popping several extra values, and these are on the virtual managed stack, not the real hardware-assisted CPU stack. If this is the case, you'll still see a significant performance difference after warmup.
  • The JVM realizes that it can optimize this code. In this case, it's taking extra time to optimize away some of the practically unnecessary push/compare logic. If this is the case, you'll see very little performance difference after warmup.

I recommend you write a correct microbenchmark to eliminate the effect of having the JIT kick in, and also trying this with a final condition that isn't zero, to force the JVM to do the same comparison on the int that it does with the long.

I don't have a 64 bit machine to test with, but the rather large difference suggests that there is more than the slightly longer bytecode at work.

I see very close times for long/int (4400 vs 4800ms) on my 32-bit 1.7.0_45.

This is only a guess, but I strongly suspect that it is the effect of a memory misalignment penalty. To confirm/deny the suspicion, try adding a public static int dummy = 0; before the declaration of i. That will push i down by 4 bytes in memory layout and may make it properly aligned for better performance. Confirmed to be not causing the issue.

EDIT: The reasoning behind this is that the VM may not reorder fields at its leisure adding padding for optimal alignment, since that may interfere with JNI (Not the case).

For the record, this version does a crude "warmup":

public class LongSpeed {


private static long i = Integer.MAX_VALUE;
private static int j = Integer.MAX_VALUE;


public static void main(String[] args) {


for (int x = 0; x < 10; x++) {
runLong();
runWord();
}
}


private static void runLong() {
System.out.println("Starting the long loop");
i = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheckI()){


}
long endTime = System.currentTimeMillis();


System.out.println("Finished the long loop in " + (endTime - startTime) + "ms");
}


private static void runWord() {
System.out.println("Starting the word loop");
j = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheckJ()){


}
long endTime = System.currentTimeMillis();


System.out.println("Finished the word loop in " + (endTime - startTime) + "ms");
}


private static boolean decrementAndCheckI() {
return --i < 0;
}


private static boolean decrementAndCheckJ() {
return --j < 0;
}


}

The overall times improve about 30%, but the ratio between the two remains roughly the same.

My JVM does this pretty straightforward thing to the inner loop when you use longs:

0x00007fdd859dbb80: test   %eax,0x5f7847a(%rip)  /* fun JVM hack */
0x00007fdd859dbb86: dec    %r11                  /* i-- */
0x00007fdd859dbb89: mov    %r11,0x258(%r10)      /* store i to memory */
0x00007fdd859dbb90: test   %r11,%r11             /* unnecessary test */
0x00007fdd859dbb93: jge    0x00007fdd859dbb80    /* go back to the loop top */

It cheats, hard, when you use ints; first there's some screwiness that I don't claim to understand but looks like setup for an unrolled loop:

0x00007f3dc290b5a1: mov    %r11d,%r9d
0x00007f3dc290b5a4: dec    %r9d
0x00007f3dc290b5a7: mov    %r9d,0x258(%r10)
0x00007f3dc290b5ae: test   %r9d,%r9d
0x00007f3dc290b5b1: jl     0x00007f3dc290b662
0x00007f3dc290b5b7: add    $0xfffffffffffffffe,%r11d
0x00007f3dc290b5bb: mov    %r9d,%ecx
0x00007f3dc290b5be: dec    %ecx
0x00007f3dc290b5c0: mov    %ecx,0x258(%r10)
0x00007f3dc290b5c7: cmp    %r11d,%ecx
0x00007f3dc290b5ca: jle    0x00007f3dc290b5d1
0x00007f3dc290b5cc: mov    %ecx,%r9d
0x00007f3dc290b5cf: jmp    0x00007f3dc290b5bb
0x00007f3dc290b5d1: and    $0xfffffffffffffffe,%r9d
0x00007f3dc290b5d5: mov    %r9d,%r8d
0x00007f3dc290b5d8: neg    %r8d
0x00007f3dc290b5db: sar    $0x1f,%r8d
0x00007f3dc290b5df: shr    $0x1f,%r8d
0x00007f3dc290b5e3: sub    %r9d,%r8d
0x00007f3dc290b5e6: sar    %r8d
0x00007f3dc290b5e9: neg    %r8d
0x00007f3dc290b5ec: and    $0xfffffffffffffffe,%r8d
0x00007f3dc290b5f0: shl    %r8d
0x00007f3dc290b5f3: mov    %r8d,%r11d
0x00007f3dc290b5f6: neg    %r11d
0x00007f3dc290b5f9: sar    $0x1f,%r11d
0x00007f3dc290b5fd: shr    $0x1e,%r11d
0x00007f3dc290b601: sub    %r8d,%r11d
0x00007f3dc290b604: sar    $0x2,%r11d
0x00007f3dc290b608: neg    %r11d
0x00007f3dc290b60b: and    $0xfffffffffffffffe,%r11d
0x00007f3dc290b60f: shl    $0x2,%r11d
0x00007f3dc290b613: mov    %r11d,%r9d
0x00007f3dc290b616: neg    %r9d
0x00007f3dc290b619: sar    $0x1f,%r9d
0x00007f3dc290b61d: shr    $0x1d,%r9d
0x00007f3dc290b621: sub    %r11d,%r9d
0x00007f3dc290b624: sar    $0x3,%r9d
0x00007f3dc290b628: neg    %r9d
0x00007f3dc290b62b: and    $0xfffffffffffffffe,%r9d
0x00007f3dc290b62f: shl    $0x3,%r9d
0x00007f3dc290b633: mov    %ecx,%r11d
0x00007f3dc290b636: sub    %r9d,%r11d
0x00007f3dc290b639: cmp    %r11d,%ecx
0x00007f3dc290b63c: jle    0x00007f3dc290b64f
0x00007f3dc290b63e: xchg   %ax,%ax /* OK, fine; I know what a nop looks like */

then the unrolled loop itself:

0x00007f3dc290b640: add    $0xfffffffffffffff0,%ecx
0x00007f3dc290b643: mov    %ecx,0x258(%r10)
0x00007f3dc290b64a: cmp    %r11d,%ecx
0x00007f3dc290b64d: jg     0x00007f3dc290b640

then the teardown code for the unrolled loop, itself a test and a straight loop:

0x00007f3dc290b64f: cmp    $0xffffffffffffffff,%ecx
0x00007f3dc290b652: jle    0x00007f3dc290b662
0x00007f3dc290b654: dec    %ecx
0x00007f3dc290b656: mov    %ecx,0x258(%r10)
0x00007f3dc290b65d: cmp    $0xffffffffffffffff,%ecx
0x00007f3dc290b660: jg     0x00007f3dc290b654

So it goes 16 times faster for ints because the JIT unrolled the int loop 16 times, but didn't unroll the long loop at all.

For completeness, here is the code I actually tried:

public class foo136 {
private static int i = Integer.MAX_VALUE;
public static void main(String[] args) {
System.out.println("Starting the loop");
for (int foo = 0; foo < 100; foo++)
doit();
}


static void doit() {
i = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheck()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
}


private static boolean decrementAndCheck() {
return --i < 0;
}
}

The assembly dumps were generated using the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly. Note that you need to mess around with your JVM installation to have this work for you as well; you need to put some random shared library in exactly the right place or it will fail.

For the records:

if i use

boolean decrementAndCheckLong() {
lo = lo - 1l;
return lo < -1l;
}

(changed "l--" to "l = l - 1l") long performance improves by ~50%

I have just written a benchmark using caliper.

The results are quite consistent with the original code: a ~12x speedup for using int over long. It certainly seems that the loop unrolling reported by tmyklebu or something very similar is going on.

timeIntDecrements         195,266,845.000
timeLongDecrements      2,321,447,978.000

This is my code; note that it uses a freshly-built snapshot of caliper, since I could not figure out how to code against their existing beta release.

package test;


import com.google.caliper.Benchmark;
import com.google.caliper.Param;


public final class App {


@Param({""+1}) int number;


private static class IntTest {
public static int v;
public static void reset() {
v = Integer.MAX_VALUE;
}
public static boolean decrementAndCheck() {
return --v < 0;
}
}


private static class LongTest {
public static long v;
public static void reset() {
v = Integer.MAX_VALUE;
}
public static boolean decrementAndCheck() {
return --v < 0;
}
}


@Benchmark
int timeLongDecrements(int reps) {
int k=0;
for (int i=0; i<reps; i++) {
LongTest.reset();
while (!LongTest.decrementAndCheck()) { k++; }
}
return (int)LongTest.v | k;
}


@Benchmark
int timeIntDecrements(int reps) {
int k=0;
for (int i=0; i<reps; i++) {
IntTest.reset();
while (!IntTest.decrementAndCheck()) { k++; }
}
return IntTest.v | k;
}
}

It's likely due to the JVM checking for safepoints when long is used (uncounted loop), and not doing it for int (counted loop).

Some references: https://stackoverflow.com/a/62557768/14624235

https://stackoverflow.com/a/58726530/14624235

http://psy-lob-saw.blogspot.com/2016/02/wait-for-it-counteduncounted-loops.html