为什么这个 C + + 程序如此之快?

我编写了一个小基准来比较 Python、 Ruby、 JavaScript 和 C + + 的不同解释器/编译器的性能。 正如预期的那样,(经过优化的) C + + 打败了脚本语言,但是它这样做的因素是难以置信的高。

结果是:

sven@jet:~/tmp/js$ time node bla.js              # * JavaScript with node *
0


real    0m1.222s
user    0m1.190s
sys 0m0.015s
sven@jet:~/tmp/js$ time ruby foo.rb              # * Ruby *
0


real    0m52.428s
user    0m52.395s
sys 0m0.028s
sven@jet:~/tmp/js$ time python blub.py           # * Python with CPython *
0


real    1m16.480s
user    1m16.371s
sys 0m0.080s


sven@jet:~/tmp/js$ time pypy blub.py             # * Python with PyPy *
0


real    0m4.707s
user    0m4.579s
sys 0m0.028s


sven@jet:~/tmp/js$ time ./cpp_non_optimized 1000 1000000 # * C++ with -O0 (gcc) *
0


real    0m1.702s
user    0m1.699s
sys 0m0.002s
sven@jet:~/tmp/js$ time ./cpp_optimized 1000 1000000     # * C++ with -O3 (gcc) *
0


real    0m0.003s # (!!!) <---------------------------------- WHY?
user    0m0.002s
sys 0m0.002s

我想知道是否有人可以解释一下为什么优化后的 c + + 代码比其他代码快了三数量级。

C + + 基准测试使用命令行参数,以防止在编译时预先计算结果。

下面,我放置了不同语言基准测试的源代码,它们在语义上应该是等价的。 此外,我还为优化的 C + + 编译器输出提供了汇编代码(使用 gcc)。 当查看优化的程序集时,似乎编译器将基准测试中的两个循环合并为一个循环,但是,仍然存在一个循环!

JavaScript:

var s = 0;
var outer = 1000;
var inner = 1000000;


for (var i = 0; i < outer; ++i) {
for (var j = 0; j < inner; ++j) {
++s;
}
s -= inner;
}
console.log(s);

巨蟒:

s = 0
outer = 1000
inner = 1000000


for _ in xrange(outer):
for _ in xrange(inner):
s += 1
s -= inner
print s

露比:

s = 0
outer = 1000
inner = 1000000


outer_end = outer - 1
inner_end = inner - 1


for i in 0..outer_end
for j in 0..inner_end
s = s + 1
end
s = s - inner
end
puts s

C + + :

#include <iostream>
#include <cstdlib>
#include <cstdint>


int main(int argc, char* argv[]) {
uint32_t s = 0;
uint32_t outer = atoi(argv[1]);
uint32_t inner = atoi(argv[2]);
for (uint32_t i = 0; i < outer; ++i) {
for (uint32_t j = 0; j < inner; ++j)
++s;
s -= inner;
}
std::cout << s << std::endl;
return 0;
}

汇编(在用 gcc-S-O3-std = c + + 0x 编译上面的 C + + 代码时) :

    .file   "bar.cpp"
.section    .text.startup,"ax",@progbits
.p2align 4,,15
.globl  main
.type   main, @function
main:
.LFB1266:
.cfi_startproc
pushq   %r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
movl    $10, %edx
movq    %rsi, %r12
pushq   %rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
pushq   %rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
movq    8(%rsi), %rdi
xorl    %esi, %esi
call    strtol
movq    16(%r12), %rdi
movq    %rax, %rbp
xorl    %esi, %esi
movl    $10, %edx
call    strtol
testl   %ebp, %ebp
je  .L6
movl    %ebp, %ebx
xorl    %eax, %eax
xorl    %edx, %edx
.p2align 4,,10
.p2align 3
.L3:                             # <--- Here is the loop
addl    $1, %eax             # <---
cmpl    %eax, %ebx           # <---
ja  .L3                      # <---
.L2:
movl    %edx, %esi
movl    $_ZSt4cout, %edi
call    _ZNSo9_M_insertImEERSoT_
movq    %rax, %rdi
call    _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_
popq    %rbx
.cfi_remember_state
.cfi_def_cfa_offset 24
popq    %rbp
.cfi_def_cfa_offset 16
xorl    %eax, %eax
popq    %r12
.cfi_def_cfa_offset 8
ret
.L6:
.cfi_restore_state
xorl    %edx, %edx
jmp .L2
.cfi_endproc
.LFE1266:
.size   main, .-main
.p2align 4,,15
.type   _GLOBAL__sub_I_main, @function
_GLOBAL__sub_I_main:
.LFB1420:
.cfi_startproc
subq    $8, %rsp
.cfi_def_cfa_offset 16
movl    $_ZStL8__ioinit, %edi
call    _ZNSt8ios_base4InitC1Ev
movl    $__dso_handle, %edx
movl    $_ZStL8__ioinit, %esi
movl    $_ZNSt8ios_base4InitD1Ev, %edi
addq    $8, %rsp
.cfi_def_cfa_offset 8
jmp __cxa_atexit
.cfi_endproc
.LFE1420:
.size   _GLOBAL__sub_I_main, .-_GLOBAL__sub_I_main
.section    .init_array,"aw"
.align 8
.quad   _GLOBAL__sub_I_main
.local  _ZStL8__ioinit
.comm   _ZStL8__ioinit,1,1
.hidden __dso_handle
.ident  "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section    .note.GNU-stack,"",@progbits
9134 次浏览

I didn't do a complete analysis of the assembly, but it looks like it did loop unrolling of the inner loop and figured out that together with the subtraction of inner it is a nop.

The assembly only seems to do the outer loop which only increments a counter until outer is reached. It could even have optimized that away, but it seems like it didn't do that.

The optimizer has worked out that the inner loop along with the subsequent line is a no-op, and eliminated it. Unfortunately it hasn't managed to eliminate the outer loop as well.

Note that the node.js example is faster than the unoptimized C++ example, indicating that V8 (node's JIT compiler) has managed to eliminate at least one of the loops. However, its optimization has some overhead, as (like any JIT compiler) it must balance the opportunities for optimization and profile-guided re-optimization against the cost of doing so.

Is there a way to cache the JIT compiled code after it optimizes it, or does it have to re-optimize the code every time the program is run?

If I were writing in Python I'd try to reduce the code down in size to get an "overhead" view of what the code was doing. Like try writing this (much easier to read IMO):

for i in range(outer):
innerS = sum(1 for _ in xrange(inner))
s += innerS
s -= innerS

or even s = sum(inner - inner for _ in xrange(outer))

for (uint32_t i = 0; i < outer; ++i) {
for (uint32_t j = 0; j < inner; ++j)
++s;
s -= inner;
}

The inner loop is equivalent to "s += inner; j = inner; " which a good optimising compiler can do. Since the variable j is gone after the loop, the whole code is equivalent to

for (uint32_t i = 0; i < outer; ++i) {
s += inner;
s -= inner;
}

Again, a good optimising compiler can remove the two changes to s, then remove the variable i, and there's nothing left whatsoever. It seems that is what happened.

Now it's up to you to decide how often an optimisation like this happens, and whether it is any real life benefit.

Even though the loops have plenty of iterations, the programs probably still aren't long-running enough to escape the overhead of interpreter/JVM/shell/etc.'s startup times. In some environments these can vary massively - in some cases *cough*Java*cough* taking several seconds before it gets anywhere near your actual code.

Ideally you would time the execution within each piece of code. It may be tricky to do this accurately across all languages, but even printing out clock time in ticks before and after would be better than using time, and would do the job since you probably aren't concerned with super-accurate timing here.

(I guess this doesn't really relate to why the C++ example is so much faster - but it could explain some of the variability in the other results. :) ).