当启用c++ 11时，std::vector性能回归

小开

最佳答案

我可以用你在帖子里写的选项在我的机器上重现你的结果。

然而，如果我也启用链接时间优化(我也将-flto标志传递给gcc 4.7.2)，结果是相同的:

(我正在编译你的原始代码，与container.push_back(Item());)

$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out


Performance counter stats for './a.out' (10 runs):


35.426793 task-clock                #    0.986 CPUs utilized            ( +-  1.75% )
4 context-switches          #    0.116 K/sec                    ( +-  5.69% )
0 CPU-migrations            #    0.006 K/sec                    ( +- 66.67% )
19,801 page-faults               #    0.559 M/sec
99,028,466 cycles                    #    2.795 GHz                      ( +-  1.89% ) [77.53%]
50,721,061 stalled-cycles-frontend   #   51.22% frontend cycles idle     ( +-  3.74% ) [79.47%]
25,585,331 stalled-cycles-backend    #   25.84% backend  cycles idle     ( +-  4.90% ) [73.07%]
141,947,224 instructions              #    1.43  insns per cycle
#    0.36  stalled cycles per insn  ( +-  0.52% ) [88.72%]
37,697,368 branches                  # 1064.092 M/sec                    ( +-  0.52% ) [88.75%]
26,700 branch-misses             #    0.07% of all branches          ( +-  3.91% ) [83.64%]


0.035943226 seconds time elapsed                                          ( +-  1.79% )






$ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out


Performance counter stats for './a.out' (10 runs):


35.510495 task-clock                #    0.988 CPUs utilized            ( +-  2.54% )
4 context-switches          #    0.101 K/sec                    ( +-  7.41% )
0 CPU-migrations            #    0.003 K/sec                    ( +-100.00% )
19,801 page-faults               #    0.558 M/sec                    ( +-  0.00% )
98,463,570 cycles                    #    2.773 GHz                      ( +-  1.09% ) [77.71%]
50,079,978 stalled-cycles-frontend   #   50.86% frontend cycles idle     ( +-  2.20% ) [79.41%]
26,270,699 stalled-cycles-backend    #   26.68% backend  cycles idle     ( +-  8.91% ) [74.43%]
141,427,211 instructions              #    1.44  insns per cycle
#    0.35  stalled cycles per insn  ( +-  0.23% ) [87.66%]
37,366,375 branches                  # 1052.263 M/sec                    ( +-  0.48% ) [88.61%]
26,621 branch-misses             #    0.07% of all branches          ( +-  5.28% ) [83.26%]


0.035953916 seconds time elapsed

至于原因，需要查看生成的程序集代码(g++ -std=c++11 -O3 -S regr.cpp)。在c++ 11模式下，生成的代码明显更加混乱 than用于c++ 98模式，内联函数
void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&) < br > 在c++ 11模式中使用默认inline-limit失败。

这个失败的内联具有多米诺骨牌效应。不是因为这个函数被调用 (它甚至没有被调用!)但因为我们必须做好准备:如果它被调用，函数参数(Item.a和Item.b)必须已经在正确的位置。这就导致

下面是为内联成功生成的代码的相关部分:

.L42:
testq   %rbx, %rbx  # container$D13376$_M_impl$_M_finish
je  .L3 #,
movl    $0, (%rbx)  #, container$D13376$_M_impl$_M_finish_136->a
movl    $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
.L3:
addq    $8, %rbx    #, container$D13376$_M_impl$_M_finish
subq    $1, %rbp    #, ivtmp.106
je  .L41    #,
.L14:
cmpq    %rbx, %rdx  # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
jne .L42    #,

这是一个漂亮而紧凑的for循环。现在，让我们将其与失败的内联情况进行比较:

.L49:
testq   %rax, %rax  # D.15772
je  .L26    #,
movq    16(%rsp), %rdx  # D.13379, D.13379
movq    %rdx, (%rax)    # D.13379, *D.15772_60
.L26:
addq    $8, %rax    #, tmp75
subq    $1, %rbx    #, ivtmp.117
movq    %rax, 40(%rsp)  # tmp75, container.D.13376._M_impl._M_finish
je  .L48    #,
.L28:
movq    40(%rsp), %rax  # container.D.13376._M_impl._M_finish, D.15772
cmpq    48(%rsp), %rax  # container.D.13376._M_impl._M_end_of_storage, D.15772
movl    $0, 16(%rsp)    #, D.13379.a
movl    $0, 20(%rsp)    #, D.13379.b
jne .L49    #,
leaq    16(%rsp), %rsi  #,
leaq    32(%rsp), %rdi  #,
call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

这段代码很混乱，在循环中比在前一种情况下有更多的事情要做。在函数call之前(显示的最后一行)，参数必须适当地放置:

leaq    16(%rsp), %rsi  #,
leaq    32(%rsp), %rdi  #,
call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

即使这从未真正执行，循环也会在此之前安排事情:

movl    $0, 16(%rsp)    #, D.13379.a
movl    $0, 20(%rsp)    #, D.13379.b

如果内联成功，没有函数call，我们在循环中只有2个移动指令，并且%rsp(堆栈指针)没有混乱。然而，如果内联失败，我们得到6次移动，并且我们对%rsp.

只是为了证实我的理论(注意-finline-limit)，两者都在c++ 11模式下:

 $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out


Performance counter stats for './a.out' (10 runs):


84.739057 task-clock                #    0.993 CPUs utilized            ( +-  1.34% )
8 context-switches          #    0.096 K/sec                    ( +-  2.22% )
1 CPU-migrations            #    0.009 K/sec                    ( +- 64.01% )
19,801 page-faults               #    0.234 M/sec
266,809,312 cycles                    #    3.149 GHz                      ( +-  0.58% ) [81.20%]
206,804,948 stalled-cycles-frontend   #   77.51% frontend cycles idle     ( +-  0.91% ) [81.25%]
129,078,683 stalled-cycles-backend    #   48.38% backend  cycles idle     ( +-  1.37% ) [69.49%]
183,130,306 instructions              #    0.69  insns per cycle
#    1.13  stalled cycles per insn  ( +-  0.85% ) [85.35%]
38,759,720 branches                  #  457.401 M/sec                    ( +-  0.29% ) [85.43%]
24,527 branch-misses             #    0.06% of all branches          ( +-  2.66% ) [83.52%]


0.085359326 seconds time elapsed                                          ( +-  1.31% )


$ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out


Performance counter stats for './a.out' (10 runs):


37.790325 task-clock                #    0.990 CPUs utilized            ( +-  2.06% )
4 context-switches          #    0.098 K/sec                    ( +-  5.77% )
0 CPU-migrations            #    0.011 K/sec                    ( +- 55.28% )
19,801 page-faults               #    0.524 M/sec
104,699,973 cycles                    #    2.771 GHz                      ( +-  2.04% ) [78.91%]
58,023,151 stalled-cycles-frontend   #   55.42% frontend cycles idle     ( +-  4.03% ) [78.88%]
30,572,036 stalled-cycles-backend    #   29.20% backend  cycles idle     ( +-  5.31% ) [71.40%]
140,669,773 instructions              #    1.34  insns per cycle
#    0.41  stalled cycles per insn  ( +-  1.40% ) [88.14%]
38,117,067 branches                  # 1008.646 M/sec                    ( +-  0.65% ) [89.38%]
27,519 branch-misses             #    0.07% of all branches          ( +-  4.01% ) [86.16%]


0.038187580 seconds time elapsed                                          ( +-  2.05% )

事实上，如果我们要求编译器稍微努力一点来内联那个函数，性能上的差异就消失了。

那么这个故事告诉我们什么呢?失败的内联会花费你很多，你应该充分利用编译器的功能:我只能推荐链接时间优化。它给了我的程序一个显著的性能提升(高达2.5倍)，我所需要做的就是传递-flto标志。这是一个很好的交易!；）

但是，我不建议使用inline关键字丢弃代码;让编译器决定要做什么。(无论如何，优化器都允许将内联关键字视为空白。)

问得好，+1!