> vs. >= in bubble sort causes significant performance difference

小开

I think this can indeed be explained by branch misprediction.

Consider, for example, LIMIT=11, and sortB. On first iteration of the outer loop, it will very quickly stumble upon one of elements equal to 10. So it will have a[j]=10, and therefore definitely a[j] will be >=a[next], as there are no elements that are greater than 10. Therefore, it will perform a swap, then do one step in j only to find that a[j]=10 once again (the same swapped value). So once again it will be a[j]>=a[next], and so one. Every comparison except several at the very beginning will be true. Similarly it will run on the next iterations of the outer loop.

Not the same for sortA. It will start roughly the same way, stumble upon a[j]=10, do some swaps in a similar manner, but only to a point when it finds a[next]=10 too. Then the condition will be false, and no swap will be done. An so on: every time it stumbles on a[next]=10, the condition is false and no swaps are done. Therefore, this condition is true 10 times out of 11 (values of a[next] from 0 to 9), and false in 1 case out of 11. Nothing strange that branch prediction fails.

小开

最佳答案

I think it may indeed be due to branch prediction. If you count the number of swaps compared to the number of inner sort iterations you find:

Limit = 10

A = 560M swaps / 1250M loops
B = 1250M swaps / 1250M loops (0.02% less swaps than loops)

Limit = 50000

A = 627M swaps / 1250M loops
B = 850M swaps / 1250M loops

So in the Limit == 10 case the swap is performed 99.98% of the time in the B sort which is obviously favourable for the branch predictor. In the Limit == 50000 case the swap is only hit randomly 68% so the branch predictor is less beneficial.

小开

Edit 2: This answer is probably wrong in most cases, lower when I say everything above is correct is still true, but the lower portion is not true for most processor architectures, see the comments. However, I will say that it's still theoretically possible there is some JVM on some OS/Architecture that does this, but that JVM is probably poorly implemented or it's a weird architecture. Also, this is theoretically possible in the sense that most conceivable things are theoretically possible, so I'd take the last portion with a grain of salt.

First, I am not sure about the C++, but I can talk some about the Java.

Here is some code,

public class Example {


public static boolean less(final int a, final int b) {
return a < b;
}


public static boolean lessOrEqual(final int a, final int b) {
return a <= b;
}
}

Running javap -c on it I get the bytecode

public class Example {
public Example();
Code:
0: aload_0
1: invokespecial #8                  // Method java/lang/Object."<init>":()V
4: return


public static boolean less(int, int);
Code:
0: iload_0
1: iload_1
2: if_icmpge     7
5: iconst_1
6: ireturn
7: iconst_0
8: ireturn


public static boolean lessOrEqual(int, int);
Code:
0: iload_0
1: iload_1
2: if_icmpgt     7
5: iconst_1
6: ireturn
7: iconst_0
8: ireturn
}

You'll notice the only difference is if_icmpge (if compare greater/equal) versus if_icmpgt (if compare greater than).

Everything above is fact, the rest is my best guess as to how if_icmpge and if_icmpgt are handled based on a college course I took on assembly language. To get a better answer you should look up how your JVM handles these. My guess is that C++ also compiles down to a similar operation.

Edit: Documentation on if_i<cond> is here

The way computers compare numbers is subtracting one from the other and checking if that number is 0 or not, so when doing a < b if subtracts b from a and sees if the result is less than 0 by checking the sign of the value (b - a < 0). To do a <= b though it has to do an additional step and subtract 1 (b - a - 1 < 0).

Normally this is a very miniscule difference, but this isn't any code, this is freaking bubble sort! O(n^2) is the average number of times we are doing this particular comparison because it's in the inner most loop.

Yes, it may have something to do with branch prediction I am not sure, I am not an expert on that, but I think this may also play a non-insignificant role.

小开

Using the C++ code provided (time counting removed) with the perf stat command I got results that confirm the brach-miss theory.

With Limit = 10, BubbleSortB highly benefits from branch prediction (0.01% misses) but with Limit = 50000 branch prediction fails even more (with 15.65% misses) than in BubbleSortA (12.69% and 12.76% misses respectively).

BubbleSortA Limit=10:

Performance counter stats for './bubbleA.out':


46670.947364 task-clock                #    0.998 CPUs utilized
73 context-switches          #    0.000 M/sec
28 CPU-migrations            #    0.000 M/sec
379 page-faults               #    0.000 M/sec
117,298,787,242 cycles                    #    2.513 GHz
117,471,719,598 instructions              #    1.00  insns per cycle
25,104,504,912 branches                  #  537.904 M/sec
3,185,376,029 branch-misses             #   12.69% of all branches


46.779031563 seconds time elapsed

BubbleSortA Limit=50000:

Performance counter stats for './bubbleA.out':


46023.785539 task-clock                #    0.998 CPUs utilized
59 context-switches          #    0.000 M/sec
8 CPU-migrations            #    0.000 M/sec
379 page-faults               #    0.000 M/sec
118,261,821,200 cycles                    #    2.570 GHz
119,230,362,230 instructions              #    1.01  insns per cycle
25,089,204,844 branches                  #  545.136 M/sec
3,200,514,556 branch-misses             #   12.76% of all branches


46.126274884 seconds time elapsed

BubbleSortB Limit=10:

Performance counter stats for './bubbleB.out':


26091.323705 task-clock                #    0.998 CPUs utilized
28 context-switches          #    0.000 M/sec
2 CPU-migrations            #    0.000 M/sec
379 page-faults               #    0.000 M/sec
64,822,368,062 cycles                    #    2.484 GHz
137,780,774,165 instructions              #    2.13  insns per cycle
25,052,329,633 branches                  #  960.179 M/sec
3,019,138 branch-misses             #    0.01% of all branches


26.149447493 seconds time elapsed

BubbleSortB Limit=50000:

Performance counter stats for './bubbleB.out':


51644.210268 task-clock                #    0.983 CPUs utilized
2,138 context-switches          #    0.000 M/sec
69 CPU-migrations            #    0.000 M/sec
378 page-faults               #    0.000 M/sec
144,600,738,759 cycles                    #    2.800 GHz
124,273,104,207 instructions              #    0.86  insns per cycle
25,104,320,436 branches                  #  486.101 M/sec
3,929,572,460 branch-misses             #   15.65% of all branches


52.511233236 seconds time elapsed