什么是最快的整数除法支持除以零不管结果是什么?

摘要:

我在寻找最快的计算方法

(int) x / (int) y

我只想要一个任意的结果。


背景:

在编写图像处理算法时,我经常需要除以(累积的) alpha 值。最简单的变体是带整数算术的普通 C 代码。我的问题是,对于使用 alpha==0的结果像素,我通常会得到一个除以零的误差。然而,这正是结果完全不重要的像素: 我不关心像素的颜色值与 alpha==0


详情:

我要找的是这样的东西:

result = (y==0)? 0 : x/y;

或者

result = x / MAX( y, 1 );

X 和 y 是正整数。代码在嵌套循环中执行了大量次数,因此我正在寻找一种方法来消除条件分支。

当 y 不超过字节范围时,我对解决方案感到满意

unsigned char kill_zero_table[256] = { 1, 1, 2, 3, 4, 5, 6, 7, [...] 255 };
[...]
result = x / kill_zero_table[y];

但这显然不适用于更大的范围。

我想最后一个问题是: 在保持所有其他值不变的情况下,把0改成任何其他整数值最快的速度是多少?


澄清

我不能百分之百肯定分支太贵。但是,使用了不同的编译器,所以我更喜欢基准测试,只进行少量优化(这确实值得怀疑)。

当然,编译器在处理比特的时候是很棒的,但是我不能用 C 语言表达“不在乎”的结果,所以编译器将永远不能使用所有的优化。

代码应该是完全 C 兼容,主要平台是 Linux 64位与 gcc & clang 和 MacOS。

6629 次浏览

Inspired by some of the comments I got rid of the branch on my Pentium and gcc compiler using

int f (int x, int y)
{
y += y == 0;
return x/y;
}

The compiler basically recognizes that it can use a condition flag of the test in the addition.

As per request the assembly:

.globl f
.type   f, @function
f:
pushl   %ebp
xorl    %eax, %eax
movl    %esp, %ebp
movl    12(%ebp), %edx
testl   %edx, %edx
sete    %al
addl    %edx, %eax
movl    8(%ebp), %edx
movl    %eax, %ecx
popl    %ebp
movl    %edx, %eax
sarl    $31, %edx
idivl   %ecx
ret

As this turned out to be such a popular question and answer, I'll elaborate a bit more. The above example is based on programming idiom that a compiler recognizes. In the above case a boolean expression is used in integral arithmetic and the use of condition flags are invented in hardware for this purpose. In general condition flags are only accessible in C through using idiom. That is why it so hard to make a portable multiple precision integer library in C without resorting to (inline) assembly. My guess is that most decent compilers will understand the above idiom.

Another way of avoiding branches, as also remarked in some of the above comments, is predicated execution. I therefore took philipp's first code and my code and ran it through the compiler from ARM and the GCC compiler for the ARM architecture, which features predicated execution. Both compilers avoid the branch in both samples of code:

Philipp's version with the ARM compiler:

f PROC
CMP      r1,#0
BNE      __aeabi_idivmod
MOVEQ    r0,#0
BX       lr

Philipp's version with GCC:

f:
subs    r3, r1, #0
str     lr, [sp, #-4]!
moveq   r0, r3
ldreq   pc, [sp], #4
bl      __divsi3
ldr     pc, [sp], #4

My code with the ARM compiler:

f PROC
RSBS     r2,r1,#1
MOVCC    r2,#0
ADD      r1,r1,r2
B        __aeabi_idivmod

My code with GCC:

f:
str     lr, [sp, #-4]!
cmp     r1, #0
addeq   r1, r1, #1
bl      __divsi3
ldr     pc, [sp], #4

All versions still need a branch to the division routine, because this version of the ARM doesn't have hardware for a division, but the test for y == 0 is fully implemented through predicated execution.

Without knowing the platform there is no way to know the exact most efficient method, however, on a generic system this may close to the optimum (using Intel assembler syntax):

(assume divisor is in ecx and the dividend is in eax)

mov ebx, ecx
neg ebx
sbb ebx, ebx
add ecx, ebx
div eax, ecx

Four unbranched, single-cycle instructions plus the divide. The quotient will be in eax and the remainder will be in edx at the end. (This kind of shows why you don't want to send a compiler to do a man's job).

Here are some concrete numbers, on Windows using GCC 4.7.2:

#include <stdio.h>
#include <stdlib.h>


int main()
{
unsigned int result = 0;
for (int n = -500000000; n != 500000000; n++)
{
int d = -1;
for (int i = 0; i != ITERATIONS; i++)
d &= rand();


#if CHECK == 0
if (d == 0) result++;
#elif CHECK == 1
result += n / d;
#elif CHECK == 2
result += n / (d + !d);
#elif CHECK == 3
result += d == 0 ? 0 : n / d;
#elif CHECK == 4
result += d == 0 ? 1 : n / d;
#elif CHECK == 5
if (d != 0) result += n / d;
#endif
}
printf("%u\n", result);
}

Note that I am intentionally not calling srand(), so that rand() always returns exactly the same results. Note also that -DCHECK=0 merely counts the zeroes, so that it is obvious how often appeared.

Now, compiling and timing it various ways:

$ for it in 0 1 2 3 4 5; do for ch in 0 1 2 3 4 5; do gcc test.cc -o test -O -DITERATIONS=$it -DCHECK=$ch && { time=`time ./test`; echo "Iterations $it, check $ch: exit status $?, output $time"; }; done; done

shows output that can be summarised in a table:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5
-------------+-------------------------------------------------------------------
Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555
Check 1      | 0m0.612s | -        | -        | -         | -         | -
Check 2      | 0m0.612s | 0m6.527s | 0m9.718s | 0m13.464s | 0m18.422s | 0m22.871s
Check 3      | 0m0.616s | 0m5.601s | 0m8.954s | 0m13.211s | 0m19.579s | 0m25.389s
Check 4      | 0m0.611s | 0m5.570s | 0m9.030s | 0m13.544s | 0m19.393s | 0m25.081s
Check 5      | 0m0.612s | 0m5.627s | 0m9.322s | 0m14.218s | 0m19.576s | 0m25.443s

If zeroes are rare, the -DCHECK=2 version performs badly. As zeroes start appearing more, the -DCHECK=2 case starts performing significantly better. Out of the other options, there really isn't much difference.

For -O3, though, it is a different story:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5
-------------+-------------------------------------------------------------------
Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555
Check 1      | 0m0.646s | -        | -        | -         | -         | -
Check 2      | 0m0.654s | 0m5.670s | 0m9.905s | 0m14.238s | 0m17.520s | 0m22.101s
Check 3      | 0m0.647s | 0m5.611s | 0m9.085s | 0m13.626s | 0m18.679s | 0m25.513s
Check 4      | 0m0.649s | 0m5.381s | 0m9.117s | 0m13.692s | 0m18.878s | 0m25.354s
Check 5      | 0m0.649s | 0m6.178s | 0m9.032s | 0m13.783s | 0m18.593s | 0m25.377s

There, check 2 has no drawback compared the other checks, and it does keep the benefits as zeroes become more common.

You should really measure to see what happens with your compiler and your representative sample data, though.

According to this link, you can just block the SIGFPE signal with sigaction() (I have not tried it myself, but I believe it should work).

This is the fastest possible approach if divide by zero errors are extremely rare: you only pay for the divisions by zero, not for the valid divisions, the normal execution path is not changed at all.

However, the OS will be involved in every exception that's ignored, which is expensive. I think, you should have at least a thousand good divisions per division by zero that you ignore. If exceptions are more frequent than that, you'll likely pay more by ignoring the exceptions than by checking every value before the division.