C/C + + 中获得正模的最快方法

通常在我的内部循环中,我需要以“环绕”的方式索引一个数组,这样(例如)如果数组大小为100,并且我的代码要求元素 -2,那么它应该被赋予元素98。在许多高级语言(如 Python)中,人们只需使用 my_array[index % array_size]就可以做到这一点,但出于某种原因,C 的整数算法(通常)向零舍入,而不是始终向下舍入,因此当给定负的第一个参数时,它的模运算符返回负结果。

通常我知道 index不会小于 -array_size,在这些情况下我只做 my_array[(index + array_size) % array_size]。然而,有时这是不能保证的,对于这些情况,我想知道最快的方法来实现一个总是正的模函数。有几种不使用分支的“聪明”方法,例如

inline int positive_modulo(int i, int n) {
return (n + (i % n)) % n;
}

或者

inline int positive_modulo(int i, int n) {
return (i % n) + (n * (i < 0));
}

当然,我可以对它们进行分析,以找出在我的系统中哪个是最快的,但是我不禁担心我可能错过了一个更好的,或者在我的机器上最快的东西在另一个机器上可能会慢下来。

那么有没有一种标准的方法来做这件事,或者一些我错过的可能是最快可能的方法的聪明的技巧呢?

另外,我知道这可能是一厢情愿的想法,但如果有一种方法可以做到这一点,可以自动向量化,这将是惊人的。

63322 次浏览

Modulo a power of two, the following works (assuming twos complement representation):

return i & (n-1);

The standard way I learned is

inline int positive_modulo(int i, int n) {
return (i % n + n) % n;
}

This function is essentially your first variant without the abs (which, in fact, makes it return the wrong result). I wouldn't be surprised if an optimizing compiler could recognize this pattern and compile it to machine code that computes an "unsigned modulo".

Edit:

Moving on to your second variant: First of all, it contains a bug, too -- the n < 0 should be i < 0.

This variant may not look as if it branches, but on a lot of architectures, the i < 0 will compile into a conditional jump. In any case, it will be at least as fast to replace (n * (i < 0)) with i < 0? n: 0, which avoids the multiplication; in addition, it's "cleaner" because it avoids reinterpreting the bool as an int.

As to which of these two variants is faster, that probably depends on the compiler and processor architecture -- time the two variants and see. I don't think there's a faster way than either of these two variants, though.

You can as well do array[(i+array_size*N) % array_size], where N is large enough integer to guarantee positive argument, but small enough for not to overflow.

When the array_size is constant, there are techniques to calculate the modulus without division. Besides of power of two approach, one can calculate a weighted sum of bitgroups multiplied by the 2^i % n, where i is the least significant bit in each group:

e.g. 32-bit integer 0xaabbccdd % 100 = dd + cc*[2]56 + bb*[655]36 + aa*[167772]16, having the maximum range of (1+56+36+16)*255 = 27795. With repeated applications and different subdivision one can reduce the operation to few conditional subtractions.

Common practises also include approximation of division with reciprocal of 2^32 / n, which usually can handle reasonably large range of arguments.

 i - ((i * 655)>>16)*100; // (gives 100*n % 100 == 100 requiring adjusting...)

An old-school way to get the optional addend using twos-complement sign-bit propagation:

int positive_mod(int i, int m)
{
/* constexpr */ int shift = CHAR_BIT*sizeof i - 1;
int r = i%m;
return r+ (r>>shift & m);
}

Your second example is better than the first. A multiplication is a more complex operation than an if/else operation, so use this:

inline int positive_modulo(int i, int n) {
int tmp = i % n;
return tmp ? i >= 0 ? tmp : tmp + n : 0;
}

If you can afford to promote to a larger type (and do your modulo on the larger type), this code does a single modulo and no if:

int32_t positive_modulo(int32_t number, int32_t modulo) {
return (number + ((int64_t)modulo << 32)) % modulo;
}

Fastest way to get a positive modulo in C/C++

The following fast? - maybe not as fast as others, yet is simple and functionally correct for all1 a,b -- unlike others.

int modulo_Euclidean(int a, int b) {
int m = a % b;
if (m < 0) {
// m += (b < 0) ? -b : b; // avoid this form: -b is UB when b == INT_MIN
m = (b < 0) ? m - b : m + b;
}
return m;
}

Various other answers have mod(a,b) weaknesses especially when b < 0.

See Euclidean division for ideas about b < 0


inline int positive_modulo(int i, int n) {
return (i % n + n) % n;
}

Fails when i % n + n overflows (think large i, n) - Undefined behavior.


return i & (n-1);

Relies on n as a power of two. (Fair that the answer does mention this.)


int positive_mod(int i, int n)
{
/* constexpr */ int shift = CHAR_BIT*sizeof i - 1;
int m = i%n;
return m+ (m>>shift & n);
}

Often fails when n < 0. e, g, positive_mod(-2,-3) --> -5


int32_t positive_modulo(int32_t number, int32_t modulo) {
return (number + ((int64_t)modulo << 32)) % modulo;
}

Obliges using 2 integer widths. (Fair that the answer does mention this.)
Fails with modulo < 0. positive_modulo(2, -3) --> -1.


inline int positive_modulo(int i, int n) {
int tmp = i % n;
return tmp ? i >= 0 ? tmp : tmp + n : 0;
}

Often fails when n < 0. e, g, positive_modulo(-2,-3) --> -5


1 Exceptions: In C, a%b is not defined when a/b overflows as in a/0 or INT_MIN/-1.

Most of the time, compilers are very good at optimizing your code, so it is usually best to keep your code readable (for both compilers and other developers to know what you are doing).

Since your array size is always positive, I suggest you to define the quotient as unsigned. The compiler will optimize small if/else blocks into conditional instructions which have no branches:

unsigned modulo( int value, unsigned m) {
int mod = value % (int)m;
if (mod < 0) {
mod += m;
}
return mod;
}

This creates a very small function without branches:

modulo(int, unsigned int):
mov     eax, edi
cdq
idiv    esi
add     esi, edx
mov     eax, edx
test    edx, edx
cmovs   eax, esi
ret

For example modulo(-5, 7) returns 2.

Unfortunately, since the quotient is not known they must perform an integer division, which is a bit slow compared to other integer operations. If you know the sizes of your array are power of two, I recommend keeping these function definitions in a header, so that the compiler can optimize them into a more efficient function. Here is the function unsigned modulo256(int v) { return modulo(v,256); }:

modulo256(int):                          # @modulo256(int)
mov     edx, edi
sar     edx, 31
shr     edx, 24
lea     eax, [rdi+rdx]
movzx   eax, al
sub     eax, edx
lea     edx, [rax+256]
test    eax, eax
cmovs   eax, edx
ret

See assembly: https://gcc.godbolt.org/z/DG7jMw

See comparison with most voted answer: http://quick-bench.com/oJbVwLr9G5HJb0oRaYpQOCec4E4

Benchmark comparison

Edit: turns out Clang is able to generate a function without any conditional move instructions (which cost more than regular arithmetic operations). This difference is completely negligible in the general case due to the fact that the integral division takes around 70% of the total time.

Basically, Clang shifts value right to extend its sign bit to the whole width of m (that is 0xffffffff when negative and 0 otherwise) which is used to mask the second operand in mod + m.

unsigned modulo (int value, unsigned m) {
int mod = value % (int)m;
m &= mod >> std::numeric_limits<int>::digits;
return mod + m;
}

If you want to avoid all conditional paths (including the conditional move generated above, (For example if you need this code to vectorize, or to run in constant time), You can use the sign bit as a mask:

unsigned modulo(int value, unsigned m) {
int shift_width = sizeof(int) * 8 - 1;
int tweak = (value >> shift_width);
int mod = ((value - tweak) % (int) m) + tweak;
mod += (tweak & m);
return mod;
}

Here are the quickbench results You can see that on gcc it's better in the generic case. For clang it's the same speed in the generic case, because clang generates the branch free code in the generic case. The technique is useful regardless, because the compiler can't always be relied on to produce the particular optimization, and you may have to roll it by hand for vector code.