为了安全地完成这项工作,在 C + + 或 C 中不使用未定义的行为,可以使用 memcpy代替指针强制转换来进行类型双关。编译器知道如何有效地内联它。
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
或者在 C99和更高版本中,使用 union {double d; uint32_t u[2];};。但是请注意,在 C + + 中,只有一些编译器支持联合类型双关,而在 ISO C + + 中不支持。
这通常比前导零计数指令的平台特定内部指令慢,但便携式 ISO C 没有这样的功能。一些 CPU 也缺少前置零计数指令,但是其中一些可以有效地将整数转换为 double。但是,将 FP 位模式返回整数的类型双击可能会比较慢(例如,在 PowerPC 上,它需要存储/重新加载,并且通常会导致加载到存储停滞)。
该算法可能对 SIMD 实现非常有用,因为具有 SIMD lzcnt的 CPU 较少。X86只有这样的指令 和 AVX512CD
-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.
-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.
-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.
(“未定义的结果”不是 C 未定义行为,只是一个未定义的值。它实际上是指令运行时目标寄存器中的内容。AMD 记录了这一点,而英特尔没有,但英特尔的 CPU 确实实现了这一行为。但是它是 没有,不管你之前给 C 变量赋值的是什么,当 gcc 把 C 转换成 asm 时,通常不是这样运行的。另见 为什么打破 LZCNT 的“输出依赖性”很重要?)
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
仅使用 -O2,差异会变得更大。决策树的速度几乎快了四倍。
我还以幼稚的位移代码为基准:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
正如人们所预料的那样,这只对少数人来说是快速的。在确定 n = = 1的最高位为1时,它的基准测试速度提高了80% 以上。然而,在63位空间中随机选择的数字中,有一半是设置为63位的!
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
请注意,您可以尝试一次搜索多于1位的内容。
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
这种方法使用二进制搜索
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
另一个二分法也许更易读,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
一些过于复杂的答案。只有当输入已经是2的幂时,才应该使用 Debruin 技术,否则会有更好的方法。对于2的输入功率,Debruin 是绝对最快的,甚至比我测试过的任何处理器上的 _BitScanReverse都要快。但是,在一般情况下,_BitScanReverse(或者在编译器中调用的任何内部函数)是最快的(在某些 CPU 上可以对它进行微编码)。
如果内部函数不是一个选项,这里有一个处理一般输入的最优软件解决方案。
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
请注意,这个版本不需要在最后查找德布鲁因,不像其他大多数答案。它计算位置。
但是,如果多次调用表,则表可能更可取,因为表的加速会使缓存丢失的风险相形见绌。
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
可以显示 1,对于任意位宽度,需要测试的平均位数最多为2。这转化为 分期付款时间复杂度相对于 < em > O (1) 的位数(!).
当然,最糟糕的情况仍然是 < em > O (n) ,比二进制搜索方法得到的 < em > O (log (n)) 更糟糕,但是因为最糟糕的情况很少,所以对于大多数应用程序来说可以忽略不计(< em > 更新 : 不完全是: 可能很少,但是它们可能发生的概率很高——参见下面的 < em > 更新 )。
这是我想出来的“天真”的方法,至少在我的机器上打败了大多数其他方法(对于32位 int 的二进制搜索方案总是需要 木头2(32) = 5个步骤,而这个愚蠢的算法平均需要不到2个步骤)——抱歉这是 C + + 而不是纯 C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
该表需要通过上面的代码进行一次性初始化。它是只读的,因此可以共享单个全局副本以进行并发访问。使用这个表,您可以快速查找整数 日志 < sub > 2 ,这就是我们在这里要查找的,查找所有不同的整数宽度(8、16、32和64位)。
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Uint (32位)版本
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
以上各种过载
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
如果你想让它工作而不考虑符号,你可以在循环之前添加一个额外的“ v < < = 1;”(并相应地将 r 值改为30)。
如果我忘了什么东西,请告诉我。我还没有测试过,但它应该可以正常工作。
if (1) { #TpositionOfMostSignificantBitIn64
my @m = ( # Test strings
#B0 1 2 3 4 5 6 7
#b0123456701234567012345670123456701234567012345670123456701234567
'0000000000000000000000000000000000000000000000000000000000000000',
'0000000000000000000000000000000000000000000000000000000000000001',
'0000000000000000000000000000000000000000000000000000000000000010',
'0000000000000000000000000000000000000000000000000000000000000111',
'0000000000000000000000000000000000000000000000000000001010010000',
'0000000000000000000000000000000000001000000001100100001010010000',
'0000000000000000000001001000010000000000000001100100001010010000',
'0000000000000000100000000000000100000000000001100100001010010000',
'1000000000000000100000000000000100000000000001100100001010010000',
);
my @n = (0, 1, 2, 3, 10, 28, 43, 48, 64); # Expected positions of msb
sub positionOfMostSignificantBitIn64($) # Find the position of the most significant bit in a string of 64 bits starting from 1 for the least significant bit or return 0 if the input field is all zeros
{my ($s64) = @_; # String of 64 bits
my $N = 128; # 128 bit operations
my $f = 0; # Position of first bit set
my $x = '0'x$N; # Double Quad Word set to 0
my $s = substr $x.$s64, -$N; # 128 bit area needed
substr(VPTESTMD($s, $s), -2, 1) eq '1' ? ($s = PSRLDQ $s, 4) : ($f += 32); # Test 2 dwords
substr(VPTESTMW($s, $s), -2, 1) eq '1' ? ($s = PSRLDQ $s, 2) : ($f += 16); # Test 2 words
substr(VPTESTMB($s, $s), -2, 1) eq '1' ? ($s = PSRLDQ $s, 1) : ($f += 8); # Test 2 bytes
$s = substr($s, -8); # Last byte remaining
$s < $_ ? ++$f : last for # Search remaing byte
(qw(10000000 01000000 00100000 00010000
00001000 00000100 00000010 00000001));
64 - $f # Position of first bit set
}
ok $n[$_] eq positionOfMostSignificantBitIn64 $m[$_] for keys @m # Test
}
如果没有直接的硬件支持,那么在 w = 64位的情况下应该可以比 O (log (w))做得更好。实际上,在 O (log log w)中也可以这样做,只不过性能交叉要到 w > = 256位才会发生。
不管怎样,我尝试了一下,我能想到的最好的办法就是下面这些技巧的组合:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}