为什么 SSE 标量 sqrt (x)比 rsqrt (x) * x 慢?

我已经在 Intel Core Duo 上分析了我们的一些核心算法,在研究平方根的各种方法时,我注意到了一些奇怪的现象: 使用 SSE 标量操作,获取一个倒数平方根并将其乘以得到 sqrt 要比使用本机 sqrt 操作码快得多!

我用一个循环来测试它,比如:

inline float TestSqrtFunction( float in );


void TestFunc()
{
#define ARRAYSIZE 4096
#define NUMITERS 16386
float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache


cyclecounter.Start();
for ( int i = 0 ; i < NUMITERS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; ++j )
{
flOut[j] = TestSqrtFunction( flIn[j] );
// unrolling this loop makes no difference -- I tested it.
}
cyclecounter.Stop();
printf( "%d loops over %d floats took %.3f milliseconds",
NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为 TestSqrtfunction 在几个不同的主体上尝试过这种方法,并且我得到了一些真正令我头疼的计时方法。到目前为止,最糟糕的是使用本机 sqrt ()函数并让“智能”编译器“优化”。在24ns/float 的情况下,使用 x87 FPU 的结果糟糕得可怜:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

接下来我尝试使用一个内部函数来强制编译器使用 SSE 的标量 sqrt 操作码:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
_mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
// compiles to movss, sqrtss, movss
}

这是更好的,在11.9 ns/float。我还尝试了 卡马克古怪的牛顿-拉斐逊近似技术,它的运行速度甚至比硬件还要好,为4.3 ns/float,尽管在210中有1个错误(这对我的目的来说太多了)。

当我尝试对 互惠互利的平方根执行 SSE 操作,然后使用乘法得到平方根(x * 1/& radic; x = & radic; x)时,情况非常糟糕。尽管这需要两个相互依赖的操作,但它是迄今为止最快的解决方案,为1.24 ns/float,精确到2-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
__m128 in = _mm_load_ss( pIn );
_mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
// compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题基本上是 怎么回事? 为什么 SSE 的内置硬件平方根操作码 < em > 比用另外两个数学运算合成它要慢?

我相信这是行动本身的成本,因为我已经证实:

  • 所有数据都适合缓存,并且 访问是连续的
  • 函数是内联的
  • 展开循环没有什么区别
  • 编译器标志被设置为完全优化(我检查过,程序集是好的)

(编辑: stephentyrone 正确地指出,对长串数字的操作应该使用向量化 SIMD 打包操作,比如 rsqrtps & mdash,但这里的数组只是为了测试目的,我真正想测量的是在无法向量化的代码中使用的 标量性能。)

47963 次浏览

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

Instead of supplying an answer, that actually might be incorrect (I'm also not going to check or argue about cache and other stuff, let's say they are identical) I'll try to point you to the source that can answer your question.
The difference might lie in how sqrt and rsqrt are computed. You can read more here http://www.intel.com/products/processor/manuals/. I'd suggest to start from reading about processor functions you are using, there are some info, especially about rsqrt (cpu is using internal lookup table with huge approximation, which makes it much simpler to get the result). It may seem, that rsqrt is so much faster than sqrt, that 1 additional mul operation (which isn't to costly) might not change the situation here.

Edit: Few facts that might be worth mentioning:
1. Once I was doing some micro optimalizations for my graphics library and I've used rsqrt for computing length of vectors. (instead of sqrt, I've multiplied my sum of squared by rsqrt of it, which is exactly what you've done in your tests), and it performed better.
2. Computing rsqrt using simple lookup table might be easier, as for rsqrt, when x goes to infinity, 1/sqrt(x) goes to 0, so for small x's the function values doesn't change (a lot), whereas for sqrt - it goes to infinity, so it's that simple case ;).

Also, clarification: I'm not sure where I've found it in books I've linked, but I'm pretty sure I've read that rsqrt is using some lookup table, and it should be used only, when the result doesn't need to be exact, although - I might be wrong as well, as it was some time ago :).

This is also true for division. MULSS(a,RCPSS(b)) is way faster than DIVSS(a,b). In fact it's still faster even when you increase its precision with a Newton-Raphson iteration.

Intel and AMD both recommend this technique in their optimisation manuals. In applications which don't require IEEE-754 compliance, the only reason to use div/sqrt is code readability.

Newton-Raphson converges to the zero of f(x) using increments equals to -f/f' where f' is the derivative.

For x=sqrt(y), you can try to solve f(x) = 0 for x using f(x) = x^2 - y;

Then the increment is: dx = -f/f' = 1/2 (x - y/x) = 1/2 (x^2 - y) / x which has a slow divide in it.

You can try other functions (like f(x) = 1/y - 1/x^2) but they will be equally complicated.

Let's look at 1/sqrt(y) now. You can try f(x) = x^2 - 1/y, but it will be equally complicated: dx = 2xy / (y*x^2 - 1) for instance. One non-obvious alternate choice for f(x) is: f(x) = y - 1/x^2

Then: dx = -f/f' = (y - 1/x^2) / (2/x^3) = 1/2 * x * (1 - y * x^2)

Ah! It's not a trivial expression, but you only have multiplies in it, no divide. => Faster!

And: the full update step new_x = x + dx then reads:

x *= 3/2 - y/2 * x * x which is easy too.

It is faster becausse these instruction ignore rounding modes, and do not handle floatin point exceptions or dernormalized numbers. For these reasons it is much easier to pipeline, speculate and execute other fp instruction Out of order.

There are a number of other answers to this already from a few years ago. Here's what the consensus got right:

  • The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
  • It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
  • The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
  • There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.

Here's what the consensus got wrong:

  • SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.

The N-R algorithm to compute reciprocal square root has this update step, as others have noted:

x' = 0.5 * x * (3 - n*x*x);

That's a lot of data-dependent multiplications and one subtraction.

What follows is the algorithm that modern FPUs actually use.

Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:

x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]

Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).

We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:

b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])

Then:

x[0] = n Y[0]
x[i] = x[i-1] * Y[i]

and:

y[0] = Y[0]
y[i] = y[i-1] * Y[i]

The next key observation is that b[i] = x[i-1] * y[i-1]. So:

Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
= 1 + 0.5 * (1 - x[i-1] * y[i-1])

Then:

x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))

That is, given initial x and y, we can use the following update step:

r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r

Or, even fancier, we can set h = 0.5 * y. This is the initialisation:

Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5

And this is the update step:

r = 0.5 - x * h
x' = x + x * r
h' = h + h * r

This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.

In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.

Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.