浮点除法与浮点乘法

通过编码是否有任何(非微优化)性能增益

float f1 = 200f / 2

与... 相比

float f2 = 200f * 0.5

几年前,我的一位教授告诉我,浮点除法比浮点乘法慢,但没有详细说明原因。

这种说法是否适用于现代 PC 架构?

更新1

关于评论,请也考虑一下这个案例:

float f1;
float f2 = 2
float f3 = 3;
for( i =0 ; i < 1e8; i++)
{
f1 = (i * f2 + i / f3) * 0.5; //or divide by 2.0f, respectively
}

引用评论:

我想知道是什么样的算法/架构需求使得除法在硬件上比乘法复杂得多

84804 次浏览

Yes, many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).

If you look at this answer you will see that division can exceed 24 cycles.

Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.

Yes. Every FPU I am aware of performs multiplications much faster than divisions.

However, modern PCs are very fast. They also contain pipelining archtectures that can make the difference negligable under many circumstances. To top it off, any decent compiler will perform the division operation you showed at compile time with optimizations turned on. For your updated example, any decent compiler would perform that transformation itself.

So generally you should worry about making your code readable, and let the compiler worry about making it fast. Only if you have a measured speed issue with that line should you worry about perverting your code for the sake of speed. Compilers are well aware of what is faster than what on their CPU's, and are generally much better optimizers than you can ever hope to be.

Division is inherently a much slower operation than multiplication.

And this may in fact be something that the compiler cannot (and you may not want to) optimize in many cases due to floating point inaccuracies. These two statements:

double d1 = 7 / 10.;
double d2 = 7 * 0.1;

are not semantically identical - 0.1 cannot be exactly represented as a double, so a slightly different value will end up being used - substituting the multiplication for the division in this case would yield a different result!

The answer depends on the platform for which you are programming.

For example, doing lots of multiplication on an array on x86 should be much faster then doing division, because the compiler should create the assembler code which uses SIMD instructions. Since there are no division in the SIMD instructions, then you would see great improvements using multiplication then division.

Think about what is required for multiplication of two n bit numbers. With the simplest method, you take one number x and repeatedly shift and conditionally add it to an accumulator (based on a bit in the other number y). After n additions you are done. Your result fits in 2n bits.

For division, you start with x of 2n bits and y of n bits, you want to compute x / y. The simplest method is long division, but in binary. At each stage you do a comparison and a subtraction to get one more bit of the quotient. This takes you n steps.

Some differences: each step of the multiplication only needs to look at 1 bit; each stage of the division needs to look at n bits during the comparison. Each stage of the multiplication is independent of all other stages (doesn't matter the order you add the partial products); for division each step depends on the previous step. This is a big deal in hardware. If things can be done independently then they can happen at the same time within a clock cycle.

Newton rhapson solves integer division in O(M(n)) complexity via linear algebra apploximation. Faster than The otherwise O(n*n) complexity.

In code The method contains 10mults 9adds 2bitwiseshifts.

This explains why a division is roughly 12x as many cpu ticks as a multiplication.

Be very careful with division, and avoid it when possible. For example, hoist float inverse = 1.0f / divisor; out of a loop and multiply by inverse inside the loop. (If the rounding error in inverse is acceptable)

Usually 1.0/x will not be exactly-representable as a float or double. It will be exact when x is a power of 2. This lets compilers optimize x / 2.0f to x * 0.5f without any change in the result.

To let the compiler do this optimization for you even when the result won't be exact (or with a runtime-variable divisor), you need options like gcc -O3 -ffast-math. Specifically, -freciprocal-math (enabled by -funsafe-math-optimizations enabled by -ffast-math) lets the compiler replace x / y with x * (1/y) when that's useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think it does, but I forget).

-ffast-math is often important to allow auto-vectorization of FP loops, especially reductions (e.g. summing an array into one scalar total), because FP math is not associative. Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

Also note that C++ compilers can fold + and * into an FMA in some cases (when compiling for a target that supports it, like -march=haswell), but they can't do that with /.


Division has worse latency than multiplication or addition (or FMA) by a factor of 2 to 4 on modern x86 CPUs, and worse throughput by a factor of 6 to 401 (for a tight loop doing only division instead of only multiplication).

The divide / sqrt unit is not fully pipelined, for reasons explained in @NathanWhitehead's answer. The worst ratios are for 256b vectors, because (unlike other execution units) the divide unit is usually not full-width, so wide vectors have to be done in two halves. A not-fully-pipelined execution unit is so unusual that Intel CPUs have an arith.divider_active hardware performance counter to help you find code that bottlenecks on divider throughput instead of the usual front-end or execution port bottlenecks. (Or more often, memory bottlenecks or long latency chains limiting instruction-level parallelism causing instruction throughput to be less than ~4 per clock).

However, FP division and sqrt on Intel and AMD CPUs (other than KNL) is implemented as a single uop, so it doesn't necessarily have a big throughput impact on surrounding code. The best case for division is when out-of-order execution can hide the latency, and when there are lots of multiplies and adds (or other work) that can happen in parallel with the divide.

(Integer division is microcoded as multiple uops on Intel, so it always has more impact on surrounding code that integer multiply. There's less demand for high-performance integer division, so the hardware support isn't as fancy. Related: microcoded instructions like idiv can cause alignment-sensitive front-end bottlenecks.)

So for example, this will be really bad:

for ()
a[i] = b[i] / scale;  // division throughput bottleneck


// Instead, use this:
float inv = 1.0 / scale;
for ()
a[i] = b[i] * inv;  // multiply (or store) throughput bottleneck

All you're doing in the loop is load/divide/store, and they're independent so it's throughput that matters, not latency.

A reduction like accumulator /= b[i] would bottleneck on divide or multiply latency, rather than throughput. But with multiple accumulators that you divide or multiply at the end, you can hide the latency and still saturate the throughput. Note that sum += a[i] / b[i] bottlenecks on add latency or div throughput, but not div latency because the division isn't on the critical path (the loop-carried dependency chain).


But in something like this (approximating a function like log(x) with a ratio of two polynomials), the divide can be pretty cheap:

for () {
// (not shown: extracting the exponent / mantissa)
float p = polynomial(b[i], 1.23, -4.56, ...);  // FMA chain for a polynomial
float q = polynomial(b[i], 3.21, -6.54, ...);
a[i] = p/q;
}

For log() over the range of the mantissa, a ratio of two polynomials of order N has much less error than a single polynomial with 2N coefficients, and evaluating 2 in parallel gives you some instruction-level parallelism within a single loop body instead of one massively long dep chain, making things a LOT easier for out-of-order execution.

In this case, we don't bottleneck on divide latency because out-of-order execution can keep multiple iterations of the loop over the arrays in flight.

We don't bottleneck on divide throughput as long as our polynomials are big enough that we only have one divide for every 10 FMA instructions or so. (And in a real log() use case, there's be a bunch of work extracting exponent / mantissa and combining things back together again, so there's even more work to do between divides.)


When you do need to divide, usually it's best to just divide instead of rcpps

x86 has an approximate-reciprocal instruction (rcpps), which only gives you 12 bits of precision. (AVX512F has 14 bits, and AVX512ER has 28 bits.)

You can use this to do x / y = x * approx_recip(y) without using an actual divide instruction. (rcpps itsef is fairly fast; usually a bit slower than multiplication. It uses a table lookup from a table internal to the CPU. The divider hardware may use the same table for a starting point.)

For most purposes, x * rcpps(y) is too inaccurate, and a Newton-Raphson iteration to double the precision is required. But that costs you 2 multiplies and 2 FMAs, and has latency about as high as an actual divide instruction. If all you're doing is division, then it can be a throughput win. (But you should avoid that kind of loop in the first place if you can, maybe by doing the division as part of another loop that does other work.)

But if you're using division as part of a more complex function, the rcpps itself + the extra mul + FMA usually makes it faster to just divide with a divps instruction, except on CPUs with very low divps throughput.

(For example Knight's Landing, see below. KNL supports AVX512ER, so for float vectors the VRCP28PS result is already accurate enough to just multiply without a Newton-Raphson iteration. float mantissa size is only 24 bits.)


Specific numbers from Agner Fog's tables:

Unlike every other ALU operation, division latency/throughput is data-dependent on some CPUs. Again, this is because it's so slow and not fully pipelined. Out-of-order scheduling is easier with fixed latencies, because it avoids write-back conflicts (when the same execution port tries to produce 2 results in the same cycle, e.g. from running a 3 cycle instruction and then two 1-cycle operations).

Generally, the fastest cases are when the divisor is a "round" number like 2.0 or 0.5 (i.e. the base2 float representation has lots of trailing zeros in the mantissa).

float latency (cycles) / throughput (cycles per instruction, running just that back to back with independent inputs):

                   scalar & 128b vector        256b AVX vector
divss      |  mulss
divps xmm  |  mulps           vdivps ymm | vmulps ymm


Nehalem          7-14 /  7-14 | 5 / 1           (No AVX)
Sandybridge     10-14 / 10-14 | 5 / 1        21-29 / 20-28 (3 uops) | 5 / 1
Haswell         10-13 / 7     | 5 / 0.5       18-21 /   14 (3 uops) | 5 / 0.5
Skylake            11 / 3     | 4 / 0.5          11 /    5 (1 uop)  | 4 / 0.5


Piledriver       9-24 / 5-10  | 5-6 / 0.5      9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops)
Ryzen              10 / 3     | 3 / 0.5         10  /    6 (2 uops) | 3 / 1 (2 uops)


Low-power CPUs:
Jaguar(scalar)     14 / 14    | 2 / 1
Jaguar             19 / 19    | 2 / 1            38 /   38 (2 uops) | 2 / 2 (2 uops)


Silvermont(scalar)    19 / 17    | 4 / 1
Silvermont      39 / 39 (6 uops) | 5 / 2            (No AVX)


KNL(scalar)     27 / 17 (3 uops) | 6 / 0.5
KNL             32 / 20 (18uops) | 6 / 0.5        32 / 32 (18 uops) | 6 / 0.5  (AVX and AVX512)

double latency (cycles) / throughput (cycles per instruction):

                   scalar & 128b vector        256b AVX vector
divsd      |  mulsd
divpd xmm  |  mulpd           vdivpd ymm | vmulpd ymm


Nehalem         7-22 /  7-22 | 5 / 1        (No AVX)
Sandybridge    10-22 / 10-22 | 5 / 1        21-45 / 20-44 (3 uops) | 5 / 1
Haswell        10-20 /  8-14 | 5 / 0.5      19-35 / 16-28 (3 uops) | 5 / 0.5
Skylake        13-14 /     4 | 4 / 0.5      13-14 /     8 (1 uop)  | 4 / 0.5


Piledriver      9-27 /  5-10 | 5-6 / 1       9-27 / 9-18 (2 uops)  | 5-6 / 1 (2 uops)
Ryzen           8-13 /  4-5  | 4 / 0.5       8-13 /  8-9 (2 uops)  | 4 / 1 (2 uops)


Low power CPUs:
Jaguar            19 /   19  | 4 / 2            38 /  38 (2 uops)  | 4 / 2 (2 uops)


Silvermont(scalar) 34 / 32    | 5 / 2
Silvermont         69 / 69 (6 uops) | 5 / 2           (No AVX)


KNL(scalar)      42 / 42 (3 uops) | 6 / 0.5   (Yes, Agner really lists scalar as slower than packed, but fewer uops)
KNL              32 / 20 (18uops) | 6 / 0.5        32 / 32 (18 uops) | 6 / 0.5  (AVX and AVX512)

Ivybridge and Broadwell are different too, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its max clock speeds were lower.)

Atom, Silvermont, and even Knight's Landing (Xeon Phi based on Silvermont) have exceptionally low divide performance, and even a 128b vector is slower than scalar. AMD's low-power Jaguar CPU (used in some consoles) is similar. A high-performance divider takes a lot of die area. Xeon Phi has low power per-core, and packing lots of cores on a die gives it tighter die-area constraints that Skylake-AVX512. It seems that AVX512ER rcp28ps / pd is what you're "supposed" to use on KNL.

(See this InstLatx64 result for Skylake-AVX512 aka Skylake-X. Numbers for vdivps zmm: 18c / 10c, so half the throughput of ymm.)


Long latency chains become a problem when they're loop-carried, or when they're so long that they stop out-of-order execution from finding parallelism with other independent work.


Footnote 1: how I made up those div vs. mul performance ratios:

FP divide vs. multiple performance ratios are even worse than that in low-power CPUs like Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use AVX512ER).

Actual divide/multiply throughput ratios for scalar (non-vectorized) double: 8 on Ryzen and Skylake with their beefed-up dividers, but 16-28 on Haswell (data-dependent, and more likely towards the 28 cycle end unless your divisors are round numbers). These modern CPUs have very powerful dividers, but their 2-per-clock multiply throughput blows it away. (Even more so when your code can auto-vectorize with 256b AVX vectors). Also note that with the right compiler options, those multiply throughputs also apply to FMA.

Numbers from http://agner.org/optimize/ instruction tables for Intel Haswell/Skylake and AMD Ryzen, for SSE scalar (not including x87 fmul / fdiv) and for 256b AVX SIMD vectors of float or double. See also the tag wiki.