不同的浮点结果与优化启用编译器错误？

小开

不同的编译器有不同的优化设置。根据 IEEE 754，一些更快的优化设置不维护严格的浮点规则。VisualStudio 有一个特定的设置 /fp:strict、 /fp:precise、 /fp:fast，在这些设置中，/fp:fast违反了可以执行的标准。您可能会发现这个标志是控制这种设置中的优化的标志。您还可以在 GCC 中找到类似的设置来改变行为。

如果是这样的话，那么编译器之间唯一的不同就是 GCC 默认情况下会在更高的优化级别上寻找最快的浮点行为，而 Visual Studio 不会在更高的优化级别上改变浮点行为。因此，它可能不一定是一个实际的错误，而是一个选项的预期行为，你不知道你正在打开。

小开

最佳答案

Intel x86处理器在内部使用80位扩展精度，而 double通常为64位宽。不同的优化级别会影响 CPU 的浮点值保存到内存中的频率，从而将80位精度四舍五入到64位精度。

使用 -ffloat-store gcc 选项可以获得具有不同优化级别的相同浮点结果。

或者，使用 long double类型，它在 gcc 上通常为80位宽，以避免从80位舍入到64位精度。

man gcc说明了一切:

   -ffloat-store
Do not store floating point variables in registers, and inhibit
other options that might change whether a floating point value is
taken from a register or memory.


This option prevents undesirable excess precision on machines such
as the 68000 where the floating registers (of the 68881) keep more
precision than a "double" is supposed to have.  Similarly for the
x86 architecture.  For most programs, the excess precision does
only good, but a few programs rely on the precise definition of
IEEE floating point.  Use -ffloat-store for such programs, after
modifying them to store all pertinent intermediate computations
into variables.

In x86_64 builds compilers use SSE registers for float and double by default, so that no extended precision is used and this issue doesn't occur.

gcc compiler option -mfpmath controls that.

小开

对于那些不能重现 bug 的人: 不要取消注释掉的调试 stmts，它们会影响结果。

This implies that the problem is related to the debug statements. And it looks like there's a rounding error caused by loading the values into registers during the output statements, which is why others found that you can fix this with -ffloat-store

进一步问题:

我在想，我是否应该总是打开 -ffloat-store选项？

To be flippant, there must be a reason that some programmers don't turn on -ffloat-store, otherwise the option wouldn't exist (likewise, there must be a reason that some programmers 做 turn on -ffloat-store). I wouldn't recommend always turning it on or always turning it off. Turning it on prevents some optimizations, but turning it off allows for the kind of behavior you're getting.

但是，一般来说，在二进制浮点数(如计算机使用的)和十进制浮点数(人们熟悉的)之间存在有些不匹配，这种不匹配可能导致与你得到的相似的行为(需要明确的是，你得到的行为是由这种不匹配引起的没有，但是 similar的行为是 can)。问题是，由于在处理浮点数时已经存在一些模糊性，我不能说 -ffloat-store使它更好或更差。

相反，你可能想查看其他解决办法来解决你正在试图解决的问题(不幸的是，Koenig 没有指向实际的论文，而且我实在找不到一个明显的“规范”的地方，所以我将不得不把你发送到谷歌)。

If you're not rounding for output purposes, I would probably look at std::modf() (in cmath) and std::numeric_limits<double>::epsilon() (in limits). Thinking over the original round() function, I believe it would be cleaner to replace the call to std::floor(d + .5) with a call to this function:

// this still has the same problems as the original rounding function
int round_up(double d)
{
// return value will be coerced to int, and truncated as expected
// you can then assign the int to a double, if desired
return d + 0.5;
}

我认为这意味着以下方面的改进:

// this won't work for negative d ...
// this may still round some numbers up when they should be rounded down
int round_up(double d)
{
double floor;
d = std::modf(d, &floor);
return floor + (d + .5 + std::numeric_limits<double>::epsilon());
}

一个简单的说明: std::numeric_limits<T>::epsilon()被定义为“加到1后产生一个不等于1的数的最小数”你通常需要使用一个相对的 ε (也就是说，用某种方式来表示除了“1”以外的数字)。d、 .5和 std::numeric_limits<double>::epsilon()的总和应该接近于1，因此对这个加法进行分组意味着 std::numeric_limits<double>::epsilon()的大小将与我们正在做的事情大致相同。如果有什么不同的话，那就是 std::numeric_limits<double>::epsilon()太大了(当三者之和小于1时) ，可能会导致我们在不应该的时候四舍五入一些数字。

现在，您应该考虑 std::nearbyint()。

小开

Output should be: 4.5 4.6 如果您具有无限的精度，或者您使用的设备使用基于十进制而不是基于二进制的浮点表示，那么输出结果就是这样的。但你不是。大多数计算机使用二进制 IEEE 浮点标准。

正如 Maxim Yegorushkin 在他的回答中已经指出的那样，问题的 一部分是你的计算机内部使用的是80位的浮点表示。但这只是问题的一部分。这个问题的根本原因在于，任何形式的 n.nn5都没有一个精确的二进制浮点表示。那些边缘案件总是不精确的数字。

如果您真的希望舍入能够可靠地绕过这些拐角情况，那么您需要一个舍入算法来解决 n.n5、 n.nn5或 n.nnn5等(但不是 n.5)总是不精确的问题。查找确定某个输入值是向上还是向下舍入的角线大小写，并根据与此角线大小写的比较返回向上舍入或向下舍入的值。而且你需要注意的是，编译器最佳化不会把找到的边框放在一个扩展的精密寄存器中。

有关这样的算法，请参见即使浮点数不精确，Excel 是如何成功舍入浮点数的？。

或者你可以接受这样一个事实: 有时候拐角处的情况会出现错误。

小开

就我个人而言，我遇到了同样的问题——从 gcc 到 VS。在大多数情况下，我认为最好避免优化。只有在处理涉及大量浮点数据数组的数值方法时才值得这样做。即使在分解之后，我也经常对编译器的选择不感兴趣。通常情况下，使用编译器内部函数或者自己编写程序集更容易。

小开

如果您正在编译一个不包含 SSE2的 x86目标，那么公认的答案是正确的。所有现代的 x86处理器都支持 SSE2，所以如果你能利用它，你应该:

-mfpmath=sse -msse2 -ffp-contract=off

我们来分析一下。

-mfpmath=sse -msse2.这通过使用 SSE2寄存器执行舍入，这比将每个中间结果存储到内存要快得多。注意，这是用于 x86-64的 GCC 上的已经是默认值了。来自海湾合作委员会维基百科:

On more modern x86 processors that support SSE2, specifying the compiler options -mfpmath=sse -msse2 ensures all float and double operations are performed in SSE registers and correctly rounded. These options do not affect the ABI and should therefore be used whenever possible for predictable numerical results.

-ffp-contract=off.然而，控制舍入并不足以实现精确匹配。FMA (积和熔加运算)指令可以改变舍入行为相对于它的非融合对应物，所以我们需要禁用它。这是 Clang 的默认值，而不是 GCC。正如这个答案所解释的:

FMA 只有一个舍入(它有效地保持了内部临时乘法结果的无限精度) ，而 ADD + MUL 有两个舍入。

通过禁用 FMA，我们得到的结果在调试和发布时完全匹配，代价是一些性能(和准确性)。我们仍然可以利用 SSE 和 AVX 的其他性能优势。

小开

我更深入地研究了这个问题，我可以带来更多的精确度。首先，根据 x84 _ 64上的 gcc，4.45和4.55的精确表示如下(使用 libQuadmath 打印最后一个精度) :

float 32:   4.44999980926513671875
double 64:  4.45000000000000017763568394002504646778106689453125
doublex 80: 4.449999999999999999826527652402319290558807551860809326171875
quad 128:   4.45000000000000000000000000000000015407439555097886824447823540679418548304813185723105561919510364532470703125


float 32:   4.55000019073486328125
double 64:  4.54999999999999982236431605997495353221893310546875
doublex 80: 4.550000000000000000173472347597680709441192448139190673828125
quad 128:   4.54999999999999999999999999999999984592560444902113175552176459320581451695186814276894438080489635467529296875

如上所述，问题是由于 FPU 寄存器的80位大小造成的。

但是为什么这个问题从来没有在 Windows 上出现过呢？在 IA-32上，x87 FPU 被配置为使用53位尾数的内部精度(相当于总大小为64位: double)。对于 Linux 和 Mac OS，默认精度为64位(相当于总大小为80位: long double)。因此，通过更改 FPU 的控制词(假设指令序列会触发 bug) ，这个问题在这些不同的平台上应该是可能的，或者不可能的。这个问题被报告给 gcc 作为窃听器323(至少阅读评论92!).

为了在 Windows 上显示尾数的精确度，你可以用 VC + + 以32位编译它:

#include "stdafx.h"
#include <stdio.h>
#include <float.h>


int main(void)
{
char t[] = { 64, 53, 24, -1 };
unsigned int cw = _control87(0, 0);
printf("mantissa is %d bits\n", t[(cw >> 16) & 3]);
}

以及 Linux/Cygwin:

#include <stdio.h>


int main(int argc, char **argv)
{
char t[] = { 24, -1, 53, 64 };
unsigned int cw = 0;
__asm__ __volatile__ ("fnstcw %0" : "=m" (*&cw));
printf("mantissa is %d bits\n", t[(cw >> 8) & 3]);
}

请注意，使用 gcc 可以使用 -mpc32/64/80设置 FPU 精度，尽管在 Cygwin 会忽略它。但是请记住，它会改变尾数的大小，但不会改变指数的大小，这样就会为其他不同的行为敞开大门。

On x86_64 architecture, SSE is used as said by Tmandry, so the problem will not occur unless you force the old x87 FPU for FP computing with -mfpmath=387, or unless you compile in 32 bits mode with -m32 (you will need multilib package). I could reproduce the problem on Linux with different combinations of flags and versions of gcc:

g++-5 -m32 floating.cpp -O1
g++-8 -mfpmath=387 floating.cpp -O1

我在 Windows 或 Cygwin 上用 VC + +/gcc/tcc 尝试了一些组合，但是没有出现 bug。我想生成的指令序列是不一样的。

最后，请注意，用4.45或4.55来防止这个问题的一种奇特的方法是使用 _Decimal32/64/128，但是支持真的很少... 我花了很多时间只是为了能够用 libdfp来做 printf！