Is it possible to get 0 by subtracting two unequal floating point numbers?

Is it possible to get division by 0 (or infinity) in the following example?

public double calculation(double a, double b)
{
if (a == b)
{
return 0;
}
else
{
return 2 / (a - b);
}
}

In normal cases it will not, of course. But what if a and b are very close, can (a-b) result in being 0 due to precision of the calculation?

Note that this question is for Java, but I think it will apply to most programming languages.

11078 次浏览

You wouldn't get a division by zero regardless of the value of a - b, since floating point division by 0 doesn't throw an exception. It returns infinity.

Now, the only way a == b would return true is if a and b contain the exact same bits. If they differ by just the least significant bit, the difference between them will not be 0.

EDIT :

As Bathsheba correctly commented, there are some exceptions:

  1. "Not a number compares" false with itself but will have identical bit patterns.

  2. -0.0 is defined to compare true with +0.0, and their bit patterns are different.

So if both a and b are Double.NaN, you will reach the else clause, but since NaN - NaN also returns NaN, you will not be dividing by zero.

You shouldn't ever compare floats or doubles for equality; because, you can't really guarantee that the number you assign to the float or double is exact.

To compare floats for equality sanely, you need to check if the value is "close enough" to the same value:

if ((first >= second - error) || (first <= second + error)

In Java, a - b is never equal to 0 if a != b. This is because Java mandates IEEE 754 floating point operations which support denormalized numbers. From the spec:

In particular, the Java programming language requires support of IEEE 754 denormalized floating-point numbers and gradual underflow, which make it easier to prove desirable properties of particular numerical algorithms. Floating-point operations do not "flush to zero" if the calculated result is a denormalized number.

If an FPU works with denormalized numbers, subtracting unequal numbers can never produce zero (unlike multiplication), also see this question.

For other languages, it depends. In C or C++, for example, IEEE 754 support is optional.

That said, it is possible for the expression 2 / (a - b) to overflow, for example with a = 5e-308 and b = 4e-308.

As a workaround, what about the following?

public double calculation(double a, double b) {
double c = a - b;
if (c == 0)
{
return 0;
}
else
{
return 2 / c;
}
}

That way you don't depend on IEEE support in any language.

There is no case where a division by zero can happen here.

The SMT Solver Z3 supports precise IEEE floating point arithmetic. Let's ask Z3 to find numbers a and b such that a != b && (a - b) == 0:

(set-info :status unknown)
(set-logic QF_FP)
(declare-fun b () (FloatingPoint 8 24))
(declare-fun a () (FloatingPoint 8 24))
(declare-fun rm () RoundingMode)
(assert
(and (not (fp.eq a b)) (fp.eq (fp.sub rm a b) +zero) true))
(check-sat)

The result is UNSAT. There are no such numbers.

The above SMTLIB string also allows Z3 to pick an arbitrary rounding mode (rm). This means that the result holds for all possible rounding modes (of which there are five). The result also includes the possibility that any of the variables in play might be NaN or infinity.

a == b is implemented as fp.eq quality so that +0f and -0f compare equal. The comparison with zero is implemented using fp.eq as well. Since the question is aimed at avoiding a division by zero this is the appropriate comparison.

If the equality test was implemented using bitwise equality, +0f and -0f would have been a way to make a - b zero. An incorrect previous version of this answer contains mode details about that case for the curious.

Z3 Online does not yet support the FPA theory. This result was obtained using the latest unstable branch. It can be reproduced using the .NET bindings as follows:

var fpSort = context.MkFPSort32();
var aExpr = (FPExpr)context.MkConst("a", fpSort);
var bExpr = (FPExpr)context.MkConst("b", fpSort);
var rmExpr = (FPRMExpr)context.MkConst("rm", context.MkFPRoundingModeSort());
var fpZero = context.MkFP(0f, fpSort);
var subExpr = context.MkFPSub(rmExpr, aExpr, bExpr);
var constraintExpr = context.MkAnd(
context.MkNot(context.MkFPEq(aExpr, bExpr)),
context.MkFPEq(subExpr, fpZero),
context.MkTrue()
);


var smtlibString = context.BenchmarkToSMTString(null, "QF_FP", null, null, new BoolExpr[0], constraintExpr);


var solver = context.MkSimpleSolver();
solver.Assert(constraintExpr);


var status = solver.Check();
Console.WriteLine(status);

Using Z3 to answer IEEE float questions is nice because it is hard to overlook cases (such as NaN, -0f, +-inf) and you can ask arbitrary questions. No need to interpret and cite specifications. You can even ask mixed float and integer questions such as "is this particular int log2(float) algorithm correct?".

I can think of a case where you might be able to cause this to happen. Here's an analogous sample in base 10 - really, this would happen in base 2, of course.

Floating point numbers are stored more or less in scientific notation - that is, instead of seeing 35.2, the number being stored would be more like 3.52e2.

Imagine for the sake of convenience that we have a floating point unit that operates in base 10 and has 3 digits of accuracy. What happens when you subtract 9.99 from 10.0?

1.00e2-9.99e1

Shift to give each value the same exponent

1.00e2-0.999e2

Round to 3 digits

1.00e2-1.00e2

Uh oh!

Whether this can happen ultimately depends on the FPU design. Since the range of exponents for a double is very large, the hardware has to round internally at some point, but in the case above, just 1 extra digit internally will prevent any problem.

In a floating-point implementation that conforms to IEEE-754, each floating-point type can hold numbers in two formats. One ("normalized") is used for most floating-point values, but the second-smallest number it can represent is only a tiny bit larger than the smallest, and so the difference between them is not representable in that same format. The other ("denormalized") format is used only for very small numbers that are not representable in the first format.

Circuitry to handle the denormalized floating-point format efficiently is expensive, and not all processors include it. Some processors offer a choice between either having operations on really small numbers be much slower than operations on other values, or having the processor simply regard numbers which are too small for normalized format as zero.

The Java specifications imply that implementations should support denormalized format, even on machines where doing so would make code run more slowly. On the other hand, it's possible that some implementations might offer options to allow code to run faster in exchange for slightly sloppy handling of values which would for most purposes be way too small to matter (in cases where values are too small to matter, it can be annoying having calculations with them take ten times as long as calculations that do matter, so in many practical situations flush-to-zero is more useful than slow-but-accurate arithmetic).

The supplied function can indeed return infinity:

public class Test {
public static double calculation(double a, double b)
{
if (a == b)
{
return 0;
}
else
{
return 2 / (a - b);
}
}


/**
* @param args
*/
public static void main(String[] args) {
double d1 = Double.MIN_VALUE;
double d2 = 2.0 * Double.MIN_VALUE;
System.out.println("Result: " + calculation(d1, d2));
}
}

The output is Result: -Infinity.

When the result of the division is to big to be stored in a double, infinity is returned even if the denominator is non-zero.

Based on @malarres response and @Taemyr comment, here is my little contribution:

public double calculation(double a, double b)
{
double c = 2 / (a - b);


// Should not have a big cost.
if (isnan(c) || isinf(c))
{
return 0; // A 'whatever' value.
}
else
{
return c;
}
}

My point is to says: the easyest way to know if the result of the division is nan or inf is actualy to perform the division.

In olden times before IEEE 754, it was quite possible that a != b didn't imply a-b != 0 and vice versa. That was one of the reasons to create IEEE 754 in the first place.

With IEEE 754 it is almost guaranteed. C or C++ compilers are allowed to do an operation with higher precision than needed. So if a and b are not variables but expressions, then (a + b) != c doesn't imply (a + b) - c != 0, because a + b could be calculated once with higher precision, and once without higher precision.

Many FPUs can be switched to a mode where they don't return denormalized numbers but replace them with 0. In that mode, if a and b are tiny normalised numbers where the difference is smaller than the smallest normalised number but greater than 0, a != b also doesn't guarantee a == b.

"Never compare floating-point numbers" is cargo cult programming. Among the people who have the mantra "you need an epsilon", most have no idea how to choose that epsilon properly.

Division by zero is undefined, since the limit from positive numbers tend to infinity, the limited from negative numbers tend to negative infinity.

Not sure if this is C++ or Java since there is no language tag.

double calculation(double a, double b)
{
if (a == b)
{
return nan(""); // C++


return Double.NaN; // Java
}
else
{
return 2 / (a - b);
}
}

The core problem is that computer representation of a double (aka float, or real number in mathematical language) is wrong when you have "too much" decimal, for instance when you deal with double that can't be written as a numerical value (pi or the result of 1/3).

So a==b can't be done with any double value of a and b, how to you deal with a==b when a=0.333 and b=1/3 ? Depending of your OS vs FPU vs number vs language versus count of 3 after 0, you will have true or false.

Anyway if you do "double value calculation" on a computer, you have to deal with accuracy, so instead of doing a==b, you have to do absolute_value(a-b)<epsilon, and epsilon is relative to what you are modeling at that time in your algorithm. You can't have an epsilon value for all of your double comparison.

In brief, when you type a==b, you have a mathemical expression that can't be translated on a computer (for any floating point number).

PS: hum, everything I answer here is yet more or less in others responses and comments.