为了得到最精确的结果，应该按什么顺序添加浮点数？

小开

最佳答案

Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.

If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.

That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.

Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.

In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.

It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.

小开

There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.

According to Wikipedia,

The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).

In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0          //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c    //So far, so good: c is zero.
t = sum + y         //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y   //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t             //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i               //Next time around, the lost low part will be added to y in a fresh attempt.
return sum

小开

I tried out the extreme example in the answer supplied by Steve Jessop.

#include <iostream>
#include <iomanip>
#include <cmath>


int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;


double sum = big;
for (long i = 0; i < billion; ++i)
sum += small;
std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
std::fixed << std::setprecision(15) << sum <<
"    (difference = " << std::fabs(expected - sum) << ")" << std::endl;


sum = 0;
for (long i = 0; i < billion; ++i)
sum += small;
sum += big;
std::cout  << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
std::fixed << std::setprecision(15) << sum <<
"    (difference = " << std::fabs(expected - sum) << ")" << std::endl;


return 0;
}

I got the following result:

1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371    (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933    (difference = 0.000000007460067)

The error in the first line is more than ten times bigger in the second.

If I change the doubles to floats in the code above, I get:

1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000    (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000    (difference = 0.968750000000000)

Neither answer is even close to 2.0 (but the second is slightly closer).

Using the Kahan summation (with doubles) as described by Daniel Pryden:

#include <iostream>
#include <iomanip>
#include <cmath>


int main()
{
long billion = 1000000000;
double big = 1.0;
double small = 1e-9;
double expected = 2.0;


double sum = big;
double c = 0.0;
for (long i = 0; i < billion; ++i) {
double y = small - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}


std::cout << "Kahan sum  = " << std::fixed << std::setprecision(15) << sum <<
"    (difference = " << std::fabs(expected - sum) << ")" << std::endl;


return 0;
}

I get exactly 2.0:

Kahan sum  = 2.000000000000000    (difference = 0.000000000000000)

And even if I change the doubles to floats in the code above, I get:

Kahan sum  = 2.000000000000000    (difference = 0.000000000000000)

It would seem that Kahan is the way to go!

小开

Building on Steve's answer of first sorting the numbers in ascending order, I'd introduce two more ideas:

Decide on the difference in exponent of two numbers above which you might decide that you would lose too much precision.
Then add the numbers up in order until the exponent of the accumulator is too large for the next number, then put the accumulator onto a temporary queue and start the accumulator with the next number. Continue until you exhaust the original list.

You repeat the process with the temporary queue (having sorted it) and with a possibly larger difference in exponent.

I think this will be quite slow if you have to calculate exponents all the time.

I had a quick go with a program and the result was 1.99903

小开

I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:

while the list has multiple elements
remove the two smallest elements from the list
add them and put the result back in
the single element in the list is the result

Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:

template <typename Queue>
void reduce(Queue& queue)
{
typedef typename Queue::value_type vt;
while (queue.size() > 1)
{
vt x = queue.top();
queue.pop();
vt y = queue.top();
queue.pop();
queue.push(x + y);
}
}

driver:

#include <iterator>
#include <queue>


template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
{
typedef typename std::iterator_traits<Iterator>::value_type vt;
std::priority_queue<vt> positive_queue;
positive_queue.push(0);
std::priority_queue<vt> negative_queue;
negative_queue.push(0);
for (; begin != end; ++begin)
{
vt x = *begin;
if (x < 0)
{
negative_queue.push(x);
}
else
{
positive_queue.push(-x);
}
}
reduce(positive_queue);
reduce(negative_queue);
return negative_queue.top() - positive_queue.top();
}

The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.

小开

There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.

In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.

Here is the abstract of a recent paper:

We present a novel, online algorithm for exact summation of a stream of floating-point numbers. By “online” we mean that the algorithm needs to see only one input at a time, and can take an arbitrary length input stream of such inputs while requiring only constant memory. By “exact” we mean that the sum of the internal array of our algorithm is exactly equal to the sum of all the inputs, and the returned result is the correctly-rounded sum. The proof of correctness is valid for all inputs (including nonnormalized numbers but modulo intermediate overflow), and is independent of the number of summands or the condition number of the sum. The algorithm asymptotically needs only 5 FLOPs per summand, and due to instruction-level parallelism runs only about 2--3 times slower than the obvious, fast-but-dumb “ordinary recursive summation” loop when the number of summands is greater than 10,000. Thus, to our knowledge, it is the fastest, most accurate, and most memory efficient among known algorithms. Indeed, it is difficult to see how a faster algorithm or one requiring significantly fewer FLOPs could exist without hardware improvements. An application for a large number of summands is provided.

Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.

小开

This doesn't quite answer your question, but a clever thing to do is to run the sum twice, once with rounding mode "round up" and once with "round down". Compare the two answers, and you know /how/ inaccurate your results are, and if you therefore need to use a cleverer summing strategy. Unfortunately, most languages don't make changing the floating point rounding mode as easy as it should be, because people don't know that it's actually useful in everyday calculations.

Take a look at Interval arithmetic where you do all maths like this, keeping highest and lowest values as you go. It leads to some interesting results and optimisations.

小开

The simplest sort that improves accuracy is to sort by the ascending absolute value. That lets the smallest magnitude values have a chance to accumulate or cancel before interacting with larger magnitude values that have would trigger a loss of precision.

That said, you can do better by tracking multiple non-overlapping partial sums. Here is a paper describing the technique and presenting a proof-of-accuracy: www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps

That algorithm and other approaches to exact floating point summation are implemented in simple Python at: http://code.activestate.com/recipes/393090/ At least two of those can be trivially converted to C++.

小开

Your floats should be added in double precision. That will give you more additional precision than any other technique can. For a bit more precision and significantly more speed, you can create say four sums, and add them up at the end.

If you are adding double precision numbers, use long double for the sum - however, this will only have a positive effect in implementations where long double actually has more precision than double (typically x86, PowerPC depending on compiler settings).

小开

For IEEE 754 single or double precision or known format numbers, another alternative is to use an array of numbers (passed by caller, or in a class for C++) indexed by the exponent. When adding numbers into the array, only numbers with the same exponent are added (until an empty slot is found and the number stored). When a sum is called for, the array is summed from smallest to largest to minimize truncation. Single precision example:

/* clear array */
void clearsum(float asum[256])
{
size_t i;
for(i = 0; i < 256; i++)
asum[i] = 0.f;
}


/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
while(1){
/* i = exponent of f */
i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
if(i == 0xff){          /* max exponent, could be overflow */
asum[i] += f;
return;
}
if(asum[i] == 0.f){     /* if empty slot store f */
asum[i] = f;
return;
}
f += asum[i];           /* else add slot to f, clear slot */
asum[i] = 0.f;          /* and continue until empty slot */
}
}


/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
for(i = 0; i < 256; i++)
sum += asum[i];
return sum;
}

double precision example:

/* clear array */
void clearsum(double asum[2048])
{
size_t i;
for(i = 0; i < 2048; i++)
asum[i] = 0.;
}


/* add a number into array */
void addtosum(double d, double asum[2048])
{
size_t i;
while(1){
/* i = exponent of d */
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7ff){         /* max exponent, could be overflow */
asum[i] += d;
return;
}
if(asum[i] == 0.){      /* if empty slot store d */
asum[i] = d;
return;
}
d += asum[i];           /* else add slot to d, clear slot */
asum[i] = 0.;           /* and continue until empty slot */
}
}


/* return sum from array */
double returnsum(double asum[2048])
{
double sum = 0.;
size_t i;
for(i = 0; i < 2048; i++)
sum += asum[i];
return sum;
}

小开

Regarding sorting, it seems to me that if you expect cancellation then the numbers should be added in descending order of magnitude, not ascending. For instance:

((-1 + 1) + 1e-20) will give 1e-20

but

((1e-20 + 1) - 1) will give 0

In the first equation that two large numbers are cancelled out, whereas in the second the 1e-20 term gets lost when added to 1, since there is not enough precision to retain it.

Also, pairwise summation is pretty decent for summing lots of numbers.