计算二进制搜索中的中间值

我在读一本关于算法的书,书中提到了二进制搜索的算法:

public class BinSearch {
static int search ( int [ ] A, int K ) {
int l = 0 ;
int u = A. length −1;
int m;
while (l <= u ) {
m = (l+u) /2;
if (A[m] < K) {
l = m + 1 ;
} else if (A[m] == K) {
return m;
} else {
u = m−1;
}
}
return −1;
}
}

作者说: “错误在于分配 m = (l+u)/2;,它可能导致溢出,应该由 m = l + (u-l)/2代替。”

我不明白这怎么会导致人满为患。当我在脑海中对几个不同的输入运行这个算法时,我没有看到数组索引中的中间值消失。

那么,在哪些情况下会发生溢出呢?

72109 次浏览

The potential overflow is in the l+u addition itself.

This was actually a bug in early versions of binary search in the JDK.

The problem is that (l+u) is evaluated first, and could overflow int, so (l+u)/2 would return the wrong value.

This post covers this famous bug in a lot of detail. As others have said it's an overflow issue. The fix recommended on the link is as follows:

int mid = low + ((high - low) / 2);


// Alternatively
int mid = (low + high) >>> 1;

It is also probably worth mentioning that in case negative indices are allowed, or perhaps it's not even an array that's being searched (for example, searching for a value in some integer range satisfying some condition), the code above may not be correct as well. In this case, something as ugly as

(low < 0 && high > 0) ? (low + high) / 2 : low + (high - low) / 2

may be necessary. One good example is searching for the median in an unsorted array without modifying it or using additional space by simply performing a binary search on the whole Integer.MIN_VALUEInteger.MAX_VALUE range.

Jeff suggested really good post to read about this bug, here is summary if you want quick overview.

In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.

The following C++ program can show you how an overflow can happen with a 32-bit unsigned integer:

#include <iostream>
using namespace std;


int main ()
{
unsigned int  low = 33,
high = 4294967290,
mid;


cout << "The value of low is " << low << endl;
cout << "The value of high is " << high << endl;


mid = (low + high) / 2;


cout << "The value of mid is " << mid << endl;
  

return 0;
}

If you run it on a Mac:

$ g++ try.cpp && ./a.out
The value of low is 33
The value of high is 4294967290
The value of mid is 13

The value of mid might be expected to be 2147483661, but low + high overflowed because a 32-bit unsigned integer cannot contain the proper value, and give back 27, and so mid becomes 13.

When the calculation of mid is changed to

mid = low + (high - low) / 2;

Then it will show

The value of mid is 2147483661

The simple answer is, the addition l + u can overflow, and has undefined behavior in some languages, as described in a blog post by Joshua Bloch, about a bug in the Java library for the implementation of binary search.

Some readers may not understand what it is about:

l + (u - l) / 2

Note that in some code, the variable names are different, and it is

low + (high - low) / 2

The answer is: let's say if you have two numbers: 200 and 210, and now you want the "middle number". And let's say if you add any two numbers and the result is greater than 255, then it can overflow and the behavior is undefined, then what can you do? A simple way is just to add the difference between them, but just half of it, to the smaller value: look at what the difference is between 200 and 210. It is 10. (You can consider it the "difference" or "length", between them). So you just need to add 10 / 2 = 5 to 200, and get 205. You don't need to add 200 and 210 together first -- and that's how we can reach the calculation: (u - l) is the difference. (u - l) / 2 is half of it. Add that to l and we have l + (u - l) / 2.

It is like, if we are looking at two trees, one is 200 feet tall and one is 210 feet tall, what is the "midpoint" or the "mean"? We don't have to add them together first. We can just tell the difference is 10 feet, and we can add half of that, which is 5, to 200, and we know it is 205 feet.

To put this into history perspectives, Robert Sedgewick mentioned that the first binary search was stated in 1946, and it wasn't correct until 1964. Jon Bentley described in his book Programming Pearls in 1988 that more that 90% of the professional programmers could not write it correctly given a couple of hours. But even Jon Bentley himself had that overflow bug for 20 years. A study that was published in 1988 showed that accurate code for binary search was only found in 5 out of 20 textbooks. In 2006, Joshua Bloch wrote that blog post about the bug about calculating the mid value. So it took 60 years for this code to be correct. But now, next time in the job interview, remember to write it correctly within that 5 minutes.

I have created this video with an example where number overflow will happen.

https://youtu.be/fMgenZq7qls

Usually, for simple binary search where you need to find an element from an array, this won't happen due to array size limitation in languages like Java but where problem space is not limited to an array, this problem can occur. Please see my video for practical example.

int mid=(l+h)/2; can lead to integer overflow problem.

(l+u) gets evaluated into a large negative integer value and its half is returned. Now,if we are searching for an element in an array, it would lead to "index out of range error."

However, the issue is resolved as:-

  • int mid=l+(h-l)/2;
  • Bit Manipulation: For faster computation->int mid=((unsigned int)l+(unsigned int)h) >> 1 ;

where >> is the right shift operator.

Hope this helps :)

Here is an example, suppose you had a very big array of size 2,000,000,000 and 10 (10^9 + 10) and the left index was at 2,000,000,000 and the right index was at 2,000,000,000 + 1.

By using lo + hi will sum upto 2,000,000,000 + 2,000,000,001 = 4,000,000,001. Since the max value of an integer is 2,147,483,647. So you won't get 4,000,000,000 + 1, you will get an integer overflow.

But low + ((high - low) / 2) will work. 2,000,000,000 + ((2,000,000,001 - 2,000,000,000) / 2) = 2,000,000,000

To avoid overflow, you can also do this: int midIndex = (int) (startIndex/2.0 + endIndex / 2.0);

You divide both indices by 2.0 -> You are getting two doubles that are less or equal to Integer.MAX_VALUE / 2 and their sum is also less or equal to Integer.MAXVALUE and a double as well. Same for Integer.MIN_VALUE. Finally, you convert the sum to an int and prevented overflow ;)

Actually the following statement in calculating mid may result in INT range overflow.

mid = (start + end) /2

Suppose the given ordered input list is very large, and suppose it surpasses the INT range(-2^31 to 2^31-1). The start + end may result in exception. To counter this, the following statement is written:

mid = start + (end-start)/2

Ultimately it results in the same expression. But the exception is averted by this trick.

It is a very subtle error and easy to miss out the first time. Most articles on the internet don't seem to clearly explain how this error occurs and how the optimized formula prevents overflow.

After a lot of digging I found this article which has a excellent and detailed explanation on how the error occurs when mid = (left+right)/2 formula is used and also how it is overcome using mid = low + ((high - low) / 2). Most importantly they explain it with example which makes the understanding so much easier.

It also explains why mid = low + ((high - low) / 2) doesn't cause an overflow.

This answer gives a practical example of why the l + (r-l)/2 calculation is necessary.

In case you are curious how the two are equivalent mathematically, here is the proof. The key is adding 0 then splitting that into l/2 - l/2.

(l+r)/2 =
l/2 + r/2 =
l/2 + r/2 + 0 =
l/2 + r/2 + (l/2 - l/2) =
(l/2 + l/2) + (r/2 - l/2) =
l + (r-l)/2

it is because if we add : [ mid = low + high ] and both mid and high are large their addition may be out of range of integer

also why it is not [ mid = low/2 + high/2 ] it is because it is an integer division so if [ low = 5 and high= 11 ] then [ mid = low/2 + high/2 ] will be mid = 5/2 + 11/2 => 2+ 5 => 9 so it will lead to wrong answer that is why it is taken as mid = low + (high -low)/2;