HashSet < T > . RemoveAll 方法出奇地慢

Jon Skeet 最近在他的博客上提出了一个有趣的编程话题: “亲爱的丽莎，我的抽象思维中有一个洞”(强调添加) :

事实上，我有一套 HashSet。我想从里面移除一些东西，很多东西可能根本不存在。实际上，在我们的测试用例中，“清除”集合中的项目的没有将位于原始集合中。这听起来——实际上是是——非常容易编码。毕竟，我们有 Set<T>.removeAll来帮助我们，对吗？

我们在命令行上指定“ source”集合的大小和“ remove”集合的大小，并构建它们。源集只包含非负整数; 删除集只包含负整数。我们使用 System.currentTimeMillis()测量移除所有元素所需的时间，它不是世界上最精确的秒表，但是在这种情况下已经足够了，正如您将看到的。密码是这样的:
import java.util.*;
public class Test
{
public static void main(String[] args)
{
int sourceSize = Integer.parseInt(args[0]);
int removalsSize = Integer.parseInt(args[1]);
        

Set<Integer> source = new HashSet<Integer>();
Collection<Integer> removals = new ArrayList<Integer>();
        

for (int i = 0; i < sourceSize; i++)
{
source.add(i);
}
for (int i = 1; i <= removalsSize; i++)
{
removals.add(-i);
}
        

long start = System.currentTimeMillis();
source.removeAll(removals);
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
让我们从一个简单的工作开始: 100个项目的源集，以及100个删除项目:
c:UsersJonTest>java Test 100 100
Time taken: 1ms
好吧，我们没想到会这么慢... 显然我们可以加快进度。一个有100万个项目和30万个要删除项目的来源怎么样？
c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms
嗯。看起来还是挺快的。现在我觉得我有点残忍，让它做那些移除工作。让我们把它变得简单一点——300,000个源文件和300,000个删除:
c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms
你说什么？将近三个 几分钟？哎呀！当然它应该更容易删除项目从一个更小收集比一个我们管理在38毫秒？

有人能解释一下为什么会发生这种情况吗? 为什么 HashSet<T>.removeAll方法这么慢？

20801

小开

最佳答案

The behaviour is (somewhat) documented in the javadoc:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method.

What this means in practice, when you call source.removeAll(removals);:

if the removals collection is of a smaller size than source, the remove method of HashSet is called, which is fast.
if the removals collection is of equal or larger size than the source, then removals.contains is called, which is slow for an ArrayList.

Quick fix:

Collection<Integer> removals = new HashSet<Integer>();

Note that there is an open bug that is very similar to what you describe. The bottom line seems to be that it is probably a poor choice but can't be changed because it is documented in the javadoc.

For reference, this is the code of removeAll (in Java 8 - haven't checked other versions):

public boolean removeAll(Collection<?> c) {
Objects.requireNonNull(c);
boolean modified = false;


if (size() > c.size()) {
for (Iterator<?> i = c.iterator(); i.hasNext(); )
modified |= remove(i.next());
} else {
for (Iterator<?> i = iterator(); i.hasNext(); ) {
if (c.contains(i.next())) {
i.remove();
modified = true;
}
}
}
return modified;
}