Gcc std: : unorder_map 实现慢吗? 如果慢,原因是什么?

我们正在用 C + + 开发一个高性能的关键软件。在这里,我们需要一个并发的散列映射并实现它。因此,我们编写了一个基准来计算,与 std::unordered_map相比,并发散列映射的速度要慢多少。

但是,std::unordered_map似乎是难以置信的慢... 所以这是我们的微基准测试(对于并发映射,我们产生了一个新的线程,以确保锁不会被优化掉,并注意到我从来没有插入0,因为我也基准与 google::dense_hash_map,它需要一个空值) :

boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
std::vector<uint64_t> vec(SIZE);
for (int i = 0; i < SIZE; ++i) {
uint64_t val = 0;
while (val == 0) {
val = dist(rng);
}
vec[i] = val;
}
std::unordered_map<int, long double> map;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < SIZE; ++i) {
map[vec[i]] = 0.0;
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "inserts: " << elapsed.count() << std::endl;
std::random_shuffle(vec.begin(), vec.end());
begin = std::chrono::high_resolution_clock::now();
long double val;
for (int i = 0; i < SIZE; ++i) {
val = map[vec[i]];
}
end = std::chrono::high_resolution_clock::now();
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "get: " << elapsed.count() << std::endl;

(编辑: 整个源代码可以在这里找到: http://pastebin.com/vPqf7eya)

std::unordered_map的结果是:

inserts: 35126
get    : 2959

对于 google::dense_map:

inserts: 3653
get    : 816

对于我们手工支持的并发 map (它执行锁定,尽管基准测试是单线程的——但是在一个单独的产生线程中) :

inserts: 5213
get    : 2594

如果我在不支持 pthread 的情况下编译基准测试程序,并在主线程中运行所有程序,我会得到以下手工支持的并发映射的结果:

inserts: 4441
get    : 1180

我用以下命令编译:

g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc

因此,特别是插入的 std::unordered_map似乎是非常昂贵的-35秒对其他地图3-5秒。此外,查找时间似乎相当高。

My question: why is this? I read another question on stackoverflow where someone asks, why std::tr1::unordered_map is slower than his own implementation. There the highest rated answer states, that the std::tr1::unordered_map needs to implement a more complicated interface. But I can not see this argument: we use a bucket approach in our concurrent_map, std::unordered_map uses a bucket-approach too (google::dense_hash_map does not, but than std::unordered_map should be at least as fast than our hand backed concurrency-safe version?). Apart from that I can not see anything in the interface that force a feature which makes the hash map perform badly...

所以我的问题是: std::unordered_map看起来很慢是真的吗?如果没有: 出了什么问题?如果是: 原因是什么。

我的主要问题是: 为什么在 std::unordered_map中插入一个值如此昂贵(即使我们在开始时预留了足够的空间,它也不会表现得更好——所以重新散列似乎不是问题) ?

编辑:

First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64 distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...).

目前大多数注释都解释说,通过预先为 unorder _ map 分配足够的空间,可以使它运行得更快。在我们的应用程序中,这是不可能的: 我们正在开发一个数据库管理系统,需要一个散列映射来在事务期间存储一些数据(例如锁定信息)。因此,这个映射可以是从1(用户只进行一次插入并提交)到数十亿条目(如果发生完整的表扫描)的所有内容。在这里预分配足够的空间是不可能的(而且在开始时分配很多空间会消耗太多的内存)。

此外,我很抱歉,我没有说清楚我的问题: 我对快速制作 unorder _ map 并不感兴趣(使用谷歌密集散列映射对我们来说很好) ,我只是不明白这种巨大的性能差异来自何处。它不能仅仅是预分配(即使有足够的预分配内存,稠密映射也比无序映射快一个数量级,我们的手工支持并发映射以大小为64的数组开始——所以比无序映射小)。

那么 std::unordered_map表现如此糟糕的原因是什么呢?或者换一种问法: 是否可以编写一个 std::unordered_map接口的实现,它是标准的符合规范的,并且(几乎)和 Google 稠密散列映射一样快?或者标准中有什么东西迫使实现者选择一种低效的方式来实现它?

编辑2:

通过分析,我发现大量的时间被用于整数除法。std::unordered_map使用质数表示数组大小,而其他实现使用2的幂。为什么 std::unordered_map使用素数?如果散列不好,要表现得更好吗?对于好的散列,它没有什么区别。

编辑3:

以下是 std::map的数据:

inserts: 16462
get    : 16978

为什么插入到 std::map比插入到 std::unordered_map快... 我的意思是什么?std::map的局部性更差(树 vs 数组) ,需要进行更多的分配(每次插入 vs 每次 rehash + 每次冲突加上 ~ 1) ,最重要的是: 具有另一个算法复杂度(O (logn) vs O (1)) !

32994 次浏览

I am guessing that you have not properly sized your unordered_map, as Ylisar suggested. When chains grow too long in unordered_map, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_map defaults to (smallest prime larger than) 100.

I didn't have chrono on my system, so I timed with times().

template <typename TEST>
void time_test (TEST t, const char *m) {
struct tms start;
struct tms finish;
long ticks_per_second;


times(&start);
t();
times(&finish);
ticks_per_second = sysconf(_SC_CLK_TCK);
std::cout << "elapsed: "
<< ((finish.tms_utime - start.tms_utime
+ finish.tms_stime - start.tms_stime)
/ (1.0 * ticks_per_second))
<< " " << m << std::endl;
}

I used a SIZE of 10000000, and had to change things a bit for my version of boost. Also note, I pre-sized the hash table to match SIZE/DEPTH, where DEPTH is an estimate of the length of the bucket chain due to hash collisions.

Edit: Howard points out to me in comments that the max load factor for unordered_map is 1. So, the DEPTH controls how many times the code will rehash.

#define SIZE 10000000
#define DEPTH 3
std::vector<uint64_t> vec(SIZE);
boost::mt19937 rng;
boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
std::numeric_limits<uint64_t>::max());
std::unordered_map<int, long double> map(SIZE/DEPTH);


void
test_insert () {
for (int i = 0; i < SIZE; ++i) {
map[vec[i]] = 0.0;
}
}


void
test_get () {
long double val;
for (int i = 0; i < SIZE; ++i) {
val = map[vec[i]];
}
}


int main () {
for (int i = 0; i < SIZE; ++i) {
uint64_t val = 0;
while (val == 0) {
val = dist(rng);
}
vec[i] = val;
}
time_test(test_insert, "inserts");
std::random_shuffle(vec.begin(), vec.end());
time_test(test_insert, "get");
}

Edit:

I modified the code so that I could change out DEPTH more easily.

#ifndef DEPTH
#define DEPTH 10000000
#endif

So, by default, the worst size for the hash table is chosen.

elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1

My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.

I found the reason: it is a Problem of gcc-4.7!!

With gcc-4.7

inserts: 37728
get    : 2985

With gcc-4.6

inserts: 2531
get    : 1565

So std::unordered_map in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).

I will submit a bug report.. until then: DO NOT use std::unordered_map with gcc 4.7!

I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computer and it gave me the following results:

MinGW-W64 4.9.2:

Using std::unordered_map:

inserts: 9280
get: 3302

Using std::map:

inserts: 23946
get: 24824

VC 2015 with all the optimization flags I know:

Using std::unordered_map:

inserts: 7289
get: 1908

Using std::map:

inserts: 19222
get: 19711

I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_map it's still broken.

[EDIT]

So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. When I have the change I will be testing the code on GCC.

My answer is just to establish some kind of knowledge base to other answers.