为什么 memmove 比 memcpy 快？

我正在调查一个应用程序的性能热点，这个应用程序花费50% 的该应用程序插入数百万个4字节的整数并使用 memmove 将数据“向右”移动到为插入的值腾出空间。

我的预期是，复制记忆是非常快的，我感到很惊讶我花了很多时间在 memmove 上，但是后来我有了一个想法，memmove 是缓慢的，因为它移动的重叠区域，这必须实现在一个紧密的循环中，而不是复制大的内存页。我写了一个小的微基准测试，以找出是否有一个性能差异 Memcpy 和 memmove 希望 memcpy 轻松获胜。

我在两台机器上运行了基准测试(core i5和 core i7) ，发现 memmove 是实际上比 memcpy 更快，在老的内核 i7上甚至快了将近两倍！现在我需要你的解释。

这是我的基准。它使用 memcpy 复制100mb，然后使用 memmove 移动大约100mb; 源和目标是重叠的。各种“距离” 每个测试运行10次，平均时间被打印出来了。

Https://gist.github.com/cruppstahl/78a57cdf937bca3d062c

下面是在 Core i5(Linux 3.5.0-54-general # 81 ~ presise1-Ubuntu 上的结果 SMP x86 _ 64 GNU/Linux，gcc 是4.6.3(Ubuntu/Linaro 4.6.3-1ubuntu5) 括号内的是来源和目的地之间的距离(间隙大小) :

memcpy        0.0140074
memmove (002) 0.0106168
memmove (004) 0.01065
memmove (008) 0.0107917
memmove (016) 0.0107319
memmove (032) 0.0106724
memmove (064) 0.0106821
memmove (128) 0.0110633

Memmove 实现为一个 SSE 优化的汇编代码，从后面进行复制它使用硬件预取将数据加载到缓存中，并且将128字节复制到 XMM 寄存器，然后将它们存储在目的地。

(回来，线1650ff)

L(gobble_ll_loop):
prefetchnta -0x1c0(%rsi)
prefetchnta -0x280(%rsi)
prefetchnta -0x1c0(%rdi)
prefetchnta -0x280(%rdi)
sub $0x80, %rdx
movdqu  -0x10(%rsi), %xmm1
movdqu  -0x20(%rsi), %xmm2
movdqu  -0x30(%rsi), %xmm3
movdqu  -0x40(%rsi), %xmm4
movdqu  -0x50(%rsi), %xmm5
movdqu  -0x60(%rsi), %xmm6
movdqu  -0x70(%rsi), %xmm7
movdqu  -0x80(%rsi), %xmm8
movdqa  %xmm1, -0x10(%rdi)
movdqa  %xmm2, -0x20(%rdi)
movdqa  %xmm3, -0x30(%rdi)
movdqa  %xmm4, -0x40(%rdi)
movdqa  %xmm5, -0x50(%rdi)
movdqa  %xmm6, -0x60(%rdi)
movdqa  %xmm7, -0x70(%rdi)
movdqa  %xmm8, -0x80(%rdi)
lea -0x80(%rsi), %rsi
lea -0x80(%rdi), %rdi
jae L(gobble_ll_loop)

为什么 memmove 比 memcpy 快, 它应该比循环快得多。在最坏的情况下，我会期望 memcpy 跟我一样快。

PS: 我知道在我的代码中不能用 memcpy 替换 memmove 代码示例混合了 C 和 C + + 目的。

更新1

我根据不同的答案做了一些不同的测试。

当运行 memcpy 两次时，第二次运行比第一次运行快。
当“接触”memcpy 的目标缓冲区(memset(b2, 0, BUFFERSIZE...))时，memcpy 的第一次运行也会更快。
Memcpy 仍然比 memmove 慢一点。

结果如下:

memcpy        0.0118526
memcpy        0.0119105
memmove (002) 0.0108151
memmove (004) 0.0107122
memmove (008) 0.0107262
memmove (016) 0.0108555
memmove (032) 0.0107171
memmove (064) 0.0106437
memmove (128) 0.0106648

My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.

感谢所有伟大的回答和评论！

35423

小开

最佳答案

Your memmove calls are shuffling memory along by 2 to 128 bytes, while your memcpy source and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpy ends up possibly a smidge faster, e.g. on ideone.com:

memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919

Hardly anything in it though - no evidence that writing back to an already faulted in memory page has much impact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpy unnecessarily slower when compared apples-for-apples.

小开

Historically, memmove and memcpy are the same function. They worked in the same way and had the same implementation. It was then realised that memcpy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.

The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. memcpy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.

The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.

You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.

Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)

Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.

小开

When you are using memcpy, the writes need to go into the cache. When you use memmove where when you are copying a small step forward, the memory you are copying over will already be in the cache (because it was read 2, 4, 16 or 128 bytes "back"). Try doing a memmove where the destination is several megabytes (> 4 * cache size), and I suspect (but can't be bothered to test) that you'll get similar results.

I guarantee that ALL is about cache maintenance when you do large memory operations.

小开

"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions.

In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.

Reference: https://www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain, (Stanford Intro Systems Lecture - 7) Time: 36:00