为什么 CUDA 固定内存的速度这么快?

当我使用固定存储器进行 CUDA 数据传输时,我观察到数据传输的大幅加速。在 linux 上,实现这一点的底层系统调用是 mlock。在 mlock 的手册页中,它声明锁定该页可以防止将其交换出去:

Mlock ()锁定地址范围内从 addr 开始并持续为 len 字节的页面。当调用成功返回时,所有包含指定地址范围的一部分的页面都保证驻留在 RAM 中;

在我的测试中,我的系统上有几千兆的空闲内存,所以从来没有任何风险的内存页面可以被交换出来,但我仍然观察到了加速。有人能解释一下这到底是怎么回事吗?如有任何见解或信息,我们将不胜感激。

36990 次浏览

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

As described here http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk. If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA. So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

A verbose note on copying non-locked pages to locked pages.

It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.

Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.