多通道 x86系统的内存带宽

我在测试桌面和服务器的内存带宽。

Sklyake desktop 4 cores/8 hardware threads
Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads

系统的峰值带宽是

Peak bandwidth desktop = 2-channels*8*2400 = 38.4 GB/s
Peak bandwidth server  = 6-channels*2-sockets*8*2666 = 255.94 GB/s

我正在使用我自己的来自 STREAM 的三角函数来测量带宽(稍后完整的代码)

void triad(double *a, double *b, double *c, double scalar, size_t n) {
#pragma omp parallel for
for(int i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}

这是我得到的结果

         Bandwidth (GB/s)
threads  Desktop  Server
1             28      16
2(24)         29     146
4(48)         25     177
8(96)         24     189

对于1个线程，我不明白为什么桌面比服务器快那么多。根据这个答案，https://stackoverflow.com/a/18159503/2542702 SSE 足以获得双通道系统的全部带宽。这就是我在桌面上看到的。两个线程只有轻微的帮助，4个和8个线程的结果更差，但在服务器上，单线程带宽要小得多。为什么会这样？

在服务器上，我使用96个线程得到了最好的结果。我还以为里面的线会少很多呢。我的结果有很大的误差范围，我不包括一个误差估计。我在几次比赛中取得了最好的成绩。

密码

//gcc -O3 -march=native triad.c -fopenmp
//gcc -O3 -march=skylake-avx512 -mprefer-vector-width=512 triad.c -fopenmp
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>


void triad_init(double *a, double *b, double *c, double k, size_t n) {
#pragma omp parallel for
for(size_t i=0; i<n; i++) a[i] = k, b[i] = k, c[i] = k;
}


void triad(double *a, double *b, double *c, double scalar, size_t n) {
#pragma omp parallel for
for(size_t i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}


void triad_stream(double *a, double *b, double *c, double scalar, size_t n) {
#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
__m512d scalarv = _mm512_set1_pd(scalar);
#pragma omp parallel for
for(size_t i=0; i<n/8; i++) {
__m512d bv = _mm512_load_pd(&b[8*i]), cv = _mm512_load_pd(&c[8*i]);
_mm512_stream_pd(&a[8*i], _mm512_add_pd(bv, _mm512_mul_pd(scalarv, cv)));
}
#else
__m256d scalarv = _mm256_set1_pd(scalar);
#pragma omp parallel for
for(size_t i=0; i<n/4; i++) {
__m256d bv = _mm256_load_pd(&b[4*i]), cv = _mm256_load_pd(&c[4*i]);
_mm256_stream_pd(&a[4*i], _mm256_add_pd(bv, _mm256_mul_pd(scalarv, cv)));
}
#endif
}


int main(void) {
size_t n = 1LL << 31LL;
double *a = _mm_malloc(sizeof *a * n, 64), *b = _mm_malloc(sizeof *b * n, 64), *c = _mm_malloc(sizeof *c * n, 64);
//double peak_bw = 2*8*2400*1E-3; // 2-channels*8-bits/byte*2400MHz
double peak_bw = 2*6*8*2666*1E-3; // 2-sockets*6-channels*8-bits/byte*2666MHz
double dtime, mem, bw;
printf("peak bandwidth %.2f GB/s\n", peak_bw);


triad_init(a, b, c, 3.14159, n);
dtime = -omp_get_wtime();
triad(a, b, c, 3.14159, n);
dtime += omp_get_wtime();
mem = 4*sizeof(double)*n*1E-9, bw = mem/dtime;
printf("triad:       %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%\n", mem, dtime, bw, 100*bw/peak_bw);


triad_init(a, b, c, 3.14159, n);
dtime = -omp_get_wtime();
triad_stream(a, b, c, 3.14159, n);
dtime += omp_get_wtime();
mem = 3*sizeof(double)*n*1E-9, bw = mem/dtime;
printf("triads:      %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%\n", mem, dtime, bw, 100*bw/peak_bw);
}

2122

小开

The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you're experiencing, but from the other side of the coin:

Hardware Prefetcher Aggressiveness Controllers: Do We Need Them All the Time?