Nvidia-smi 挥发性 GPU-使用说明?

我知道 nvidia-smi -l 1每一秒都会给出 GPU 的使用情况(类似于下面的情况)。但是,我希望你能解释一下 Volatile GPU-Util到底是什么意思。这是使用的短信总数超过短信总数,或占用,或其他东西?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:03:00.0     Off |                    0 |
| 30%   41C    P0    53W / 225W |      0MiB /  4742MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:43:00.0     Off |                    0 |
| 36%   49C    P0    95W / 225W |   4516MiB /  4742MiB |     63%      Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1      5193    C   python                                        4514MiB |
+-----------------------------------------------------------------------------+
63587 次浏览

It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running).

It doesn't tell you anything about how many SMs were used, or how "busy" the code was, or what it was doing exactly, or in what way it may have been using memory.

The above claim(s) can be verified without too much difficulty using a microbenchmarking-type exercise (see below).

Based on the Nvidia docs, The sample period may be between 1 second and 1/6 second depending on the product. However, the period shouldn't make much difference on how you interpret the result.

Also, the word "Volatile" does not pertain to this data item in nvidia-smi. You are misreading the output format.

Here's a trivial code that supports my claim:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>


const long long tdelay=1000000LL;
const int loops = 10000;
const int hdelay = 1;


__global__ void dkern(){


long long start = clock64();
while(clock64() < start+tdelay);
}


int main(int argc, char *argv[]){


int my_delay = hdelay;
if (argc > 1) my_delay = atoi(argv[1]);
for (int i = 0; i<loops; i++){
dkern<<<1,1>>>();
usleep(my_delay);}


return 0;
}

On my system, when I run the above code with a command line parameter of 100, nvidia-smi will report 99% utilization. When I run with a command line parameter of 1000, nvidia-smi will report ~83% utilization. When I run it with a command line parameter of 10000, nvidia-smi will report ~9% utilization.

Although this answer is focused on GPU kernels, I have lately noticed that nvidia-smi will also report non-zero GPU utilization when for example cudaMemcpy operations are running (and nothing else). So the above description should be considered a description of reporting with respect to CUDA kernel activity.

The 'Volatile' on nvidia-smi isn't part of GPU-Util, it's part of 'Volatile Uncorr. ECC', which shows the number of uncorrected errors that have occurred on the GPU since the last driver load. There's a good writeup of everything in nvidia-smi here:

https://medium.com/analytics-vidhya/explained-output-of-nvidia-smi-utility-fc4fbee3b124