什么是停滞-周期-前端和停滞-周期-后端在“性能统计”的结果?

有人知道 停滞-自行车-前端失速-循环-后端在性能统计结果中的意义吗?我在网上搜索,但没有找到答案。谢谢

$ sudo perf stat ls


Performance counter stats for 'ls':


0.602144 task-clock                #    0.762 CPUs utilized
0 context-switches          #    0.000 K/sec
0 CPU-migrations            #    0.000 K/sec
236 page-faults               #    0.392 M/sec
768956 cycles                    #    1.277 GHz
962999 stalled-cycles-frontend   #  125.23% frontend cycles idle
634360 stalled-cycles-backend    #   82.50% backend  cycles idle
890060 instructions              #    1.16  insns per cycle
#    1.08  stalled cycles per insn
179378 branches                  #  297.899 M/sec
9362 branch-misses             #    5.22% of all branches         [48.33%]


0.000790562 seconds time elapsed
29830 次浏览

To convert generic events exported by perf into your CPU documentation raw events you can run:

more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend

It will show you something like

event=0x0e,umask=0x01,inv,cmask=0x01

According to the Intel documentation SDM volume 3B (I have a core i5-2520):

UOPS_ISSUED.ANY:

  • Increments each cycle the # of Uops issued by the RAT to RS.
  • Set Cmask = 1, Inv = 1, Any= 1 to count stalled cycles of this core.

For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:

UOPS_DISPATCHED.THREAD:

  • Counts total number of uops to be dispatched per- thread each cycle
  • Set Cmask = 1, INV =1 to count stall cycles.

Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.

A CPU cycle is “stalled” when the pipeline doesn't advance during it.

Processor pipeline is composed of many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.

Taken from http://paolobernardi.wordpress.com/2012/08/07/playing-around-with-perf/

The theory:

Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).

Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).

We can divide the total cycles being spent by the CPU in 3 categories:

  1. Cycles where instructions get retired (useful work)
  2. Cycles being spent in the Back-End (wasted)
  3. Cycles spent in the Front-End (wasted).

To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.

The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).

The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.

Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.

The implementation in profilers:

How do you interpret the BE and FE stalled cycles?

Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm

In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.

Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?

I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.

Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c

According to author of these events, they defined loosely and are approximated by available CPU performance counters. As I know, perf doesn't support formulas to calculate some synthetic event based on several hardware events, so it can't use front-end/back-end stall bound method from Intel's Optimization manual (Implemented in VTune) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "B.3.2 Hierarchical Top-Down Performance Characterization Methodology"

%FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N );
%Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ;
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ;
%BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ;
N = 4*CPU_CLK_UNHALTED.THREAD" (for SandyBridge)

Right formulas can be used with some external scripting, like it was done in Andi Kleen's pmu-tools (toplev.py): https://github.com/andikleen/pmu-tools (source), http://halobates.de/blog/p/262 (description):

% toplev.py -d -l2 numademo  100M stream
...
perf stat --log-fd 4 -x, -e
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles}
numademo 100M stream
...
BE      Backend Bound:                      72.03%
This category reflects slots where no uops are being delivered due to a lack
of required resources for accepting more uops in the    Backend of the pipeline.
.....
FE      Frontend Bound:                     54.07%
This category reflects slots where the Frontend of the processor undersupplies
its Backend.

Commit which introduced stalled-cycles-frontend and stalled-cycles-backend events instead of original universal stalled-cycles:

http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=8f62242246351b5a4bc0c1f00c0c7003edea128a

author  Ingo Molnar <mingo@el...>   2011-04-29 11:19:47 (GMT)
committer   Ingo Molnar <mingo@el...>   2011-04-29 12:23:58 (GMT)
commit  8f62242246351b5a4bc0c1f00c0c7003edea128a (patch)
tree    9021c99956e0f9dc64655aaa4309c0f0fdb055c9
parent  ede70290046043b2638204cab55e26ea1d0c6cd9 (diff)

perf events: Add generic front-end and back-end stalled cycle event definitions Add two generic hardware events: front-end and back-end stalled cycles.

These events measure conditions when the CPU is executing code but its capabilities are not fully utilized. Understanding such situations and analyzing them is an important sub-task of code optimization workflows.

Both events limit performance: most front end stalls tend to be caused by branch misprediction or instruction fetch cachemisses, backend stalls can be caused by various resource shortages or inefficient instruction scheduling.

Front-end stalls are the more important ones: code cannot run fast if the instruction stream is not being kept up.

An over-utilized back-end can cause front-end stalls and thus has to be kept an eye on as well.

The exact composition is very program logic and instruction mix dependent.

We use the terms 'stall', 'front-end' and 'back-end' loosely and try to use the best available events from specific CPUs that approximate these concepts.

Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Frederic Weisbecker Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm@git.kernel.org Signed-off-by: Ingo Molnar

    /* Install the stalled-cycles event: UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
-       intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES] = 0x1803fb1;
+       intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;


-   PERF_COUNT_HW_STALLED_CYCLES        = 7,
+   PERF_COUNT_HW_STALLED_CYCLES_FRONTEND   = 7,
+   PERF_COUNT_HW_STALLED_CYCLES_BACKEND    = 8,