为什么循环指令很慢? 英特尔不能有效地实现它吗?

循环(英特尔参考手册) 减量 ecx/rcx,如果非零则跳跃。它的速度很慢,但是英特尔不能廉价地让它变得更快吗?在 Sandybridge-family 上,dec/jnz已经是 宏融合成一个单独的 uop了; 唯一的区别在于它设置了标志。

loop on various microarchitectures, from Agner Fog 的指令表:

  • K8/K10:7m-ops

  • Bulldozer-family/Ryzen : 1 m-op (与宏融合测试和分支(或 jecxz)相同的成本)

  • P4:4 Uops (与 jecxz相同)

  • P6(PII/PIII) : 8 UOP

  • 奔腾 M,核心2:11

  • Nehalem: 6分钟(loope/loopne为11分钟)。吞吐量 = 4c (loop)或7c (loope/ne)。

  • SnB-family : 7 uops.(loope/loopne为11)。吞吐量 = 每5个周期一个,作为一个瓶颈,因为保持您的循环计数器在内存中!jecxz只有2个 uops,其吞吐量与常规 jcc相同

  • 西尔弗蒙特,七个人

  • AMD Jaguar (低功耗) : 8 uops,5c 吞吐量

  • 通过 Nano3000:2 uops


难道解码器不能像 lea rcx, [rcx-1]/jrcxz一样解码吗?一共是三个人。至少在没有地址大小前缀的情况下是这样的,否则它必须使用 ecx,并在跳转时将 RIP截断为 EIP; 也许地址大小的奇怪选择控制了减价的宽度,解释了为什么会有这么多支票?(有趣的事实: rep-string 指令与使用32位地址大小的 ecx具有相同的行为)

或者更好的方法是,把它解码为一个融合的不会设置标志的 dec-and-Branch?SnB 上的 dec ecx/jnz解码为一个 uop (它设置标志)。

我知道真正的代码不会使用它(因为它至少从 P5或者其他什么时候开始就一直很慢) ,但是 AMD 认为这是值得的,因为它可以让推土机变得更快。可能是因为太容易了。


  • 对于 SnB 家族来说,拥有快速的 loop是否容易?如果是这样,他们为什么不呢?如果不是,为什么这么难?很多解码器晶体管?或者在一个融合的十二月和分支上的额外位来记录它不设置标志?那7个人在干什么?这是个很简单的指令。

  • 推土机有什么特别之处使得快速 loop变得容易/值得?还是 AMD 浪费了一堆晶体管来让 loop变得更快?如果是这样,想必有人认为这是个好主意。


如果 loop是 fast ,那么对于 BigInteger 任意精度的 adc循环,以避免部分标志停滞/减速(请参阅我在答案中的评论)或者任何其他不需要触摸标志的循环来说都是完美的。与 dec/jnz相比,它还有一个小的代码大小优势。(dec/jnz只在 SnB- 家族上使用大熔丝)。

在 ADC 循环中 dec/jnz没问题的现代 CPU 上,loop仍然适合 ADCX/ADOX 循环(保留 OF)。

如果 loop很快的话,编译器早就把它当作一个窥视孔来优化 CPU 的代码大小 + 速度,而不需要进行宏融合了。


这并不能阻止我对每个循环都使用 loop的糟糕的16位代码的所有问题感到恼火,即使他们在循环中还需要另一个计数器。但至少不会是坏的 作为

12523 次浏览

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.

This is not a good or complete answer.

It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.


Good posts with different theories in that thread:

Robert

LOOP became slow on some of the earliest machines (circa 486) when significant pipelining started to happen, and running any but the simplest instruction down the pipeline efficiently was technologically impractical. So LOOP was slow for a number of generations. So nobody used it. So when it became possible to speed it up, there was no real incentive to do so, since nobody was actually using it.


Anton Ertl:

IIRC LOOP was used in some software for timing loops; there was (important) software that did not work on CPUs where LOOP was too fast (this was in the early 90s or so). So CPU makers learned to make LOOP slow.


(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)

@Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):

I could imagine a possibly sensible 6-µop version:

virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc

(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)

Then Paul said:

I agree that a shorter sequence should be possible, but I was trying to think of a bloated sequence that might make sense if minimal microarchitectural adjustments were permitted.

summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.

If a useless, compatibility-only instruction is handed to the microcode developers, they might reasonably not be able or willing to suggest minor changes to the internal microarchitecture to improve such an instruction. Not only would they rather use their "change suggestion capital" more productively but the suggestion of a change for a useless case would reduce the credibility of other suggestions.

(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)

... Paul continues:

The architects behind Nano may have found avoiding the special casing of LOOP simplified their design in terms of area or power. Or they may have had incentives from embedded users to provide a fast implementation (for code density benefits). Those are just WILD guesses.

If optimization of LOOP fell out of other optimizations (like fusion of compare and branch), it might be easier to tweak LOOP into a fast path instruction than to handle it in microcode even if the performance of LOOP was unimportant.

I suspect that such decisions are based on specific details of the implementation. Information about such details does not seem to be generally available and interpreting such information would be beyond the skill level of most people. (I am not a hardware designer--and have never played one on television or stayed at a Holiday Inn Express. :-)


The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.

But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).

bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.

Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html

The summary of the article is the following:

Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is difficult because the chips use significantly different memory architectures and instruction execution times. Code cannot be optimized for the 80x86 family; rather, code must be designed to produce good performance on a range of systems or optimized for particular combinations of processors and memory. Programmers must avoid the unusual instructions supported by the 8088, which have lost their performance edge in subsequent chips. String instructions should be used but not relied upon. Registers should be used rather than memory operations. Branching is also slow for all four processors. Memory accesses should be aligned to improve performance. Generally, optimizing an 80486 requires exactly the opposite steps as optimizing an 8088.

By "unusual instructions supported by the 8088" the author also means "loop":

Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP with: LOOP LOOPTOP because LOOP is significantly faster on the 8088. LOOP is also faster on the 286. On the 386, however, LOOP is actually two cycles slower than DEC/JNZ. The pendulum swings still further on the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you, we're talking about what was originally perhaps the most obvious optimization in the entire 80x86 instruction set.

This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.

But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.

The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.

Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115

In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.

Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.

For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”

So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.

You can also read about The Out-of-Order Execution Engine in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.

His answer made sense. It had to do with paging.

LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.

A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.

Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:

The protection violations involved usually indicate a probable software bug and restart is not desired if one of these violations occurs. In a Protected Mode 80286 system with wait states during any bus cycles, when certain protection violations are detected by the 80286 component, and the component transfers control to the exception handling routine, the contents of the CX register may be unreliable. (Whether CX contents are changed is a function of bus activity at the time internal microcode detects the protection violation.)

To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.

It's this extra burden of saving CX that made LOOP so slow.

Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.