为什么我的应用程序要花费24% 的生命来执行空检查？

小开

最佳答案

The tree is massive

By far the most expensive thing a processor ever does is not executing instructions, it is accessing memory. The execution core of a modern CPU is many times faster than the memory bus. A problem related to distance, the further an electrical signal has to travel, the harder it gets to get that signal delivered to the other end of the wire without it being corrupted. The only cure for that problem is to make it go slower. A big problem with the wires that connect the CPU to the RAM in your machine, you can pop the case and see the wires.

Processors have a counter-measure for this problem, they use caches, buffers that store a copy of the bytes in RAM. An important one is the L1 cache, typically 16 kilobytes for data and 16 kilobytes for instructions. Small, allowing it to be close to the execution engine. Reading bytes from the L1 cache typically takes 2 or 3 CPU cycles. Next up is the L2 cache, bigger and slower. Upscale processors also have an L3 cache, bigger and slower yet. As process technology improves, those buffers take less space and automatically becomes faster as they get closer to the core, a big reason why newer processors are better and how they manage to use an ever increasing number of transistors.

Those caches are however not a perfect solution. The processor will still stall on a memory access if the data is not available in one of the caches. It cannot continue until the very slow memory bus has supplied the data. Losing a fat hundred CPU cycles is possible on a single instruction.

Tree structures are a problem, they are not cache friendly. Their nodes tend to be scattered throughout the address space. The fastest way to access memory is by reading from sequential addresses. The unit of storage for the L1 cache is 64 bytes. Or in other words, once the processor reads one byte, the next 63 are very fast since they'll be present in the cache.

Which makes an array by far the most efficient data structure. Also the reason that the .NET List<> class isn't a list at all, it uses an array for storage. The same for other collection types, like Dictionary, structurally not remotely similar to an array, but internally implemented with arrays.

So your while() statement is very likely to be suffering from CPU stalls because it is dereferencing a pointer to access the BranchData field. The next statement is very cheap because the while() statement already did the heavy lifting of retrieving the value from memory. Assigning the local variable is cheap, a processor uses a buffer for writes.

Not otherwise a simple problem to solve, flattening your tree into arrays is very likely to be unpractical. Not in the least because you typically cannot predict in which order the nodes of the tree will be visited. A red-black tree might help, it isn't clear from the question. So a simple conclusion to draw is that it is already running as fast you can hope for. And if you need it to go faster then you'll need better hardware with a faster memory bus. DDR4 is going mainstream this year.

小开

To complement Hans' great answer about memory cache effects, I add a discussion of virtual memory to physical memory translation and NUMA effects.

With virtual memory computer (all current computer), when doing a memory access, each virtual memory address must be translated to a physical memory address. This is done by the memory management hardware using a translation table. This table is managed by the operating system for each process and it is itself stored in RAM. For each page of virtual memory, there is an entry in this translation table mapping a virtual to a physical page. Remember Hans' discussion about memory accesses that are expensive: if each virtual to physical translation needs a memory lookup, all memory access would cost twice as much. The solution is to have a cache for the translation table which is called the translation lookaside buffer (TLB for short). TLB are not large (12 to 4096 entries), and typical page size on x86-64 architecture are only 4 KB, which means that there is at most 16 MB directly accessible with TLB hits (it is probably even less than that, the Sandy Bridge having a TLB size of 512 items). To reduce the number of TLB misses, you can have the operating system and the application work together to use a larger page size like 2 MB, leading to a much larger memory space accessible with TLB hits. This page explain how to use large pages with Java which can greatly speedup memory accesses.

If your computer has many sockets, it is probably a NUMA architecture. NUMA means Non-Uniform Memory Access. In these architectures, some memory accesses cost more than others. As an example, with a 2 socket computer with 32 GB of RAM, each socket probably has 16 GB of RAM. On this example computer, local memory accesses are cheaper than accesses to memory of another socket (remote access are 20 to 100% slower, maybe even more). If on such computer, your tree uses 20 GB of RAM, at least 4 GB of your data is on the other NUMA node, and if accesses are 50% slower for remote memory, NUMA accesses slow down your memory accesses by 10%. In addition, if you only have free memory on a single NUMA node, all the processes needing memory on the starved node will be allocated memory from the other node which accesses are more expensive. Even worst, the operating system could think it is a good idea to swap out part of the memory of the starved node, which would cause even more expensive memory accesses. This is explained in more details in The MySQL “swap insanity” problem and the effects of the NUMA architecture where some solutions are given for Linux (spreading memory accesses on all NUMA nodes, biting the bullet on remote NUMA accesses to avoid swapping). I can also think of allocating more RAM to a socket (24 and 8 GB instead of 16 and 16 GB) and making sure your program is schedule on the larger NUMA node, but this needs physical access to the computer and a screwdriver ;-).

小开

This is not an answer per se but rather an emphasis on what Hans Passant wrote about delays in the memory system.

Really high performance software - such as computer games - is not only written to implement the game itself, it is also adapted such that code and data structures make the most of the cache and memory systems i e treat them as a limited resource. When I deal with cache issues I typically assume that the L1 will deliver in 3 cycles if the data is present there. If it isn't and I have to go to L2 I assume 10 cycles. For L3 30 cycles and for RAM memory 100.

There is an additional memory-related action that - if you need to use it - imposes an even greater penalty and that is a bus lock. Bus locks are called critical sections if you use the Windows NT functionality. If you use a home-grown variety you might call it a spinlock. Whatever the name it synchronizes down to the slowest bus-mastering device in the system before the lock is in place. The slowest bus-mastering device might be a classic 32-bit PCI card connected @ 33MHz. 33MHz is one hundredth of the frequency of a typical x86 CPU (@ 3.3 GHz). I assume not less than 300 cycles to complete a bus lock but I know they can take many times that long so if I see 3000 cycles I won't be surprised.

Novice multi-threading software developers will use bus locks all over the place and then wonder why their code is slow. The trick - as with everything that has to do with memory - is to economize on accesses.