为什么我的应用程序要花费24% 的生命来执行空检查?

我得到了一个性能关键的二进制决策树,我想把这个问题集中在一行代码上。下面是二叉树迭代器的代码,以及针对它运行性能分析的结果。

        public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)
{
0.2%        ScTreeNode node = RootNodes[rootIndex].TreeNode;


24.6%       while (node.BranchData != null)
{
0.2%            BranchNodeData b = node.BranchData;
0.5%            node = b.Child2;
12.8%           if (inputs[b.SplitInputIndex] <= b.SplitValue)
0.8%                node = b.Child1;
}


0.4%        return node;
}

BranchData 是一个字段,而不是一个属性。 我这样做是为了防止它不被内联的风险。

BranchNodeData 类如下:

public sealed class BranchNodeData
{
/// <summary>
/// The index of the data item in the input array on which we need to split
/// </summary>
internal int SplitInputIndex = 0;


/// <summary>
/// The value that we should split on
/// </summary>
internal float SplitValue = 0;


/// <summary>
/// The nodes children
/// </summary>
internal ScTreeNode Child1;
internal ScTreeNode Child2;
}

正如您所看到的,while 循环/null 检查对性能有很大的影响。这棵树很大,所以我希望寻找一片叶子需要花费一些时间,但是我想知道在这一行上花费的时间是多么的不成比例。

我试过了:

  • 将 Null 检查从 while 中分离出来—— Null 检查才是关键。
  • 向对象添加一个布尔字段并对其进行检查,结果没有什么不同。比较什么并不重要,重要的是比较。

这是一个分支预测问题吗? 如果是,我能做些什么? 如果有什么?

我不会假装理解的 CIL,但我会张贴它的任何人,这样他们可以尝试从它刮一些信息。

.method public hidebysig
instance class OptimalTreeSearch.ScTreeNode GetNodeForState (
int32 rootIndex,
float32[] inputs
) cil managed
{
// Method begins at RVA 0x2dc8
// Code size 67 (0x43)
.maxstack 2
.locals init (
[0] class OptimalTreeSearch.ScTreeNode node,
[1] class OptimalTreeSearch.BranchNodeData b
)


IL_0000: ldarg.0
IL_0001: ldfld class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode> OptimalTreeSearch.ScSearchTree::RootNodes
IL_0006: ldarg.1
IL_0007: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode>::get_Item(int32)
IL_000c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.ScRootNode::TreeNode
IL_0011: stloc.0
IL_0012: br.s IL_0039
// loop start (head: IL_0039)
IL_0014: ldloc.0
IL_0015: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
IL_001a: stloc.1
IL_001b: ldloc.1
IL_001c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child2
IL_0021: stloc.0
IL_0022: ldarg.2
IL_0023: ldloc.1
IL_0024: ldfld int32 OptimalTreeSearch.BranchNodeData::SplitInputIndex
IL_0029: ldelem.r4
IL_002a: ldloc.1
IL_002b: ldfld float32 OptimalTreeSearch.BranchNodeData::SplitValue
IL_0030: bgt.un.s IL_0039


IL_0032: ldloc.1
IL_0033: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child1
IL_0038: stloc.0


IL_0039: ldloc.0
IL_003a: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
IL_003f: brtrue.s IL_0014
// end loop


IL_0041: ldloc.0
IL_0042: ret
} // end of method ScSearchTree::GetNodeForState

编辑: 我决定做一个分支预测测试,我在同一时间内添加了一个相同的 if,所以我们有

while (node.BranchData != null)

还有

if (node.BranchData != null)

里面。然后对其进行性能分析,执行第一个比较所花的时间是执行第二个比较所花的时间的六倍,而第二个比较总是返回 true。所以看起来这确实是一个分支预测问题——我猜我对此无能为力? !

另一个编辑

如果节点。BranchData 必须在 while 检查时从 RAM 中加载——然后缓存 if 语句。


这是我关于类似主题的第三个问题,这次我只关注一行代码。 我在这个问题上的其他问题是:

4794 次浏览

The tree is massive

By far the most expensive thing a processor ever does is not executing instructions, it is accessing memory. The execution core of a modern CPU is many times faster than the memory bus. A problem related to distance, the further an electrical signal has to travel, the harder it gets to get that signal delivered to the other end of the wire without it being corrupted. The only cure for that problem is to make it go slower. A big problem with the wires that connect the CPU to the RAM in your machine, you can pop the case and see the wires.

Processors have a counter-measure for this problem, they use caches, buffers that store a copy of the bytes in RAM. An important one is the L1 cache, typically 16 kilobytes for data and 16 kilobytes for instructions. Small, allowing it to be close to the execution engine. Reading bytes from the L1 cache typically takes 2 or 3 CPU cycles. Next up is the L2 cache, bigger and slower. Upscale processors also have an L3 cache, bigger and slower yet. As process technology improves, those buffers take less space and automatically becomes faster as they get closer to the core, a big reason why newer processors are better and how they manage to use an ever increasing number of transistors.

Those caches are however not a perfect solution. The processor will still stall on a memory access if the data is not available in one of the caches. It cannot continue until the very slow memory bus has supplied the data. Losing a fat hundred CPU cycles is possible on a single instruction.

Tree structures are a problem, they are not cache friendly. Their nodes tend to be scattered throughout the address space. The fastest way to access memory is by reading from sequential addresses. The unit of storage for the L1 cache is 64 bytes. Or in other words, once the processor reads one byte, the next 63 are very fast since they'll be present in the cache.

Which makes an array by far the most efficient data structure. Also the reason that the .NET List<> class isn't a list at all, it uses an array for storage. The same for other collection types, like Dictionary, structurally not remotely similar to an array, but internally implemented with arrays.

So your while() statement is very likely to be suffering from CPU stalls because it is dereferencing a pointer to access the BranchData field. The next statement is very cheap because the while() statement already did the heavy lifting of retrieving the value from memory. Assigning the local variable is cheap, a processor uses a buffer for writes.

Not otherwise a simple problem to solve, flattening your tree into arrays is very likely to be unpractical. Not in the least because you typically cannot predict in which order the nodes of the tree will be visited. A red-black tree might help, it isn't clear from the question. So a simple conclusion to draw is that it is already running as fast you can hope for. And if you need it to go faster then you'll need better hardware with a faster memory bus. DDR4 is going mainstream this year.

To complement Hans' great answer about memory cache effects, I add a discussion of virtual memory to physical memory translation and NUMA effects.

With virtual memory computer (all current computer), when doing a memory access, each virtual memory address must be translated to a physical memory address. This is done by the memory management hardware using a translation table. This table is managed by the operating system for each process and it is itself stored in RAM. For each page of virtual memory, there is an entry in this translation table mapping a virtual to a physical page. Remember Hans' discussion about memory accesses that are expensive: if each virtual to physical translation needs a memory lookup, all memory access would cost twice as much. The solution is to have a cache for the translation table which is called the translation lookaside buffer (TLB for short). TLB are not large (12 to 4096 entries), and typical page size on x86-64 architecture are only 4 KB, which means that there is at most 16 MB directly accessible with TLB hits (it is probably even less than that, the Sandy Bridge having a TLB size of 512 items). To reduce the number of TLB misses, you can have the operating system and the application work together to use a larger page size like 2 MB, leading to a much larger memory space accessible with TLB hits. This page explain how to use large pages with Java which can greatly speedup memory accesses.

If your computer has many sockets, it is probably a NUMA architecture. NUMA means Non-Uniform Memory Access. In these architectures, some memory accesses cost more than others. As an example, with a 2 socket computer with 32 GB of RAM, each socket probably has 16 GB of RAM. On this example computer, local memory accesses are cheaper than accesses to memory of another socket (remote access are 20 to 100% slower, maybe even more). If on such computer, your tree uses 20 GB of RAM, at least 4 GB of your data is on the other NUMA node, and if accesses are 50% slower for remote memory, NUMA accesses slow down your memory accesses by 10%. In addition, if you only have free memory on a single NUMA node, all the processes needing memory on the starved node will be allocated memory from the other node which accesses are more expensive. Even worst, the operating system could think it is a good idea to swap out part of the memory of the starved node, which would cause even more expensive memory accesses. This is explained in more details in The MySQL “swap insanity” problem and the effects of the NUMA architecture where some solutions are given for Linux (spreading memory accesses on all NUMA nodes, biting the bullet on remote NUMA accesses to avoid swapping). I can also think of allocating more RAM to a socket (24 and 8 GB instead of 16 and 16 GB) and making sure your program is schedule on the larger NUMA node, but this needs physical access to the computer and a screwdriver ;-).

This is not an answer per se but rather an emphasis on what Hans Passant wrote about delays in the memory system.

Really high performance software - such as computer games - is not only written to implement the game itself, it is also adapted such that code and data structures make the most of the cache and memory systems i e treat them as a limited resource. When I deal with cache issues I typically assume that the L1 will deliver in 3 cycles if the data is present there. If it isn't and I have to go to L2 I assume 10 cycles. For L3 30 cycles and for RAM memory 100.

There is an additional memory-related action that - if you need to use it - imposes an even greater penalty and that is a bus lock. Bus locks are called critical sections if you use the Windows NT functionality. If you use a home-grown variety you might call it a spinlock. Whatever the name it synchronizes down to the slowest bus-mastering device in the system before the lock is in place. The slowest bus-mastering device might be a classic 32-bit PCI card connected @ 33MHz. 33MHz is one hundredth of the frequency of a typical x86 CPU (@ 3.3 GHz). I assume not less than 300 cycles to complete a bus lock but I know they can take many times that long so if I see 3000 cycles I won't be surprised.

Novice multi-threading software developers will use bus locks all over the place and then wonder why their code is slow. The trick - as with everything that has to do with memory - is to economize on accesses.