Benchmarking small code samples in C#, can this implementation be improved?

Quite often on SO I find myself benchmarking small chunks of code to see which implemnetation is fastest.

Quite often I see comments that benchmarking code does not take into account jitting or the garbage collector.

I have the following simple benchmarking function which I have slowly evolved:

  static void Profile(string description, int iterations, Action func) {
// warm up
func();
// clean up
GC.Collect();


var watch = new Stopwatch();
watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}

Usage:

Profile("a descriptions", how_many_iterations_to_run, () =>
{
// ... code being profiled
});

Does this implementation have any flaws? Is it good enough to show that implementaion X is faster than implementation Y over Z iterations? Can you think of any ways you would improve this?

EDIT Its pretty clear that a time based approach (as opposed to iterations), is preferred, does anyone have any implementations where the time checks do not impact performance?

23902 次浏览

如果你想去掉 GC 的相互作用,你可能需要运行你的“热身”调用 之后,GC.Collect 调用,而不是之前。这样你就知道了。NET 将已经有足够的内存从操作系统为您的功能的工作集分配。

请记住,您正在为每个迭代进行一个非内联方法调用,因此请确保您将要测试的内容与一个空的主体进行比较。您还必须承认,只能对比方法调用长几倍的事件进行可靠的计时。

此外,取决于你分析的内容类型,你可能想要在一定时间内进行基于计时的运行,而不是进行一定数量的迭代——这可能会导致更容易进行比较的数字,而不必为了最好的实现而进行非常短的运行,或者为了最坏的结果而进行非常长的运行。

我会尽量避免让代表通过:

  1. 委托调用是 ~ 虚方法调用。中最小内存分配的约25% 。NET.如果您对细节感兴趣,请参阅 例如:
  2. 匿名委托可能导致使用闭包,您甚至不会注意到这一点。同样,访问闭包字段明显比访问堆栈上的变量要好。

导致闭包使用的示例代码:

public void Test()
{
int someNumber = 1;
Profiler.Profile("Closure access", 1000000,
() => someNumber + someNumber);
}

如果您不知道闭包,那么请看一下.NETRefector 中的这个方法。

You must also run a "warm up" pass prior to actual measurement to exclude the time JIT compiler spends on jitting your code.

我认为像这样的基准测试方法最难克服的问题是考虑边缘情况和意外情况。例如-“这两个代码片段如何在高 CPU 负载/网络使用/磁盘颠簸等情况下工作”,它们对于基本逻辑检查来看某个算法是否比另一个算法工作得更快非常有用。但是,要正确测试大多数代码的性能,您必须创建一个测试来度量特定代码的特定瓶颈。

我仍然认为,测试小块代码通常没有什么投资回报,而且可以鼓励使用过于复杂的代码,而不是简单的可维护代码。编写清晰的代码,让其他开发人员,或者我自己6个月后能够快速理解,这将比高度优化的代码带来更多的性能好处。

GC.Collect回归之前,最终定价不一定会完成。终结过程排队,然后在一个单独的线程上运行。在测试期间,此线程仍可能处于活动状态,从而影响结果。

如果你想在开始测试之前确保终结工作已经完成,那么你可以调用 GC.WaitForPendingFinalizers,它会一直阻塞直到终结队列被清除:

GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

下面是修改后的功能: 按照社区的建议,可以随意修改这个它是一个社区 wiki。

static double Profile(string description, int iterations, Action func) {
//Run at highest priority to minimize fluctuations caused by other processes/threads
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Thread.CurrentThread.Priority = ThreadPriority.Highest;


// warm up
func();


var watch = new Stopwatch();


// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();


watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
return watch.Elapsed.TotalMilliseconds;
}

确保你 在已启用优化的版本中编译,并在 VisualStudio 之外运行测试。最后一部分很重要,因为即使在发布模式下,JIT 也会在附加调试器的情况下限制其优化。

I'd call func() several times for the warm-up, not just one.

根据您正在进行基准测试的代码和它所运行的平台,您可能需要考虑 代码对齐如何影响性能。这样做可能需要一个外部包装器多次运行测试(在单独的应用程序域或进程中?),有时首先调用“填充代码”来强制对其进行 JIT 编译,从而导致被基准测试的代码以不同的方式对齐。完整的测试结果将为各种代码对齐提供最佳和最差情况的计时。

改进建议

  1. 检测执行环境是否有利于基准测试(比如检测是否附加了调试器,或者是否禁用了 jit 优化,这将导致不正确的度量)。

  2. 独立地测量代码的某些部分(确切地看到瓶颈在哪里)。

  3. 比较不同的版本/组件/代码块(在第一句话中你说“ ... 对小块代码进行基准测试,看哪个实现最快”).

关于第一点:

  • 若要检测是否附加了调试器,请读取属性 System.Diagnostics.Debugger.IsAttached(记住还要处理调试器最初没有附加,但在一段时间后附加的情况)。

  • To detect if jit optimization is disabled, read property DebuggableAttribute.IsJITOptimizerDisabled of the relevant assemblies:

    private bool IsJitOptimizerDisabled(Assembly assembly)
    {
    return assembly.GetCustomAttributes(typeof (DebuggableAttribute), false)
    .Select(customAttribute => (DebuggableAttribute) customAttribute)
    .Any(attribute => attribute.IsJITOptimizerDisabled);
    }
    

Regarding #2:

This can be done in many ways. One way is to allow several delegates to be supplied and then measure those delegates individually.

Regarding #3:

This could also be done in many ways, and different use-cases would demand very different solutions. If the benchmark is invoked manually, then writing to the console might be fine. However if the benchmark is performed automatically by the build system, then writing to the console is probably not so fine.

One way to do this is to return the benchmark result as a strongly typed object that can easily be consumed in different contexts.


Etimo.Benchmarks

Another approach is to use an existing component to perform the benchmarks. Actually, at my company we decided to release our benchmark tool to public domain. At it's core, it manages the garbage collector, jitter, warmups etc, just like some of the other answers here suggest. It also has the three features I suggested above. It manages several of the issues discussed in Eric Lippert blog.

This is an example output where two components are compared and the results are written to the console. In this case the two components compared are called 'KeyedCollection' and 'MultiplyIndexedKeyedCollection':

Etimo.Benchmarks - Sample Console Output

There is a NuGet package, a sample NuGet package and the source code is available at GitHub. There is also a blog post.

If you're in a hurry, I suggest you get the sample package and simply modify the sample delegates as needed. If you're not in a hurry, it might be a good idea to read the blog post to understand the details.

如果您试图从基准测试中完全消除垃圾收集的影响,是否值得设置 GCSettings.LatencyMode

如果不是,并且您希望在 func中创建的垃圾的影响成为基准测试的一部分,那么您不应该在测试结束时(在定时器内部)强制收集吗?

你的问题的基本问题是假设 测量可以回答你所有的问题。你需要测量 多次得到一个有效的图片的情况和 特别是在像 C # 这样的垃圾收集语言中。

另一个答案给出了一个衡量基本性能的好方法。

static void Profile(string description, int iterations, Action func) {
// warm up
func();


var watch = new Stopwatch();


// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();


watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
}

但是,这个单一的测量不能解释垃圾吗 一个合适的配置文件还考虑到了最坏的情况下的性能 分散在多个调用中的垃圾收集(这个数字是 sort) 没有用的,因为虚拟机可以终止而不收集任何剩余 垃圾,但仍然是有用的比较两个不同的 实施 func。)

static void ProfileGarbageMany(string description, int iterations, Action func) {
// warm up
func();


var watch = new Stopwatch();


// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();


watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();


watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
}

人们可能还想衡量一下 只调用一次的方法的垃圾回收。

static void ProfileGarbage(string description, int iterations, Action func) {
// warm up
func();


var watch = new Stopwatch();


// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();


watch.Start();
for (int i = 0; i < iterations; i++) {
func();


GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
}

但比建议任何可能的附加条款更重要的是 测量配置文件的想法是,一个应该测量多个 不同的统计数据,而不仅仅是一种统计数据。