简单基准测试中奇怪的性能增长

昨天我发现了一个 Christoph Nahr 撰写的题为“ .NET 结构性能”的文章,它对几种语言(C + + ,C # ,Java,JavaScript)进行了基准测试,为一个增加了两个结构点(double元组)的方法。

事实证明,C + + 版本需要大约1000ms 才能执行(1e9次迭代) ,而 C # 在同一台机器上不能低于3000ms (在 x64中表现更差)。

为了自己测试它,我使用了 C # 代码(稍微简化了一下,只调用参数通过值传递的方法) ,并在 i7-3610QM 机器上运行它(单核3.1 Ghz 升压) ,8 GB RAM,Win8.1,使用。NET 4.5.2,释放构建32位(x86 WoW64,因为我的操作系统是64位)。这是一个简化版本:

public static class CSharpTest
{
private const int ITERATIONS = 1000000000;


[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Point AddByVal(Point a, Point b)
{
return new Point(a.X + b.Y, a.Y + b.X);
}


public static void Main()
{
Point a = new Point(1, 1), b = new Point(1, 1);


Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
}

Point的定义很简单:

public struct Point
{
private readonly double _x, _y;


public Point(double x, double y) { _x = x; _y = y; }


public double X { get { return _x; } }


public double Y { get { return _y; } }
}

运行它会得到与文章中相似的结果:

Result: x=1000000001 y=1000000001, Time elapsed: 3159 ms

第一个奇怪的发现

因为这个方法应该是内联的,所以我想知道如果我把结构全部删除,然后简单地把整个东西内联在一起,代码会怎样执行:

public static class CSharpTest
{
private const int ITERATIONS = 1000000000;


public static void Main()
{
// not using structs at all here
double ax = 1, ay = 1, bx = 1, by = 1;


Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
{
ax = ax + by;
ay = ay + bx;
}
sw.Stop();


Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
ax, ay, sw.ElapsedMilliseconds);
}
}

并且得到了几乎相同的结果(实际上在几次重试后慢了1%) ,这意味着 JIT-ter 似乎在优化所有函数调用方面做得很好:

Result: x=1000000001 y=1000000001, Time elapsed: 3200 ms

这也意味着基准测试似乎不能测量任何 struct性能,实际上似乎只能测量基本的 double算法(在其他所有内容都被优化掉之后)。

奇怪的东西

奇怪的部分来了。如果我仅仅添加 又一个跑圈外的秒表(是的,经过几次重试后,我将范围缩小到这个疯狂的步骤) ,代码运行 快三倍:

public static void Main()
{
var outerSw = Stopwatch.StartNew();     // <-- added


{
Point a = new Point(1, 1), b = new Point(1, 1);


var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}


outerSw.Stop();                         // <-- added
}


Result: x=1000000001 y=1000000001, Time elapsed: 961 ms

太荒谬了!这并不是说 Stopwatch给我的结果是错误的,因为我可以清楚地看到它在一秒钟之后就结束了。

有人能告诉我这里发生了什么吗?

(更新)

下面是同一个程序中的两个方法,它们表明原因不是 JITting:

public static class CSharpTest
{
private const int ITERATIONS = 1000000000;


[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Point AddByVal(Point a, Point b)
{
return new Point(a.X + b.Y, a.Y + b.X);
}


public static void Main()
{
Test1();
Test2();


Console.WriteLine();


Test1();
Test2();
}


private static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);


var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}


private static void Test2()
{
var swOuter = Stopwatch.StartNew();


Point a = new Point(1, 1), b = new Point(1, 1);


var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);


swOuter.Stop();
}
}

产出:

Test1: x=1000000001 y=1000000001, Time elapsed: 3242 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 974 ms


Test1: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 972 ms

这是糊状物。您需要将它作为32位版本运行在。NET 4.x (在代码中有几个检查来确保这一点)。

(更新4)

根据@usr 对@Hans 回答的评论,我检查了这两个方法的优化反汇编,它们有很大的不同:

Test1 on the left, Test2 on the right

这似乎表明,差异可能是由于编译器在第一种情况下行为有趣,而不是双字段对齐?

此外,如果我添加 变量(总偏移量为8字节) ,我仍然得到同样的速度提升——而且它似乎不再与汉斯•帕桑特(Hans Passant)提到的字段对齐有关:

// this is still fast?
private static void Test3()
{
var magical_speed_booster_1 = "whatever";
var magical_speed_booster_2 = "whatever";


{
Point a = new Point(1, 1), b = new Point(1, 1);


var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}


GC.KeepAlive(magical_speed_booster_1);
GC.KeepAlive(magical_speed_booster_2);
}
5720 次浏览

There seems to be some bug in the Jitter because the behavior is even wierder. Consider the following code:

public static void Main()
{
Test1(true);
Test1(false);
Console.ReadLine();
}


public static void Test1(bool warmup)
{
Point a = new Point(1, 1), b = new Point(1, 1);


Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


if (!warmup)
{
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
}

This will run in 900 ms, same as the outer stopwatch case. However, if we remove the if (!warmup) condition, it will run in 3000 ms. What's even stranger, is that the following code will also run in 900 ms:

public static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);


Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
0, 0, sw.ElapsedMilliseconds);
}

Note I've removed a.X and a.Y references from the Console output.

I have no idea whats going on, but this smells pretty buggy to me and its not related to having an outer Stopwatch or not, the issue seems a bit more generalized.

Narrowed it down some what (only seems to affect 32-bit CLR 4.0 runtime).

Notice the placement of the var f = Stopwatch.Frequency; makes all the difference.

Slow (2700ms):

static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);
var f = Stopwatch.Frequency;


var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}

Fast (800ms):

static void Test1()
{
var f = Stopwatch.Frequency;
Point a = new Point(1, 1), b = new Point(1, 1);


var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();


Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}

There is a very simple way to always get the "fast" version of your program. Project > Properties > Build tab, untick the "Prefer 32-bit" option, ensure that the Platform target selection is AnyCPU.

You really don't prefer 32-bit, unfortunately is always turned on by default for C# projects. Historically, the Visual Studio toolset worked much better with 32-bit processes, an old problem that Microsoft has been chipping away at. Time to get that option removed, VS2015 in particular addressed the last few real road-blocks to 64-bit code with a brand-new x64 jitter and universal support for Edit+Continue.

Enough chatter, what you discovered is the importance of alignment for variables. The processor cares about it a great deal. If a variable is mis-aligned in memory then the processor has to do extra work to shuffle the bytes to get them in the right order. There are two distinct misalignment problems, one is where the bytes are still inside a single L1 cache line, that costs an extra cycle to shift them into the right position. And the extra bad one, the one you found, where part of the bytes are in one cache line and part in another. That requires two separate memory accesses and gluing them together. Three times as slow.

The double and long types are the trouble-makers in a 32-bit process. They are 64-bits in size. And can get thus get misaligned by 4, the CLR can only guarantee a 32-bit alignment. Not a problem in a 64-bit process, all variables are guaranteed to be aligned to 8. Also the underlying reason why the C# language cannot promise them to be atomic. And why arrays of double are allocated in the Large Object Heap when they have more than a 1000 elements. The LOH provides an alignment guarantee of 8. And explains why adding a local variable solved the problem, an object reference is 4 bytes so it moved the double variable by 4, now getting it aligned. By accident.

A 32-bit C or C++ compiler does extra work to ensure that double cannot be misaligned. Not exactly a simple problem to solve, the stack can be misaligned when a function is entered, given that the only guarantee is that it is aligned to 4. The prologue of such a function need to do extra work to get it aligned to 8. The same trick doesn't work in a managed program, the garbage collector cares a great deal about where exactly a local variable is located in memory. Necessary so it can discover that an object in the GC heap is still referenced. It cannot deal properly with such a variable getting moved by 4 because the stack was misaligned when the method was entered.

This is also the underlying problem with .NET jitters not easily supporting SIMD instructions. They have much stronger alignment requirements, the kind that the processor cannot solve by itself either. SSE2 requires an alignment of 16, AVX requires an alignment of 32. Can't get that in managed code.

Last but not least, also note that this makes the perf of a C# program that runs in 32-bit mode very unpredictable. When you access a double or long that's stored as a field in an object then perf can drastically change when the garbage collector compacts the heap. Which moves objects in memory, such a field can now suddenly get mis/aligned. Very random of course, can be quite a head-scratcher :)

Well, no simple fixes but one, 64-bit code is the future. Remove the jitter forcing as long as Microsoft won't change the project template. Maybe next version when they feel more confident about Ryujit.

Update 4 explains the problem: in the first case, JIT keeps the calculated values (a, b) on the stack; in the second case, JIT keeps it in the registers.

In fact, Test1 works slowly because of the Stopwatch. I wrote the following minimal benchmark based on BenchmarkDotNet:

[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
private const int IterationCount = 100001;


[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithoutStopwatch()
{
double a = 1, b = 1;
for (int i = 0; i < IterationCount; i++)
{
// fld1
// faddp       st(1),st
a = a + b;
}
return string.Format("{0}", a);
}


[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithStopwatch()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < IterationCount; i++)
{
// fld1
// fadd        qword ptr [ebp-14h]
// fstp        qword ptr [ebp-14h]
a = a + b;
}
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}


[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithTwoStopwatches()
{
var outerSw = new Stopwatch();
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < IterationCount; i++)
{
// fld1
// faddp       st(1),st
a = a + b;
}
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}

The results on my computer:

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
Type=Jit_RegistersVsStack  Mode=Throughput  Platform=X86  Jit=HostJit  .NET=HostFramework


Method |   AvrTime |    StdDev |       op/s |
------------------- |---------- |---------- |----------- |
WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |

As we can see:

  • WithoutStopwatch works quickly (because a = a + b uses the registers)
  • WithStopwatch works slowly (because a = a + b uses the stack)
  • WithTwoStopwatches works quickly again (because a = a + b uses the registers)

Behavior of JIT-x86 depends on big amount of different conditions. For some reason, the first stopwatch forces JIT-x86 to use the stack, and the second stopwatch allows it to use the registers again.