为什么 GCC 不能为两个 int32的结构生成一个最优运算符 = = ?

一个同事向我展示了一些我认为没有必要的代码,但确实如此。我认为大多数编译器会认为这三种相等测试的尝试是等价的:

#include <cstdint>
#include <cstring>


struct Point {
std::int32_t x, y;
};


[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
return a.x == b.x && a.y == b.y;
}


[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
// Why can't the compiler produce the same assembly in naiveEqual as it does here?
std::uint64_t ai, bi;
static_assert(sizeof(Point) == sizeof(ai));
std::memcpy(&ai, &a, sizeof(Point));
std::memcpy(&bi, &b, sizeof(Point));
return ai == bi;
}


[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
return std::memcmp(&a, &b, sizeof(a)) == 0;
}




[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
// Let's try avoiding any jumps by using bitwise and:
return (a.x == b.x) & (a.y == b.y);
}

但令我惊讶的是,只有那些具有 memcpymemcmp的文件被 GCC 转换为单个64位比较文件。为什么?(https://godbolt.org/z/aP1ocs)

对于优化器来说,如果我在四个字节的连续对上检查相等性,那么在所有八个字节上的比较是相同的,这不是很明显吗?

为了避免分别对这两个部分进行布尔化,编译效率有所提高(减少了一条指令,并且没有对 EDX 的错误依赖) ,但仍然有两个单独的32位操作。

bool bithackEqual(const Point &a, const Point &b) {
// a^b == 0 only if they're equal
return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}

GCC 和 Clang 在通过 价值传递结构时都错过了相同的优化(所以 a在 RDI 中,b在 RSI 中,因为这是 x86-64 System V 的调用约定将结构打包到寄存器中的方式) : https://godbolt.org/z/v88a6s。Memcpy/memcmp 版本都编译成 cmp rdi, rsi/sete al,但其他版本执行单独的32位操作。

令人惊讶的是,struct alignas(uint64_t) Point在参数位于寄存器中的 by-value 情况下仍然有帮助,它为 GCC 优化了两个 naiveequals 版本,但没有优化反向 XOR/OR。(https://godbolt.org/z/ofGa1f).这是否给了我们任何关于海湾合作委员会内部的提示?排成一条直线对响声没有帮助。

4856 次浏览

If you "fix" the alignment, all give the same assembly language output (with GCC):

struct alignas(std::int64_t) Point {
std::int32_t x, y;
};

Demo

As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy, so having specific optimization (or be more aggressive) when using that function seems logical.

Why can't the compiler generate [same assembly as memcpy version]?

The compiler "could" in the sense that it would be allowed to.

The compiler simply doesn't. Why it doesn't is beyond my knowledge as that requires deep knowledge of how the optimiser has been implemented. But, the answer may range from "there is no logic covering such transformation" to "the rules aren't tuned to assume one output is faster than the other" on all target CPUs.

If you use Clang instead of GCC, you'll notice that it produces same output for naiveEqual and naiveEqual1 and that assembly has no jump. It is same as for the "optimised" version except for using two 32 bit instructions in place of one 64 bit instruction. Furthermore restricting the alignment of Point as shown in Jarod42's answer has no effect to the optimiser.

MSVC behaves like Clang in the sense that it is unaffected by the alignment, but differently in the sense that it doesn't get rid of the jump in naiveEqual.

For what its worth, the compilers (I checked GCC and Clang) produce essentially same output for the C++20 defaulted comparison as they do fornaiveEqual. For whatever reason, GCC opted to use jne instead of je for the jump.

is this a missing compiler optimization

With the assumption that one is always faster than the other on the target CPUs, that would be fair conclusion.

There's a performance cliff you risk falling off of when implementing this as a single 64-bit comparison:

You break store to load forwarding.

If the 32-bit numbers in the structs are written to memory by separate store instructions, and then loaded back from memory with 64-bit load instructions quickly (before the stores hit L1$), your execution will stall until the stores commit to globally visible cache coherent L1$. If the loads are 32-bit loads that match the previous 32-bit stores, modern CPUs will avoid the store-load stall by forwarding the stored value to the load instruction before the store reaches cache. This violates sequential consistency if multiple CPUs access the memory (a CPU sees its own stores in a different order than other CPUs do), but is allowed by most modern CPU architectures, even x86. The forwarding also allows much more code to be executed completely speculatively, because if the execution has to be rolled back, no other CPU can have seen the store for the code that used the loaded value on this CPU to be speculatively executed.

If you want this to use 64-bit operations and you don't want this perf cliff, you may want to ensure the struct is also always written as a single 64-bit number.