小开

在处理中断或高性能数据读取时，可能需要与缓存线路边界对齐(通常为每条缓存线路64字节) ，并且在处理进程间套接字时必须使用这些边界。有了进程间插槽，就有了控制变量，不能分散在多个缓存线路或 DDR RAM 字，否则它会导致 L1，L2等或缓存或 DDR RAM 作为一个低通过滤器和过滤出你的中断数据！太糟糕了! ! ！这意味着当你的算法很好的时候，你会得到奇怪的错误，它有可能让你发疯！

DDR RAM 几乎总是读取128位字(DDR RAM Words) ，即16字节，因此环形缓冲区变量不应该分散在多个 DDR RAM 字中。有些系统确实使用64位 DDR RAM 字，从技术上讲，你可以在16位 CPU 上得到一个32位 DDR RAM 字，但在这种情况下可以使用 SDRAM。

One may also just be interested in minimizing the number of cache lines in use when reading data in a high-performance algorithm. In my case, I developed the world's fastest integer-to-string algorithm (40% faster than prior fastest algorithm) and I'm working on optimizing the Grisu algorithm, which is the world's fastest floating-point algorithm. In order to print the floating-point number you must print the integer, so in order optimize the Grisu one optimization I have implemented is I have cache-line-aligned the Lookup Tables (LUT) for Grisu into exactly 15 cache lines, which is rather odd that it actually aligned like that. This takes the LUTs from the .bss section (i.e. static memory) and places them onto the stack (or heap but the Stack is more appropriate). I have not benchmarked this but it's good to bring up, and I learned a lot about this, is the fastest way to load values is to load them from the i-cache and not the d-cache. The difference is that the i-cache is read-only and has much larger cache lines because it's read-only (2KB was what a professor quoted me once.). So you're actually going to degrigate your performance from array indexing as opposed to loading a variable like this:

int faster_way = 12345678;

而不是更慢的方式:

int variables[2] = { 12345678, 123456789};
int slower_way = variables[0];

不同之处在于，int variable = 12345678将通过从函数开始偏移到 i-cache 中的变量而从 i-cache 行加载，而 slower_way = int[0]将通过使用更慢的数组索引从较小的 d-cache 行加载。正如我刚刚发现的那样，这种特殊的方法实际上减慢了我和其他许多整数到字符串算法的速度。我这样说是因为您可能正在通过缓存对齐只读数据进行优化，而实际上并非如此。

通常在 C + + 中，您将使用 std::align函数。我建议不要使用这个函数，因为它不能保证最佳工作。这里有一个最快的方法来对齐缓存线，这是前面的作者，这是一个无耻的插头:

内存对齐算法

namespace _ {
/* Aligns the given pointer to a power of two boundaries with a premade mask.
@return An aligned pointer of typename T.
@brief Algorithm is a 2's compliment trick that works by masking off
the desired number of bits in 2's compliment and adding them to the
pointer.
@param pointer The pointer to align.
@param mask The mask for the Least Significant bits to align. */
template <typename T = char>
inline T* AlignUp(void* pointer, intptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value ) & mask;
return reinterpret_cast<T*>(value);
}
} //< namespace _


// Example calls using the faster mask technique.


enum { kSize = 256 };
char buffer[kSize + 64];


char* aligned_to_64_byte_cache_line = AlignUp<> (buffer, 63);


char16_t* aligned_to_64_byte_cache_line2 = AlignUp<char16_t> (buffer, 63);

这是更快的标准: : 校准替换:

inline void* align_kabuki(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
// Begin Kabuki Toolkit Implementation
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (align - 1);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
// End Kabuki Toolkit Implementation
}

如何以及何时对齐到缓存行大小？

内存对齐算法