Matrix multiplication: Small difference in matrix size, large difference in timings

I have a matrix multiply code that looks like this:

for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Here, the size of the matrix is represented by dimension. Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.

Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0

Program compiled as gcc -O3 simple.c

I have run this on Intel's icc compiler as well, and seen similar results.

EDIT:

As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.

Heres the complete program:

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>


/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv;


double timestamp()
{
double t;
gettimeofday(&tv, NULL);
t = tv.tv_sec + (tv.tv_usec/1000000.0);
return t;
}


int main(int argc, char *argv[])
{
int i, j, k;
double *A, *B, *C, start, end;


A = (double*)malloc(dimension*dimension*sizeof(double));
B = (double*)malloc(dimension*dimension*sizeof(double));
C = (double*)malloc(dimension*dimension*sizeof(double));


srand(292);


for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
{
A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
C[dimension*i+j] = 0.0;
}


start = timestamp();
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] *
B[dimension*k+j];


end = timestamp();
printf("\nsecs:%f\n", end-start);


free(A);
free(B);
free(C);


return 0;
}
15337 次浏览

Here's my wild guess: cache

It could be that you can fit 2 rows of 2000 doubles into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)

But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)

Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.

The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway)

Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.

Your B array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k which is about 32 MB per matrix. That's much larger than your L2 cache.

When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.

However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

There are several possible explanations. One likely explanation is what Mysticial suggests: exhaustion of a limited resource (either cache or TLB). Another likely possibility is a false aliasing stall, which can occur when consecutive memory accesses are separated by a multiple of some power-of-two (often 4KB).

You can start to narrow down what's at work by plotting time/dimension^3 for a range of values. If you have blown a cache or exhausted TLB reach, you will see a more-or-less flat section followed by a sharp rise between 2000 and 2048, followed by another flat section. If you are seeing aliasing-related stalls, you will see a more-or-less flat graph with a narrow spike upward at 2048.

Of course, this has diagnostic power, but it is not conclusive. If you want to conclusively know what the source of the slowdown is, you will want to learn about performance counters, which can answer this sort of question definitively.

A couple of answers mentioned L2 Cache problems.

You can actually verify this with a cache simulation. Valgrind's cachegrind tool can do that.

valgrind --tool=cachegrind --cache-sim=yes your_executable

Set the command line parameters so they match with your CPU's L2 parameters.

Test it with different matrix sizes, you'll probably see a sudden increase in L2 miss ratio.

You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.

Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.

Usually the index and tag are simple bitfields. So a memory address looks like

  ...Tag... | ...Index... | Offset_within_Cache_Line

(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)

So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.

(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)

From p.33:

The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128- bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided into 16 banks, each 16 bytes wide. [...] Only one load can be performed from a given bank of the L1 cache in a single cycle.

To sum up:

  • 64 byte cache line => 6 offset bits within cache line

  • 16KB/4-way => the resonance is 4KB.

    I.e. address bits 0-5 are the cache line offset.

  • 16KB / 64B cache lines => 2^14/2^6 = 2^8=256 cache lines in the cache.
    (Bugfix: I originally miscalculated this as 128. that I have fixed all dependencies.)

  • 4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".

    i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).

(By the way, the terms "set" and "way" have varying definitions.)

  • there are 6 index bits, bits 6-11 in the simplest scheme.

    This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.

Now look at your program.

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j] accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.

I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.

(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]

If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.

It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf


In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.

I don't think, looking at it, that this is as big as the cache resonance.

And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.

(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)


Overall, my recommendation for tuning:

  1. Try cache blocking without any further analysis. I say this because cache blocking is easy, and it is very likely that this is all you would need to do.

  2. After that, use VTune or OProf. Or Cachegrind. Or ...

  3. Better yet, use a well tuned library routine to do matrix multiply.

I know this is waaaay too old, but I'll take a bite. It's (as it has been said) a cache issue what causes the slowdown at around powers of two. But there is another problem with this: it's too slow. If you look at your compute loop.

for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

The inner-most loop changes k by 1 each iteration, meaning you access just 1 double away from the last element you used of A but a whole 'dimension' doubles away from the last element of B. This doesn't take any advantage of the caching of the elements of B.

If you change this to:

for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];

You get the exact same results (modulo double addition associativity errors), but it's a whole lot more cache-friendly (local). I tried it and it gives substantial improvements. This can be summarized as

Don't multiply matrices by definition, but rather, by rows


Example of the speed-up (I changed your code to take the dimension as an argument)

$ diff a.c b.c
42c42
<               C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
---
>               C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];
$ make a
cc     a.c   -o a
$ make b
cc     b.c   -o b
$ ./a 1024


secs:88.732918
$ ./b 1024


secs:12.116630

As a bonus (and what makes this related to this question) is that this loop doesn't suffer from the previous problem.

If you already knew all of this, then I apologize!