I have a matrix multiply code that looks like this:
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
Here, the size of the matrix is represented by dimension
.
Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.
Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0
Program compiled as gcc -O3 simple.c
I have run this on Intel's icc compiler as well, and seen similar results.
EDIT:
As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.
Heres the complete program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv;
double timestamp()
{
double t;
gettimeofday(&tv, NULL);
t = tv.tv_sec + (tv.tv_usec/1000000.0);
return t;
}
int main(int argc, char *argv[])
{
int i, j, k;
double *A, *B, *C, start, end;
A = (double*)malloc(dimension*dimension*sizeof(double));
B = (double*)malloc(dimension*dimension*sizeof(double));
C = (double*)malloc(dimension*dimension*sizeof(double));
srand(292);
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
{
A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
C[dimension*i+j] = 0.0;
}
start = timestamp();
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] *
B[dimension*k+j];
end = timestamp();
printf("\nsecs:%f\n", end-start);
free(A);
free(B);
free(C);
return 0;
}