How to get 100% CPU usage from a C program

This is quite an interesting question so let me set the scene. I work at The National Museum of Computing, and we have just managed to get a Cray Y-MP EL super computer from 1992 running, and we really want to see how fast it can go!

We decided the best way to do this was to write a simple C program that would calculate prime numbers and show how long it took to do so, then run the program on a fast modern desktop PC and compare the results.

We quickly came up with this code to count prime numbers:

#include <stdio.h>
#include <time.h>


void main() {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;


while (num <= 1000) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num)
primes++;


system("clear");
printf("%d prime numbers calculated\n",primes);
num++;
}


end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}

Which on our dual core laptop running Ubuntu (The Cray runs UNICOS), worked perfectly, getting 100% CPU usage and taking about 10 minutes or so. When I got home I decided to try it on my hex-core modern gaming PC, and this is where we get our first issues.

I first adapted the code to run on Windows since that is what the gaming PC was using, but was saddened to find that the process was only getting about 15% of the CPU's power. I figured that must be Windows being Windows, so I booted into a Live CD of Ubuntu thinking that Ubuntu would allow the process to run with its full potential as it had done earlier on my laptop.

However I only got 5% usage! So my question is, how can I adapt the program to run on my gaming machine in either Windows 7 or live Linux at 100% CPU utilisation? Another thing that would be great but not necessary is if the end product can be one .exe that could be easily distributed and ran on Windows machines.

Thanks a lot!

P.S. Of course this program didn't really work with the Crays 8 specialist processors, and that is a whole other issue... If you know anything about optimising code to work on 90's Cray super computers give us a shout too!

59077 次浏览

You're running one process on a multi-core machine - so it only runs on one core.

The solution is easy enough, since you're just trying to peg the processor - if you have N cores, run your program N times (in parallel, of course).

Example

Here is some code that runs your program NUM_OF_CORES times in parallel. It's POSIXy code - it uses fork - so you should run that under Linux. If what I'm reading about the Cray is correct, it might be easier to port this code than the OpenMP code in the other answer.

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>


#define NUM_OF_CORES 8
#define MAX_PRIME 100000


void do_primes()
{
unsigned long i, num, primes = 0;
for (num = 1; num <= MAX_PRIME; ++num) {
for (i = 2; (i <= num) && (num % i != 0); ++i);
if (i == num)
++primes;
}
printf("Calculated %d primes.\n", primes);
}


int main(int argc, char ** argv)
{
time_t start, end;
time_t run_time;
unsigned long i;
pid_t pids[NUM_OF_CORES];


/* start of test */
start = time(NULL);
for (i = 0; i < NUM_OF_CORES; ++i) {
if (!(pids[i] = fork())) {
do_primes();
exit(0);
}
if (pids[i] < 0) {
perror("Fork");
exit(1);
}
}
for (i = 0; i < NUM_OF_CORES; ++i) {
waitpid(pids[i], NULL, 0);
}
end = time(NULL);
run_time = (end - start);
printf("This machine calculated all prime numbers under %d %d times "
"in %d seconds\n", MAX_PRIME, NUM_OF_CORES, run_time);
return 0;
}

Output

$ ./primes
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
This machine calculated all prime numbers under 100000 8 times in 8 seconds

Try to parallelize your program using, e.g., OpenMP. It is a very simple and effective framework for making up parallel programs.

The reason you're getting 15% on a hex core processor is because your code uses 1 core at 100%. 100/6 = 16.67%, which using a moving average with process scheduling (your process would be running under normal priority) could easily be reported as 15%.

Therefore, in order to use 100% cpu, you would need to use all the cores of your CPU - launch 6 parallel execution code paths for a hex core CPU and have this scale right up to however many processors your Cray machine has :)

If you want 100% CPU, you need to use more than 1 core. To do that, you need multiple threads.

Here's a parallel version using OpenMP:

I had to increase the limit to 1000000 to make it take more than 1 second on my machine.

#include <stdio.h>
#include <time.h>
#include <omp.h>


int main() {
double start, end;
double runTime;
start = omp_get_wtime();
int num = 1,primes = 0;


int limit = 1000000;


#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
for (num = 1; num <= limit; num++) {
int i = 2;
while(i <= num) {
if(num % i == 0)
break;
i++;
}
if(i == num)
primes++;
//      printf("%d prime numbers calculated\n",primes);
}


end = omp_get_wtime();
runTime = end - start;
printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);


return 0;
}

Output:

This machine calculated all 78498 prime numbers under 1000000 in 29.753 seconds

Here's your 100% CPU:

enter image description here

we really want to see how fast it can go!

Your algorithm to generate prime numbers is very inefficient. Compare it to primegen that generates the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350.

To consume all CPUs easily you could solve an embarrassingly parallel problem e.g., compute Mandelbrot set or use genetic programming to paint Mona Lisa in multiple threads (processes).

Another approach is to take an existing benchmark program for the Cray supercomputer and port it to a modern PC.

For a quick improvement on one core, remove system calls to reduce context-switching. Remove these lines:

system("clear");
printf("%d prime numbers calculated\n",primes);

The first is particularly bad, as it will spawn a new process every iteration.

Also be very aware how you're loading the CPU. A CPU can do a lot of different tasks, and while many of them will be reported as "loading the CPU 100%" they may each use 100% of different parts of the CPU. In other words, it's very hard to compare two different CPUs for performance, and especially two different CPU architectures. Executing task A may favor one CPU over another, while executing task B it can easily be the other way around (since the two CPUs may have different resources internally and may execute code very differently).

This is the reason software is just as important for making computers perform optimal as hardware is. This is indeed very true for "supercomputers" as well.

One measure for CPU performance could be instructions per second, but then again instructions aren't created equal on different CPU architectures. Another measure could be cache IO performance, but cache infrastructure is not equal either. Then a measure could be number of instructions per watt used, as power delivery and dissipation is often a limiting factor when designing a cluster computer.

So your first question should be: Which performance parameter is important to you? What do you want to measure? If you want to see which machine gets the most FPS out of Quake 4, the answer is easy; your gaming rig will, as the Cray can't run that program at all ;-)

Cheers, Steen

TLDR; The accepted answer is both inefficient and incompatible. Following algo works 100x faster.

The gcc compiler available on MAC can't run omp. I had to install llvm (brew install llvm ). But I didn't see CPU idle was going down while running OMP version.

Here is a screenshot while OMP version was running. enter image description here

Alternatively, I used the basic POSIX thread, that can be run using any c compiler and saw almost entire CPU used up when nos of thread = no of cores = 4 (MacBook Pro, 2.3 GHz Intel Core i5). Here is the programme -

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NUM_THREADS     10
#define THREAD_LOAD 100000
using namespace std;


struct prime_range {
int min;
int max;
int total;
};


void* findPrime(void *threadarg)
{
int i, primes = 0;
struct prime_range *this_range;
this_range = (struct prime_range *) threadarg;


int minLimit =  this_range -> min ;
int maxLimit =  this_range -> max ;
int flag = false;
while (minLimit <= maxLimit) {
i = 2;
int lim = ceil(sqrt(minLimit));
while (i <= lim) {
if (minLimit % i == 0){
flag = true;
break;
}
i++;
}
if (!flag){
primes++;
}
flag = false;
minLimit++;
}
this_range ->total = primes;
pthread_exit(NULL);
}


int main (int argc, char *argv[])
{
struct timespec start, finish;
double elapsed;


clock_gettime(CLOCK_MONOTONIC, &start);


pthread_t threads[NUM_THREADS];
struct prime_range pr[NUM_THREADS];
int rc;
pthread_attr_t attr;
void *status;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(int t=1; t<= NUM_THREADS; t++){
pr[t].min = (t-1) * THREAD_LOAD + 1;
pr[t].max = t*THREAD_LOAD;
rc = pthread_create(&threads[t], NULL, findPrime,(void *)&pr[t]);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
int totalPrimesFound = 0;
// free attribute and wait for the other threads
pthread_attr_destroy(&attr);
for(int t=1; t<= NUM_THREADS; t++){
rc = pthread_join(threads[t], &status);
if (rc) {
printf("Error:unable to join, %d" ,rc);
exit(-1);
}
totalPrimesFound += pr[t].total;
}
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("This machine calculated all %d prime numbers under %d in %lf seconds\n",totalPrimesFound, NUM_THREADS*THREAD_LOAD, elapsed);
pthread_exit(NULL);
}

Notice how the entire CPU is used up - enter image description here

P.S. - If you increase no of threads then actual CPU usage go down (Try making no of threads = 20 .) as the system uses more time in context switching than actual computing.

By the way, my machine is not as beefy as @mystical (Accepted answer). But my version with basic POSIX threading works way faster than OMP one. Here is the result -

enter image description here

P.S. Increase threadload to 2.5 million to see CPU usage , as it completes in less than a second.

Simply try to Zip and Unzip a big file , nothing as a heavy I/o operations can use cpu.

Linux (and other Unix systems) report CPU usage such that a single-threaded process keeping a single CPU core busy is reported as 100%. This is what you see on your Ubuntu system: the program is always ready to run, and actually running on one core or the other.

On a machine with N cores, with all cores busy Linux/Unix show that as N * 100%, or a load average of N. (Load average also includes tasks waiting for I/O, or for a CPU to run on, so it can be higher than the number of CPU cores).

e.g. 800% load on a machine with 8 logical cores, whether that's across 4 physical cores with hyperthreading or 8 separate physical cores.


On a Windows machine, all-cores-busy is reported as 100% load. And the utilization of a single-threaded program maxing out one core is 100% / N, highly inconvenient on a many core machine for seeing if a single-threaded program spends any of its time sleeping or waiting for I/O.


However your OS reports it, to max out all the cores at once, you need to either run N processes, or have a process that starts N threads, or some combination of those.