Threads

Review of Processes

To understand threads, it's important to review what a process is.

Processes

Recall the classic "fork" code:
#include <stdio.h>    /* printf       */
#include <unistd.h>   /* fork, getpid */
#include <stdlib.h>   /* exit         */
#include <sys/wait.h> /* wait         */

int main(void)
{
  int pid = fork();

  if (pid == 0) /* child */
  {
    printf(" child pid: %i\n", getpid());

    /* do child stuff */

    exit(0);
  }
  else if (pid > 0) /* parent */
  {
    printf("parent pid: %i\n", getpid());

    /* do parent stuff */

    wait(NULL);
  }
  else
  {
    /* fork failed */
  }

  return 0;
}
_______________________________________________
Output:
parent pid: 32491
 child pid: 32492

Threads Overview

Thread resources

Types of threads

Single CPU/core systems

With a single CPU/core:

Multi-CPU/core

With multiple CPUs/cores

No parallelismpseudocodePseudo-parallelism
//  Pseudocode for left diagram
int main()
{
  Function1(); /* Blocks until done */
  Function2(); /* Blocks until done */
  Function3(); /* Blocks until done */
  Function4(); /* Blocks until done */
}





// Pseudocode for right diagram 
int main()
{
  while (not done)
  {
    Function1(); /* Run for 5 ms */
    Function2(); /* Run for 5 ms */
    Function3(); /* Run for 5 ms */
    Function4(); /* Run for 5 ms */
  }
}
Pseudo-parallelism (detail)

Parallelism

// Pseudocode for above diagram 
int main()
{
  Function1(); /* Returns immediately and runs to completion */
  Function2(); /* Returns immediately and runs to completion */
  Function3(); /* Returns immediately and runs to completion */
  Function4(); /* Returns immediately and runs to completion */

   /* wait for the threads to complete here... */
}
Thread States

There are several states in which a thread can be, and they are very similar to process states. They are mutually-exclusive, so a thread can only be in exactly one of these states at any given time:

  1. (Admitted) The thread has been created and is now being put into the ready/runnable queue.
  2. (Dispatched) The OS has selected a thread to run and is executing it on the CPU.
  3. (Timeout) The thread has used up its alloctted time slice and is put back in the queue for later execution.
  4. (Need I/O or event to happen) The thread has requested I/O or has requested to wait until a future event.
  5. (I/O done or event occurred) The I/O has completed or event has occurred that the thread was waiting on.
  6. (Ending) The thread has completed its task or it has been terminated.

Thread Libraries

POSIX Pthreads Notes:

A More Real-World Example

The previous examples were a good introduction to how threads work, but they were contrived. There was little to be gained by having a seperate thread print out those strings at the same time to the screen. In fact, it may have been worse using threads because the output appeared chaotic. Let's do something that actually makes sense.

Remember, the goal of multi-threading is to keep the CPU busy. We never want it to be idle if there is work to be done.

In this example, we have 3 different ways to calculate the value of pi: (pi.c) Briefly, the way these methods work is to iteratively calculate more and more precision for pi. The more iterations that are done, the closer approximation to pi we get. For reference, we are going to do 200,000,000 iterations with each function, giving us several digits of precision. Also for reference, the time it takes for each function to calculate pi on a few of reference machines (olga) is: (No optimizations were enabled in the compiler.)
olga (4 cores)rebecca (4 cores)VM (4 cores)
MethodTime
Circle method2.2 seconds
Leibniz method2.4 seconds
atan method1.5 seconds
        
MethodTime
Circle method3.4 seconds
Leibniz method1.9 seconds
atan method1.2 seconds
        
MethodTime
Circle method2.7 seconds
Leibniz method1.4 seconds
atan method0.5 seconds

To give you an idea of how these function work, this is what the atan method looks like:

double atan_pi(int rectangles)
{
  double width = 1.0 / rectangles;
  double length;
  double area = 0.0;
  int i;

  double midpoint = width / 2;
  for (i = 0; i < rectangles; i++)
  {
    double area_of_rectangle;

    length = 4.0 / (1 + midpoint * midpoint);
    area_of_rectangle = length * width;
    area += area_of_rectangle;
    midpoint += width;
  }

  return area;
}
The details are not important. What is important is that this function is going to take a while to perform 200,000,000 iterations. The other two functions are similar in that they are going to loop 200,000,000 times.

Before we learned about processes, threads, and multi-core CPUs, we would have programmed it like this: (pi-nothread.c)

/* To compile under Linux, use -lm linker option */
#include <stdio.h>  /* printf */
#include <stdlib.h> /* atoi   */

double circle_pi(int rectangles);  /* Calculates PI using a quarter circle */
double leibniz_pi(int iterations); /* Calculates PI using a series         */
double atan_pi(int rectangles);    /* Calculates PI using a curve          */

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;
  int method = 0;
  double pi_circle, pi_leibniz, pi_atan;

  if (argc > 1)
    method = atoi(argv[1]);

  if (argc > 2)
    iterations = 1000 * 1000 * atoi(argv[2]);

  switch (method)
  {
    case 1:
      pi_circle = circle_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf(" circle:%20.16f\n", pi_circle);
      break;
    case 2:
      pi_leibniz = leibniz_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf("leibniz:%20.16f\n", pi_leibniz);
      break;
    case 3:
      pi_atan = atan_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf("   atan:%20.16f\n", pi_atan);
      break;
    default:
      pi_circle = circle_pi(iterations);
      pi_leibniz = leibniz_pi(iterations);
      pi_atan = atan_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf(" circle:%20.16f\n", pi_circle);
      printf("leibniz:%20.16f\n", pi_leibniz);
      printf("   atan:%20.16f\n", pi_atan);
      break;
  }

  return 0;
}


Then, run from the command line and timing it:
time ./pi-nothread 0 200
the output:
Iterations:  200000000
 circle:  3.1415926568498080
leibniz:  3.1415926485894077
   atan:  3.1415926536631549

real    0m6.168s
user    0m6.150s
sys     0m0.010s
Reminder of the times for each function (olga):
MethodTime
Circle method2.2 seconds
Leibniz method2.4 seconds
atan method1.5 seconds
As expected, since we ran each function serially (one after the other), the total time is the sum of the times of each function. (The time to run the program without calling any pi function was 0.001 s.)

On a single-core CPU, that's the best we can do. Even putting each function in its own process or own thread would likely make the timings worse. Why?

BUT, on a multi-core system, we can certainly improve things. Let's run each function in its own process and hope that the operating system will give each process its own core to run on simultaneously. Now we program it like this: (pi-mpsh.c)

/* To compile under Linux, use -lm linker option */
#include <stdio.h>    /* printf                       */
#include <stdlib.h>   /* exit, atoi                   */
#include <string.h>   /* strcpy                       */
#include <unistd.h>   /* sleep, fork                  */
#include <sys/shm.h>  /* shmget, shmat, shmdt, shmctl */
#include <sys/wait.h> /* waitpid                      */

double circle_pi(int rectangles);  /* Calculates PI using a quarter circle */
double leibniz_pi(int iterations); /* Calculates PI using a series         */
double atan_pi(int rectangles);    /* Calculates PI using a curve          */

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;

  int child1, child2, child3;
  int shmid;
  double *buffer;  /* shared buffer */
  key_t key = 123; /* arbitrary key */

  if (argc > 1)
    iterations = 1000 * 1000 * atoi(argv[1]);

    /* Buffer will hold 3 doubles; one from each child process */
  shmid = shmget(key, 3 * sizeof(double), 0600 | IPC_CREAT);
  buffer = (double *) shmat(shmid, NULL, 0);

    /* Initialize the buffer */
  *buffer = 0.0;
  *(buffer + 1) = 0.0;
  *(buffer + 2) = 0.0;

  if ( (child1 = fork()) == 0 ) /* first child */
  {
    *buffer = circle_pi(iterations);
    shmdt(buffer); /* detach memory from child process */
    exit(0);
  }

    /* parent */
  if ( (child2 = fork()) == 0 ) /* second child */
  {
    *(buffer + 1) = leibniz_pi(iterations);
    shmdt(buffer); /* detach memory from second process */
    exit(0);
  }

    /* parent */
  if ( (child3 = fork()) == 0 ) /* third child */
  {
    *(buffer + 2) = atan_pi(iterations);
    shmdt(buffer); /* detach memory from third child process */
    exit(0);
  }

  /* parent */

  waitpid(child1, NULL, 0);
  waitpid(child2, NULL, 0);
  waitpid(child3, NULL, 0);

  printf("Iterations: %10i\n", iterations);
  printf(" circle:%20.16f\n", *buffer);
  printf("leibniz:%20.16f\n", *(buffer + 1));
  printf("   atan:%20.16f\n", *(buffer + 2));

  shmdt(buffer);              /* detach memory from parent process */
  shmctl(shmid, IPC_RMID, 0); /* delete memory block               */

  return 0;
}
Then, run from the command line and timing it:
time ./pi-mpsh 200
the output:
Iterations:  200000000
 circle:  3.1415926568498080
leibniz:  3.1415926485894077
   atan:  3.1415926536631549

real    0m2.405s
user    0m6.140s
sys     0m0.000s
Reminder:
MethodTime
Circle method2.2 seconds
Leibniz method2.4 seconds
atan method1.5 seconds
The single-process version took 6.15 seconds. This is a clearly significant improvement. Why do you think the time is 2.4 seconds? (It was run on a machine with 4 cores.)

Now, let's run each function in its own thread and see how that goes. We program it like this: (pi-mt.c or pi-mt-int.c)

/* To compile under Linux, use -lm and -lpthread linker options */
#include <stdio.h>   /* printf                  */
#include <stdlib.h>  /* exit, atoi              */
#include <pthread.h> /* thread create/join/exit */

double circle_pi(int rectangles);  /* Find PI using a quarter circle */
double leibniz_pi(int iterations); /* Find PI using a series         */
double atan_pi(int rectangles);    /* Find PI using a curve          */

double pi_circle;
double pi_leibniz;
double pi_atan;

void *th1(void *p)
{
  int it = *(int *)p;
  pi_circle = circle_pi(it);
  return NULL;
}

void *th2(void *p)
{
  int it = *(int *)p;
  pi_leibniz = leibniz_pi(it);
  return NULL;
}

void *th3(void *p)
{
  int it = *(int *)p;
  pi_atan = atan_pi(it);
  return NULL;
}

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;

  pthread_t thread1, thread2, thread3;

  if (argc > 1)
    iterations = 1000 * 1000 * atoi(argv[1]);

    /* Create threads with data passing */
  pthread_create(&thread1, NULL, th1, &iterations);
  pthread_create(&thread2, NULL, th2, &iterations);
  pthread_create(&thread3, NULL, th3, &iterations);

    /* Wait on the threads */
  pthread_join(thread1, NULL);
  pthread_join(thread2, NULL);
  pthread_join(thread3, NULL);

  printf("Iterations: %10i\n", iterations);
  printf(" circle:%20.16f\n", pi_circle);
  printf("leibniz:%20.16f\n", pi_leibniz);
  printf("   atan:%20.16f\n", pi_atan);

  return 0;
}
Then, run from the command line and timing it:
time ./pi-mt 200
the output:
Iterations:  200000000
 circle:  3.1415926568498080
leibniz:  3.1415926485894077
   atan:  3.1415926536631549

real    0m2.398s
user    0m6.130s
sys     0m0.000s
Another version using arrays for scaling. This example also shows how to return data from one thread to another thread via the thread-function argument.

Some interesting benchmarks: Various computer specs for testing in this class.

Below are timings for calculating pi using the methods and systems described above. There are two primary factors that determine the overall performance of the calculations. One is the version of the GNU compiler used, and the other is the architecture of the processor, which includes the speed of the CPU itself, as well as the size, speed, and architecture of the caches. We will learn more about the details of cache memory later. Don't forget Mead's informal rule of thumb: Cache is King

Finally, realize that, although this is not quite a synthetic benchmark, (WhetstoneWhetstone for floating point and DrhystoneDrhystone for integers ) it really doesn't represent all of the tasks and programs a system is going to run during an average day, so you can't just look at this one benchmark and conclude which system is best.

Computer
(200,000,000 iterations)
Individual
timings (secs)
pi-nothread.c
One process
One thread
pi-nothread.c
Multiple processes
One thread each
pi-mpsh.c
One process
Multiple Threads
pi-mt.c
Maya (3.4 GHz) 4 cores
Linux 64-bit (3.2.0-23)
gcc 4.6.3
 circle: 0.8
Leibniz: 1.7
   atan: 0.8
         3.3
real  0m3.299s
user  0m3.288s
sys   0m0.000s
real  0m1.717s
user  0m3.420s
sys   0m0.008s
real  0m1.713s
user  0m3.680s
sys   0m0.000s
Clara (3.3 GHz) 4 cores
Linux 64-bit (3.2.0-23)
gcc 4.6.3
 circle: 1.0
Leibniz: 1.7
   atan: 1.1
         3.8
real  0m3.730s
user  0m3.724s
sys   0m0.000s
real  0m1.673s
user  0m3.724s
sys   0m0.000s
real  0m1.699s
user  0m3.764s
sys   0m0.000s
Server (3.1 GHz) 4 cores
Linux 64-bit
gcc 4.6.3
 circle: 1.7
Leibniz: 1.0
   atan: 1.7
         4.4
real 0m4.324s
user 0m4.304s
sys  0m0.000s 
real 0m1.743s
user 0m4.292s
sys  0m0.000s
real 0m1.704s
user 0m4.252s
sys  0m0.000s
Athena (2.83 GHz) 4 cores
Linux 32-bit (2.6.18-2)
gcc 4.1.2
 circle: 2.1
Leibniz: 1.6
   atan: 1.6
         5.3
real  0m5.200s
user  0m5.196s
sys   0m0.004s
real  0m2.105s
user  0m5.252s
sys   0m0.000s
real  0m2.100s
user  0m5.256s
sys   0m0.000s
Lulu (2.7 GHz) 6 cores
Linux 64-bit
gcc 4.6.3
 circle: 2.0
Leibniz: 2.1
   atan: 1.9
         6.0
real 0m6.087s
user 0m6.084s
sys  0m0.000s
real 0m2.224s
user 0m6.288s
sys  0m0.000s
real 0m2.209s
user 0m6.244s
sys  0m0.000s
Olga (2.8 GHz) 4 cores
Linux 64-bit (2.6.32-24)
gcc 4.4.3
 circle: 2.2
Leibniz: 2.4
   atan: 1.5
         6.1
real  0m6.135s
user  0m6.130s
sys   0m0.000s
real  0m2.405s
user  0m6.150s
sys   0m0.000s
real  0m2.408s
user  0m6.120s
sys   0m0.000s
Lisa (2.67 GHz) 2 core
Linux 64-bit (2.6.38)
gcc 4.5.2
 circle: 2.2
Leibniz: 2.4
   atan: 1.5
         6.1
real  0m6.131s
user  0m6.120s
sys   0m0.010s
real  0m3.007s
user  0m8.360s
sys   0m0.000s
real  0m2.984s
user  0m8.400s
sys   0m0.000s
Gina (3.0 GHz) 2 cores
Linux 64-bit
gcc 4.8.2
 circle: 4.8
Leibniz: 2.2
   atan: 1.0
         8.0
real 0m7.879s
user 0m7.813s
sys  0m0.008s
real 0m3.011s
user 0m4.026s
sys  0m0.000s
real 0m4.105s
user 0m5.550s
sys  0m0.021s
Digipen (2.4 GHz) 2 cores
Windows 32-bit (XP)
gcc 4.3.2
 circle: 6.3
Leibniz: 3.2
   atan: 3.2
        12.7
real  0m12.516s
user  0m12.499s
sys   0m0.046s
real  0m7.406s
user  0m12.341s
sys   0m0.015s
real  0m5.281s
user  0m10.374s
sys   0m0.031s
Jessica (2.8 GHz) 1 core
Windows 32-bit (Server 2003)
gcc 4.3.2
 circle: 6.2
Leibniz: 3.4
   atan: 3.7
        13.3
real  0m6.141s
user  0m5.953s
sys   0m0.015s
real  0m13.547s
user  0m23.046s
sys   0m0.015s
real  0m14.718s
user  0m26.093s
sys   0m0.015s
Sabrina (2.4 GHz) 1 core
Linux 32-bit (2.4.20)
gcc 3.3
 circle:10.7
Leibniz: 3.9
   atan: 4.0
        18.6
real  0m18.645s
user  0m18.200s
sys   0m0.000s
real  0m18.610s
user  0m18.180s
sys   0m0.000s
real  0m19.046s
user  0m18.590s
sys   0m0.000s
Maria (1.66 GHz) 2 cores
Linux 32-bit (2.6.31-16)
gcc 4.4.1
 circle:17.7
Leibniz:12.9
   atan:16.3
        46.9
real  0m46.806s
user  0m46.759s
sys   0m0.048s
real  0m27.072s
user  0m52.855s
sys   0m0.052s
real  0m27.306s
user  0m52.091s
sys   0m0.024s
Some real stress testing: (Do try this at home! pi-mtv.c)

This program runs many threads (up to 1024, but you can set it higher) and executes the function atan_pi for calculating pi. Under Windows, you can watch the program with Task Manager or Process Explorer and see all of the threads. The command line takes two parameters: The number of threads (default 1) and the number of iterations (times one million) for the function (default 100).
/* To compile under Linux, use -lm and -lpthread linker options */
#include <stdio.h>   /* printf                  */
#include <stdlib.h>  /* atoi                    */
#include <pthread.h> /* thread create/join/exit */

#define MAX_THREADS 1024

double atan_pi(int rectangles); /* Find PI using a curve */

void *th1(void *p)
{
  int it = *(int *)p;
  atan_pi(it);
  return NULL;
}

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;
  int thread_count = 1;
  int i;

  pthread_t threads[MAX_THREADS];

  if (argc > 1)
  {
    thread_count = atoi(argv[1]);
    thread_count = thread_count > MAX_THREADS ? MAX_THREADS : thread_count;
  }

  if (argc > 2)
    iterations = 1000 * 1000 * atoi(argv[2]);

  printf("   Threads: %i\n", thread_count);
  printf("Iterations: %i\n", iterations);

  for (i = 0; i < thread_count; i++)
    pthread_create(&threads[i], NULL, th1, &iterations);

  for (i = 0; i < thread_count; i++)
    pthread_join(threads[i], NULL);

  return 0;
}

Another perspective

The previous examples were all fairly trivial to run in multiple threads. The main reason is that each function implements a complete, stand-alone calculation. They can easily be run independently of each other without requiring any communication between the threads. This is ideal for multi-threading: One function, one thread, independent data. It doesn't get any easier than this.

However, what if we only had one function that calculated the value of PI? How would we take advantage of multiple threads? In this case, it's still pretty easy because it is trivial to parallelize the computation. For example, if we want to calculate the value of PI using 1,000,000 iterations, it will take a certain amount of time. But, what if we run two threads in parallel with one thread running half of the iterations and the other thread running the other half? Examples: Using this data:

#define SIZE 16
int array[SIZE];
int i;
We could set all elements in array using one loop in one thread:
  /* Thread #1  Sets elements 0-15 */
for (i = 0; i < 16; i++)
  array[i] = i;
We could essentially cut the time required in half (less thread overhead) if we used two threads:

Thread #1Thread #2
  /* Thread #1  Sets elements 0-7 */
for (i = 0; i < 8; i++)
  array[i] = i;
  /* Thread #2  Sets elements 8-15*/
for (i = 8; i < 16; i++)
  array[i] = i;
Or, better yet, use 4 threads:

Thread #1Thread #2
  /* Thread #1  Sets elements 0-3 */
for (i = 0; i < 4; i++)
  array[i] = i;
  /* Thread #2  Sets elements 4-7 */
for (i = 4; i < 8; i++)
  array[i] = i;

Thread #3Thread #4
  /* Thread #3  Sets elements 8-11 */
for (i = 8; i < 12; i++)
  array[i] = i;
  /* Thread #4  Sets elements 12-15*/
for (i = 12; i < 16; i++)
  array[i] = i;
Or, using 8 or more threads (you get the picture). Yes, with 16 elements in the example above, multiple threads is too much overhead. But, suppose you need to process millions or billions of elements, not uncommon with today's computers. Using threads will have an enormous impact on the processing time.

Aside: There are languages that will automagically divide the work into concurrent threads. One example is FORTRAN setting an array of 2 billion integers (gfortran init.f90):

Single threadConcurrent threads
! This program tests the concurrency
program TestConcurrentLoop

! Type declarations
   integer :: i, x(2000000000)

! Executable statements
   do i = 1, 2000000000
     x(i) = i * 3;
   enddo  

end program TestConcurrentLoop

Time to complete: 7.4 seconds

      
! This program tests the concurrency
program TestConcurrentLoop

! Type declarations
   integer :: i, x(2000000000)

! Executable statements
   do concurrent (i = 1: 2000000000)
     x(i) = i * 3;
   enddo  

end program TestConcurrentLoop

Time to complete: 5.3 seconds
By specifying the concurrent keyword, the compiler will automagically create multiple loops to be executed simultaneously.

Getting back to the example of calculating digits of PI using 1,000,000 iterations.

One thread could calculate the value using the range 0 - 500,000 and the second thread can caclulate the value using the range 500,001 - 1,000,000. Then, we just add the two together to get the total value. We could take it further and break the range into 4 intervals:

  1. Thread1: 0 - 250,000 (technically 249,999)
  2. Thread2: 250,000 - 500,000
  3. Thread3: 500,000 - 750,000
  4. Thread4: 750,000 - 1,000,000
Of course, we can take it further and divide it into 8, or 16, or more ranges. Basically, we'd like to have the number of threads match the number of CPUs (cores). This example (pi_atan.c) shows the function atan_pi modified to accept a range (start-end) so it knows which "portion" of the range to calculate. The main thread then sums up all of the results to produce the final value.


A note about processor affinityprocessor affinity:

Win32 Threads

Windows has its own API for threads: The threaded stress test using the Win32 API. Use Windows Task Manager or Process Explorer to see the processes and threads. (wpi-mtv.c)


#include <stdio.h>   /* printf        */
#include <stdlib.h>  /* exit, atoi    */
#include <windows.h> /* Windows stuff */

#define MAX_THREADS 1024

double atan_pi(int rectangles); /* Find PI using a curve */

DWORD WINAPI th1(LPVOID p)
{
  int it = *(int *)p;
  atan_pi(it);
  return 0;
}

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;
  int thread_count = 1;
  int i;

  HANDLE threads[MAX_THREADS];

  if (argc > 1)
  {
    thread_count = atoi(argv[1]);
    thread_count = thread_count > MAX_THREADS ? MAX_THREADS : thread_count;
  }

  if (argc > 2)
  {
    iterations = 1000 * 1000 * atoi(argv[2]);
  }

  printf("   Threads: %i\n", thread_count);
  printf("Iterations: %i\n", iterations);

  for (i = 0; i < thread_count; i++)
    threads[i] = CreateThread(0, 0, th1, &iterations, 0, 0);

  WaitForMultipleObjects(thread_count, threads, TRUE, INFINITE);

  for (i = 0; i < thread_count; i++)
    CloseHandle(threads[i]);

  return 0;
}

Multi-threading Problems

So far, everything seems to be working out just fine. However, consider the following two programs:

Single-threaded version: (thread-safe.c)

#include <stdio.h> /* printf */

int Count = 0;

void Increment(void)
{
  int i;
  int loops = 1000;
  for (i = 0; i < loops; i++)
    Count++;
}

int main(void)
{
  #define N 10
  int i;

  for (i = 0; i < N; i++)
    Increment();

  printf("Count = %i\n", Count);

  return 0;
}
Output:
Count = 10000
Multi-threaded version: (thread-unsafe.c)
#include <stdio.h>   /* printf                  */
#include <pthread.h> /* thread create/join/exit */

int Count;

void *Increment(void *p)
{
  int i;
  int loops = 1000;
  for (i = 0; i < loops; i++)
    Count++;

  return NULL;
}

int main(void)
{
  #define N 10
  int i;
  pthread_t thread_id[N];

  for (i = 0; i < N; i++)
    pthread_create(thread_id + i, NULL, Increment, NULL);

  for (i = 0; i < N; i++)
    pthread_join(thread_id[i], NULL);

  printf("Count = %i\n", Count);

  return 0;
}
Output:
Count = 10000

Hey, we found a way to speed up increments! (Not really...) Let's see what happens when we change this line in the Increment function from this:
int loops = 1000;
to this:
int loops = 1000 * 100;
Results:
 Single-threadedMulti-threaded
Run #1
Count = 1000000
Count = 591274
Run #2
Count = 1000000
Count = 636767
Run #3
Count = 1000000
Count = 591146
Run #4
Count = 1000000
Count = 611036
Run #5
Count = 1000000
Count = 625463
The first version with 10000 iterations will also fail if you run it a few times:
for val in {1..10}; do  ./thread-unsafe; done
Output:
Count = 10000
Count = 10000
Count = 10000
Count = 10000
Count = 10000
Count = 9170
Count = 10000
Count = 10000
Count = 8656
Count = 9203

This is the fundamental problem with multi-threaded programs accessing a shared resource. Later, we'll see various ways of dealing with this issue and others when we talk about synchronization.

And the Windows API has the exact same problem:

#include <stdio.h>   /* printf        */
#include <windows.h> /* Windows stuff */

int Count;

DWORD WINAPI Increment(LPVOID p)
{
  int i;
  int loops = 1000 * 100;
  for (i = 0; i < loops; i++)
    Count++;

  return 0;
}

int main(void)
{
  #define N 10
  int i;
  HANDLE thread[N];

  for (i = 0; i < N; i++)
    thread[i] = CreateThread(0, 0, Increment, 0, 0, 0);

  WaitForMultipleObjects(N, thread, TRUE, INFINITE);

  for (i = 0; i < N; i++)
    CloseHandle(thread[i]);

  printf("Count = %i\n", Count);

  return 0;
}
Output:
Count = 629261
Count = 652001
Count = 628380
Count = 743270
Count = 653191
Other points to make: Multi-process vs. multi-thread Additional links: