Threads

Review of Processes

To understand threads, it's important to review what a process is.

Processes

Each process runs in its own process space.
Each process has a unique process ID (PID) that distinguishes it from other processes.
When one process creates another process (e.g. fork), the new (child) process is an independent process with nothing shared with the parent.
The operating system ensures that each process stays within its own memory space.
If a process wants to communicate with another process, it must do so with the operating system's help.
There are several ways that processes can communicate with each other and we've looked at three so far:

If a process does nothing to enable communication with another process, no communication can take place.

Recall the classic "fork" code:

#include <stdio.h>    /* printf       */
#include <unistd.h>   /* fork, getpid */
#include <stdlib.h>   /* exit         */
#include <sys/wait.h> /* wait         */

int main(void)
{
  int pid = fork();

  if (pid == 0) /* child */
  {
    printf(" child pid: %i\n", getpid());

    /* do child stuff */

    exit(0);
  }
  else if (pid > 0) /* parent */
  {
    printf("parent pid: %i\n", getpid());

    /* do parent stuff */

    wait(NULL);
  }
  else
  {
    /* fork failed */
  }

  return 0;
}
_______________________________________________
Output:
parent pid: 32491
 child pid: 32492

Both processes are executing concurrently.
The parent is blocked (at the wait call) until the child finishes.
No data is shared between the processes.
The only communication possible is the (1 byte) one-way exit code returned from the child to the parent.

Threads Overview

Traditionally, a process is heavyweight, meaning it requires a lot of work to create and manage.
Threads are also called "lightweight processes" as they don't incur as much overhead as a process.
A thread is the most basic unit of work in a process.

It might help to think of a thread as a sub-process, although this isn't a strict definition.
A process can be broken down into several sub-processes (not unlike how a program is broken down into functions).

Like processes, separate threads are alloted their own time slices by the OS and run independently of one another.
A thread by itself is not a process.
All processes have at least one thread (thread of execution).
It is possible for a process to have more than one thread.

A process must contain at least one thread, usually called the main thread.
If it has more than one thread, the process is said to be multi-threaded.

Depending on the type of application (process), multiple threads can be very beneficial.
A process with multiple threads (called a multi-threaded process) can do multiple things at the same time. (Assuming multiple CPUs or cores)
Threads don't have their own resources; rather, they share those of the whole process.

Thread resources

Instead of an entire PCB (Process Control Block), the OS needs to store less information about the thread.

Thread identifier (similar to a process ID)
Thread state (created, ready, running, blocked, or terminated)
CPU registers
Thread stack

Makes for quicker context switching.
Each thread of a process shares its process's memory space with other threads in the process.

This means you don't need any help from the OS to communicate with the other threads in the process.
All threads can directly read/write all of the global memory and have direct access to all of the processes resources (e.g. open files).

May need to provide synchronization between the threads.

Types of threads

User level threads

Thread support is provided on the library level in user space.
Context switching between threads is initiated by the library (i.e. the user process)
The kernel only sees the process, not the threads.
If a user thread blocks, the whole process blocks.
Generally give good performance, but are tricky to implement and use.

Kernel level threads

Threads are directly supported by the OS.
Context switching is done by the OS (kernel).
Thread library only provides an interface.
The kernel sees and handles all of the process' threads.
Generally more overhead, but easier for programmer's to use.

Most modern OS's (e.g. Linux, Windows, Mac OS X) provide kernel level threads.
The difference is basically in the implementation, but most programmers won't have to worry about the differences and will use a thread library (API) that will abstract much of the differences.

Single CPU/core systems

With a single CPU/core:

Like processes, only one thread can be executing.
Only one thread will actually be executing on the CPU.
If multiple threads are in the ready/runnable state, only one is able to run, causing performance problems.
If many threads are in the blocked state, then giving the CPU to the runnable threads increases performance.

Multi-CPU/core

With multiple CPUs/cores

Multiple threads can also execute in parallel, one per CPU/core.
For example, with a quad-core CPU:

4 single-threaded processes can execute simultaneously.
A single process with 4 threads can have all 4 threads executing simultaneously.

Multi-threaded processes really work well on multi-core (or multi-CPU) systems.

This is true parallelism.

Processor affinity is the ability to specify on which CPU/core a thread will execute.

If another CPU/core is available, the thread will still wait until the specific CPU/core becomes available.

Diagrams showing parallelism vs. pseudo-parallelism (this is very similar to the process version)

This is a single process with multiple threads.
The RED LINE represents the time to switch between threads.
The thick BLACK LINE indicates that this is a single process running.
The diagrams show a different function running in each thread, but it could be the same function (code) executing in each thread.
Compare this to multi-processes

No parallelism	pseudocode	Pseudo-parallelism
	// ← Pseudocode for left diagram int main() { Function1(); / Blocks until done / Function2(); / Blocks until done / Function3(); / Blocks until done / Function4(); / Blocks until done / } // Pseudocode for right diagram → int main() { while (not done) { Function1(); / Run for 5 ms / Function2(); / Run for 5 ms / Function3(); / Run for 5 ms / Function4(); / Run for 5 ms / } }

No parallelism

pseudocode

Pseudo-parallelism

// ← Pseudocode for left diagram
int main()
{
  Function1(); /* Blocks until done */
  Function2(); /* Blocks until done */
  Function3(); /* Blocks until done */
  Function4(); /* Blocks until done */
}





// Pseudocode for right diagram →
int main()
{
  while (not done)
  {
    Function1(); /* Run for 5 ms */
    Function2(); /* Run for 5 ms */
    Function3(); /* Run for 5 ms */
    Function4(); /* Run for 5 ms */
  }
}

Pseudo-parallelism (detail)

Parallelism

// Pseudocode for above diagram ↑
int main()
{
  Function1(); /* Returns immediately and runs to completion */
  Function2(); /* Returns immediately and runs to completion */
  Function3(); /* Returns immediately and runs to completion */
  Function4(); /* Returns immediately and runs to completion */

   /* wait for the threads to complete here... */
}

Thread States

There are several states in which a thread can be, and they are very similar to process states. They are mutually-exclusive, so a thread can only be in exactly one of these states at any given time:

New (Created) - The thread is being created.
Running - The thread is actively using the CPU and other resources.
Blocked - The thread is currently waiting (paused) while some I/O completes or an event happens.
Ready - The thread is in the runnable queue and is waiting to run.
Terminated - The thread has been terminated, either because it has finished its task or because the system (or user) has terminated it.

(Admitted) The thread has been created and is now being put into the ready/runnable queue.
(Dispatched) The OS has selected a thread to run and is executing it on the CPU.
(Timeout) The thread has used up its alloctted time slice and is put back in the queue for later execution.

The thread may still have time remaining in its time slice, but a higher-priority thread has bumped it.

(Need I/O or event to happen) The thread has requested I/O or has requested to wait until a future event.
(I/O done or event occurred) The I/O has completed or event has occurred that the thread was waiting on.
(Ending) The thread has completed its task or it has been terminated.

Thread Libraries

POSIX Pthreads

POSIX provides a standard thread library called Pthreads, which is a specification for thread behavior. (There are others such as Win32 Threads).

Creation and termination of threads
Management of threads

Synchronization
Locking

A thread library provides an API for creating and managing threads.
Implementation (as kernel or user level threads) is platform dependent.
Return values are different from the process API.
pthread_create used to create and start a thread.

If successful, returns 0, else returns non-zero

pthread_join used to wait for a thread.

If successful, returns 0, else returns non-zero

pthread_exit used by a thread to exit.

Doesn't return. Works much like exit.
Do not use this function in CS180. Use a simple return statement instead. (Valgrind gives a false positive memory leak with pthread_exit.)

Supported on Linux, Mac OS X, and Cygwin
POSIX thread example: (thread1.c)

This program simply creates a thread which prints the word Hello! to the screen and then exits.

/* To compile under Linux, use -pthread or -lpthread option */
#include <stdio.h>   /* printf                  */
#include <pthread.h> /* thread create/join/exit */

void *hello(void *p)
{
  printf("Hello!\n");
  return NULL;
}

int main(void)
{
  pthread_t tid;

  pthread_create(&tid, NULL, hello, NULL);
  pthread_join(tid, NULL);

  return 0;
}

pthread_create has four arguments:

int pthread_create(pthread_t *thread, pthread_attr_t *attr, void * (*start_routine)(void *), void *arg);

pthread_t pointer � fills in with thread ID of new thread

In Linux, thread IDs are unique among all processes.

pthread_attr_t pointer � pointer to requested thread attributes, may be NULL (default attributes)

Priority, stack size, etc.

function pointer � the thread function to call

must have the prototype: void* fn_name(void *)
that's a function that takes a void pointer and returns a void pointer.

void pointer � data to pass to thread function (arguments).

pthread_join has two arguments:

int pthread_join(pthread_t thread, void **retval);

pthread_t � identifier of thread to wait for
void pointer-to-pointer � retrieve thread exit status (NULL means to ignore exit status)

Must be double-pointer so the callee can change the pointer. (Think linked-list functions. Also, can't dereference void pointers.)

pthread_exit has a single argument:

void pthread_exit(void *retval);

void pointer � stores exit value
Do not use this function in CS180. Use a simple return statement instead. (Valgrind gives a false positive memory leak with pthread_exit.)

A larger example using multiple threads and shared (global) data: (thread2.c)

All global data is shared between all threads.
Unlike processes, no extra work is required to share the data.
This gives threads a performance advantage over processes.
In this example, all threads (but the main thread) are executing the same code (same function).

This isn't a problem, since all "copies" of the function have their own registers and stack (local variables, parameters).
Think about how C++ object-oriented programming works with objects and member functions. (e.g. the this pointer)

The example could have had each thread running a different function. (A later example will show that.)

This program uses a global array of NUL-terminated strings (character pointers). The main function creates several threads. Each thread runs the same function palindrome, which simply chooses a number of random palindromes (from the global array) to print to the screen. There is only one process, but many threads.

#include <stdio.h>   /* printf                  */
#include <unistd.h>  /* sleep, getpid           */
#include <pthread.h> /* thread create/join/exit */

char const *table[]
   = { "A man, a plan, a canal: Panama!",
       "No lemons, no melon.",
       "Step on no pets.",
       "Was it a rat I saw?",
       "Dog as a devil deified lived as a god.",
       "Able was I ere I saw Elba.",
       "Yawn! Madonna fan? No damn way!",
       "Go hang a salami. I'm a lasagna hog!"
     };

const int table_size = sizeof(table) / sizeof(*table);

void *palindrome(void *p) /* p is unused */
{
  pthread_t tid = pthread_self();
  int max = 5;
  int i, pid;

  pid = getpid(); /* all threads have the same pid */

  for (i = 0; i < max; ++i)
  {
    int index = rand() % table_size;
    printf("[p:%i][t:%u]: %s\n", pid, (unsigned)tid / 1000000, table[index]);
  }

  return NULL;
}

int main(void)
{
  #define COUNT 8
  int i;

  pthread_t thread_id[COUNT];

  srand(0);

  printf("Creating threads...\n");
  for (i = 0; i < COUNT; ++i)
    pthread_create(&thread_id[i], NULL, palindrome, NULL);
  printf("Done creating threads.\n");

  printf("Waiting on threads...\n");
  for (i = 0; i < COUNT; ++i)
  {
    printf("Joining thread %i\n", i);
    pthread_join(thread_id[i], NULL);
  }
  printf("Done waiting on threads.\n");

  return 0;
}

Output from above

Another example using multiple threads and passing data (no globals): (thread3.c)

Global data isn't always desirable in programs.
Just like we pass data to functions (rather than let all functions access global data), we can pass data to the thread functions.
This can give each thread its own local copy of the data.
In this example, all threads (but the main thread) are still executing the same code (same function).

This isn't a problem, since all "copies" of the function have their own registers and stack (local variables).

Much like the IPC mechanism of message passing, we can create our own user-defined structure used to communicate with the threads.

This program is similar to the one above but instead of using global data, the data is passed to each thread. The data is a user-defined struct that contains a pointer to the palindrome strings, the size of the table (array), and a count of how many strings to print. Although each thread receives its own copy of the pointer to the strings, they are all reading from the same array. This is completely safe. As before, there is only one process, but many threads.

#include <stdio.h>   /* printf                  */
#include <unistd.h>  /* sleep, getpid           */
#include <stdlib.h>  /* rand, srand             */
#include <pthread.h> /* thread create/join/exit */

/* User-defined data for threads */
typedef struct
{
  const char **table;
  int table_size;
  int to_do_count;
}data_struct; /* an unimaginative name for our structure ... */

void *palindrome(void *data)
{
  data_struct *ds = (data_struct *)data; /* typical use when providing arguments */
  pthread_t tid = pthread_self();        /* get this thread's tid                */
  int pid = getpid();                    /* all threads have the same pid        */
  int i;

  for (i = 0; i < ds->to_do_count; ++i)
  {
    int index = rand() % ds->table_size;
    printf("[p:%i][t:%u]: %s\n", pid, (unsigned)tid / 1000000, ds->table[index]);
  }

  return NULL;
}

int main(void)
{
  #define COUNT 8
  int i;

  const char *table[]
   = { "A man, a plan, a canal: Panama!",
       "No lemons, no melon.",
       "Step on no pets.",
       "Was it a rat I saw?",
       "Dog as a devil deified lived as a god.",
       "Able was I ere I saw Elba.",
       "Yawn! Madonna fan? No damn way!",
       "Go hang a salami. I'm a lasagna hog!"
     };

  const int table_size = sizeof(table) / sizeof(*table);

  srand(0);

  data_struct ds[COUNT];
  pthread_t thread_id[COUNT];

  printf("Creating threads...\n");
  for (i = 0; i < COUNT; ++i)
  {
    ds[i].table = table;
    ds[i].table_size = table_size;
    ds[i].to_do_count = 5; /*1 + rand() % 10;*/
    pthread_create(&thread_id[i], NULL, palindrome, &ds[i]);
  }
  printf("Done creating threads.\n");

  printf("Waiting on threads...\n");
  for (i = 0; i < COUNT; ++i)
  {
    printf("Joining thread %i\n", i);
    pthread_join(thread_id[i], NULL);
  }
  printf("Done waiting on threads.\n");

  return 0;
}

Each thread is passed a pointer to its own (possibly unique) data.

Passing pointers between processes can be problematic. Why?
What about shared memory (e.g. shmget)?

The data structure is up to the programmer.
The data is created in the main thread.

It's possible for a thread to allocate its own private memory as well.
The data must still be "alive" (in main or whatever called the function) while other threads are using it.
This means that the "main" thread can't return from main until all threads are finished.

The data is passed to and from threads using void pointers, which means that it is necessary to cast the void pointers to the correct type before using them.
Passing data to the thread procedures can be used to eliminate the need for global data.

Output: (Like before, each run will produce a different ordering.)

Creating threads...
[p:20030][t:791]: Go hang a salami. I'm a lasagna hog!
[p:20030][t:791]: Was it a rat I saw?
[p:20030][t:791]: No lemons, no melon.
[p:20030][t:791]: Step on no pets.
[p:20030][t:766]: Go hang a salami. I'm a lasagna hog!
[p:20030][t:783]: Yawn! Madonna fan? No damn way!
[p:20030][t:783]: Able was I ere I saw Elba.
[p:20030][t:783]: Step on no pets.
[p:20030][t:783]: Step on no pets.
[p:20030][t:783]: Was it a rat I saw?
[p:20030][t:741]: Was it a rat I saw?
[p:20030][t:758]: No lemons, no melon.
[p:20030][t:758]: Yawn! Madonna fan? No damn way!
[p:20030][t:758]: Step on no pets.
[p:20030][t:758]: Dog as a devil deified lived as a god.
[p:20030][t:758]: A man, a plan, a canal: Panama!
Done creating threads.
[p:20030][t:733]: A man, a plan, a canal: Panama!
[p:20030][t:733]: Go hang a salami. I'm a lasagna hog!
[p:20030][t:766]: Dog as a devil deified lived as a god.
[p:20030][t:766]: Yawn! Madonna fan? No damn way!
[p:20030][t:766]: Step on no pets.
[p:20030][t:766]: Yawn! Madonna fan? No damn way!
[p:20030][t:733]: Able was I ere I saw Elba.
[p:20030][t:733]: Was it a rat I saw?
[p:20030][t:733]: Was it a rat I saw?
[p:20030][t:741]: Was it a rat I saw?
[p:20030][t:741]: Go hang a salami. I'm a lasagna hog!
[p:20030][t:791]: Dog as a devil deified lived as a god.
[p:20030][t:749]: Was it a rat I saw?
[p:20030][t:749]: No lemons, no melon.
[p:20030][t:749]: Step on no pets.
[p:20030][t:749]: Yawn! Madonna fan? No damn way!
[p:20030][t:749]: Step on no pets.
Waiting on threads...
Joining thread 0
[p:20030][t:775]: No lemons, no melon.
[p:20030][t:775]: Able was I ere I saw Elba.
[p:20030][t:775]: No lemons, no melon.
[p:20030][t:775]: A man, a plan, a canal: Panama!
[p:20030][t:775]: Was it a rat I saw?
Joining thread 1
Joining thread 2
Joining thread 3
[p:20030][t:741]: Go hang a salami. I'm a lasagna hog!
[p:20030][t:741]: Step on no pets.
Joining thread 4
Joining thread 5
Joining thread 6
Joining thread 7
Done waiting on threads.

Notes:

In normal POSIX processes, the TGID (thread group ID) is the same as the PID. With threads, the TGID is the same for all threads in a thread group. This enables threads to call getpid() and get the same PID.
Functions in the stdio library are thread safe with GNU and Microsoft compilers.

Each thread has its own global errno variable.
Thread-safety is somewhat related to reentrancy.
Global and static variables generally have a negative impact on threading (which is the primary reason to avoid them, although we don't usually teach this reason to beginners).
The strace example would suffer from this problem because the localtime function returns a pointer to static data.

Microsoft used to provide two sets of libraries: single-threaded and multi-threaded. Now everything uses multi-threaded libraries.
The I/O functions use locking functions to ensure thread-safety.
This doesn't mean you won't get interleaved output. It just means that you won't corrupt anything by having multiple threads access the functions.
Thread-safety and POSIX.1 - This explains a little bit about stdio functions that are thread-safe.
This also means that the rand() function is not thread-safe. Look at PRNG.c to see why. A better function is drand48_r.

A More Real-World^™ Example

The previous examples were a good introduction to how threads work, but they were contrived. There was little to be gained by having a seperate thread print out those strings at the same time to the screen. In fact, it may have been worse using threads because the output appeared chaotic. Let's do something that actually makes sense.

Remember, the goal of multi-threading is to keep the CPU busy. We never want it to be idle if there is work to be done.

In this example, we have 3 different ways to calculate the value of pi: (pi.c)

Using the circle method (video)
Using the Leibniz method (video)
Using the atan method

Briefly, the way these methods work is to iteratively calculate more and more precision for pi. The more iterations that are done, the closer approximation to pi we get. For reference, we are going to do 200,000,000 iterations with each function, giving us several digits of precision. Also for reference, the time it takes for each function to calculate pi on a few of reference machines (olga) is: (No optimizations were enabled in the compiler.)

olga (4 cores) rebecca (4 cores) VM (4 cores)

Method Time

Circle method 2.2 seconds

Leibniz method 2.4 seconds

atan method 1.5 seconds

Method Time

Circle method 3.4 seconds

Leibniz method 1.9 seconds

atan method 1.2 seconds

Method Time

Circle method 2.7 seconds

Leibniz method 1.4 seconds

atan method 0.5 seconds

To give you an idea of how these function work, this is what the atan method looks like:

double atan_pi(int rectangles)
{
  double width = 1.0 / rectangles;
  double length;
  double area = 0.0;
  int i;

  double midpoint = width / 2;
  for (i = 0; i < rectangles; i++)
  {
    double area_of_rectangle;

    length = 4.0 / (1 + midpoint * midpoint);
    area_of_rectangle = length * width;
    area += area_of_rectangle;
    midpoint += width;
  }

  return area;
}

The details are not important. What is important is that this function is going to take a while to perform 200,000,000 iterations. The other two functions are similar in that they are going to loop 200,000,000 times.

Before we learned about processes, threads, and multi-core CPUs, we would have programmed it like this: (pi-nothread.c)

/* To compile under Linux, use -lm linker option */
#include <stdio.h>  /* printf */
#include <stdlib.h> /* atoi   */

double circle_pi(int rectangles);  /* Calculates PI using a quarter circle */
double leibniz_pi(int iterations); /* Calculates PI using a series         */
double atan_pi(int rectangles);    /* Calculates PI using a curve          */

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;
  int method = 0;
  double pi_circle, pi_leibniz, pi_atan;

  if (argc > 1)
    method = atoi(argv[1]);

  if (argc > 2)
    iterations = 1000 * 1000 * atoi(argv[2]);

  switch (method)
  {
    case 1:
      pi_circle = circle_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf(" circle:%20.16f\n", pi_circle);
      break;
    case 2:
      pi_leibniz = leibniz_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf("leibniz:%20.16f\n", pi_leibniz);
      break;
    case 3:
      pi_atan = atan_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf("   atan:%20.16f\n", pi_atan);
      break;
    default:
      pi_circle = circle_pi(iterations);
      pi_leibniz = leibniz_pi(iterations);
      pi_atan = atan_pi(iterations);
      printf("Iterations: %10i\n", iterations);
      printf(" circle:%20.16f\n", pi_circle);
      printf("leibniz:%20.16f\n", pi_leibniz);
      printf("   atan:%20.16f\n", pi_atan);
      break;
  }

  return 0;
}

Then, run from the command line and timing it:

time ./pi-nothread 0 200

the output:

Iterations:  200000000
 circle:  3.1415926568498080
leibniz:  3.1415926485894077
   atan:  3.1415926536631549

real    0m6.168s
user    0m6.150s
sys     0m0.010s

Reminder of the times for each function (olga):

Method Time

Circle method 2.2 seconds

Leibniz method 2.4 seconds

atan method 1.5 seconds

Method	Time
Circle method	2.2 seconds
Leibniz method	2.4 seconds
atan method	1.5 seconds

As expected, since we ran each function serially (one after the other), the total time is the sum of the times of each function. (The time to run the program without calling any pi function was 0.001 s.)

On a single-core CPU, that's the best we can do. Even putting each function in its own process or own thread would likely make the timings worse. Why?

BUT, on a multi-core system, we can certainly improve things. Let's run each function in its own process and hope that the operating system will give each process its own core to run on simultaneously. Now we program it like this: (pi-mpsh.c)

/* To compile under Linux, use -lm linker option */
#include <stdio.h>    /* printf                       */
#include <stdlib.h>   /* exit, atoi                   */
#include <string.h>   /* strcpy                       */
#include <unistd.h>   /* sleep, fork                  */
#include <sys/shm.h>  /* shmget, shmat, shmdt, shmctl */
#include <sys/wait.h> /* waitpid                      */

double circle_pi(int rectangles);  /* Calculates PI using a quarter circle */
double leibniz_pi(int iterations); /* Calculates PI using a series         */
double atan_pi(int rectangles);    /* Calculates PI using a curve          */

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;

  int child1, child2, child3;
  int shmid;
  double *buffer;  /* shared buffer */
  key_t key = 123; /* arbitrary key */

  if (argc > 1)
    iterations = 1000 * 1000 * atoi(argv[1]);

    /* Buffer will hold 3 doubles; one from each child process */
  shmid = shmget(key, 3 * sizeof(double), 0600 | IPC_CREAT);
  buffer = (double *) shmat(shmid, NULL, 0);

    /* Initialize the buffer */
  *buffer = 0.0;
  *(buffer + 1) = 0.0;
  *(buffer + 2) = 0.0;

  if ( (child1 = fork()) == 0 ) /* first child */
  {
    *buffer = circle_pi(iterations);
    shmdt(buffer); /* detach memory from child process */
    exit(0);
  }

    /* parent */
  if ( (child2 = fork()) == 0 ) /* second child */
  {
    *(buffer + 1) = leibniz_pi(iterations);
    shmdt(buffer); /* detach memory from second process */
    exit(0);
  }

    /* parent */
  if ( (child3 = fork()) == 0 ) /* third child */
  {
    *(buffer + 2) = atan_pi(iterations);
    shmdt(buffer); /* detach memory from third child process */
    exit(0);
  }

  /* parent */

  waitpid(child1, NULL, 0);
  waitpid(child2, NULL, 0);
  waitpid(child3, NULL, 0);

  printf("Iterations: %10i\n", iterations);
  printf(" circle:%20.16f\n", *buffer);
  printf("leibniz:%20.16f\n", *(buffer + 1));
  printf("   atan:%20.16f\n", *(buffer + 2));

  shmdt(buffer);              /* detach memory from parent process */
  shmctl(shmid, IPC_RMID, 0); /* delete memory block               */

  return 0;
}

Then, run from the command line and timing it:

time ./pi-mpsh 200

the output:

Iterations:  200000000
 circle:  3.1415926568498080
leibniz:  3.1415926485894077
   atan:  3.1415926536631549

real    0m2.405s
user    0m6.140s
sys     0m0.000s

Reminder:

Method Time

Circle method 2.2 seconds

Leibniz method 2.4 seconds

atan method 1.5 seconds

Method	Time
Circle method	2.2 seconds
Leibniz method	2.4 seconds
atan method	1.5 seconds

The single-process version took 6.15 seconds. This is a clearly significant improvement. Why do you think the time is 2.4 seconds? (It was run on a machine with 4 cores.)

Now, let's run each function in its own thread and see how that goes. We program it like this: (pi-mt.c or pi-mt-int.c)

/* To compile under Linux, use -lm and -lpthread linker options */
#include <stdio.h>   /* printf                  */
#include <stdlib.h>  /* exit, atoi              */
#include <pthread.h> /* thread create/join/exit */

double circle_pi(int rectangles);  /* Find PI using a quarter circle */
double leibniz_pi(int iterations); /* Find PI using a series         */
double atan_pi(int rectangles);    /* Find PI using a curve          */

double pi_circle;
double pi_leibniz;
double pi_atan;

void *th1(void *p)
{
  int it = *(int *)p;
  pi_circle = circle_pi(it);
  return NULL;
}

void *th2(void *p)
{
  int it = *(int *)p;
  pi_leibniz = leibniz_pi(it);
  return NULL;
}

void *th3(void *p)
{
  int it = *(int *)p;
  pi_atan = atan_pi(it);
  return NULL;
}

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;

  pthread_t thread1, thread2, thread3;

  if (argc > 1)
    iterations = 1000 * 1000 * atoi(argv[1]);

    /* Create threads with data passing */
  pthread_create(&thread1, NULL, th1, &iterations);
  pthread_create(&thread2, NULL, th2, &iterations);
  pthread_create(&thread3, NULL, th3, &iterations);

    /* Wait on the threads */
  pthread_join(thread1, NULL);
  pthread_join(thread2, NULL);
  pthread_join(thread3, NULL);

  printf("Iterations: %10i\n", iterations);
  printf(" circle:%20.16f\n", pi_circle);
  printf("leibniz:%20.16f\n", pi_leibniz);
  printf("   atan:%20.16f\n", pi_atan);

  return 0;
}

Then, run from the command line and timing it:

time ./pi-mt 200

the output:

Iterations:  200000000
 circle:  3.1415926568498080
leibniz:  3.1415926485894077
   atan:  3.1415926536631549

real    0m2.398s
user    0m6.130s
sys     0m0.000s

Another version using arrays for scaling. This example also shows how to return data from one thread to another thread via the thread-function argument.

Some interesting benchmarks: Various computer specs for testing in this class.

Below are timings for calculating pi using the methods and systems described above. There are two primary factors that determine the overall performance of the calculations. One is the version of the GNU compiler used, and the other is the architecture of the processor, which includes the speed of the CPU itself, as well as the size, speed, and architecture of the caches. We will learn more about the details of cache memory later. Don't forget Mead's informal rule of thumb: Cache is King^™

Finally, realize that, although this is not quite a synthetic benchmark, (WhetstoneWhetstone for floating point and DrhystoneDrhystone for integers ) it really doesn't represent all of the tasks and programs a system is going to run during an average day, so you can't just look at this one benchmark and conclude which system is best.

Computer (200,000,000 iterations)	Individual timings (secs) `pi-nothread.c`	One process One thread `pi-nothread.c`	Multiple processes One thread each `pi-mpsh.c`	One process Multiple Threads `pi-mt.c`
Maya (3.4 GHz) 4 cores Linux 64-bit (3.2.0-23) gcc 4.6.3	circle: 0.8 Leibniz: 1.7 atan: 0.8 3.3	real 0m3.299s user 0m3.288s sys 0m0.000s	real 0m1.717s user 0m3.420s sys 0m0.008s	real 0m1.713s user 0m3.680s sys 0m0.000s
Clara (3.3 GHz) 4 cores Linux 64-bit (3.2.0-23) gcc 4.6.3	circle: 1.0 Leibniz: 1.7 atan: 1.1 3.8	real 0m3.730s user 0m3.724s sys 0m0.000s	real 0m1.673s user 0m3.724s sys 0m0.000s	real 0m1.699s user 0m3.764s sys 0m0.000s
Server (3.1 GHz) 4 cores Linux 64-bit gcc 4.6.3	circle: 1.7 Leibniz: 1.0 atan: 1.7 4.4	real 0m4.324s user 0m4.304s sys 0m0.000s	real 0m1.743s user 0m4.292s sys 0m0.000s	real 0m1.704s user 0m4.252s sys 0m0.000s
Athena (2.83 GHz) 4 cores Linux 32-bit (2.6.18-2) gcc 4.1.2	circle: 2.1 Leibniz: 1.6 atan: 1.6 5.3	real 0m5.200s user 0m5.196s sys 0m0.004s	real 0m2.105s user 0m5.252s sys 0m0.000s	real 0m2.100s user 0m5.256s sys 0m0.000s
Lulu (2.7 GHz) 6 cores Linux 64-bit gcc 4.6.3	circle: 2.0 Leibniz: 2.1 atan: 1.9 6.0	real 0m6.087s user 0m6.084s sys 0m0.000s	real 0m2.224s user 0m6.288s sys 0m0.000s	real 0m2.209s user 0m6.244s sys 0m0.000s
Olga (2.8 GHz) 4 cores Linux 64-bit (2.6.32-24) gcc 4.4.3	circle: 2.2 Leibniz: 2.4 atan: 1.5 6.1	real 0m6.135s user 0m6.130s sys 0m0.000s	real 0m2.405s user 0m6.150s sys 0m0.000s	real 0m2.408s user 0m6.120s sys 0m0.000s
Lisa (2.67 GHz) 2 core Linux 64-bit (2.6.38) gcc 4.5.2	circle: 2.2 Leibniz: 2.4 atan: 1.5 6.1	real 0m6.131s user 0m6.120s sys 0m0.010s	real 0m3.007s user 0m8.360s sys 0m0.000s	real 0m2.984s user 0m8.400s sys 0m0.000s
Gina (3.0 GHz) 2 cores Linux 64-bit gcc 4.8.2	circle: 4.8 Leibniz: 2.2 atan: 1.0 8.0	real 0m7.879s user 0m7.813s sys 0m0.008s	real 0m3.011s user 0m4.026s sys 0m0.000s	real 0m4.105s user 0m5.550s sys 0m0.021s
Digipen (2.4 GHz) 2 cores Windows 32-bit (XP) gcc 4.3.2	circle: 6.3 Leibniz: 3.2 atan: 3.2 12.7	real 0m12.516s user 0m12.499s sys 0m0.046s	real 0m7.406s user 0m12.341s sys 0m0.015s	real 0m5.281s user 0m10.374s sys 0m0.031s
Jessica (2.8 GHz) 1 core Windows 32-bit (Server 2003) gcc 4.3.2	circle: 6.2 Leibniz: 3.4 atan: 3.7 13.3	real 0m6.141s user 0m5.953s sys 0m0.015s	real 0m13.547s user 0m23.046s sys 0m0.015s	real 0m14.718s user 0m26.093s sys 0m0.015s
Sabrina (2.4 GHz) 1 core Linux 32-bit (2.4.20) gcc 3.3	circle:10.7 Leibniz: 3.9 atan: 4.0 18.6	real 0m18.645s user 0m18.200s sys 0m0.000s	real 0m18.610s user 0m18.180s sys 0m0.000s	real 0m19.046s user 0m18.590s sys 0m0.000s
Maria (1.66 GHz) 2 cores Linux 32-bit (2.6.31-16) gcc 4.4.1	circle:17.7 Leibniz:12.9 atan:16.3 46.9	real 0m46.806s user 0m46.759s sys 0m0.048s	real 0m27.072s user 0m52.855s sys 0m0.052s	real 0m27.306s user 0m52.091s sys 0m0.024s

Some real stress testing: (Do try this at home! pi-mtv.c)

This program runs many threads (up to 1024, but you can set it higher) and executes the function atan_pi for calculating pi. Under Windows, you can watch the program with Task Manager or Process Explorer and see all of the threads. The command line takes two parameters: The number of threads (default 1) and the number of iterations (times one million) for the function (default 100).

/* To compile under Linux, use -lm and -lpthread linker options */
#include <stdio.h>   /* printf                  */
#include <stdlib.h>  /* atoi                    */
#include <pthread.h> /* thread create/join/exit */

#define MAX_THREADS 1024

double atan_pi(int rectangles); /* Find PI using a curve */

void *th1(void *p)
{
  int it = *(int *)p;
  atan_pi(it);
  return NULL;
}

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;
  int thread_count = 1;
  int i;

  pthread_t threads[MAX_THREADS];

  if (argc > 1)
  {
    thread_count = atoi(argv[1]);
    thread_count = thread_count > MAX_THREADS ? MAX_THREADS : thread_count;
  }

  if (argc > 2)
    iterations = 1000 * 1000 * atoi(argv[2]);

  printf("   Threads: %i\n", thread_count);
  printf("Iterations: %i\n", iterations);

  for (i = 0; i < thread_count; i++)
    pthread_create(&threads[i], NULL, th1, &iterations);

  for (i = 0; i < thread_count; i++)
    pthread_join(threads[i], NULL);

  return 0;
}

Another perspective

The previous examples were all fairly trivial to run in multiple threads. The main reason is that each function implements a complete, stand-alone calculation. They can easily be run independently of each other without requiring any communication between the threads. This is ideal for multi-threading: One function, one thread, independent data. It doesn't get any easier than this.

However, what if we only had one function that calculated the value of PI? How would we take advantage of multiple threads? In this case, it's still pretty easy because it is trivial to parallelize the computation. For example, if we want to calculate the value of PI using 1,000,000 iterations, it will take a certain amount of time. But, what if we run two threads in parallel with one thread running half of the iterations and the other thread running the other half? Examples: Using this data:

#define SIZE 16
int array[SIZE];
int i;

We could set all elements in array using one loop in one thread:

  /* Thread #1  Sets elements 0-15 */
for (i = 0; i < 16; i++)
  array[i] = i;

We could essentially cut the time required in half (less thread overhead) if we used two threads:

Thread #1		Thread #2
/ Thread #1 Sets elements 0-7 / for (i = 0; i < 8; i++) array[i] = i;		/ Thread #2 Sets elements 8-15/ for (i = 8; i < 16; i++) array[i] = i;

Or, better yet, use 4 threads:

Thread #1		Thread #2
/ Thread #1 Sets elements 0-3 / for (i = 0; i < 4; i++) array[i] = i;		/ Thread #2 Sets elements 4-7 / for (i = 4; i < 8; i++) array[i] = i;

Thread #3		Thread #4
/ Thread #3 Sets elements 8-11 / for (i = 8; i < 12; i++) array[i] = i;		/ Thread #4 Sets elements 12-15/ for (i = 12; i < 16; i++) array[i] = i;

Or, using 8 or more threads (you get the picture). Yes, with 16 elements in the example above, multiple threads is too much overhead. But, suppose you need to process millions or billions of elements, not uncommon with today's computers. Using threads will have an enormous impact on the processing time.

Aside: There are languages that will automagically divide the work into concurrent threads. One example is FORTRAN setting an array of 2 billion integers (gfortran init.f90):
Single thread Concurrent threads
! This program tests the concurrency
program TestConcurrentLoop

! Type declarations
   integer :: i, x(2000000000)

! Executable statements
   do i = 1, 2000000000
     x(i) = i * 3;
   enddo  

end program TestConcurrentLoop

Time to complete: 7.4 seconds
! This program tests the concurrency
program TestConcurrentLoop

! Type declarations
   integer :: i, x(2000000000)

! Executable statements
   do concurrent (i = 1: 2000000000)
     x(i) = i * 3;
   enddo  

end program TestConcurrentLoop

Time to complete: 5.3 seconds
By specifying the concurrent keyword, the compiler will automagically create multiple loops to be executed simultaneously.

Single thread		Concurrent threads
! This program tests the concurrency program TestConcurrentLoop ! Type declarations integer :: i, x(2000000000) ! Executable statements do i = 1, 2000000000 x(i) = i * 3; enddo end program TestConcurrentLoop *Time to complete: 7.4 seconds*		! This program tests the concurrency program TestConcurrentLoop ! Type declarations integer :: i, x(2000000000) ! Executable statements do concurrent (i = 1: 2000000000) x(i) = i * 3; enddo end program TestConcurrentLoop *Time to complete: 5.3 seconds*

Getting back to the example of calculating digits of PI using 1,000,000 iterations.

One thread could calculate the value using the range 0 - 500,000 and the second thread can caclulate the value using the range 500,001 - 1,000,000. Then, we just add the two together to get the total value. We could take it further and break the range into 4 intervals:

Thread1: 0 - 250,000 (technically 249,999)
Thread2: 250,000 - 500,000
Thread3: 500,000 - 750,000
Thread4: 750,000 - 1,000,000

Of course, we can take it further and divide it into 8, or 16, or more ranges. Basically, we'd like to have the number of threads match the number of CPUs (cores). This example (pi_atan.c) shows the function atan_pi modified to accept a range (start-end) so it knows which "portion" of the range to calculate. The main thread then sums up all of the results to produce the final value.

A note about processor affinityprocessor affinity:

You can specify which core(s) a thread can run on.
This can affect the cache (hot vs. cold)
You can also specify the affinity for the process.

All of the threads in the process will then share the same affinity.
The child processes inherit the parent's affinity.

May involve some trial-and-error to see the effects.
Also see the taskset program to view and modify the affinity of a running process.
The example above has been extended to show the aspects of processor affinity.

/*
  Compile: gcc pi-mtv.c pi-int.c -lm -lpthread
*/

#define _GNU_SOURCE  /* required for CPU_ macros */
#include <stdio.h>   /* printf                   */
#include <stdlib.h>  /* atoi                     */
#include <pthread.h> /* thread create/join/exit  */

#define MAX_THREADS 1024

double atan_pi(int rectangles); /* Find PI using a curve          */

void *th1(void *p)
{
  int result = 0;
  int it, i;
  pthread_t thread = pthread_self(); /* get this thread's ID */

  cpu_set_t cpuset;    /* affinity mask   */
  CPU_ZERO(&cpuset);   /* clear mask      */
  CPU_SET(0, &cpuset); /* set only CPU 0  */
  
  /* Enable this to set affinity mask for CPUs 0 to 7 */
  #if 0
  for (i = 0; i < 8; i++)
    CPU_SET(i, &cpuset);
  #endif
  
  result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
  if (result != 0)
    printf("pthread_setaffinity_np failed\n");

  it = *(int *)p;
  atan_pi(it);
  return NULL;
}
/* main is the same as above */

Win32 Threads

Windows has its own API for threads:

Basic thread creation functions are a part of the Win32 API.
CreateThread - creates and runs a thread

Returns a thread handle

WaitForSingleObject - wait for a single thread to terminate.

can specify a wait time

WaitForMultipleObjects - wait for more than one thread to terminate.
Brief example: (wthread1.c)

This program simply creates a thread which prints the word Hello! to the screen and then exits.

#include <stdio.h>   /* printf        */
#include <windows.h> /* Windows stuff */

DWORD WINAPI hello(LPVOID)
{
  printf("Hello!\n");
  return 0;
}

int main(void)
{
  HANDLE thread_handle;

  thread_handle = CreateThread(0,     /* Default attributes */
                               0,     /* Default stack size */
                               hello, /* Start function     */
                               0,     /* No parameters      */
                               0,     /* Creation flags     */
                               0      /* Returned thread ID */
                              );

    /* Wait for the thread (no timeout) */
  WaitForSingleObject(thread_handle, INFINITE);

    /* Release the thread */
  CloseHandle(thread_handle);

  return 0;
}

The thread procedure must have the form:

DWORD WINAPI fn(LPVOID data_ptr)

A thread handle is a system resource that was allocated with CreateThread.
It needs to be freed using CloseHandle when the thread has completed.
You must include windows.h when using anything from the Windows API.

Another larger Win32 example: (wthread2.c)

This program uses a global array of strings (character pointers). The main function creates several threads. Each thread runs the same function palindrome, which simply chooses a random number of palindromes (from the global array) to print to the screen. There is only one process, but many threads.

#include <stdio.h>   /* printf        */
#include <stdlib.h>  /* rand, srand   */
#include <windows.h> /* Windows stuff */

char const *table[]
   = { "A man, a plan, a canal: Panama!",
       "No lemons, no melon.",
       "Step on no pets.",
       "Was it a rat I saw?",
       "Dog as a devil deified lived as a god.",
       "Able was I ere I saw Elba.",
       "Yawn! Madonna fan? No damn way!",
       "Go hang a salami. I'm a lasagna hog!"
     };

const int table_size = sizeof(table) / sizeof(*table);

DWORD WINAPI palindrome(LPVOID)
{
  DWORD tid = GetCurrentThreadId();  /* each thread has unique tid    */
  DWORD pid = GetCurrentProcessId(); /* all threads have the same pid */
  int max = 5;
  int i;

  for (i = 0; i < max; ++i)
  {
    int index = rand() % table_size;
    printf("[p:%i][t:%u]: %s\n", pid, tid, table[index]);
  }

  return 0;
}

int main(void)
{
  #define COUNT 8
  int i;

  HANDLE th_handles[COUNT];

  srand(0);

  printf("Creating threads...\n");
  for (i = 0; i < COUNT; ++i)
    th_handles[i] = CreateThread(0,          /* Default attributes */
                                 0,          /* Default stack size */
                                 palindrome, /* Start function     */
                                 0,          /* No parameters      */
                                 0,          /* Default flags      */
                                 0           /* Returned thread ID */
                                );
  printf("Done creating threads.\n");

  printf("Waiting on threads...\n");
  WaitForMultipleObjects(COUNT, th_handles, TRUE, INFINITE);
  printf("Done waiting on threads.\n");

  for (i = 0; i < COUNT; ++i)
    CloseHandle(th_handles[i]);

  return 0;
}

Another Win32 example: (wthread3.c)

This program is similar to the one above but instead of using global data, the data is passed to each thread. The data is a user-defined struct that contains a pointer to the palindrome strings, the size of the table (array), and a count of how many strings to print. Although each thread receives its own copy of the pointer to the strings, they are all reading from the same array. This is completely safe. As before, there is only one process, but many threads.

#include <stdio.h>   /* printf                  */
#include <stdlib.h>  /* rand, srand             */
#include <windows.h> /* Windows stuff           */

typedef struct
{
  const char **table;
  int table_size;
  int to_do_count;
}data_struct;

DWORD WINAPI palindrome(LPVOID data)
{
  data_struct *ds = (data_struct *)data;
  DWORD tid = GetCurrentThreadId();
  DWORD pid = GetCurrentProcessId();; /* all threads have the same pid */
  int i;

  for (i = 0; i < ds->to_do_count; ++i)
  {
    int index = rand() % ds->table_size;
    printf("[p:%i][t:%u]: %s\n", pid, tid, ds->table[index]);
  }

  return 0;
}

int main(void)
{
  #define COUNT 8
  int i;

  char const *table[]
   = { "A man, a plan, a canal: Panama!",
       "No lemons, no melon.",
       "Step on no pets.",
       "Was it a rat I saw?",
       "Dog as a devil deified lived as a god.",
       "Able was I ere I saw Elba.",
       "Yawn! Madonna fan? No damn way!",
       "Go hang a salami. I'm a lasagna hog!"
     };

  const int table_size = sizeof(table) / sizeof(*table);

  srand(0);

  data_struct ds[COUNT];
  HANDLE th_handles[COUNT];

  printf("Creating threads...\n");
  for (i = 0; i < COUNT; ++i)
  {
    ds[i].table = table;
    ds[i].table_size = table_size;
    ds[i].to_do_count = 5; /*1 + rand() % 10;*/
    th_handles[i] = CreateThread(0, 0, palindrome, &ds[i], 0, 0);
  }
  printf("Done creating threads.\n");

  printf("Waiting on threads...\n");
  WaitForMultipleObjects(COUNT, th_handles, TRUE, INFINITE);
  printf("Done waiting on threads.\n");

  for (i = 0; i < COUNT; ++i)
    CloseHandle(th_handles[i]);

  return 0;
}

The threaded stress test using the Win32 API. Use Windows Task Manager or Process Explorer to see the processes and threads. (wpi-mtv.c)


#include <stdio.h>   /* printf        */
#include <stdlib.h>  /* exit, atoi    */
#include <windows.h> /* Windows stuff */

#define MAX_THREADS 1024

double atan_pi(int rectangles); /* Find PI using a curve */

DWORD WINAPI th1(LPVOID p)
{
  int it = *(int *)p;
  atan_pi(it);
  return 0;
}

int main(int argc, char **argv)
{
  int iterations = 1000 * 1000 * 100;
  int thread_count = 1;
  int i;

  HANDLE threads[MAX_THREADS];

  if (argc > 1)
  {
    thread_count = atoi(argv[1]);
    thread_count = thread_count > MAX_THREADS ? MAX_THREADS : thread_count;
  }

  if (argc > 2)
  {
    iterations = 1000 * 1000 * atoi(argv[2]);
  }

  printf("   Threads: %i\n", thread_count);
  printf("Iterations: %i\n", iterations);

  for (i = 0; i < thread_count; i++)
    threads[i] = CreateThread(0, 0, th1, &iterations, 0, 0);

  WaitForMultipleObjects(thread_count, threads, TRUE, INFINITE);

  for (i = 0; i < thread_count; i++)
    CloseHandle(threads[i]);

  return 0;
}

Multi-threading Problems

So far, everything seems to be working out just fine. However, consider the following two programs:

Single-threaded version: (thread-safe.c)

#include <stdio.h> /* printf */

int Count = 0;

void Increment(void)
{
  int i;
  int loops = 1000;
  for (i = 0; i < loops; i++)
    Count++;
}

int main(void)
{
  #define N 10
  int i;

  for (i = 0; i < N; i++)
    Increment();

  printf("Count = %i\n", Count);

  return 0;
}

Output:

Count = 10000

Multi-threaded version: (thread-unsafe.c)

#include <stdio.h>   /* printf                  */
#include <pthread.h> /* thread create/join/exit */

int Count;

void *Increment(void *p)
{
  int i;
  int loops = 1000;
  for (i = 0; i < loops; i++)
    Count++;

  return NULL;
}

int main(void)
{
  #define N 10
  int i;
  pthread_t thread_id[N];

  for (i = 0; i < N; i++)
    pthread_create(thread_id + i, NULL, Increment, NULL);

  for (i = 0; i < N; i++)
    pthread_join(thread_id[i], NULL);

  printf("Count = %i\n", Count);

  return 0;
}

Output:

Count = 10000

Hey, we found a way to speed up increments! (Not really...) Let's see what happens when we change this line in the Increment function from this:

int loops = 1000;

to this:

int loops = 1000 * 100;

Results:

	Single-threaded	Multi-threaded
Run #1	Count = 1000000	Count = 591274
Run #2	Count = 1000000	Count = 636767
Run #3	Count = 1000000	Count = 591146
Run #4	Count = 1000000	Count = 611036
Run #5	Count = 1000000	Count = 625463

What's going on here???
Why did it "work" with a smaller number?
Can it be "fixed" so it produces the correct output?
Will it be faster?

The first version with 10000 iterations will also fail if you run it a few times:

for val in {1..10}; do  ./thread-unsafe; done

Output:

Count = 10000
Count = 10000
Count = 10000
Count = 10000
Count = 10000
Count = 9170
Count = 10000
Count = 10000
Count = 8656
Count = 9203

This is the fundamental problem with multi-threaded programs accessing a shared resource. Later, we'll see various ways of dealing with this issue and others when we talk about synchronization.

And the Windows API has the exact same problem:

#include <stdio.h>   /* printf        */
#include <windows.h> /* Windows stuff */

int Count;

DWORD WINAPI Increment(LPVOID p)
{
  int i;
  int loops = 1000 * 100;
  for (i = 0; i < loops; i++)
    Count++;

  return 0;
}

int main(void)
{
  #define N 10
  int i;
  HANDLE thread[N];

  for (i = 0; i < N; i++)
    thread[i] = CreateThread(0, 0, Increment, 0, 0, 0);

  WaitForMultipleObjects(N, thread, TRUE, INFINITE);

  for (i = 0; i < N; i++)
    CloseHandle(thread[i]);

  printf("Count = %i\n", Count);

  return 0;
}

Output:

Count = 629261
Count = 652001
Count = 628380
Count = 743270
Count = 653191

Other points to make:

Multiple threads was used in many of my (older) drivers on Windows. (I'm now using separate processes on Linux.)
Web server (any server) listen on one thread, do work on another.

In-process modules (threads)

Riskier (single process only)
Potentially faster and easier (communication)

Out-of-process modules (processes)

Safer (seperate processes)
Potentially slower

Always a trade-off

Communicating with threads while they run?
Terminating a thread? (Immediately or deferred)
Thread priorities?
Does fork duplicate just the calling thread or all?

Best to avoid mixing forking and threading, but, if you must then
when mixing forking and threading, do the fork first, then create the threads.

Sending a signal to a process. Which thread gets the signal?
Thread pools (faster, bounded by size of pool)

Can put a limit on how many threads are created.

Finding parallelism in the problem is generally the hardest part of thread programming.

Multi-process vs. multi-thread

Processes generally take more time/resources to create.

Thread creation on Linux is typically 10 times faster than fork.
No need to duplicate page tables.

Processes do not share any memory or other resources without explicitly making it happen (via the OS, pipes, shared memory, messages).
Threads are in the same process so they have access to global resources without needing help from the OS.
Communicating between threads is faster/easier.
You can set processor affinity for each thread as well as each process.
Processes can be safer because they run in their own, well, process space.

If one processes crashes, only that process is affected.
If one thread dies, it takes the whole process with it.
Firefox is working on this with its Multiprocess Firefox browser. The reason is for stability. Google's Chrome browser does a similar thing.

The actual performance will vary depending on the operating system.

Additional links:

C++11 threads, affinity and hyperthreading
PThreads on Windows This could make cross-platform threading using POSIX threads on Windows easier.
OpenMP API A cross-platform specification for parallel programming.
From Wikipedia List of cross-platform multi-threading libraries for the C++ programming language.
More about Multiprocess Firefox.