Synchronization

Overview

Recall this program from the "Threads" lectures:

Single-threaded version:

#include <stdio.h> /* printf */

int Count = 0;

void Increment(void) 
{
  int i;
  int loops = 1000 * 100;
  for (i = 0; i < loops; i++) 
    Count++;
}

int main(void) 
{
  #define N 10
  int i;
 
  for (i = 0; i < N; i++)
    Increment();
    
  printf("Count = %i\n", Count);
  
  return 0;
}
Output:
Count = 1,000,000 
Multi-threaded version:

#include <stdio.h>   /* printf                  */
#include <pthread.h> /* thread create/join/exit */

int Count;

void *Increment(void *p) 
{
  int i;
  int loops = 1000 * 100;
  for (i = 0; i < loops; i++) 
    Count++;
    
  return 0;
}

int main(void) 
{
  #define N 10
  int i;
  pthread_t thread_id[N];
  
  for (i = 0; i < N; i++)
    pthread_create(thread_id + i, 0, Increment, 0);
    
  for (i = 0; i < N; i++)
    pthread_join(thread_id[i], 0);
    
  printf("Count = %i\n", Count);
  
  return 0;
}
Comparing Results:
 Single-threadedMulti-threaded
Run #1
Count = 1,000,000
Count = 591,274
Run #2
Count = 1,000,000
Count = 636,767
Run #3
Count = 1,000,000
Count = 591,146
Run #4
Count = 1,000,000
Count = 611,036
Run #5
Count = 1,000,000
Count = 625,463

Summarizing:

Critical Sections

What we need to do is to ensure only one thread can access the critical section at any time. We will use some kind of locking mechanism.

Here is the test program (with the black-box functions for locking and unlocking):

#include <stdio.h>   /* printf                  */
#include <pthread.h> /* thread create/join/exit */

int Count = 0;

void *Increment(void *p)
{
  int i, id = *(int *)p;

  for (i = 0; i < 1000 * 1000 * 100; i++)
  {
    Lock0(id);   /* Acquire the critical section */
      Count++;   /* This is the critical section */
    Unlock0(id); /* Release the critical section */
  }

  return NULL;
}

int main(void)
{
  pthread_t thread_id[2];
  int tid0 = 0, tid1 = 1;
  
  pthread_create(&thread_id[0], 0, Increment, &tid0);
  pthread_create(&thread_id[1], 0, Increment, &tid1);
    
  pthread_join(thread_id[0], 0);
  pthread_join(thread_id[1], 0);
  
  printf("Count = %i\n", Count);
  
  return 0;
}
For a locking solution to be accepted, it must provide three properties:
  1. Mutual exclusion is provided:
  2. Progress is made:
  3. The waiting is bounded (not unlimited):
Also, realize that critical sections of code are generally very short (code and runtime). This is simply because all other threads have to wait for the one thread that is in the critical section. If critical sections are long, the other threads are essentially blocked (doing nothing) until the critical section completes, turning an efficient multi-threaded program into an inefficient single-threaded program.

BadGood
Lock0(id);   /* Acquire the critical section */
for (i = 0; i < 1000 * 1000 * 100; i++)
{
  Count++;   /* This is the critical section */
}
Unlock0(id); /* Release the critical section */
for (i = 0; i < 1000 * 1000 * 100; i++)
{
  Lock0(id);   /* Acquire the critical section */
  Count++;     /* This is the critical section */
  Unlock0(id); /* Release the critical section */
}

And these are a few different Lock and Unlock functions to look at:

  1. Reference functions (do nothing). The variable me will be the thread ID, either 0 or 1.
  2. void Lock0(int me)
    {
    }
    
    void Unlock0(int me)
    {
    }
    
    And the output:
    Count = 117,004,575
    
    real    0m1.061s
    user    0m2.060s
    sys     0m0.000s
    
    Q. Why is this solution unacceptable?

    A. Obviously, no mutual exclusion. Both threads can be modifying Count.


  3. Using a lock variable:
  4. int lock = 0; /* global lock variable */
    
    void Lock1(int me)
    {
      while (lock == 1) /* spin wait (busy waiting) */
        continue;
                        /* both threads can be here */
      lock = 1;
    }
    
    void Unlock1(int me)
    {
      lock = 0;
    }
    
    And the output:
    Count = 164,908,686
    
    real    0m1.602s
    user    0m3.130s
    sys     0m0.000s
    
    Q. Why is this solution unacceptable?

    A. Again, no mutual exclusion on lock so both threads can see a value of 0 and skip the while loop.


  5. Alternating turns:
  6. int turn = 0;
    
    void Lock2(int me)
    {
      while (turn != me) /* busy wait */
        continue;
    }
    
    void Unlock2(int me)
    {
      turn = 1 - me; /* other thread */
    }
    
    And the output:
    Count = 200,000,000
    
    real    0m7.523s
    user    0m14.980s
    sys     0m0.010s
    
    Q. Why is this solution unacceptable? (The output is correct and is always correct with this example.)

    A. It relies on the fact that the second thread will take a turn and then set the flag back to the first thread. What if the second thread is doing something else (or has terminated)? The first thread waits and waits until the (unready) second thread does something else. The second thread may never want the lock again. It's a violation of property #2 above: No progress is made: A thread wants to enter the critical section, but can't.


  7. Yield to the other thread:
  8. int ready[2] = {0, 0};
    
    void Lock3(int me)
    {
      int you = 1 - me;
      ready[me] = 1;
    
      /* both threads can be here */
    
      while (ready[you]) /* busy wait */
        continue;
    }
    
    void Unlock3(int me)
    {
      ready[me] = 0;
    }
    
    And the output:
    No output, it hangs. (after about 1000 iterations)
    
    Q. Why is this solution unacceptable?

    A. It's easy to tell from the output that no progress is made. Both threads are able to set their ready flag to true causing an infinite while loop.

    In thread 0, you is 1 and ready[0] is 1.
    In thread 1, you is 0 and ready[1] is 1.

    The while loop hangs in both threads.


  9. Combining alternating and yielding: (Alternating almost worked)
  10. int ready[2] = {0, 0};
    int turn = 0;
    
    void Lock4(int me)
    {
      int you = 1 - me;  /* #1 */
      turn = you;        /* #2 */
      ready[me] = 1;     /* #3 */   
                         /* #4 */
      while (ready[you] && (turn == you))
        continue;   /* only one thread can be here */
    }
    
    void Unlock4(int me)
    {
      ready[me] = 0;
    }
    
    Count = 199,999,192
    
    real    0m18.366s
    user    0m36.380s
    sys     0m0.040s
    
    Notes:


  11. Using a sychronization primitive: __sync_synchronize()__sync_synchronize()
  12. void Lock5(int me)
    {
      int you = 1 - me;
      turn = you;
      ready[me] = 1;
      __sync_synchronize();
      while (ready[you] && (turn == you))
        continue;
    }
    
    void Unlock5(int me)
    {
      ready[me] = 0;
    }
    
    And the output:
    Count = 200,000,000
    
    real    0m20.981s
    user    0m41.660s
    sys     0m0.020s
    
    According to the GCC documentation:

    In most cases, these builtins are considered a full barrier. That is, no memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.

    And the x86 specification says something like this:
    Loads and stores are not reordered with locked instructions.
    A closer look:

    The line in blue is the only difference between the two functions. (32-bit asm)

    Peterson's Solution (no sync)Peterson's Solution (w/sync)
    Lock4:
    .LFB8:
      pushl %ebp
      movl  %esp, %ebp
      subl  $16, %esp
      movl  $1, %eax            #
      subl  me, %eax            #
      movl  %eax, you           # you = 1 - me
      movl  you, %eax           #
      movl  %eax, turn          # turn = you
      movl  me, %eax            #
      movl  $1, ready(,%eax,4)  # ready[me] = 1
    
      nop
      jmp .L19
    .L22:
      nop
    .L19:
      movl  you, %eax
      movl  ready(,%eax,4), %eax
      testl %eax, %eax
      je  .L18
      movl  turn, %eax
      cmpl  you, %eax
      je  .L22
    .L18:
      leave
      ret
    
    Lock5:
    .LFB10:
      pushl %ebp
      movl  %esp, %ebp
      subl  $16, %esp
      movl  $1, %eax            #
      subl  me, %eax            #
      movl  %eax, you           # you = 1 - me
      movl  you, %eax           #
      movl  %eax, turn          # turn = you
      movl  me, %eax            #
      movl  $1, ready(,%eax,4)  # ready[me] = 1
      lock orl  $0, (%esp)
      nop
      jmp .L25
    .L28:
      nop
    .L25:
      movl  you, %eax
      movl  ready(,%eax,4), %eax
      testl %eax, %eax
      je  .L24
      movl  turn, %eax
      cmpl  you, %eax
      je  .L28
    .L24:
      leave
      ret
    
    Time comparison:

    It is interesting to note that the code without the synchronization appears to run about 50% slower. (This was under Linux in a virtual machine.)

    Peterson's Solution (no sync)Peterson's Solution (with sync)
    Count = 19,999,997     Count = 19,999,997
    
    real  0m3.455s       real  0m3.474s
    user  0m6.308s       user  0m6.344s
    sys   0m0.032s       sys   0m0.044s
    
    Count = 20,000,000     Count = 20,000,000
    
    real  0m2.179s       real  0m2.228s
    user  0m3.936s       user  0m3.960s
    sys   0m0.028s       sys   0m0.032s
    
    This solution works, but not all systems used to implement the synchronization. (Most modern systems/compilers today will implement this or something similar. so it's unlikely to be a problem these days.)

Other examples:
Running Lock4 (Peterson's solution) on my old Digipen computer (2 cores running Windows XP with Cygwin):
Count = 199,999,192
real 24.84
user 48.87
sys 0.01
Count = 199,999,998
real 24.64
user 48.97
sys 0.00
Count = 200,000,000
real 24.74
user 49.07
sys 0.00
Count = 199,999,999
real 24.90
user 49.48
sys 0.01

Running Lock5 (Peterson's solution w/synch) on my Digipen computer under Cygwin with a version of gcc (4.3.2) which doesn't support the synchronization function:

Count = 199,999,976
real 30.90
user 61.76
sys 0.01
GCC version 4.4.0 or later does support this synchronization.

Summary: (All tests run under Linux on olga, which supports the synchronization function, GCC 4.4.3.)

0 - No locking1 - Lock variable2 - Alternate turns3 - Yield to other4 - Peterson's Solution5 - Peterson's Solution w/synch
Count = 117,004,575

real    0m1.061s
user    0m2.060s
sys     0m0.000s
Count = 164,908,686

real    0m1.602s
user    0m3.130s
sys     0m0.000s
Count = 200,000,000

real    0m7.523s
user    0m14.980s
sys     0m0.010s
No output
Count = 187,677,882

real    0m18.366s
user    0m36.380s
sys     0m0.040s
Count = 200,000,000

real    0m20.981s
user    0m41.660s
sys     0m0.020s

Summary: (All tests run under Linux on maya, in a Windows 7 virtual machine with GCC version 4.5.3 .)

0 - No locking1 - Lock variable2 - Alternate turns3 - Yield to other4 - Peterson's Solution5 - Peterson's Solution w/synch
Count = 117,652,282

real    0m2.020s
user    0m3.609s
sys     0m0.046s
Count = 122,907,276

real    0m2.750s
user    0m5.250s
sys     0m0.046s
Count = 200,000,000

real    0m8.047s
user    0m15.375s
sys     0m0.077s
No output
Count = 199,996,702

real    0m16.678s
user    0m32.265s
sys     0m0.109s
Count = 200,000,000

real    0m14.561s
user    0m29.733s
sys     0m0.046s

Summary: (All tests run under Linux on maya, which supports the synchronization function, GCC 4.6.3.)

0 - No locking1 - Lock variable2 - Alternate turns3 - Yield to other4 - Peterson's Solution5 - Peterson's Solution w/synch
Count = 109,115,358

real    0m2.082s
user    0m4.148s
sys     0m0.000s
Count = 123,013,163

real    0m2.189s
user    0m4.236s
sys     0m0.004s
Count = 200,000,000

real    0m7.763s
user    0m15.505s
sys     0m0.004s
No output
Count = 198,261,472

real    0m9.374s
user    0m18.693s
sys     0m0.000s
Count = 200,000,000

real    0m8.653s
user    0m17.281s
sys     0m0.000s

Code to demonstrate the above output.

Fortunately, there are better ways to provide synchronization.

Synchronization Primitives

Suppose we had atomic functions called TestAndSetTestAndSet and SwapSwap that worked with a global variable (locked). This is how we could use them: (Assume locked is initialized to 0)

The atomic TestAndSet function provides mutual exclusion.

Client codeLibrary function (atomic)
void fn(void)
{
  do
  {
    while (TestAndSet(&locked))
      continue;
      
    /* critical section */
    
    locked = 0;
  } 
  while (1);
}
int TestAndSet(int *locked)
{
  int oldval = *locked;
  *locked = 1;
  return oldval;
}

The atomic Swap function also provides mutual exclusion:

Client codeLibrary function (atomic)
void fn2(void)
{
  do
  {
    int key = 1;
    while (key == 1)
      Swap(&locked, &key);
      
    /* critical section */
    
    locked = 0;
  } 
  while (1);
}
void Swap(int *a, int *b)
{
  int temp = *a;
  *a = *b;
  *b = temp;
}

Fortunately, we have these capabilities in a library using the PthreadsPthreads API.

The library (pthread.h) is quite extensive (about 75 functions) but we will only be looking at a very small portion of it.

Mutexes

The first synchronization mechanism we're going to look at is the mutex.

If two events are mutually exclusivemutually exclusive it means that only one or the other can happen, but not both. A mutex provides for mutual exclusionmutual exclusion between threads/processes.

A POSIX example using a global mutex: (modified from above)

/* mutex.c                                           */
/* to build on Linux, link with the pthread library: */
/* gcc mutex.c -lpthread                             */

#include <stdio.h>   /* printf       */
#include <pthread.h> /* thread stuff */

int Count = 0;
pthread_mutex_t count_mutex;               /* NEW */

void *Increment(void *p)
{
  int i;
  for (i = 0; i < 1000 * 100; i++)
  {
    pthread _mutex_lock(&count_mutex);     /* NEW */
      Count++;
    pthread _mutex_unlock(&count_mutex);   /* NEW */
  }
  return 0;
}

int main(void)
{
  #define N 10
  int i;
  pthread_t thread_id[N];
  
  pthread_mutex_init(&count_mutex, NULL);  /* NEW */

  for (i = 0; i < N; i++)
    pthread_create(&thread_id[i], 0, Increment, 0);
  
  for (i = 0; i < N; i++)
    pthread_join(thread_id[i], 0);

  pthread_mutex_destroy(&count_mutex);     /* NEW */
  
  printf("------------------\n");
  printf("Count = %i\n", Count);
  
  return 0;
}
Note:

You can also use a default static initializer for the mutex (instead of the call to pthread_mutex_init):

pthread_mutex_t count_mutex = PTHREAD_MUTEX_INITIALIZER;
instead of this function call in main:
pthread_mutex_init(&count_mutex, NULL); 

Here's a possible definition of pthread_mutex_t:
typedef struct 
{
  int __m_reserved;
  int __m_count;
  _pthread_descr __m_owner;
  int __m_kind;
  struct _pthread_fastlock __m_lock;
} pthread_mutex_t;
Output with 10 threads:
No synchronizationUsing a mutex
Count = 626,603
real 0.02
user 0.06
sys 0.00
Count = 1,000,000
real 3.08
user 0.95
sys 1.31

Output with 50 threads:

No synchronizationUsing a mutex
Count = 3,012,725
real 0.05
user 0.07
sys 0.01
Count = 5,000,000
real 15.21
user 3.56
sys 7.18

Output with 100 threads:

No synchronizationUsing a mutex
Count = 5,874,657
real 0.05
user 0.09
sys 0.04
Count = 10,000,000
real 30.27
user 6.07
sys 13.09

Factor out the mutex into a function without parameters (global mutex):

Client codeFunctions for Lock/Unlock
void *Increment(void *p)
{
  int i;
  for (i = 0; i < 1000 * 100; i++)
  {
    Lock();
      Count++;
    Unlock();
  }
  return 0;
}
void Lock(void)
{
  pthread_mutex_lock(&count_mutex);
}

void Unlock(void)
{
  pthread_mutex_unlock(&count_mutex);
}
Factor out the mutex into a function using parameters (non-global mutex):
Client codeFunctions for Lock/Unlock
void *Increment(void *p)
{
  int i;
  for (i = 0; i < 1000 * 100; i++)
  {
    Lock(&count_mutex);
      Count++;
    Unlock(&count_mutex);
  }
  return 0;
}
void Lock(pthread_mutex_t *mutex)
{
  pthread_mutex_lock(mutex);
}

void Unlock(pthread_mutex_t *mutex)
{
  pthread_mutex_unlock(mutex);
}

Note: Of course, in "normal, production code" you would check the return values of the system calls to ensure that they actually succeeded. Otherwise, you could spend unecessary time debugging your application.

A Win32 example that passes the mutex via a parameter to the thread function:
/* wmutex.c */
#include <windows.h>
#include <stdio.h>

struct ThreadParams
{
  HANDLE mutex;
  int tid;
};

DWORD WINAPI ThreadFn(LPVOID p)
{
  DWORD result; 
  int done = FALSE;
  struct ThreadParams *params = (struct ThreadParams *)p;
  HANDLE mutex = params->mutex;
  int tid = params->tid;
  
  while(!done)
  {
    /********************************************************************
      DO NOT set the timeout to zero, unless your threads have some other
      work to do. If the thread can do nothing until it obtains the mutex, 
      then don't keep checking that it is available. Use INFINITE instead
      of O (zero) for the timeout. If you use 0, and don't obtain the
      mutex immediately (the likely outcome) your program will be wasting 
      CPU cycles in a kind of busy wait loop. This is just an example
      of how you would be able to do other work, if necessary.

      If you don't understand the above, use INFINITE and forget about it.
    *********************************************************************/
    result = WaitForSingleObject(mutex, 0L);
    switch (result) 
    { 
        /* Acquired the mutex */
      case WAIT_OBJECT_0: 
        printf("ThreadId %i: acquired mutex\n", tid);
        done = TRUE;            

          /* Pretend to do some work */
        printf("ThreadId %i: working...\n", tid);
        Sleep(1000);

        if (!ReleaseMutex(mutex))
          printf("Error releasing mutex: %i\n", GetLastError());

        printf("ThreadId %i: released mutex\n", tid);
        break; 

        /* The mutex wasn't available yet */
      case WAIT_TIMEOUT: 
        /*printf("ThreadId %d: no mutex yet\n", tid);*/
        break; 
    }
  }
  return TRUE;
}

int main(int argc, char **argv)
{
  int i;
  HANDLE *threads;
  struct ThreadParams* params;  
  DWORD ThreadID;
  HANDLE mutex;
  int num_threads = 4;

    /* The user can specify the number of threads */  
  if (argc > 1)
    num_threads = atoi(argv[1]);

    /* Allocate memory for threads and parameters */    
  threads = malloc(num_threads * sizeof(HANDLE));
  params = malloc(num_threads * sizeof(struct ThreadParams));
  
    /* Create mutex */  
  mutex = CreateMutex(NULL, FALSE, NULL);
  if (mutex == NULL) 
  {
    printf("Error creating mutex: %d\n", GetLastError());
    return 1;
  }

    /* Create the threads */
  for(i = 0; i < num_threads; i++)
  {
    params[i].mutex = mutex;
    params[i].tid = i;
    
    threads[i] = CreateThread(NULL, 0, ThreadFn, &params[i], 0, &ThreadID); 
    if(threads[i] == NULL)
    {
      printf("Error creating thread: %d\n", GetLastError());
      return 1;
    }
  }

    /* Wait for each thread */
  WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
  
    /* Release thread handles */
  for(i = 0; i < num_threads; i++)
    CloseHandle(threads[i]);

    /* Release mutex */
  CloseHandle(mutex);
  
    /* Clean up memory */
  free(threads);
  free(params);

  return 0;
}
Parameters to CreateThread:
HANDLE WINAPI CreateThread
(
  __in_opt   LPSECURITY_ATTRIBUTES lpThreadAttributes,
  __in       SIZE_T dwStackSize,
  __in       LPTHREAD_START_ROUTINE lpStartAddress,
  __in_opt   LPVOID lpParameter,
  __in       DWORD dwCreationFlags,
  __out_opt  LPDWORD lpThreadId
);
Final note:

Semaphores

The second synchronization mechanism we're going to look at is the semaphoresemaphore.

A semaphore is an integer that can only be accessed using atomic operations.

A POSIX example:
This program allows the user to specify how many threads to create and how many to allow to run at any one time. For CPU-bound threads, you would want as many threads as CPU cores. For I/O-bound threads, you could launch many more (most would be in the blocked/waiting state).
/* sem.c                                             */
/* to build on Linux, link with the pthread library: */
/* gcc sem.c -lpthread                               */
  
#include <stdio.h>     /* printf             */
#include <fcntl.h>     /* O_CREAT, O_RDWR    */
#include <stdlib.h>    /* malloc, free, atoi */
#include <pthread.h>   /* thread stuff       */
#include <semaphore.h> /* semaphore stuff    */
#include <unistd.h>    /* sleep              */

struct ThreadParams
{
  sem_t *semaphore;
  int tid;
};

void *ThreadFn(void *p)
{
  struct ThreadParams *params = (struct ThreadParams *)p;
  sem_t *semaphore = params->semaphore;
  int tid = params->tid;
  
    /* Acquire semaphore */
  sem_wait(semaphore); 
    printf("ThreadId %i: acquired semaphore\n", tid);

      /* Pretend to do some work */
    printf("ThreadId %i: working...\n", tid);
    sleep(1);
  
      /* Release semaphore */
  sem_post(semaphore); 
  printf("ThreadId %i: released semaphore\n", tid);
  
  return 0;
}

int main(int argc, char **argv)
{
  int i;
  pthread_t *threads;
  struct ThreadParams* params;
  sem_t *semaphore;
  const char *sem_name = "/barney"; /* see online docs about naming: sem overview */
  int num_threads = 12;
  int sem_count = 4;
  
    /* The user can specify the number of threads and max semaphore count */
  if (argc > 1)
    num_threads = atoi(argv[1]);
  if (argc > 2)
    sem_count = atoi(argv[2]);
    
    /* Allocate memory for threads and parameters */
  threads = malloc(num_threads * sizeof(pthread_t *));
  params = malloc(num_threads * sizeof(struct ThreadParams));
  
    /* Create semaphore, see online docs for details */  
  semaphore = sem_open(sem_name, O_CREAT, O_RDWR, sem_count); 
  if (semaphore == SEM_FAILED)
  {
    printf("sem_open failed\n");
    return 1;
  }
  
    /* Mark the semaphore for deletion (will delete it even if we crash) */
  sem_unlink(sem_name);

    /* Create the threads */
  for(i = 0; i < num_threads; i++)
  {
    params[i].semaphore = semaphore;
    params[i].tid = i;
    
    pthread_create(&threads[i], NULL, ThreadFn, &params[i]);
  }

    /* Wait for each thread */
  for (i = 0; i < num_threads; i++)
    pthread_join(threads[i], NULL);

    /* Done with the semaphore */
  sem_close(semaphore); 

    /* Clean up memory */
  free(threads);
  free(params);

  return 0;
}
Output (program is called sem):
sem 4 1sem 8 3sem 12 4sem 12 8
ThreadId 0: acquired semaphore
ThreadId 0: working...
ThreadId 0: released semaphore
ThreadId 1: acquired semaphore
ThreadId 1: working...
ThreadId 1: released semaphore
ThreadId 2: acquired semaphore
ThreadId 2: working...
ThreadId 2: released semaphore
ThreadId 3: acquired semaphore
ThreadId 3: working...
ThreadId 3: released semaphore
ThreadId 0: acquired semaphore
ThreadId 0: working...
ThreadId 2: acquired semaphore
ThreadId 2: working...
ThreadId 1: acquired semaphore
ThreadId 1: working...
ThreadId 0: released semaphore
ThreadId 3: acquired semaphore
ThreadId 3: working...
ThreadId 4: acquired semaphore
ThreadId 4: working...
ThreadId 1: released semaphore
ThreadId 2: released semaphore
ThreadId 5: acquired semaphore
ThreadId 5: working...
ThreadId 3: released semaphore
ThreadId 6: acquired semaphore
ThreadId 6: working...
ThreadId 7: acquired semaphore
ThreadId 7: working...
ThreadId 5: released semaphore
ThreadId 4: released semaphore
ThreadId 6: released semaphore
ThreadId 7: released semaphore
ThreadId 1: acquired semaphore
ThreadId 1: working...
ThreadId 3: acquired semaphore
ThreadId 2: acquired semaphore
ThreadId 2: working...
ThreadId 3: working...
ThreadId 0: acquired semaphore
ThreadId 0: working...
ThreadId 1: released semaphore
ThreadId 2: released semaphore
ThreadId 5: acquired semaphore
ThreadId 5: working...
ThreadId 4: acquired semaphore
ThreadId 4: working...
ThreadId 3: released semaphore
ThreadId 6: acquired semaphore
ThreadId 6: working...
ThreadId 0: released semaphore
ThreadId 7: acquired semaphore
ThreadId 7: working...
ThreadId 6: released semaphore
ThreadId 8: acquired semaphore
ThreadId 8: working...
ThreadId 5: released semaphore
ThreadId 9: acquired semaphore
ThreadId 9: working...
ThreadId 4: released semaphore
ThreadId 10: acquired semaphore
ThreadId 10: working...
ThreadId 7: released semaphore
ThreadId 11: acquired semaphore
ThreadId 11: working...
ThreadId 8: released semaphore
ThreadId 9: released semaphore
ThreadId 10: released semaphore
ThreadId 11: released semaphore
ThreadId 0: acquired semaphore
ThreadId 0: working...
ThreadId 2: acquired semaphore
ThreadId 2: working...
ThreadId 3: acquired semaphore
ThreadId 5: acquired semaphore
ThreadId 5: working...
ThreadId 4: acquired semaphore
ThreadId 4: working...
ThreadId 3: working...
ThreadId 7: acquired semaphore
ThreadId 7: working...
ThreadId 6: acquired semaphore
ThreadId 1: acquired semaphore
ThreadId 1: working...
ThreadId 6: working...
ThreadId 0: released semaphore
ThreadId 3: released semaphore
ThreadId 5: released semaphore
ThreadId 4: released semaphore
ThreadId 7: released semaphore
ThreadId 8: acquired semaphore
ThreadId 8: working...
ThreadId 11: acquired semaphore
ThreadId 11: working...
ThreadId 2: released semaphore
ThreadId 10: acquired semaphore
ThreadId 10: working...
ThreadId 6: released semaphore
ThreadId 9: acquired semaphore
ThreadId 9: working...
ThreadId 1: released semaphore
ThreadId 8: released semaphore
ThreadId 11: released semaphore
ThreadId 10: released semaphore
ThreadId 9: released semaphore
Notes:


The same example using the Win32 API:

/* wsemp.c */
#include <windows.h>
#include <stdio.h>

struct ThreadParams
{
  HANDLE semaphore;
  int tid;
};

DWORD WINAPI ThreadFn(LPVOID p)
{
  DWORD result; 
  int done = FALSE;
  struct ThreadParams *params = (struct ThreadParams *)p;
  HANDLE semaphore = params->semaphore;
  int tid = params->tid;
  
  while(!done)
  {
    /********************************************************************
      DO NOT set the timeout to zero, unless your threads have some other
      work to do. If the thread can do nothing until it obtains the mutex, 
      then don't keep checking that it is available. Use INFINITE instead
      of O (zero) for the timeout. If you use 0, and don't obtain the
      mutex immediately (the likely outcome) your program will be wasting 
      CPU cycles in a kind of busy wait loop. This is just an example
      of how you would be able to do other work, if necessary.

      If you don't understand the above, use INFINITE and forget about it.
    *********************************************************************/
    result = WaitForSingleObject(semaphore, 0L);
    switch (result) 
    { 
        /* Acquired the semaphore */
      case WAIT_OBJECT_0: 
        printf("ThreadId %i: acquired semaphore\n", tid);
        done = TRUE;            

          /* Pretend to do some work */
        printf("ThreadId %i: working...\n", tid);
        Sleep(1000);

        if (!ReleaseSemaphore(semaphore, 1, NULL))
          printf("ReleaseSemaphore error: %i\n", GetLastError());

        printf("ThreadId %i: released semaphore\n", tid);
        break; 

        /* The semaphore wasn't available yet */
      case WAIT_TIMEOUT: 
        /*printf("ThreadId %d: no semaphore yet\n", GetCurrentThreadId());*/
        break; 
    }
  }
  return TRUE;
}

int main(int argc, char **argv)
{
  int i;
  HANDLE *threads;
  struct ThreadParams* params;  
  DWORD ThreadID;
  HANDLE semaphore;
  int num_threads = 12;
  int sem_count = 4;

    /* The user can specify the number of threads and max semaphore count */  
  if (argc > 1)
    num_threads = atoi(argv[1]);
  if (argc > 2)
    sem_count = atoi(argv[2]);

    /* Allocate memory for threads and parameters */    
  threads = malloc(num_threads * sizeof(HANDLE));
  params = malloc(num_threads * sizeof(struct ThreadParams));
  
    /* Create semaphore */  
  semaphore = CreateSemaphore(NULL, sem_count, sem_count, NULL);
  if (semaphore == NULL) 
  {
    printf("CreateSemaphore error: %d\n", GetLastError());
    return 1;
  }

    /* Create the threads */
  for(i = 0; i < num_threads; i++)
  {
    params[i].semaphore = semaphore;
    params[i].tid = i;
    
    threads[i] = CreateThread(NULL, 0, ThreadFn, &params[i], 0, &ThreadID); 
    if(threads[i] == NULL)
    {
      printf("CreateThread error: %d\n", GetLastError());
      return 1;
    }
  }

    /* Wait for each thread */
  WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
  
    /* Release thread handles */
  for(i = 0; i < num_threads; i++)
    CloseHandle(threads[i]);

    /* Release semaphore */
  CloseHandle(semaphore);
  
    /* Clean up memory */
  free(threads);
  free(params);

  return 0;
}
Output (program is called sem.exe) The numbers in parentheses in the second column are the number of threads in the critical section.
sem.exe 4 1sem.exe 8 3sem.exe 12 4sem.exe 12 8
ThreadId 0: acquired semaphore
ThreadId 0: working...
ThreadId 0: released semaphore
ThreadId 1: acquired semaphore
ThreadId 1: working...
ThreadId 1: released semaphore
ThreadId 2: acquired semaphore
ThreadId 2: working...
ThreadId 2: released semaphore
ThreadId 3: acquired semaphore
ThreadId 3: working...
ThreadId 3: released semaphore
ThreadId 0: acquired semaphore (1)
ThreadId 1: acquired semaphore (2)
ThreadId 0: working...
ThreadId 2: acquired semaphore (3)
ThreadId 1: working...
ThreadId 2: working...
ThreadId 0: released semaphore (2)
ThreadId 3: acquired semaphore (3)
ThreadId 2: released semaphore (2)
ThreadId 1: released semaphore (1)
ThreadId 4: acquired semaphore (2)
ThreadId 5: acquired semaphore (3)
ThreadId 3: working...
ThreadId 4: working...
ThreadId 5: working...
ThreadId 3: released semaphore (2)
ThreadId 6: acquired semaphore (3)
ThreadId 7: acquired semaphore (4)
ThreadId 4: released semaphore
ThreadId 5: released semaphore
ThreadId 6: working...
ThreadId 7: working...
ThreadId 6: released semaphore
ThreadId 7: released semaphore
ThreadId 0: acquired semaphore
ThreadId 1: acquired semaphore
ThreadId 2: acquired semaphore
ThreadId 0: working...
ThreadId 3: acquired semaphore
ThreadId 1: working...
ThreadId 2: working...
ThreadId 3: working...
ThreadId 0: released semaphore
ThreadId 1: released semaphore
ThreadId 4: acquired semaphore
ThreadId 3: released semaphore
ThreadId 5: acquired semaphore
ThreadId 6: acquired semaphore
ThreadId 2: released semaphore
ThreadId 7: acquired semaphore
ThreadId 4: working...
ThreadId 5: working...
ThreadId 6: working...
ThreadId 7: working...
ThreadId 5: released semaphore
ThreadId 4: released semaphore
ThreadId 9: acquired semaphore
ThreadId 8: acquired semaphore
ThreadId 7: released semaphore
ThreadId 10: acquired semaphore
ThreadId 6: released semaphore
ThreadId 11: acquired semaphore
ThreadId 9: working...
ThreadId 8: working...
ThreadId 10: working...
ThreadId 11: working...
ThreadId 9: released semaphore
ThreadId 8: released semaphore
ThreadId 11: released semaphore
ThreadId 10: released semaphore
ThreadId 0: acquired semaphore
ThreadId 1: acquired semaphore
ThreadId 2: acquired semaphore
ThreadId 0: working...
ThreadId 3: acquired semaphore
ThreadId 4: acquired semaphore
ThreadId 1: working...
ThreadId 6: acquired semaphore
ThreadId 5: acquired semaphore
ThreadId 2: working...
ThreadId 7: acquired semaphore
ThreadId 3: working...
ThreadId 4: working...
ThreadId 6: working...
ThreadId 5: working...
ThreadId 7: working...
ThreadId 0: released semaphore
ThreadId 9: acquired semaphore
ThreadId 9: working...
ThreadId 1: released semaphore
ThreadId 8: acquired semaphore
ThreadId 8: working...
ThreadId 2: released semaphore
ThreadId 10: acquired semaphore
ThreadId 3: released semaphore
ThreadId 11: acquired semaphore
ThreadId 5: released semaphore
ThreadId 6: released semaphore
ThreadId 4: released semaphore
ThreadId 7: released semaphore
ThreadId 10: working...
ThreadId 11: working...
ThreadId 9: released semaphore
ThreadId 8: released semaphore
ThreadId 10: released semaphore
ThreadId 11: released semaphore
A simpler Win32 thread function:
/* wsemp2.c */
DWORD WINAPI ThreadFn(LPVOID p)
{
  struct ThreadParams *params = (struct ThreadParams *)p;
  HANDLE semaphore = params->semaphore;
  int tid = params->tid;
  
  WaitForSingleObject(semaphore, INFINITE);
    
    printf("ThreadId %i: acquired semaphore\n", tid);

      /* Pretend to do some work */
    printf("ThreadId %i: working...\n", tid);
    Sleep(1000);

  ReleaseSemaphore(semaphore, 1, NULL);
  printf("ThreadId %i: released semaphore\n", tid);
  
  return TRUE;
}

Critical Sections (Windows-only)

The third synchronization mechanism we're going to look at is the Critical Section. This is specific to Windows.

A Win32 Example using a Critical Section. (Follow the links below for more details).

/* wcritical.c */
#include <windows.h>
#include <stdio.h>

struct ThreadParams
{
  CRITICAL_SECTION *critical_section;
  int tid;
};

DWORD WINAPI ThreadFn(LPVOID p)
{
  struct ThreadParams *params = (struct ThreadParams *)p;
  CRITICAL_SECTION *critical_section = params->critical_section;
  int tid = params->tid;
  
  EnterCriticalSection(critical_section);
  
      /* Pretend to do some work */
    printf("ThreadId %i: working...\n", tid);
    Sleep(1000);
  
  LeaveCriticalSection(critical_section);
  
  return TRUE;
}

int main(int argc, char **argv)
{
  int i;
  HANDLE *threads;
  struct ThreadParams* params;  
  DWORD ThreadID;
  CRITICAL_SECTION critical_section;
  int num_threads = 4;

    /* The user can specify the number of threads */  
  if (argc > 1)
    num_threads = atoi(argv[1]);

    /* Allocate memory for threads and parameters */    
  threads = malloc(num_threads * sizeof(HANDLE));
  params = malloc(num_threads * sizeof(struct ThreadParams));
  
    /* Setup critical section */  
  InitializeCriticalSection(&critical_section);

    /* Create the threads */
  for(i = 0; i < num_threads; i++)
  {
    params[i].critical_section = &critical_section;
    params[i].tid = i;
    
    threads[i] = CreateThread(NULL, 0, ThreadFn, &params[i], 0, &ThreadID); 
    if(threads[i] == NULL)
    {
      printf("Error creating thread: %d\n", GetLastError());
      return 1;
    }
  }

    /* Wait for each thread */
  WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
  
    /* Release thread handles */
  for(i = 0; i < num_threads; i++)
    CloseHandle(threads[i]);

    /* Release critical section */
  DeleteCriticalSection(&critical_section);
  
    /* Clean up memory */
  free(threads);
  free(params);

  return 0;
}


Synchronization Summary:
Mutex (wmutex2.c)Critical Section (wcritical2.c)

DWORD WINAPI ThreadFn(LPVOID p)
{
  int i;
  struct ThreadParams *params = (struct ThreadParams *)p;
  HANDLE mutex = params->mutex;
  
  for (i = 0; i < 1000 * 1000; i++)
  {
    WaitForSingleObject(mutex, INFINITE);
      Count++;
    ReleaseMutex(mutex);
  }
  
  return TRUE;
}
Output using 10 threads:
Value of Count is 10000000
real 27.23
user 0.01
sys 0.01

DWORD WINAPI ThreadFn(LPVOID p)
{
  int i;
  struct ThreadParams *params = (struct ThreadParams *)p;
  CRITICAL_SECTION *critical_section = params->critical_section;

  for (i = 0; i < 1000 * 1000; i++)
  {
    EnterCriticalSection(critical_section);
      Count++;
    LeaveCriticalSection(critical_section);
  }
  
  return TRUE;
}
Output using 10 threads:
Value of Count is 10000000
real 14.05
user 0.01
sys 0.00

Using a mutex between processes (Win32 example): This code is the parent process and this code is for the child processes. This example also shows how to create shared memory between processes.

Notes:
Spin lock (busy-waiting, CPU-bound)Blocked (likely I/O-bound)
void Lock5(int me)
{
  int you = 1 - me;
  turn = you;
  ready[me] = 1;
  __sync_synchronize();
  while (ready[you] && (turn == you))
    continue;
}

Lock5(id);
  Critical section...
Unlock5(id);
WaitForSingleObject(mutex, INFINITE);
  Critical section...
ReleaseMutex(mutex);

WaitForSingleObject(semaphore, INFINITE);
  Critical section...
ReleaseSemaphore(semaphore);

EnterCriticalSection(critical_section);
  Critical section...
LeaveCriticalSection(critical_section);
Final note:

Deadlock

The synchronization techniques still require the programmer to use them correctly. Even when used "correctly", bad things can happen:

    P0              P1
lock(res1);     lock(res2);
lock(res2);     lock(res1);
    
   /* use resources */

unlock(res1);   unlock(res2);
unlock(res2);   unlock(res1);

Can you see the problem? The two processes are deadlockeddeadlocked.

The deadlock depends on how the resources are acquired:

    P0              P1
lock(res1);       -------
lock(res2);       -------
use resources     blocked (try to lock res2)
use resources     blocked
use resources     blocked
use resources     blocked
unlock(res1);     blocked           
unlock(res2);     blocked           
-------           lock(res2);
-------           lock(res1);
-------           use resources
-------           use resources
-------           use resources
-------           use resources
-------           unlock(res2);
-------           unlock(res1);
-------           -------
-------           -------

There is no deadlock above. However, this will cause deadlock:

    P0                           P1
lock(res1);                  lock(res2);
blocked (try to lock res2)   blocked (try to lock res1)  
blocked (forever)            blocked (forever)
            
             DEADLOCK

Other problems:

Deadlock occurs when several processes are waiting for an event that can be caused by the other waiting processes.

There are 4 conditions that must hold for there to be a deadlock:

  1. Mutual exclusion - resources can only be held (used) by one process at a time, and must be released in order for others to use them.
  2. Hold and wait - A process must be holding a resource and waiting for another resource that is held by another process.
  3. No preemption - processes can only voluntarily release a held resource.
  4. Circular waiting - process P0 is waiting for process P1 to release a resource, P1 is waiting for P2, ... , Pn is waiting for P0

Looking back at this deadlock we see all 4 conditions:

    P0                           P1
lock(res1);                  lock(res2);
blocked (try to lock res2)   blocked (try to lock res1)  
blocked (forever)            blocked (forever)
            
             DEADLOCK

There is no deadlock here: (no process is holding/waiting at the same time)

    P0              P1
lock(res1);       -------
lock(res2);       -------
use resources     blocked (try to lock res2)
use resources     blocked
use resources     blocked
use resources     blocked
unlock(res1);     blocked           
unlock(res2);     blocked           
-------           lock(res2);
-------           lock(res1);
-------           use resources
-------           use resources
-------           use resources
-------           use resources
-------           unlock(res2);
-------           unlock(res1);
-------           -------
-------           -------

Resource allocation graphs

There are 3 methods for dealing with deadlocks:
  1. Deadlock prevention (Programmer)
  2. Deadlock detection and recovery (Operating system)
  3. Do nothing

Which one do you think is used by most operating systems?

C11 and the atomic_int

C11 added a new type called atomic_int that can do some operations atomically, such as increment:
#include <stdatomic.h> /* atomic_int */
atomic_int i = 0;
...
i++; /* This is now an atomic operation. */
Example (from before now using an atomic_int):
#include <stdio.h>     /* printf                  */
#include <pthread.h>   /* thread create/join/exit */
#include <stdatomic.h> /* atomic_int              */

atomic_int Count = 0; /* This is the only change. */

void *Increment(void *p)
{
  int i, id = *(int *)p;

  for (i = 0; i < 1000 * 1000 * 100; i++)
  {
    Count++;   /* This is the critical section */
  }

  return NULL;
}

int main(void)
{
  pthread_t thread_id[2];
  int tid0 = 0, tid1 = 1;
  
  pthread_create(&thread_id[0], 0, Increment, &tid0);
  pthread_create(&thread_id[1], 0, Increment, &tid1);
    
  pthread_join(thread_id[0], 0);
  pthread_join(thread_id[1], 0);
  
  printf("Count = %i\n", Count);
  
  return 0;
}
Results:
int (no locks)atomic_intint with mutex
Count = 112089636

real  0m2.075s
user  0m4.074s
sys   0m0.000s
Count = 200000000

real  0m5.332s
user  0m10.590s
sys   0m0.004s
Count = 200000000

real  0m11.353s
user  0m22.684s
sys   0m0.004s
More results:
int (no locks)atomic_intint with mutex
  2.07s
  2.29s
  2.20s
  2.19s
  2.09s
  2.26s
  2.26s
  2.05s
  2.28s
  2.14s
  5.20s
  4.53s
  5.80s
  5.87s
  6.04s
  5.56s
  5.94s
  5.53s
  5.58s
  5.09s
  11.02s
  11.08s
  10.70s
  11.52s
  10.90s
  11.10s
  10.97s
  10.87s
  10.87s
  10.37s
C code:
regular intatomic_int


int count;

int main(void)
{
  return ++count;
}
#include <stdatomic.h>

atomic_int count;

int main(void)
{
  return ++count;
}
Assembly output from C code:
regular intatomic_int
main:
  movl  count(%rip), %eax
  addl  $1, %eax
  movl  %eax, count(%rip)
  ret
main:
  movl  $1, %eax
  lock xaddl  %eax, count(%rip)
  addl  $1, %eax
  ret