Hashing

Introduction to Hashing

If we have to use a comparison function to identify an item (which could even be the built-in comparison operators), the answer to "Can we do better?" is no. Binary search is the best we can do.

The Time vs. Space Trade-off

Unsorted 10-element array (full):

Sorted 10-element array (full):

Complexity of searching? Pros/Cons?

A First Attempt

Storing Other Data Types (from very large sets)

Given this Student structure:
struct Student
{
  string Name; // Maybe last and/or first
  long Year;   // 1 - 4
  float GPA;   // 0.00 - 4.00
  long ID;     // social security number, 000-00-0000 .. 999-99-9999
};
The ID appears to be the best candidate: Using a Portion of the ID

Hash Functions

Hash functions are functions in the normal (mathematical) sense of the word (i.e. given some input, produce one output):

In our Student example above, the key is the social security number and the index is the result of applying the hash function. If H is a hash function, and K is a key,

H(K) --> index
So, given a hash function SSHashFirst4, we can map social security numbers into a set of indices. (SSHashFirst4 simply extracts and returns the first 4 digits):
SSHashFirst4(525334568) --> 5253  !!!
SSHashFirst4(559304056) --> 5593
SSHashFirst4(167782345) --> 1677
SSHashFirst4(525387000) --> 5253  !!!
SSHashFirst4(552559999) --> 5525
Or, given a hash function SSHashLast4, we can map social security numbers into a different set of indices. (SSHashLast4 extracts the last 4 digits):
SSHashLast4(525334568) --> 4568
SSHashLast4(559304056) --> 4056
SSHashLast4(167782345) --> 2345
SSHashLast4(525387000) --> 7000
SSHashLast4(552559999) --> 9999
Graphically:

These trivial hash functions could be implemented like this:
unsigned SSHashFirst4(unsigned SSNumber)
{
  return SSNumber / 100000;  // return first 4 digits
}

unsigned SSHashLast4(unsigned SSNumber)
{
  return SSNumber % 10000;  // return last 4 digits
}
A note about hash functions: A good hash function will map keys uniformly and randomly into the full range of indices. More on this later...

To be clear: The keys (input) used in a hash table must be unique. However, the resulting hashed values (output) do not have to be unique (and often are not).

Collision Resolution by Linear Probing

There are two parts to hash-based algorithms that implementations must deal with:
  1. Computing the hash function to produce an index from a key
  2. Dealing with the inevitable collisions

As a simple example, we'll use single letters (A - Z) as keys for some larger data structure. Also, to keep things simple, and to apply our hash function knowledge, we'll assume that our hash table has only 7 slots. (Aside: What if the hash table had 26 slots?)

Each letter (key) will be its position in the alphabet (e.g. A1, B2, ..., Z26)
Our hash function will simply be something like this:
  H(K) = K % 7   (the key modulo 7)
So A1 maps to 1, G7 maps to 0, H8 maps to 1, T20 maps to 6, etc.

Notice the significance of the value 7 (the table size). This property will be used in all of our hash functions.

We want to insert the following keys into our hash table: S19, P16, I9, N14, A1, L12

Our hash function will map the keys to indices like this:

H(S) = 5   (19 % 7)
H(P) = 2   (16 % 7)
H(I) = 2   ( 9 % 7)
H(N) = 0   (14 % 7)
H(A) = 1   ( 1 % 7)
H(L) = 5   (12 % 7)
  1. Using linear probing to resolve collisions, what would the hash table look like during and after the insertions of the letters above?
  2. How many probes would be performed?
  3. What happens when we try to lookup (find) I9? How about B2?
  4. What if we inserted T20 into the table?
  5. What happens when we now try to lookup B2?

  6. Another example: Using linear probing to resolve collisions, what would the hash table look like during and after the insertions of the letters:
    F6, L12, E5, T20, C3, H8
    

Self-check: What does the hash table look like after inserting the 6 letters above? How many probes were required?

The hash tables after inserting the letters above.

Notes:

Performance Characteristics of Linear Probing

Probability of forming/growing clusters:

Clustering diagrams

Clusters affect probing (and should be avoided):

Average number of probes for a hit: 
 
Average number of probes for a miss: 

Plugging in values for the load factor, x, in the equations leads to this table:

           Load      Number of Probes
          Factor       Hit      Miss
-------------------------------------
  Nearly   .05        1.03      1.05
  empty    .10        1.06      1.12
           .15        1.09      1.19
           .20        1.13      1.28
           .25        1.17      1.39
           .30        1.21      1.52
           .35        1.27      1.68
           .40        1.33      1.89
           .45        1.41      2.15
  Half     .50        1.50      2.50
  full     .55        1.61      2.97
           .60        1.75      3.62
           .65        1.93      4.58
           .67        2.02      5.09
           .70        2.17      6.06
           .75        2.50      8.50
           .80        3.00     13.00
           .85        3.83     22.72
  Nearly   .90        5.50     50.50
  full     .95       10.50    200.50
Note: As the load factor of the table approaches 1, the values tend to lose accuracy. However, in practice you would never allow a hash table to get this full. So the values are accurate enough when the table has a reasonable load factor.

Other Probing Methods

The fundamental problem with linear probing is that all of the probes trace the same sequence (stride).

Double hashing performance: (formulas derived by Guibas and Szemeredi)

Average number of probes for a hit: 
 
Average number of probes for a miss: 
                        Linear Probing       Double Hashing
             Load      Number of Probes     Number of Probes
            Factor%      Hit      Miss        Hit      Miss
------------------------------------------------------------
   Nearly      5        1.03      1.05       1.03      1.05
   empty      10        1.06      1.12       1.05      1.11
              15        1.09      1.19       1.08      1.18
              20        1.13      1.28       1.12      1.25
              25        1.17      1.39       1.15      1.33
              30        1.21      1.52       1.19      1.43
              35        1.27      1.68       1.23      1.54
              40        1.33      1.89       1.28      1.67
              45        1.41      2.15       1.33      1.82
   Half       50        1.50      2.50       1.39      2.00
   full       55        1.61      2.97       1.45      2.22
              60        1.75      3.62       1.53      2.50
              65        1.93      4.58       1.62      2.86
              70        2.17      6.06       1.72      3.33
              75        2.50      8.50       1.85      4.00
              80        3.00     13.00       2.01      5.00
              85        3.83     22.72       2.23      6.67
   Nearly     90        5.50     50.50       2.56     10.00
   full       95       10.50    200.50       3.15     20.00
            
Full table

Expanding the Hash Table (Dynamic Hash Tables)

Run-time issues:

Deleting Items from a Linear Probing Hash Table

Given the table containing the keys:
F6, L12, E5, T20, C3, H8
What happens to our search algorithm if we delete an item?

Question: How many clusters are in the diagram on the left?


Handling Deletions (Two Policies)

1. Marking slots as deleted: (MARK)

2. Adjusting the table: (PACK)

Results of re-inserting the cluster:

For relatively sparse tables the number of re-insertions is small. This is simply because the size of the clusters are very small. (Maybe 2 or 3 elements.)

Self-check:What does the hash table look like after deleting the letter F and re-inserting the cluster (PACK)?

Note: The performance of an open-addressed hash table is directly related to the size of the clusters. The larger the clusters, the more comparisons that are required. Smaller clusters require fewer comparisons.

Collision Resolution by Chaining

The previous example would create a table with chains that look like this:


S19, P16, I9, N14, A1, L12

H(S) = 5   (19 % 7)
H(P) = 2   (16 % 7)
H(I) = 2   ( 9 % 7)
H(N) = 0   (14 % 7)
H(A) = 1   ( 1 % 7)
H(L) = 5   (12 % 7)

Notes on Chaining: Complexity of Chaining

Recall the performance of linear probing:

With chaining:
  • There is no concept of "2/3 full".
  • Load factor is still calculated the same as before, but now it is likely to be greater than 1.
  • Think of the load factor as being the average length of the lists. (Given a good hash function, of course).
  • This example has a load factor of 2.86 (N/M, 20/7):
  • 
    
  • Complexity is the same as it was for singly-linked lists.
  • Why average complexity and not worst case here?
  • What looks better, linear probing or chaining? (Depends, pros and cons of both)
  • Self-check: Given a table of size M that contains N items, what is the average number of nodes visited in a successful search? Unsuccessful search?

    Chaining demo GUI

    Hashing Strings

    Note: The size of the table in the following examples is 211.
    A simple (and naive) hash function for hashing strings:
    unsigned SimpleHash(const char *Key, unsigned TableSize)
    {
        // Initial value of hash
      unsigned hash = 0;
    
        // Process each char in the string
      while (*Key)
      {
          // Add in current char
        hash += *Key;
    
          // Next char
        Key++;
      }
    
        // Modulo so hash is within the table
      return hash % TableSize;
    }
    Sample run (string, hash):
    bat, 100
    cat, 101
    dat, 102
    pam, 107
    amp, 107
    map, 107
    tab, 100
    tac, 101
    tad, 102
    DigiPen, 39
    digipen, 103
    DIGIPEN, 90
    
    A better algorithm is Program 14.1 in the Sedgewick book:
    int RSHash(const char *Key, int TableSize)
    {
      int hash = 0;         // Initial value of hash
      int multiplier = 127; // Prevent anomalies
    
        // Process each char in the string
      while (*Key)
      {
          // Adjust hash total
        hash = hash * multiplier;
    
          // Add in current char and mod result
        hash = (hash + *Key) % TableSize;
    
          // Next char
        Key++;
      }
    
        // Hash is within 0 - (TableSize - 1)
      return hash;
    }
    Sample run (string, hash):
    bat, 27
    cat, 120
    dat, 2
    pam, 56
    amp, 188
    map, 202
    tab, 206
    tac, 207
    tad, 208
    DigiPen, 162
    digipen, 140
    DIGIPEN, 198
    
    A more complex hash function invented by P.J. Weinberger. It's a classic example.
    int PJWHash(const char *Key, int TableSize)
    {
        // Initial value of hash
      int hash = 0;
    
        // Process each char in the string
      while (*Key)
      {
          // Shift hash left 4
        hash = (hash << 4);
          
          // Add in current char
        hash = hash + (*Key);   
    
          // Get the four high-order bits
        int bits = hash & 0xF0000000;
    
          // If any of the four bits are non-zero,
        if (bits)
        {
            // Shift the four bits right 24 positions (...bbbb0000)
            // and XOR them back in to the hash
          hash = hash ^ (bits >> 24); 
    
            // Now, XOR the four bits back in (sets them all to 0)
          hash = hash ^ bits;         
        }
    
          // Next char
        Key++;   
      }
    
        // Modulo so hash is within the table
      return hash % TableSize;  
    }
    Sample run (string, hash):
    bat, 170
    cat, 4
    dat, 49
    pam, 160
    amp, 102
    map, 28
    tab, 118
    tac, 119
    tad, 120
    DigiPen, 12
    digipen, 154
    DIGIPEN, 91
    
    A "pseudo-universal" hash function (Program 14.2 in the Sedgewick book):
    int UHash(const char *Key, int TableSize)
    {
      int hash = 0;       // Initial value of hash
      int rand1 = 31415; // "Random" 1
      int rand2 = 27183; // "Random" 2
    
        // Process each char in string
      while (*Key)
      {
          // Multiply hash by random
        hash = hash * rand1;
    
          // Add in current char, keep within TableSize
        hash = (hash + *Key) % TableSize; 
    
          // Update rand1 for next "random" number
        rand1 = (rand1 * rand2) % (TableSize - 1);
    
          // Next char
        Key++;
      }
    
        // Account for possible negative values
      if (hash < 0)
        hash = hash + TableSize;
        
        // Hash value is within 0 - TableSize - 1
      return hash;
    }
    Sample run (string, hash):
    bat, 163
    cat, 162
    dat, 161
    pam, 142
    amp, 67
    map, 148
    tab, 127
    tac, 128
    tad, 129
    DigiPen, 142
    digipen, 64
    DIGIPEN, 96
    
    Comparisons (Tablesize is 211)

    SimpleHash  RSHash  PJWHash  UHash
    bat, 100
    cat, 101
    dat, 102
    pam, 107
    amp, 107
    map, 107
    tab, 100
    tac, 101
    tad, 102
    DigiPen, 39
    digipen, 103
    DIGIPEN, 90
    
     
    bat, 27
    cat, 120
    dat, 2
    pam, 56
    amp, 188
    map, 202
    tab, 206
    tac, 207
    tad, 208
    DigiPen, 162
    digipen, 140
    DIGIPEN, 198
    
     
    bat, 170
    cat, 4
    dat, 49
    pam, 160
    amp, 102
    map, 28
    tab, 118
    tac, 119
    tad, 120
    DigiPen, 12
    digipen, 154
    DIGIPEN, 91
    
     
    bat, 163
    cat, 162
    dat, 161
    pam, 142
    amp, 67
    map, 148
    tab, 127
    tac, 128
    tad, 129
    DigiPen, 142
    digipen, 64
    DIGIPEN, 96
    
    Other data:

    Hash function performance and distribution

    Choosing the "Right" Hash Table

    Things to consider: Probing with open-addressing: Chaining with closed-addressing: Final thoughts: Interesting links:
  • This idea of randomization leads to randomized algorithms (used in skip lists) or Monte Carlo algorithms such as Buffon's Needles

    Buffon.exe from efg's Computer Lab