11.5 ★ Perfect hashing

Although hashing is most often used for its excellent expected performance, hashing can be used to obtain excellent worst-case performance when the set of keys is static: once the keys are stored in the table, the set of keys never changes. Some applications naturally have static sets of keys: consider the set of reserved words in a programming language, or the set of file names on a CD-ROM. We call a hashing technique perfect hashing if the worst-case number of memory accesses required to perform a search is O(1).

The basic idea to create a perfect hashing scheme is simple. We use a two-level hashing scheme with universal hashing at each level. Figure 11.6 illustrates the approach.

Figure 11.6: Using perfect hashing to store the set K = {10, 22, 37, 40, 60, 70, 75}. The outer hash function is h(k) = ((ak + b) mod p) mod m, where a = 3, b = 42, p = 101, and m = 9. For example, h(75) = 2, so key 75 hashes to slot 2 of table T . A secondary hash table S_j stores all keys hashing to slot j . The size of hash table S_j is m_j , and the associated hash function is h_j (k) = ((a_j k + b_j) mod p) mod m_j. Since h₂(75) = 1, key 75 is stored in slot 1 of secondary hash table S₂. There are no collisions in any of the secondary hash tables, and so searching takes constant time in the worst case.

The first level is essentially the same as for hashing with chaining: the n keys are hashed into m slots using a hash function h carefully selected from a family of universal hash functions.

Instead of making a list of the keys hashing to slot j, however, we use a small secondary hash table S_j with an associated hash function h_j. By choosing the hash functions h_j carefully, we can guarantee that there are no collisions at the secondary level.

In order to guarantee that there are no collisions at the secondary level, however, we will need to let the size m_j of hash table S_j be the square of the number n_j of keys hashing to slot j. While having such a quadratic dependence of m_j on n_j may seem likely to cause the overall storage requirements to be excessive, we shall show that by choosing the first level hash function well, the expected total amount of space used is still O(n).

We use hash functions chosen from the universal classes of hash functions of Section 11.3.3. The first-level hash function is chosen from the class ℋ_p,m, where as in Section 11.3.3, p is a prime number greater than any key value. Those keys hashing to slot j are re-hashed into a secondary hash table S_j of size m_j using a hash function h_j chosen from the class .^[1]

We shall proceed in two steps. First, we shall determine how to ensure that the secondary tables have no collisions. Second, we shall show that the expected amount of memory used overall-for the primary hash table and all the secondary hash tables-is O(n).

Theorem 11.9

If we store n keys in a hash table of size m = n² using a hash function h randomly chosen from a universal class of hash functions, then the probability of there being any collisions is less than 1/2.

Proof There are pairs of keys that may collide; each pair collides with probability 1/m if h is chosen at random from a universal family ℋ of hash functions. Let X be a random variable that counts the number of collisions. When m = n², the expected number of collisions is

(Note that this analysis is similar to the analysis of the birthday paradox in Section 5.4.1.) Applying Markov's inequality (C.29), Pr{X ≥ t} ≤ E[X] /t, with t = 1 completes the proof.

In the situation described in Theorem 11.9, where m = n², it follows that a hash function h chosen at random from ℋ is more likely than not to have no collisions. Given the set K of n keys to be hashed (remember that K is static), it is thus easy to find a collision-free hash function h with a few random trials.

When n is large, however, a hash table of size m = n² is excessive. Therefore, we adopt the two-level hashing approach, and we use the approach of Theorem 11.9 only to hash the entries within each slot. An outer, or first-level, hash function h is used to hash the keys into m = n slots. Then, if n_j keys hash to slot j, a secondary hash table S_j of size is used to provide collision-free constant time lookup.

We now turn to the issue of ensuring that the overall memory used is O(n). Since the size m_j of the jth secondary hash table grows quadratically with the number n_j of keys stored, there is a risk that the overall amount of storage could be excessive.

If the first-level table size is m = n, then the amount of memory used is O(n) for the primary hash table, for the storage of the sizes m_j of the secondary hash tables, and for the storage of the parameters a_j and b_j defining the secondary hash functions h_j drawn from the class of Section 11.3.3 (except when n_j = 1 and we use a = b = 0). The following theorem and a corollary provide a bound on the expected combined sizes of all the secondary hash tables. A second corollary bounds the probability that the combined size of all the secondary hash tables is superlinear.

Theorem 11.10

If we store n keys in a hash table of size m = n using a hash function h randomly chosen from a universal class of hash functions, then

where n_j is the number of keys hashing to slot j.

Proof We start with the following identity, which holds for any nonnegative integer a:

(11.6)

We have

To evaluate the summation , we observe that it is just the total number of collisions. By the properties of universal hashing, the expected value of this summation is at most

since m = n. Thus,

Corollary 11.11

If we store n keys in a hash table of size m = n using a hash function h randomly chosen from a universal class of hash functions and we set the size of each secondary hash table to for j = 0, 1, ..., m - 1, then the expected amount of storage required for all secondary hash tables in a perfect hashing scheme is less than 2n.

Proof Since for j = 0, 1, ..., m - 1, Theorem 11.10 gives

(11.7)

which completes the proof.

Corollary 11.12

If we store n keys in a hash table of size m = n using a hash function h randomly chosen from a universal class of hash functions and we set the size of each secondary hash table to for j = 0, 1, ..., m - 1, then the probability that the total storage used for secondary hash tables exceeds 4n is less than 1/2.

Proof Again we apply Markov's inequality (C.29), Pr {X ≥ t} ≤ E [X] /t, this time to inequality (11.7), with and t = 4n:

From Corollary 11.12, we see that testing a few randomly chosen hash functions from the universal family will quickly yield one that uses a reasonable amount of storage.

Exercises 11.5-1: ⋆

Suppose that we insert n keys into a hash table of size m using open addressing and uniform hashing. Let p(n, m) be the probability that no collisions occur. Show that p(n, m) ≤ e^-n(n-1)/2m. (Hint: See equation (3.11).) Argue that when n exceeds , the probability of avoiding collisions goes rapidly to zero.

Problems 11-1: Longest-probe bound for hashing

A hash table of size m is used to store n items, with n ≤ m/2. Open addressing is used for collision resolution.

Assuming uniform hashing, show that for i = 1, 2, ..., n, the probability that the ith insertion requires strictly more than k probes is at most 2^-k.
Show that for i = 1, 2, ..., n, the probability that the ith insertion requires more than 2 lg n probes is at most 1/n².

Let the random variable X_i denote the number of probes required by the ith insertion. You have shown in part (b) that Pr {X_i > 2lg n} ≤ 1/n². Let the random variable X = max_1≤i≤n X_i denote the maximum number of probes required by any of the n insertions.

Show that Pr{X > 2lg n} ≤ 1/n.
Show that the expected length E[X] of the longest probe sequence is O(lg n).

Problems 11-2: Slot-size bound for chaining

Suppose that we have a hash table with n slots, with collisions resolved by chaining, and suppose that n keys are inserted into the table. Each key is equally likely to be hashed to each slot. Let M be the maximum number of keys in any slot after all the keys have been inserted. Your mission is to prove an O(lg n/lg lg n) upper bound on E[M], the expected value of M.

Argue that the probability Q_k that exactly k keys hash to a particular slot is given by
Let P_k be the probability that M = k, that is, the probability that the slot containing the most keys contains k keys. Show that P_k ≤ nQ_k.
Use Stirling's approximation, equation (3.17), to show that Q_k < e^k/k^k.
Show that there exists a constant c > 1 such that for k₀ = clg n/ lg lg n. Conclude that P_k < 1/n^2< for k ≥ k₀ = c lg n/ lg lg n.
Argue that

Conclude that E[M] = O(lg n/ lg lg n).

Problems 11-3: Quadratic probing

Suppose that we are given a key k to search for in a hash table with positions 0, 1, ..., m - 1, and suppose that we have a hash function h mapping the key space into the set {0, 1, ..., m - 1}. The search scheme is as follows.

Compute the value i ← h(k), and set j ← 0.
Probe in position i for the desired key k. If you find it, or if this position is empty, terminate the search.
Set j ← (j + 1) mod m and i ← (i + j) mod m, and return to step 2.

Assume that m is a power of 2.

Show that this scheme is an instance of the general "quadratic probing" scheme by exhibiting the appropriate constants c₁ and c₂ for equation (11.5).
Prove that this algorithm examines every table position in the worst case.

Problems 11-4: k-universal hashing and authentication

Let ℋ = {h} be a class of hash functions in which each h maps the universe U of keys to {0, 1, ..., m - 1}. We say that ℋ is k-universal if, for every fixed sequence of k distinct keys 〈x⁽¹⁾, x⁽²⁾, ..., x^(k)〉 and for any h chosen at random from ℋ, the sequence 〈h(x⁽¹⁾), h(x⁽²⁾), ..., h(x^(k))〉 is equally likely to be any of the m^k sequences of length k with elements drawn from {0, 1, ..., m - 1}.

Show that if ℋ is 2-universal, then it is universal.
Let U be the set of n-tuples of values drawn from Z_p, and let B = Z_p, where p is prime. For any n-tuple a = 〈a₀, a₁, ..., a_n-1〉 of values from Z_p and for any b ∈ Z_p, define the hash function h_a,b : U → B on an input n-tuple x = 〈x₀, x₁, ..., x_n-1〉 from U as

and let ℋ = {h_a,b}. Argue that ℋ is 2-universal.
Suppose that Alice and Bob agree secretly on a hash function h_a,b from a 2-universal family ℋ of hash functions. Later, Alice sends a message m to Bob over the Internet, where m ∈ U . She authenticates this message to Bob by also sending an authentication tag t = h_a,b(m), and Bob checks that the pair (m, t) he receives satisfies t = h_a,b(m). Suppose that an adversary intercepts (m, t) en route and tries to fool Bob by replacing the pair with a different pair (m', t'). Argue that the probability that the adversary succeeds in fooling Bob into accepting (m', t') is at most 1/p, no matter how much computing power the adversary has.

^[1]When n_j = m_j = 1, we don't really need a hash function for slot j; when we Choose a hash function h_a,b(k) = ((ak + b) mod p) mod m_j for such a slot, we just use a = b = 0.