Hash Tables

hashing functions

Positive integers

The most commonly used method for hashing integers is called **modular

hashing**: we choose the array size M to be prime and, for any positive

integer key k, compute the remainder when dividing k by M. This function

is very easy to compute (k % M, in Java) and is effective in dispersing the

keys evenly between 0 and M − 1. If M is not prime, it may be the case that not all of the bits of the key play a role, which amounts to missing an

opportunity to disperse the values evenly. For example, if the keys are

base-10 numbers and M is 10k, then only the k least significant digits are

used.

Floating-point numbers

If the keys are real numbers between 0 and 1, we might just multiply by M and round off to the nearest integer to get an index between 0 and M − 1.

Although this approach is intuitive, it is defective because it gives more weight to the most significant bits of the keys; the least significant bits

play no role. One way to address this situation is to use modular hashing on the binary representation of the key (this is what Java does).

Strings

Modular hashing works for long keys such as strings, too: we simply treat them as huge integers.

int hash = 0;

for (int i = 0; i < s.length(); i++)

hash = (R * hash + s.charAt(i)) % M;

IN SUMMARY, WE HAVE THREE PRIMARY REQUIREMENTS in implementing a good

hash function for a given data type:

• It should be consistent—equal keys must produce the same hash value.

• It should be efficient to compute.

• It should uniformly distribute the set of keys.

Software caching

If computing the hash code is expensive, it may be worthwhile to cache the

hash for each key. That is, we maintain an instance variable hash in the

key type that contains the value of hashCode() for each key object

collision resolution

separate chaining

linear probing

The simplest open-addressing method is called linear probing: when there is a collision (when we hash to a table index that is already occupied with a key different from the search key), then we just check the next entry in the table (by incrementing the index). Linear probing is characterized by identifying three possible outcomes:

• Key equal to search key: search hit

• Empty position (null key at indexed position): search miss

• Key not equal to search key: try next entry

The essential idea behind hashing with open addressing is this: rather than using memory space for references in linked lists, we use it for the empty entries in the hash table, which mark the ends of probe sequences.

We implement the table with parallel arrays, one for the keys and one for the values, and use the hash function as an index to access the data as just discussed.

AS WITH SEPARATE CHAINING, the performance of hashing with open addressing depends on the ratio α = N / M, but we interpret it differently. We refer to α as the load factor of a hash table. For separate chaining, α is the average number of keys per list and is generally larger than 1; for linear probing, α is the percentage of table entries that are occupied; it cannot be greater than 1. In fact, we cannot let the load factor reach 1 (completely full table) in LinearProbingHashST because a search miss would go into an infinite loop in a full table. Indeed, for the sake of good performance, we use array resizing to guarantee that the load factor is between one-eighth and one-half. This strategy is validated by mathematical analysis, which we consider before we discuss implementation details.

Clustering

The average cost of linear probing depends on the way in which the entriesclump together into contiguous groups of occupied table entries, called clusters, when they are inserted.

Memory

Not counting the memory for keys and values,

our implementation SeparateChainingHashST uses memory for M

references to SequentialSearchST objects plus M

SequentialSearchST objects. Each SequentialSearchST object

has the usual 16 bytes of object overhead plus one 8-byte reference

(first), and there are a total of N Node objects, each with 24 bytes of

object overhead plus 3 references (key, value, and next). This compares

with an extra reference per node for binary search trees. With array

resizing to ensure that the table is between one-eighth and one-half full,

linear probing uses between 4N and 16N references.

Still, hashing is not a panacea, for several reasons, including:

• A good hash function for each type of key is required.

• The performance guarantee depends on the quality of the hash

function.

• Hash functions can be difficult and expensive to compute.

• Ordered symbol-table operations are not easily supported.

Searching-Symbol Tables - Hash Tables