Hash Tables
hashing functions
Positive integers
The most commonly used method for hashing integers is called **modular
hashing**: we choose the array size M to be prime and, for any positive
integer key k, compute the remainder when dividing k by M. This function
is very easy to compute (k % M, in Java) and is effective in dispersing the
keys evenly between 0 and M − 1. If M is not prime, it may be the case that not all of the bits of the key play a role, which amounts to missing an
opportunity to disperse the values evenly. For example, if the keys are
base-10 numbers and M is 10k, then only the k least significant digits are
used.
Floating-point numbers
If the keys are real numbers between 0 and 1, we might just multiply by M and round off to the nearest integer to get an index between 0 and M − 1.
Although this approach is intuitive, it is defective because it gives more weight to the most significant bits of the keys; the least significant bits
play no role. One way to address this situation is to use modular hashing on the binary representation of the key (this is what Java does).
Strings
Modular hashing works for long keys such as strings, too: we simply treat them as huge integers.
int hash = 0;
for (int i = 0; i < s.length(); i++)
hash = (R * hash + s.charAt(i)) % M;
IN SUMMARY, WE HAVE THREE PRIMARY REQUIREMENTS in implementing a good
hash function for a given data type:
• It should be consistent—equal keys must produce the same hash value.
• It should be efficient to compute.
• It should uniformly distribute the set of keys.
Software caching
If computing the hash code is expensive, it may be worthwhile to cache the
hash for each key. That is, we maintain an instance variable hash in the
key type that contains the value of hashCode() for each key object
collision resolution
separate chaining
linear probing
The simplest open-addressing method is called linear probing: when there is a collision (when we hash to a table index that is already occupied with a key different from the search key), then we just check the next entry in the table (by incrementing the index). Linear probing is characterized by identifying three possible outcomes:
• Key equal to search key: search hit
• Empty position (null key at indexed position): search miss
• Key not equal to search key: try next entry
The essential idea behind hashing with open addressing is this: rather than using memory space for references in linked lists, we use it for the empty entries in the hash table, which mark the ends of probe sequences.
We implement the table with parallel arrays, one for the keys and one for the values, and use the hash function as an index to access the data as just discussed.
AS WITH SEPARATE CHAINING, the performance of hashing with open addressing depends on the ratio α = N / M, but we interpret it differently. We refer to α as the load factor of a hash table. For separate chaining, α is the average number of keys per list and is generally larger than 1; for linear probing, α is the percentage of table entries that are occupied; it cannot be greater than 1. In fact, we cannot let the load factor reach 1 (completely full table) in LinearProbingHashST because a search miss would go into an infinite loop in a full table. Indeed, for the sake of good performance, we use array resizing to guarantee that the load factor is between one-eighth and one-half. This strategy is validated by mathematical analysis, which we consider before we discuss implementation details.
Clustering
The average cost of linear probing depends on the way in which the entriesclump together into contiguous groups of occupied table entries, called clusters, when they are inserted.
Memory
Not counting the memory for keys and values,
our implementation SeparateChainingHashST uses memory for M
references to SequentialSearchST objects plus M
SequentialSearchST objects. Each SequentialSearchST object
has the usual 16 bytes of object overhead plus one 8-byte reference
(first), and there are a total of N Node objects, each with 24 bytes of
object overhead plus 3 references (key, value, and next). This compares
with an extra reference per node for binary search trees. With array
resizing to ensure that the table is between one-eighth and one-half full,
linear probing uses between 4N and 16N references.
Still, hashing is not a panacea, for several reasons, including:
• A good hash function for each type of key is required.
• The performance guarantee depends on the quality of the hash
function.
• Hash functions can be difficult and expensive to compute.
• Ordered symbol-table operations are not easily supported.