CSC 3XX - Lecture Notes: Hashing and Hash Tables
Topic: Hashing and Hash Tables
Date: October 26, 2023
1. Introduction to Hashing
Hashing is a technique used to convert a large key (e.g., a string, a large integer, an object)
into a small, fixed-size integer value, which typically serves as an index in an array. The
primary goal of hashing is to enable very efficient data storage and retrieval operations,
ideally achieving constant average time complexity (O(1)).
The core data structure that leverages hashing is called a Hash Table (also known as a Hash
Map or Dictionary). It's an array-based data structure that stores key-value pairs, allowing for
quick lookups, insertions, and deletions based on the key.
Why Hashing? Consider a scenario where you need to store and quickly retrieve information
about millions of users based on their unique email addresses. If you used a simple array or
linked list, searching would be O(N). A balanced binary search tree could achieve O(log N).
Hashing aims to beat even O(log N) on average.
2. Key Concepts
2.1. The Hash Function
A hash function is the heart of any hash table. It takes an input key and returns an integer,
which is then mapped to an index within the hash table's underlying array.
Properties of a Good Hash Function:
1. Deterministic: The same input key must always produce the same hash value.
2. Fast Computation: The function should be efficient to compute, as it's called for every
operation (insert, search, delete).
3. Uniform Distribution: It should distribute keys as evenly as possible across the entire
range of possible hash values. This minimizes collisions and ensures good performance.
4. Low Collision Rate: While collisions are inevitable, a good hash function minimizes
their frequency.
Example Hash Functions:
• Simple Modulo (for integers): If keys are integers, a common approach is
hash_value = key % M, where M is the size of the hash table's array.
○ Problem: If keys have a common factor with M, or are clustered, distribution can
be poor.
○ Better: Choose M as a prime number.
• For Strings (Polynomial Rolling Hash): hash(S) = (S[0] * p^(k-1) + S[1]
* p^(k-2) + ... + S[k-1] * p^0) % M where S is the string, k is its length, p
is a prime number (e.g., 31 or 37 for lowercase English letters), and M is the table size.
This treats the string as a number in base p.
2.2. The Hash Table Structure
A hash table fundamentally consists of:
1. An Array (Buckets): This is the main storage space. Each position in the array is called
a "bucket" or "slot."
2. A Mechanism for Collision Resolution: Since multiple keys can map to the same
bucket (a collision), a strategy is needed to handle this.
Basic Operations and Time Complexity:
• insert(key, value):
1. Compute index = hash_function(key) % M.
2. Place the (key, value) pair at index, handling collisions.
○ Average: O(1)
○ Worst: O(N) (if all keys hash to the same bucket)
• search(key):
1. Compute index = hash_function(key) % M.
2. Look for key at index, handling collisions.
○ Average: O(1)
○ Worst: O(N)
• delete(key):
1. Compute index = hash_function(key) % M.
2. Find and remove key at index, handling collisions.
○ Average: O(1)
○ Worst: O(N)
3. Collision Resolution Strategies
Collisions are inevitable due to the Pigeonhole Principle (mapping a larger set of possible
keys to a smaller set of array indices). Two main approaches exist:
3.1. Separate Chaining
In separate chaining, each bucket in the hash table array doesn't directly store an item, but
rather a reference to a data structure (typically a linked list, but can be a dynamic array or
even another hash table) that holds all key-value pairs that hash to that specific bucket.
• How it works:
1. To insert(key, value): Compute index = hash(key) % M. Add the
(key, value) pair to the linked list at table[index].
2. To search(key): Compute index. Traverse the linked list at table[index]
until the key is found or the list ends.
3. To delete(key): Compute index. Traverse the linked list at table[index]
and remove the node containing the key.
• Advantages:
○ Relatively simple to implement.
○ The hash table can never truly "fill up" (can always add more elements to linked
lists).
○ Performance degrades gracefully as the load factor increases.
○ Deletion is straightforward.
• Disadvantages:
○ Requires extra space for pointers in the linked lists.
○ Can suffer from poor cache performance if linked lists become long, as nodes
might not be contiguous in memory.
Example (Separate Chaining): Hash table size M = 5. Hash function hash(key) = key
% 5. Keys to insert: 10, 22, 5, 15, 7, 12
• 10 % 5 = 0: table[0] -> [10]
• 22 % 5 = 2: table[2] -> [22]
• 5 % 5 = 0: table[0] -> [10] -> [5]
• 15 % 5 = 0: table[0] -> [10] -> [5] -> [15]
• 7 % 5 = 2: table[2] -> [22] -> [7]
• 12 % 5 = 2: table[2] -> [22] -> [7] -> [12]
Resulting table: table[0] -> [10] -> [5] -> [15] table[1] -> NULL table[2] -> [22]
-> [7] -> [12] table[3] -> NULL table[4] -> NULL
3.2. Open Addressing
In open addressing, all elements are stored directly within the hash table array. When a
collision occurs, the algorithm "probes" for an alternative empty slot in the table using a
specific sequence.
• How it works:
1. To insert(key, value): Compute index = hash(key) % M. If
table[index] is empty, place the item there. If not, systematically search for
the next available empty slot.
2. To search(key): Compute index. If table[index] contains the key, return
it. If table[index] is occupied by a different key, follow the same probing
sequence as insertion until the key is found or an empty slot is encountered
(meaning the key is not in the table).
3. To delete(key): This is more complex. Simply removing an item can break the
search chain for other items that were placed later due to that item's presence. A
common solution is to mark the slot as "deleted" (a "tombstone") instead of truly
emptying it. This allows searches to continue past the deleted slot but complicates
insertions (can reuse "deleted" slots) and can degrade performance over time.
• Advantages:
○ Better cache performance because elements are stored contiguously in memory.
○ No overhead for storing pointers.
• Disadvantages:
○ Sensitive to the load factor (table can become full).
○ Deletion is more complex.
○ Can suffer from clustering, where occupied slots form blocks, increasing probe
sequence lengths.
Probing Strategies for Open Addressing:
1. Linear Probing:
• Probe sequence: (hash(key) + i) % M, for i = 0, 1, 2, ...
• If table[index] is occupied, try table[(index + 1) % M], then
table[(index + 2) % M], and so on.
• Problem: Primary Clustering – long runs of occupied slots form, making future
insertions and searches take longer.
2. Quadratic Probing:
• Probe sequence: (hash(key) + i^2) % M, for i = 0, 1, 2, ...
• Helps reduce primary clustering.
• Problem: Secondary Clustering – keys that hash to the same initial index follow
the exact same quadratic probe sequence, still causing clusters. Also, not all slots
may be reachable if M is not chosen carefully (e.g., a prime number).
3. Double Hashing:
• Probe sequence: (hash1(key) + i * hash2(key)) % M, for i = 0, 1, 2,
...
• Uses two hash functions: hash1 for the initial position, and hash2 for the step
size in the probing sequence.
• hash2(key) must never return zero and should be relatively prime to M. A
common choice: hash2(key) = R - (key % R) where R is a prime number
smaller than M.
• Effectively eliminates both primary and secondary clustering by providing unique
probe sequences for each key.
Example (Linear Probing): Hash table size M = 5. Hash function hash(key) = key % 5.
Keys to insert: 10, 22, 5, 15, 7
• 10 % 5 = 0: table[0] = 10
• 22 % 5 = 2: table[2] = 22
• 5 % 5 = 0: table[0] is occupied. Try (0+1)%5 = 1. table[1] = 5
• 15 % 5 = 0: table[0] is occupied. Try (0+1)%5 = 1. table[1] is occupied. Try
(0+2)%5 = 2. table[2] is occupied. Try (0+3)%5 = 3. table[3] = 15
• 7 % 5 = 2: table[2] is occupied. Try (2+1)%5 = 3. table[3] is occupied. Try
(2+2)%5 = 4. table[4] = 7
Resulting table: table[0] = 10 table[1] = 5 table[2] = 22 table[3] = 15
table[4] = 7
4. Load Factor and Resizing (Rehashing)
The load factor (α) is a crucial metric for hash table performance. It's defined as: α = N /
M where N is the number of items currently stored in the hash table, and M is the total
number of buckets (array size).
• Impact: A higher load factor means more collisions and longer collision resolution
chains/probes, degrading performance from O(1) towards O(N).
• Thresholds:
○ For separate chaining, α can exceed 1, but typically performance starts degrading
significantly above α = 1 or 2.
○ For open addressing, α must always be less than 1. Typically, a threshold of α =
0.5 to 0.7 is used before resizing.
Resizing (Rehashing): When the load factor exceeds a predefined threshold, the hash table
needs to be resized to maintain good performance. This involves:
1. Creating a new, larger array (e.g., double the size of the old array, and often choosing
a new prime number for M).
2. Rehashing all existing key-value pairs from the old table into the new table using the
new M. This is necessary because the modulo operation key % M will produce different
indices with a new M.
3. Discarding the old table.
• Cost: Resizing is an O(N) operation, as every item must be re-inserted.
• Amortized Analysis: While a single resize is expensive, if the table grows by a constant
factor, the cost of resizing is "amortized" over many insertions, resulting in an average
O(1) cost per insertion over a sequence of operations.
5. Applications of Hash Tables
Hash tables are one of the most widely used and versatile data structures in computer science
due to their average O(1) performance.
• Symbol Tables in Compilers/Interpreters: Store information about variables,
functions, and other identifiers for quick lookup during parsing and compilation.
• Database Indexing: Used to quickly locate records based on a key (e.g., primary key).
• Caches: Web browsers, CPU caches, and other caching systems use hash tables to store
frequently accessed data for fast retrieval.
• Implementing Set and Dictionary/Map Data Types: Most programming languages
(Python's dict, Java's HashMap, C++'s std::unordered_map) implement these
using hash tables.
• Routing Tables: Used in network routers to map IP addresses to outbound interfaces.
• Checksums and Data Integrity: While distinct from hash functions for data structures,
cryptographic hash functions generate fixed-size "fingerprints" for data blocks to detect
tampering.
• Spell Checkers: Store a dictionary of valid words for quick lookup.
• Graph Algorithms: Can be used to store visited nodes or edges efficiently.
6. Conclusion
Hashing and hash tables are powerful tools for achieving near-constant time complexity for
data storage and retrieval. Understanding hash functions, collision resolution strategies
(separate chaining, open addressing with its probing variants), and the concept of load factor
and resizing are fundamental to designing efficient and reliable software systems. While the
worst-case time complexity can be O(N), a well-designed hash table with good hash functions
and proper load factor management ensures excellent average-case performance in practice.