What is Hashing?
Hashing is the process of mapping large amount of data item to smaller
table with the help of hashing function.
Hashing is also known as Hashing Algorithm or Message Digest
Function.
It is a technique to convert a range of key values into a range of indexes of
an array.
It is used to facilitate the next level searching method when compared with
the linear or binary search.
Hashing allows to update and retrieve any data entry in a constant time
O(1).
Constant time O(1) means the operation does not depend on the size of the
data.
Hashing is used with a database to enable items to be retrieved more
quickly.
It is used in the encryption and decryption of digital signatures.
What is Hash Function?
A fixed process converts a key to a hash key is known as a Hash Function.
This function takes a key and maps it to a value of a certain length which is
called a Hash value or Hash.
Hash value represents the original string of characters, but it is normally
smaller than the original.
It transfers the digital signature and then both hash value and signature are
sent to the receiver. Receiver uses the same hash function to generate the
hash value and then compares it to that received with the message.
If the hash values are same, the message is transmitted without errors.
What is Hash Table?
Hash table or hash map is a data structure used to store key-value pairs.
It is a collection of items stored to make it easy to find them later.
It uses a hash function to compute an index into an array of buckets or slots
from which the desired value can be found.
It is an array of list where each list is known as bucket.
It contains value based on the key.
Hash table is used to implement the map interface and extends Dictionary
class.
Hash table is synchronized and contains only unique elements.
The above figure shows the hash table with the size of n = 10. Each position
of the hash table is called as Slot. In the above hash table, there are n slots
in the table, names = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Slot 0, slot 1, slot 2 and
so on. Hash table contains no items, so every slot is empty.
As we know the mapping between an item and the slot where item belongs
in the hash table is called the hash function. The hash function takes any
item in the collection and returns an integer in the range of slot names
between 0 to n-1.
Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common
method of determining a hash key is the division method of hashing and the
formula is :
Hash Key = Key Value % Number of Slots in the Table
Division method or reminder method takes an item and divides it by the
table size and returns the remainder as its hash value.
Data Item Value % No. of Slots Hash Value
26 26 % 10 = 6 6
70 70 % 10 = 0 0
18 18 % 10 = 8 8
31 31 % 10 = 1 1
54 54 % 10 = 4 4
93 93 % 10 = 3 3
After computing the hash values, we can insert each item into the hash
table at the designated position as shown in the above figure. In the hash
table, 6 of the 10 slots are occupied, it is referred to as the load factor and
denoted by, λ = No. of items / table size. For example , λ = 6/10.
It is easy to search for an item using hash function where it computes the
slot name for the item and then checks the hash table to see if it is present.
Constant amount of time O(1) is required to compute the hash value and
index of the hash table at that location.
Collision in Hashing-
In hashing,
Hash function is used to compute the hash value for a key.
Hash value is then used as an index to store the key in the hash table.
Hash function may return the same hash value for two or more keys.
When the hash value of a key maps to an already occupied bucket of the hash table,
it is called as a Collision.
Collision Resolution Techniques-
Collision Resolution Techniques are the techniques used for resolving or handling the collision.
Collision resolution techniques are classified as-
1. Separate Chaining
2. Open Addressing
Separate Chaining-
To handle the collision,
This technique creates a linked list to the slot for which collision occurs.
The new key is then inserted in the linked list.
These linked lists to the slots appear like chains.
That is why, this technique is called as separate chaining.
Time Complexity-
For Searching-
In worst case, all the keys might map to the same bucket of the hash table.
In such a case, all the keys will be present in a single linked list.
Sequential search will have to be performed on the linked list to perform the search.
So, time taken for searching in worst case is O(n).
For Deletion-
In worst case, the key might have to be searched first and then deleted.
In worst case, time taken for searching is O(n).
So, time taken for deletion in worst case is O(n).
Load Factor (α)-
Load factor (α) is defined as-
If Load factor (α) = constant, then time complexity of Insert, Search, Delete = Θ(1)
PRACTICE PROBLEM BASED ON SEPARATE
CHAINING-
Problem-
Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101
Use separate chaining technique for collision resolution.
Solution-
The given sequence of keys will be inserted in the hash table as-
Step-01:
Draw an empty hash table.
For the given hash function, the possible range of hash values is [0, 6].
So, draw an empty hash table consisting of 7 buckets as-
Step-02:
Insert the given keys in the hash table one by one.
The first key to be inserted in the hash table = 50.
Bucket of the hash table to which key 50 maps = 50 mod 7 = 1.
So, key 50 will be inserted in bucket-1 of the hash table as-
Step-03:
The next key to be inserted in the hash table = 700.
Bucket of the hash table to which key 700 maps = 700 mod 7 = 0.
So, key 700 will be inserted in bucket-0 of the hash table as-
Step-04:
The next key to be inserted in the hash table = 76.
Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
So, key 76 will be inserted in bucket-6 of the hash table as-
Step-05:
The next key to be inserted in the hash table = 85.
Bucket of the hash table to which key 85 maps = 85 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
Separate chaining handles the collision by creating a linked list to bucket-1.
So, key 85 will be inserted in bucket-1 of the hash table as-
Step-06:
The next key to be inserted in the hash table = 92.
Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
Separate chaining handles the collision by creating a linked list to bucket-1.
So, key 92 will be inserted in bucket-1 of the hash table as-
Step-07:
The next key to be inserted in the hash table = 73.
Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
So, key 73 will be inserted in bucket-3 of the hash table as-
Step-08:
The next key to be inserted in the hash table = 101.
Bucket of the hash table to which key 101 maps = 101 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
Separate chaining handles the collision by creating a linked list to bucket-3.
So, key 101 will be inserted in bucket-3 of the hash table as-
Collision Resolution Techniques-
We have discussed-
Hashing is a well-known searching technique.
Collision occurs when hash value of the new key maps to an occupied bucket of the hash
table.
Collision resolution techniques are classified as-
Open Addressing-
In open addressing,
Unlike separate chaining, all the keys are stored inside the hash table.
No key is stored outside the hash table.
Techniques used for open addressing are-
Linear Probing
Quadratic Probing
Double Hashing
Operations in Open Addressing-
Let us discuss how operations are performed in open addressing-
Insert Operation-
Hash function is used to compute the hash value for a key to be inserted.
Hash value is then used as an index to store the key in the hash table.
In case of collision,
Probing is performed until an empty bucket is found.
Once an empty bucket is found, the key is inserted.
Probing is performed in accordance with the technique used for open addressing.
Search Operation-
To search any particular key,
Its hash value is obtained using the hash function used.
Using the hash value, that bucket of the hash table is checked.
If the required key is found, the key is searched.
Otherwise, the subsequent buckets are checked until the required key or an empty bucket is
found.
The empty bucket indicates that the key is not present in the hash table.
Delete Operation-
The key is first searched and then deleted.
After deleting the key, that particular bucket is marked as “deleted”.
NOTE-
During insertion, the buckets marked as “deleted” are treated like any other empty bucket.
During searching, the search is not terminated on encountering the bucket marked as
“deleted”.
The search terminates only after the required key or an empty bucket is found.
Open Addressing Techniques-
Techniques used for open addressing are-
Take the above example, if we insert next item 40 in our collection, it would
have a hash value of 0 (40 % 10 = 0). But 70 also had a hash value of 0, it
becomes a problem. This problem is called as Collision or Clash. Collision
creates a problem for hashing technique.
Linear probing is used for resolving the collisions in hash table, data
structures for maintaining a collection of key-value pairs.
Linear probing was invented by Gene Amdahl, Elaine M. McGraw and Arthur
Samuel in 1954 and analyzed by Donald Knuth in 1963.
It is a component of open addressing scheme for using a hash table to solve
the dictionary problem.
The simplest method is called Linear Probing. Formula to compute linear
probing is:
P = (1 + P) % (MOD) Table_size
For example,
If we insert next item 40 in our collection, it would have a hash value of 0
(40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem.
Linear probing solves this problem:
P = H(40)
44 % 10 = 0
Position 0 is occupied by 70. so we look elsewhere for a position to store 40.
Using Linear Probing:
P= (P + 1) % table-size
0 + 1 % 10 = 1
But, position 1 is occupied by 31, so we look elsewhere for a position to store
40.
Using linear probing, we try next position : 1 + 1 % 10 = 2
Position 2 is empty, so 40 is inserted there.
In linear probing,
When collision occurs, we linearly probe for the next bucket.
We keep probing until an empty bucket is found.
Advantage-
It is easy to compute.
Disadvantage-
The main problem with linear probing is clustering.
Many consecutive elements form groups.
Then, it takes time to search an element or to find an empty bucket.
Time Complexity-
Worst time to search an element in linear probing is O (table size).
This is because-
Even if there is only one element present and all other elements are deleted.
Then, “deleted” markers present in the hash table makes search the entire table.
Problem-
Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101
Use linear probing technique for collision resolution.
Solution-
The given sequence of keys will be inserted in the hash table as-
Step-01:
Draw an empty hash table.
For the given hash function, the possible range of hash values is [0, 6].
So, draw an empty hash table consisting of 7 buckets as-
Step-02:
Insert the given keys in the hash table one by one.
The first key to be inserted in the hash table = 50.
Bucket of the hash table to which key 50 maps = 50 mod 7 = 1.
So, key 50 will be inserted in bucket-1 of the hash table as-
Step-03:
The next key to be inserted in the hash table = 700.
Bucket of the hash table to which key 700 maps = 700 mod 7 = 0.
So, key 700 will be inserted in bucket-0 of the hash table as-
Step-04:
The next key to be inserted in the hash table = 76.
Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
So, key 76 will be inserted in bucket-6 of the hash table as-
Step-05:
The next key to be inserted in the hash table = 85.
Bucket of the hash table to which key 85 maps = 85 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket
is found.
The first empty bucket is bucket-2.
So, key 85 will be inserted in bucket-2 of the hash table as-
Step-06:
The next key to be inserted in the hash table = 92.
Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket
is found.
The first empty bucket is bucket-3.
So, key 92 will be inserted in bucket-3 of the hash table as-
Step-07:
The next key to be inserted in the hash table = 73.
Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket
is found.
The first empty bucket is bucket-4.
So, key 73 will be inserted in bucket-4 of the hash table as-
Step-08:
The next key to be inserted in the hash table = 101.
Bucket of the hash table to which key 101 maps = 101 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket
is found.
The first empty bucket is bucket-5.
So, key 101 will be inserted in bucket-5 of the hash table as-
2. Quadratic Probing-
In quadratic probing,
When collision occurs, we probe for i2‘th bucket in ith iteration.
We keep probing until an empty bucket is found.
3. Double Hashing-
In double hashing,
We use another hash function hash2(x) and look for i * hash2(x) bucket in ith iteration.
It requires more computation time as two hash functions need to be computed.
Comparison of Open Addressing Techniques-
Linear Probing Quadratic Probing Double Hashing
Primary Clustering Yes No No
Secondary Clustering Yes Yes No
Number of Probe
Sequence m m m2
(m = size of table)
Cache performance Best Lies between the two Poor
Conclusions-
Linear Probing has the best cache performance but suffers from clustering.
Quadratic probing lies between the two in terms of cache performance and clustering.
Double caching has poor cache performance but no clustering.
Load Factor (α)-
Load factor (α) is defined as-
In open addressing, the value of load factor always lie between 0 and 1.
This is because-
In open addressing, all the keys are stored inside the hash table.
So, size of the table is always greater or at least equal to the number of keys stored in the
table.
Double hashing is a collision resolving technique in Open Addressed Hash
tables. Double hashing uses the idea of applying a second hash function to
key when a collision occurs.
Double hashing can be done using :
(hash1(key) + i * hash2(key)) % TABLE_SIZE
Here hash1() and hash2() are hash functions and TABLE_SIZE
is size of hash table.
(We repeat by increasing i when collision occurs)
First hash function is typically hash1(key) = key % TABLE_SIZE
A popular second hash function is :
hash2(key) = PRIME – (key % PRIME)
where PRIME is a prime smaller than the TABLE_SIZE.
A good second Hash function is:
It must never evaluate to zero
Must make sure that all cells can be probed