0% found this document useful (0 votes)
18 views36 pages

Hashing Techniques in Algorithms

The document discusses hashing as a technique for mapping key-value pairs into a hash table, emphasizing the importance of a good hash function to minimize collisions and ensure efficient data retrieval. It outlines various collision resolution strategies, including separate chaining and open addressing methods like linear and quadratic probing, detailing their advantages and disadvantages. Additionally, it covers the significance of load factors and the performance implications of different hashing methods in terms of time complexity for insert, search, and delete operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views36 pages

Hashing Techniques in Algorithms

The document discusses hashing as a technique for mapping key-value pairs into a hash table, emphasizing the importance of a good hash function to minimize collisions and ensure efficient data retrieval. It outlines various collision resolution strategies, including separate chaining and open addressing methods like linear and quadratic probing, detailing their advantages and disadvantages. Additionally, it covers the significance of load factors and the performance implications of different hashing methods in terms of time complexity for insert, search, and delete operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

CSC 301 – Design and Analysis of Algorithms

Instructor: Anwar Ur Rehman


Lecture# 05: Hashing

1
Hashing: Introduction
• Many applications deal with a lot of data, stored in some form of a table (with
multiple fields), containing information.

• For example: A telephone directory has fields name, address and phone number.
To find somebody’s phone number, the book is searched on the name field.

• To find the entry in the table, the contents of one of the fields is required. We
don’t need to know the contents of all the fields.

• The field used to search the contents of other fields is called the key.

• Ideally, the key should uniquely identify the entry, i.e. no two names in the
telephone directory should be the same. 2
Hashing: Introduction
• The Table formulation can be treated as an abstract data type that is composed
of a collection of (key, value) pairs, such that each key is distinct.

• The storage size is dependent on the number of distinct keys

• A set of operations can be defined for this ADT.


• Insert
• Search
• Delete

3
Hashing: Introduction
• Given an ADT, what would be the complexity to perform these operations.

Insert Search Delete

Unsorted Array O (1 ) O (𝑛) O (𝑛)

Sorted Array O (𝑛) O (lg 𝑛) O (𝑛)

Unsorted Linked List O (1 ) O (𝑛) O (𝑛)

Sorted Linked List O (𝑛) O (𝑛) O (𝑛)


4
Ordered Binary Tree O (lg 𝑛) O (lg 𝑛) O (lg 𝑛)
Hashing: Introduction
• Frequently used applications that deal with lots of data
• Web Searches
• Databases
• Password Verification
• Compilers (Symbol Tables)
• Spell Checkers

• There are countless lookup operations that are time critical.

• As the data size becomes large, the value of of insert, search and delete
operations becomes significant.

• In general we need something that can do better than . We need lookups to


occur in near constant time. 5
Hashing: Idea
• Hashing is a technique that maps set of key-value pairs into a hash table of size .

• Hash table is a data structure that stores key-value pairs and has two components.
• Hash Function
• Array

• Hash function (H) is applied to the key to determine the index of the key-value pair
in the hash table. Each key is mapped to an index in the range to

• Ideally, we like to have a one-to-one map, but it is not easy to find one.

• The array holds all the key-value entries in the table. The entries in the array are
6
scattered and not stored consecutively.
Hashing: Example
• Given the following (KEY, VALUE) pairs: (22,a), (33,c), (3,d), (72,e),
(85,f)

• Given a table of size M = 8.

• Define a Hash Function

• Where will the given data be stored?

(72,e (33,c) (3,d) (85,f)(22,a


[0] ) [1] [2] [3] [4] )
[5] [6] [7] 7
Hashing: Collisions
• What if we need to insert the following new pairs (57,g), (25,p), (73,x)


• They all map to the same location. Obviously, is not a good hash function.

• This is called “Collision”

(72,e (33,c) (3,d) (85,f)(22,a


[0] ) [1] [2] [3] [4] )
[5] [6] [7] 8
Hashing: Collisions
• Two or more keys hashing to the same slot leads to “Collision”.

• When collisions occur, we need to “handle” them.

• Collisions can be reduced by:


• Choosing a good hash function
• Increasing the size of the hash table
• Devising a mechanism to deal with collisions

9
Hashing: Choosing a Hash Function
• A good hash function must:
• Be easy to compute (i.e. computational time of a hash function should be .)
• Avoid collisions
• Distribute data "uniformly" over the range of available addresses
• Generate the same hash value when applied to equal objects
• Generate a different hash value when applied to unequal objects
• Generate a hash value independent from any patterns existing in the distribution of the keys
• Generate very different hash values for similar strings (i.e. pt and pts)

10
Hashing: Choosing a Hash Function
• How do we find a good hash function?
• In general, we cannot -- there is no such magic function
• In some specific cases, where all possible values are known in advance, it is possible to
compute a perfect hash function

• What is the next best thing?


• A perfect hash function would tell us exactly where to look
• In general, the best we can do is choose a function that tells us where to start looking!

11
Hashing: Choosing a Hash Function
Truncation Method:
Ignore part of the key and use the remaining part directly as the index to hash table.

Example: If the keys are 8-digit numbers and the hash table has 1000 entries, then the
last three digits could make the hash function.

Disadvantage: Does not always distribute the keys uniformly.

12
Hashing: Choosing a Hash Function
Folding Method:
Break up the key into parts of equal length and combine them in some way.

Example: If the keys are 8-digit numbers and the hash table has 1000 entries, break up

(95 + 476 + 445 = 1,016)


a key into three, three and two digits, add them up and, if necessary, truncate them.

Better than truncation.

13
Hashing: Choosing a Hash Function
Mid-square Method:
Square the given key and take the middle digits from the squared value.

Example: If the keys are 4-digit numbers and the hash table has 100 entries, take the

038
square of the key and pick the middle two digits as the index to the hash table.

14
Hashing: Choosing a Hash Function
Division Method (Modular Arithmetic):
If the hash table has slots, to map a key into one of the slots we define:

Advantages: It is fast as it only requires a single operation

Disadvantage: Not all values of are suitable for this; e.g., avoid powers of 2.

Generally, good values ofare prime numbers that are not very close to powers of 2.

15
Hashing: Choosing a Hash Function
Hashing a String Key:
If the hash table has slots, to map a key into one of the slots we define:

[ ]
𝐾𝑒𝑦𝑠𝑖𝑧𝑒 − 1
𝐻 ( 𝐾 )= ∑ 𝐾 [ 𝐾𝑒𝑦𝑠𝑖𝑧𝑒 −𝑖 −1 ] ⋅3 7𝑖 % 𝑀
𝑖=0

97 108 105 K[ i ]

KeySize = 3;
K a l i
0 1 2 i 0
1
𝐻 ( 𝑎𝑙𝑖 )=[ ( 105 × 1 ) + ( 10 8 ×37 ) + ( 97 ×3 7 ) ] %1,009 ⇒ 679
2 2
……
“ali” hash ali 679
function 16
……
1,009
Hashing: Hash Table Size
• Given a hash table with slots and elements stored in it, we define the load
factor of the table as

• The load factor gives us an indication of how full the table is.

• The possible values of the load factor depends on the method we use for
resolving collisions.

17
Hashing: Collision Resolving Strategies
• Separate chaining
• Open addressing
• Linear Probing
• Quadratic Probing
• Double Probing

18
Hashing: Collision Resolving Strategies
Separate chaining Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
• Idea: Put all elements that hash to the
hash(key) = key % 10.
same slot in a linked list (chain). The slots
contains a pointer to the head of the list. 0 0

1 81 1
• Define the Load Factor of the table that 2
indicates the average number of elements
3
stored in a chain.
4 64
• can be more than 1. 4
5 25
• For search and updates, a good hash 6 36 16
function is still needed to distribute keys 7
evenly.
8
19
9 49 9
Hashing: Collision Resolving Strategies
Separate chaining
• Insert: The slot in the table is located using the hash function and the key is inserted
to the head of the linked list at that slot (search the list first to avoid duplicates).

• Search: The slot in the table is located using the hash function and the key searched in
the linked list, headed at that slot, using linear search.

• Delete: The slot in the table is located using the hash function and the key is deleted
in the linked list headed at that slot.

• Advantages: average case performance stays good even when 𝝀>𝟏; better space
utilization for large number of items; more items than the hash table size; delete is
easier to implement than with open addressing.
20
• Disadvantages: requires dynamic data, requires storage for pointers in addition to
data, can have poor locality which causes poor caching performance
Hashing: Collision Resolving Strategies
Analysis of separate chaining
• Keep in mind that the load factor 𝝀 measures how full the table is. It also indicates the
average number of elements stored in the linked list. Given a load factor 𝝀, we would
like to know the best, average, and worst case of:
• New-key insert and unsuccessful find (these are the same)
• Successful find

• The best case is and worst case is for all these cases.
• The average number of comparisons for insert or unsuccessful search is
• In successful search, on average half of the keys in the linked list, containing the target
key, is searched before finding the target. So the average number of comparisons for
successful search is which is again
• If then 𝝀 and the total time for search (on average), insert and delete (on the worst
21

case) is , independent of N.
Hashing: Collision Resolving Strategies
Open Addressing
• Idea: Store all elements in the hash table itself. If a collision occurs, find another slot.
When searching for an element, examine slots until the element is found or the
element is not in the table.

• It is possible that the table is full, and a new element cannot be inserted.
• Resize the hash table

• The sequence of slots to be examined (probed) is computed in a systematic way.

• There are three common ways to determine a probe sequence:


• Linear probing
• Quadratic probing
• Double probing 22
Hashing: Collision Resolving Strategies
Open Addressing: Linear Probing
• Idea: Table remains a simple array of size .

• Insert: On collision, place the record in the next empty slot found using linear search.
Upon reaching the end of the table, continue searching from the start of the table
(wrap around).

• Search: Same probe sequence as Insert. There are three cases:


• Slot in table is occupied with an element of equal key
• Position in table is empty
• Position in table occupied with a different element

• Delete: Deletion is a bit tricky. We can’t just delete a record that is involved in collision.
• “Lazy Delete” – Just mark the items as inactive rather than removing it. 23
• The deleted slot can later be used for insertion
Hashing: Collision Resolving Strategies
Open Addressing: Linear Probing
Example: Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table of size =10.

49 58 9 18 89
[0] [1] [2] [3] [4] [5] [6]
[7] [8] [9]

24
Hashing: Collision Resolving Strategies
Open Addressing: Linear Probing
• An issue with linear probing is that it has the tendency to form “clusters”.

• A “cluster” is a group of elements stored consecutively in the hash table without


containing any empty slots.

• The bigger a cluster gets; it starts taking more time to find a free slot or to search an
element. Additionally, it becomes more likely that the new values will hash into the
cluster and make it ever bigger.
• Even if the table is relatively empty, clustering is still possible.

• This effect is known as primary clustering. 25


Hashing: Collision Resolving Strategies
Open Addressing: Linear Probing Analysis
• Worst-Case time complexity: If all the keys mapped to the same index, we would
need to probe over all elements. So the worst-case time complexity of Insert,
Search, and Delete is .

• Best-Case time complexity: If there is no collision, the cost of Insert, Search, and
Delete operations is .

• Average-Case time complexity: The formal proof is too complex for the scope of this
course. For Insert, Search, and Delete operations the average case is .

• Generally it is better to keep the load factor under 0.7. If gets high, double the table
size and rehash. 26
Hashing: Collision Resolving Strategies
Open Addressing: Quadratic Probing
• Idea: Table remains a simple array of size . The collision is resolved by examining
certain cells away from the original probe point. For probing, use the quadratic
function f(i) = i2 .

• If the hash function evaluates to a slot S which results in collision, we probe


the slots S + 12, S + 22, … S + i2 (i.e. we examine slots 1, 4, 9 and so on away from
the original probe). The subsequent probe points are a quadratic number of
positions from the original probe point.

• Quadratic Probing eliminates the problem of primary clustering.

27
Hashing: Collision Resolving Strategies
Open Addressing: Quadratic Probing
Example: Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table of size =10.

49 58 9 18 89
[0] [1] [2] [3] [4] [5] [6]
[7] [8] [9]

28
Hashing: Collision Resolving Strategies
Open Addressing: Quadratic Probing
• Caveat:
• May not find a vacant slot while linear probing always finds an empty slot.
• The table must be less than half full .
• May not be sure that we will probe all the slots in the table.

• If the hash table size is not prime this problem will be much severe.

• However, there is a theorem stating that If the table size is prime and load factor is
not larger than 0.5, all probes will be to different locations and an item can always
be inserted.

• Quadratic probing requires * and % operations. We can speed up these


computations by using the following trick: 29
Hashing: Collision Resolving Strategies
Open Addressing: Quadratic Probing
• If the load factor gets too high, dynamically expand and double the table size as
soon as the load factor reaches 0.5; always double to a prime number; and refill
the new table by using the new hash function.

• Quadratic probing does not suffer from primary clustering:


• No problem with keys initially hashing to the same neighborhood

• But it doesn’t help if keys initially hash to the same index as they will probe
the same alternative cells. This is called secondary clustering.

• Using probe functions that depends on the key can avoid secondary
clustering. This technique is called double probing (hashing).
30
Hashing: Collision Resolving Strategies
Open Addressing: Double Probing
• Idea: Given two good hash functions, and , it is very unlikely that for a given key, ==
, so we can make the probe function to determine the offset in case of collision.

• Probe sequence:
0th probe:
1st probe:
2nd probe:
3rd probe:

ith probe:

Detail: Make sure cannot be 0. 31


Hashing: Collision Resolving Strategies
Open Addressing: Double Probing
0
1 79
2
3
4 69
• Insert key : 5 98
6
7 72
8
9 14
10
11 50
32
12
Hashing: Collision Resolving Strategies
Open Addressing: Double Probing
• Intuition: Since each probe is “jumping” by each time, we “leave the
neighborhood” and “go different places from other initial collision”.

• Just like quadratic probing, there is still a chance that we might not probe all the
slots in the table (infinite loop despite room in table).

• Disadvantages:
• it is harder to delete an element
• can generate a maximum of probe sequences

33
Hashing: Collision Resolving Strategies
Efficiency:

34

𝝀 𝝀
Hashing: Summary
• The hash table is one of the most important data structures
• Supports only find, insert and delete efficiently

• Important to use a good hash function

• Important to keep hash table at a good size

• Side-Comment: Hash functions have uses beyond hash tables


• Example: Cryptography, check-sums

35
THANK YOU
36

You might also like