0% found this document useful (0 votes)
6 views105 pages

Understanding Hashing Techniques in Data Structures

Chapter 5 discusses hashing as a technique for efficient data retrieval in databases, comparing it to linked lists and binary search trees. It covers the general idea of hash tables, hash functions, collision resolution methods like separate chaining and open addressing, and the performance analysis of these techniques. The chapter emphasizes the importance of selecting effective hash functions and managing load factors to optimize search times.

Uploaded by

jobsobserverpk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views105 pages

Understanding Hashing Techniques in Data Structures

Chapter 5 discusses hashing as a technique for efficient data retrieval in databases, comparing it to linked lists and binary search trees. It covers the general idea of hash tables, hash functions, collision resolution methods like separate chaining and open addressing, and the performance analysis of these techniques. The chapter emphasizes the importance of selecting effective hash functions and managing load factors to optimize search times.

Uploaded by

jobsobserverpk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Chapter 5

Hashing
Motivation
 Let us assume that we want to search for a particular
item in a database of 20,000,000 data items
 How long would it take to find for a successful
search?
 How long would it take for an unsuccessful
search?
 It depends on the data structure
Motivation…
 If the data structure is a linked list,
 the search time is O(N)
 If the data structure is a binary search tree,
 estimated running time is O(logN)
 log 20,000,000 ≈ 24

 Can we do even better than O(logN) ?


 hash table ADT
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing
Chapter 5: Hashing
Our goals: We will
 See several methods of implementing the hash table
 Compare these methods analytically
 Show numerous applications of hashing
 Compare hash tables with binary search trees
First some terminology
 Hash table ADT is a data structure that supports
only a subset of the operations allowed by the
binary search trees
 Implementation of a hash table is called hashing
 Hashing is a technique used for performing
insertions, deletions, and finds in a constant time
General Idea
 The general idea behind hashing is to directly map
each data item into an address in memory using
some function
 key  hash function  index to an array

 Components of hashing

 A hash table is an array of some fixed size ‘m’


 A hash function h(k) that maps a search key k to
some location in the range [0...m-1]
• h(k): S  {0, 1, …, m-1}
General Idea… array

0
Name: Irzam Shahid h(Irzam) = 1
University: RCET 1
Office: room 1 EED
Mobile Number: 2
Email:
etc

Data Item

Here we are using a hashing function that


accepts my last name as a key and returns a 1

m-1
General Idea…
 Desired Properties of h(k)
 simple to compute
 uniform distribution of keys over {0, 1, …, m-1}

• when h(k1) = h(k2) for two distinct keys k1, k2 , we


have a collision
General Idea… array

0
Name: Irzam Shahid h(Irzam) = 1
University: RCET 1
Office: room 1 EED
Mobile Number: 2
Email:
etc

Data Items
Name: Rehan Arif
University: RCET
Office: room 8 EED
Mobile Number:
h(Rehan) = 1
Email:
etc
A collision has occurred m-1
General Idea…
 Two Important Topics in Hashing
 How to select a hash function
 How to resolve collisions
General Idea…
 Hashing revisited
 A hash table data structure is an array
 Each data element contains a key
 Each key is mapped to some number in the range
from 0 to TableSize-1, with the help of a hash
function
• The hash function should be efficient to compute and
should ensure that different data items get mapped to
different numbers
 The key and the hashing function are used both to
insert the data into the table and to later find that
data
General Idea…
 Example
 PTCL is a large telephone company, and they
want to maintain a database that provides the
caller ID capability
• given a phone number, return the caller’s name
• phone numbers range from 0 to r = 107 -1
• want to do this as efficiently as possible
General Idea…
 Solution 1
 an array indexed by key
• takes O(1) time,
• O(r) space - huge amount of wasted space

Umer (null) Hassan (null) (null)


Hamid Hamid
6829227 0000000 6829229 0000000 0000000
General Idea…
 Solution 2
 Linked list
• takes O(r) time,
• O(r) space (only as much space as is needed )

Umer Hamid Hassan Hamid


6829227 6829229
General Idea…
 Solution 3
 Hash table
• O(1) expected time, O(n+m) space, where m is table size
 Like an array, but come up with a function to map the
large range into one which we can manage
• e.g. take the original key, modulo the (relatively small) size
of the array, and use that as an index
• 6829229 mod 5 = 4

(null) (null) (null) (null) Hassan


Hamid
0 1 2 3 4
Hash Function
 A simple hash function
 If input keys (k) are integers
 hash function, h( k ) = k mod m
um ber
• where m is the table size ime n
a pr
l d be
shou
io n, m
 Example
s i t ua t
c h a
• o i d su
To aSuppose
v m = 10,
• k = 10, 20, 30, 40
• h(k) = 0, 0, 0, 0
– A bad choice if the keys end in zeros
Hash Function…
 Another simple hash function
 If input keys (k) are integers
 hash function, h( k ) = k mod m
• where m is the table size and is a prime number

 Example
• Suppose m = 11,
• k = 10, 20, 30, 40
• h(k) = 10, 9, 8, 7
– Distributes the keys more uniformly
Hash Function…
 A simple hash function
 If the keys are strings, then the hash function
can be some function of the characters in the
strings
 One possibility is to simply add the ASCII
values of the characters:

 length  1 
h( str )   str[i ]  %m
 i 0 
• Example
– h(ABC) = (65 + 66 + 67)%m
Hash Function…
 Problem
 If the table size is large, the function does not
distribute the keys well
 TableSize = 10,007 (prime number)
 Keys are <= 8 characters
 Each char is 1 byte long so highest value it can
have is 28 – 1 = 127
 Hash function will have range: 0 to (127*8) = 0 to
1016
 ~10K spaces in the table and only using the first
1K elements
Hash Function…
 Another hash function
 If the keys are strings
 convert the string into some number in some
arbitrary base b

 length  1 i
h( str )   str[i ] b  %m
 i 0 

• Example
– h(ABC) = (65b0 + 66b1 + 67b2) %m
Hash Function…
 Examines first three characters of the input
 The value 27 represents the number of letters
in English alphabet, plus the blank o not
i s al s
abl e ,
m p ut l a r g e
i l y c o a b ly
he a s eas o n
oug le i s r
i on , th t a b
Index f un ct e ha sh
T his a t e i f th
pr p ri char *Key, int TableSize )
Hash2(oconst
{ ap
return ( Key[ 0 ] + 27 * Key[ 1 ] + 729 * Key[ 2 ] )% TableSize;

}
Hash Function…
 Rule of Thumb

 Hash functions should try to achieve uniform full


coverage of the hash table, while minimizing
collisions
 Since this is usually impossible, and collisions will
almost always occur, an important design
consideration is how you deal with the collision
resolution
Separate Chaining
 How to deal with two keys which hash to the same
spot in the array?
 Use chaining
 All data items that hash to the same number are
kept in a linked list
• Setup an array of lists, indexed by the keys, to
lists of items with the same key
Separate Chaining…
 Example

0 Name: Irzam Shahid Name: Rehan Arif


University: RCET University: RCET
Office: room 8 EED
1 Office: room 1 EED
Mobile Number:
Mobile Number:
Email: Email:
2 etc etc

m-1 The two entries are now stored


in a linked list
Separate Chaining…
 Example
 Here the size of the
hash table = 10
 Keys are the first ten
perfect squares 0, 1, 4,
9, 16, 25, 36, 49, 64, and
81
 The hash function,
h(k) = k mod 10

A separate chaining hash table


Separate Chaining…
 To find an element
 using hash function, look up its position in table
 search for the element in the linked list of the
hashed slot

 To insert an element
 compute h(k) to determine which list to traverse
 If T[h(k)] contains a null pointer, initialize this entry
to point to a linked list that contains k alone
 If T[h(k)] is a non-empty list, we add k at the
beginning of this list
Separate Chaining…
 To delete an element
 compute h(k), then search for k within the list at
T[h(k)]
 delete k if it is found
Separate Chaining…
 Analysing the performance of separate chaining hash
table
 as we increase the number of elements N in the
hash table, more and more items will be stored in
linked lists, thus slowing everything down
 Also increasing the table size TableSize allows you
to hold more data in an efficient manner
 It turns out that the ratio λ = N / T is the important
quantity to analyze
• This is called the load factor
Separate Chaining…
 Analysing the performance of separate chaining hash
table…
 Time to perform search = the constant time
required to evaluate the hash function + time to
traverse the list
 Note that, for separate chaining, the average
length of a linked list is λ
 Thus, an unsuccessful search will require to
traverse λ links on average
 A successful search requires that about 1 + (λ/2)
links be traversed
Separate Chaining…
 Analysing the performance of separate chaining hash
table…
 Thus, lowering the load factor is a good thing, from
the time point of view
 From the space point of view, lowering the load
factor means increasing the table size
• This can lead to largely wasted space
 A reasonable compromise is λ ≈ 1
• search times will be roughly O(1)
Open Addressing
 Separate chaining has the disadvantage of using
linked lists that slows the algorithm because of the
time required to allocate new cells

 Open addressing
 relocate the key k to be inserted if it collides with
an existing key
• That is, we store k at an entry different from
T[h(k)]
Open Addressing…
 Open addressing hashing resolves collisions by trying
alternative slots in the hash table, until an empty cell
is found
 cells h0 (X), h1 (X), h2 (X),… are tried in succession
where hi (X) = (Hash(X) + F(i))mod TableSize with
F(0) = 0
 The function, F, is the collision resolution strategy
Open Addressing…
 Linear Probing
 F(i) is a linear function of i, i.e. F(i) = i
• h0(X) = Hash(X) + 0
• h1(X) = Hash(X) + 1
• h2(X) = Hash(X) + 2
•…
• cells are probed sequentially (with wraparound)
in search of an empty cell
Open Addressing…
 *Example
 suppose that our hash function converts a 2-digit
integer into a single digit by taking the least-
significant digit

*[Link]
~ece250/
Open Addressing…
 *Insertions
 Insert the numbers 81, 70, 97, 60, 51, 38, 89, 68, 24 into the
initially empty hash table:

0 1 2 3 4 5 6 7 8 9

*[Link]
~ece250/
Open Addressing…
 *Insertions…
 We can easily insert 81, 70, and 97 into their corresponding
bins:

0 1 2 3 4 5 6 7 8 9
70 81 97

*[Link]
~ece250/
Open Addressing…
 *Insertions…
 Inserting 60 causes a collision in bin 0, therefore, we check:

• bin 1 (also full), and


• bin 2 (empty)

0 1 2 3 4 5 6 7 8 9
70 81 60 97

*[Link]
~ece250/
Open Addressing…
 *Insertions…
 Inserting 51 also causes a collision, this time, in bin 1,
therefore, we check:
• bin 2 (also full), and
• bin 3 (empty)

0 1 2 3 4 5 6 7 8 9
70 81 60 51 97

*[Link]
~ece250/
Open Addressing…
 *Insertions…
 38 and 89 can be placed into bins 8 and 9 respectively
without collisions

0 1 2 3 4 5 6 7 8 9
70 81 60 51 97 38 89

*[Link]
~ece250/
Open Addressing…
 *Insertions…
 Inserting 68 causes a collision in bin 8, and therefore we
check bins:
• 9, 0, 1, 2, 3, and finally 4 which is empty
• insert 68 into bin 4

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 97 38 89

*[Link]
~ece250/
Open Addressing…
 *Insertions…
 Inserting 24 causes a collision in bin 4, however the next bin
is empty

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

*[Link]
~ece250/
Open Addressing…
 *Searching
 Testing for membership is similar to insertions
 Start at the appropriate bin, and continue
searching forward until either:
• the item is found, or
• an empty bin is found

*[Link]
~ece250/
Open Addressing…
 *Searching…
 Searching for 68, we first examine bin 8, then 9, 0, 1, 2, 3,
and 4, finding 68 in bin 4
 Searching for 23, we search bins 3, 4, 5, and bin 6 is empty,
so 23 is not in the table

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

*[Link]
~ece250/
Open Addressing…
 *Removing
 We cannot simply remove elements from the hash table
 For example, if we delete 89 by removing it, we can no
longer find 68

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

*[Link]
~ece250/
Open Addressing…
 *Removing…
 However, we cannot simply move all entries up to fill the gap
 Moving 70 to bin 9 would make it impossible to find 70

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

81 60 51 68 24 97 38 70

*[Link]
~ece250/
Open Addressing…
 *Removing…
 Instead, we must probe forward, moving only those
elements which would not be moved to a location before
their bin starts
 For example, we remove 89

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38

*[Link]
~ece250/
Open Addressing…
 *Removing…
 We probe forward until we find an entry which can be moved
into bin 9
 We cannot move 70, 81, 60, or 51, but we can move 68

0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68

*[Link]
~ece250/
Open Addressing…
 *Removing…
 Next, we search forward again, and note that 24 can be
moved forward
 The next cell is already empty, and therefore we are finished

0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68

*[Link]
~ece250/
Open Addressing…
 *Removing…
 Suppose we now remove 60

0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68

*[Link]
~ece250/
Open Addressing…
 *Removing…
 We find 60 in bin 2, and therefore we remove it
 We search forward and find that we can move 51 into bin 2

0 1 2 3 4 5 6 7 8 9
70 81 51 24 97 38 68

*[Link]
~ece250/
Open Addressing…
 *Removing…
 We cannot move 24 forward
 The next bin (5) is empty, therefore we are finished

0 1 2 3 4 5 6 7 8 9
70 81 51 24 97 38 68

*[Link]
~ece250/
Open Addressing…
 *Primary Clustering
 We have already observed the following
phenomenon:
• as we insert more elements into the hash table,
the contiguous regions get larger
• Any key that hashes into the cluster will require
several attempts to resolve the collision
 This results in longer search times

*[Link]
~ece250/
Open Addressing…
 *Primary Clustering…
 Consider inserting the following entries 81, 70, 97, 63, 76,
38, 85, 68, 21, 9, 55, 73, 57, 60, 72, 74, 85, 16, 61, 7, 49
 Use the number modulo 25 to determine which bin it should
occupy
 The first five don’t cause any collisions

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

76 81 63 70 97

*[Link]
~ece250/
Open Addressing…
 *Primary Clustering…
 Inserting 38 causes a collision in bin 13
 The next seven do not cause any further collisions

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

76 55 81 57 9 85 63 38 68 70 21 97 73

*[Link]
~ece250/
Open Addressing…
 *Primary Clustering…
 The next four insertions cause collisions:

60 (bin 10)
72 (bin 22)
74 (bin 24)
85 (bin 10)
 We can safely insert 16 into bin 16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

74 76 55 81 57 9 85 60 85 63 38 16 68 70 21 97 73 72

*[Link]
~ece250/
Open Addressing…
 *Primary Clustering…
 The remaining insertions all cause collisions:

61 (bin 11)
7 (bin 7)
49 (bin 24)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

74 76 49 55 81 57 7 9 85 60 85 63 38 61 16 68 70 21 97 73 72

*[Link]
~ece250/
Open Addressing…
 Asymptotic Performance
 Primary clustering affects the number of probes
required to perform the insertions, searches or
deletions
 The average number of probes for a successful
search can be estimated as
• Number of probes  ( ½ ) ( 1+1/( 1- ) )
• where  is the load factor – what fraction of the table is
used
Open Addressing…
 Asymptotic Performance…
 The number of probes for an unsuccessful search
n ha lf
or for an insertion is higher: r e tha
2 is m
o
• Number of probes  ( ½ ) ( 1+1/( t1- a bl e
) )
i f t h e
e
ic expected
• if  = 0.75 , 8.5 probeshoare
b c
ad are expected, and this unreasonable
• if  = 0.9 , 50be a
probes
c a n
r o bi ng
i ne a rp
l
full
Open Addressing…
 *The following plot shows how the number of
required probes increases

*[Link]
~ece250/
Open Addressing…
 *Primary clustering occurs with linear probing
because the same linear pattern
 if a bin is inside a cluster, then the next bin must
either
• also be in that cluster, or
• expand the cluster

 Instead of searching forward in a linear fashion,


consider searching forward using a quadratic function

*[Link]
~ece250/
Open Addressing…
 Quadratic Probing
 with quadratic probing F(i) = i2
 This eliminates the primary clustering problem of
linear probing
• h0(X) = Hash(X) + 0
• h1(X) = Hash(X) + 1
• h2(X) = Hash(X) + 4
• …
Open Addressing…
*Insertions
 Suppose that an element should appear in bin h
 if bin h is occupied, then check the following
sequence of bins
h + 12, h + 22, h + 32, h + 42, h + 52, ...
h + 1, h + 4, h + 9, h + 16, h + 25, ...

 For example, with M = 17

*[Link]
~ece250/
Open Addressing…
*Insertions…
 If one of h + i2 falls into a cluster, this does not imply
the next one will

*[Link]
~ece250/
Open Addressing…
*Insertions…
 For example, suppose an element was to be inserted in
bin 23 in a hash table with 31 bins

 The sequence in which the bins would be checked is


23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0

 Even if two bins are initially close, the sequence in


which subsequent bins are checked varies greatly

*[Link]
~ece250/
Open Addressing…
*Insertions…
 Thus, quadratic probing solves the problem of
primary clustering
 Unfortunately, there is a second problem which must
be dealt with
 Suppose we have M = 8 bins
12 ≡ 1, 22 ≡ 4, 32 ≡ 1

 In this case, we are checking bin h + 1 twice


having checked only one other bin

*[Link]
~ece250/
Open Addressing…
*Insertions…
 Unfortunately, there is no guarantee that
h + i2 mod M
will cycle through 0, 1, ..., M – 1

 Solution
 M should be a prime number
 in this case, h + i2 mod M for i = 0, ..., (M – 1)/2 will
cycle through exactly (M + 1)/2 values before
repeating

*[Link]
~ece250/
Open Addressing…
*Insertions…
 Example

 with M = 11
0, 1, 4, 9, 16 ≡ 5, 25 ≡ 3, 36 ≡ 3

 with M = 13
0, 1, 4, 9, 16 ≡ 3, 25 ≡ 12, 36 ≡ 10, 49 ≡ 10

 with M = 17
0, 1, 4, 9, 16, 25 ≡ 8, 36 ≡ 2, 49 ≡ 15, 64 ≡ 13, 81 ≡ 13

*[Link]
~ece250/
Open Addressing…
*Insertions…
 Thus, quadratic probing avoids primary clustering

 Unfortunately, we are not guaranteed that we will use


all the bins
 In reality, if the hash function is reasonable, this is not
a significant problem until  approaches 1

*[Link]
~ece250/
Open Addressing…
*Insertions…
 Example
 with a hash table with M = 19 using quadratic
probing, insert the following random 3-digit
numbers

086, 198, 466, 709, 973, 981, 374,


766, 473, 342, 191, 393, 300, 011,
538, 913, 220, 844, 565
 using the number modulo 19 to be the initial bin
*[Link]
~ece250/
Open Addressing…
*Insertions…
 The first two fall into their correct bin
086 → 10, 198 → 8

 The next already causes a collision


466 → 10 → 11

 The next four cause no collisons


709 → 6, 973 → 4, 981 → 12, 374 → 13

 Then another collision


766 → 6 → 7
*[Link]
~ece250/
Open Addressing…
*Insertions…
 At this point, two clusters have appeared and the
load factor is  = 0.42

*[Link]
~ece250/
Open Addressing…
*Insertions…
 The next three also go into their appropriate bins
473 → 17, 342 → 0, 191 → 1

 Then there is one more collision


393 → 13 → 14

 300 falls into its correct bin


300 → 15

*[Link]
~ece250/
Open Addressing…
*Insertions…
 With previous five insertions, the load factor is  =
0.68 with one large cluster

*[Link]
~ece250/
Open Addressing…
*Insertions…
 At this point, insertions become more tedious

011 → 11 → 12 → 15 → 1 → 8 → 17 → 9

538 → 6 → 7 → 10 → 15 → 3

913 → 1 → 2

220 → 11 → ⋅⋅⋅ → 9 → 3 → 18

844 → 8 → 9 → 12 → 17 → 5

*[Link]
~ece250/
Open Addressing…
*Insertions…
 To show how quadratic probing works, consider the
addition of 538, starting in bin 6

 The first four bins all fall within the same cluster,
however, the fifth bin checked falls far outside the
cluster

*[Link]
~ece250/
Open Addressing…
*Insertions…
 At this point, the array is almost full (bin 16 is open)
and the load factor is  = 0.95

 If we try to add the last number 565, the sequence of


bins checked is
14 → 15 → 18 → 4 → 11 → 1 → 12 → 6 → 2 →
0
which does not hit bin 16
*[Link]
~ece250/
Open Addressing…
 *We can compare the number of probes required with
that of linear probing
086 → 10, 10 198 → 8
466 → 10 → 11 709 → 6
973 → 4 981 → 12
374 → 13 766 → 6 → 7
473 → 17 342 → 0
191 → 1 393 → 13 → 14
300 → 15 011 → 11 → 12 → 13 → 14 → 15 → 16
538 → 6 → 7 → 8 → 9 913 → 1 → 2
220 → 11 → 12 → 13 → 14 → 15 → 16 → 17 → 18
844 → 8 → 9 → 10 → 11 → 12 → 13 → 14 → 15 → 16 → 17 → 18 → 0 → 1 → 2 → 3
565 → 14 → 15 → 16 → 17 → 18 → 0 → 1 → 2 → 3 → 4 → 5

*[Link]
~ece250/
Open Addressing…
*Deletions
 With linear probing, if we deleted the contents of a
bin, we had to search ahead to determine if any
nodes had to be moved back
 easy with linear probing; we simply moved from
bin to bin until an empty bin was located

*[Link]
~ece250/
Open Addressing…
*Deletions…
 The nonlinear probing associated with quadratic
probing does not allow us to do this efficiently
 For example, suppose we delete 466 which is
currently in bin 11

 The two other entries which pass through bin 11 were


011 and 220
 We cannot (efficiently) find these entries

*[Link]
~ece250/
Open Addressing…
*Deletions…
 Solution
 associate with each bin a field which is either
EMPTY, OCCUPIED, or DELETED

*[Link]
~ece250/
Open Addressing…
*Deletions…
 Initially, all bins are initially marked EMPTY

 When a bin is filled, it is marked OCCUPIED

 If a bin is emptied (as a result of a remove), it is


marked DELETED
 Note that a bin which is marked as being DELETED
may once again be filled (and hence marked
OCCUPIED)

*[Link]
~ece250/
Open Addressing…
*Deletions…
 Example
 given a hash table with
M = 11 bins, enter the values
135 909 246 894 518 365
Bin 0 1 2 3 4 5 6 7 8 9 10
Entry
Flag E E E E E E E E E E E

*[Link]
~ece250/
Open Addressing…
*Deletions…
 The first three are straight-forward
135 → 3 909 → 7 246 → 4

Bin 0 1 2 3 4 5 6 7 8 9 10
Entry 135 246 909
Flag E E E O O E E O E E E

*[Link]
~ece250/
Open Addressing…
 The phenomenon of primary clustering will not occur
with quadratic probing
 However, if multiple items all hash to the same initial
bin, the same sequence of numbers will be followed
 This is termed secondary clustering

 The effect is less significant than that of primary


clustering

*[Link]
~ece250/
Open Addressing…
 Secondary clustering may be a problem if the hash
function does not produce an even distribution of
entries
 One solution to secondary clustering is double
hashing, associating with each element an initial bin
(defined by one hash function) and a skip (defined by
a second hash function)

*[Link]
~ece250/
Open Addressing…
 Example
 Insert the 6 elements
14, 107, 31, 118, 34, 112
into an initially empty hash table of size 11 using
quadratic hashing

 Let the hash function be the number modulo 11

*[Link]
~ece250/
Open Addressing…
 The first three fall into bins 3, 8, and 9, respectively

0 1 2 3 4 5 6 7 8 9 10

14 107 31

*[Link]
~ece250/
Open Addressing…
 118 also falls into bin 8 (occupied)
 Thus, we check
8+1=9 - occupied
8+4=1 - unoccupied

0 1 2 3 4 5 6 7 8 9 10

118 14 107 31

*[Link]
~ece250/
Open Addressing…
 34 falls into bin 1 which is occupied, thus we check
 1+1=2 - unoccupied

0 1 2 3 4 5 6 7 8 9 10

118 34 14 107 31

*[Link]
~ece250/
Open Addressing…
 112 falls into bin 2 which is now occupied, thus we
check
2+1=3 - occupied
2+4=6 - unoccupied

0 1 2 3 4 5 6 7 8 9 10

118 34 14 112 107 31

*[Link]
~ece250/
Open Addressing…
 At this point, the hash table is over half full
 We are no longer guaranteed that the insertion of a
new element may be possible
 Solution
 increase the size of the table (perhaps only after failing)
 Problem
 the new size must, too, be prime

0 1 2 3 4 5 6 7 8 9 10

118 34 14 112 107 31

*[Link]
~ece250/
Open Addressing…
 To remove an element, we must simply mark it as
deleted
 In our example, removing 118, we begin in bin 8, and
continue to check 9, and then 1
 Mark that bin as having had an element deleted

0 1 2 3 4 5 6 7 8 9 10
DEL 34 14 112 107 31

*[Link]
~ece250/
Open Addressing…
 To find an element we start by checking the bin it
should have initially been in, and then begin checking
following quadratic probing until either
 we find it, or
 we find a bin which is empty

0 1 2 3 4 5 6 7 8 9 10
DEL 34 14 112 107 31

*[Link]
~ece250/
Open Addressing…
 We find 14 in bin 3
 We don’t find 34 in bin 1 (marked as deleted), so we
check bin 1 + 1 = 2, and find it

0 1 2 3 4 5 6 7 8 9 10
DEL 34 14 112 107 31

*[Link]
~ece250/
Open Addressing…
 We search for 19 in bin 8
 Not finding it, we check
8+ 1=9 - occupied
8+ 4=1 - deleted
8+ 9=6 - occupied
8 + 16 = 2 - occupied
8 + 25 = 0 - unoccupied: not found

0 1 2 3 4 5 6 7 8 9 10
DEL 34 14 112 107 31

*[Link]
~ece250/
Open Addressing…
 Double Hashing
 choose the initial bin with a first hash function
 choose the jump value with a different hash
function i.e. F(i) = i * hash2(key)
• A function such as hash2(key) = R – (key%R) ,
with R a prime smaller than TableSize, will often
work well
Open Addressing…
 Example
 The hash table size, TableSize = 10,
 Insert the keys 89, 18, 49, 58, and 69
 The hash function is h(key) = key%10

 The 2nd hash function is hash (key) = 7- (key%7)


2
Open Addressing…
 Example…
Open Addressing…
 Conclusions
 Double hashing has performance that is almost
optimal
 However, calculating the 2nd hash function does
provide some additional computational inefficiency
Rehashing
 If the table gets too full, the running time for the
operations will start taking too long

 Solution
 Rehashing
• Build another table twice as big with a new associated
hash function
• Scan down entire original hash table
• Compute new hash value for each element
• Insert into new table
Rehashing… Original Hash Table

 Example
 hash table size = 7
 Insert the elements
13, 15, 24, and 6
 The hash function
After Inserting 23
is h(key) = key%7
 Use linear probing
to resolve collisions
 Insert 23 now

Because this table is too full, enlarge


it to size 17, and redefine the hash
function
Rehashing… After Rehashing

 Example…
 hash table size = 17
 The hash function is
h(key) = key%17
 The old table is
scanned and elements
6, 15, 23, 24, and 13 are
inserted into the new
table
 Use linear probing to
resolve collisions
Rehashing…
 Complexity of Rehashing
 It takes O(N) time to rehash, since there are N
elements to rehash
Hashing : Summary
 Hash tables can be used to implement the Insert and
Find operations in constant average time
 For these time bounds to be valid, special attention has to
be paid to load factor
 For separate chaining hashing, λ should be close to 1
 For open addressing hashing, λ should not exceed 0.5
 If linear probing is used, performance degenerates rapidly
as the λ approaches to 1
 Rehashing can be implemented to allow the table to grow ,
thus maintaining a reasonable λ

You might also like