Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 53

Analysis of Algorithms

CS 477/677
Hashing
Instructor: George Bebis

(Chapter 11)

2
The Search Problem
Find items with keys matching a given search key
Given an array A, containing n keys, and a search key
x, find the index i such as x=A[i]
As in the case of sorting, a key could be part of a
large record.
3
Applications
Keeping track of customer account information at
a bank
Search through records to check balances and perform
transactions
Keep track of reservations on flights
Search to find empty seats, cancel/modify reservations
Search engine
Looks for all documents containing a given word
4
Special Case: Dictionaries
Dictionary = data structure that supports mainly
two basic operations: insert a new item and
return an item with a given key
Queries: return information about the set S:
Search (S, k)
Minimum (S), Maximum (S)
Successor (S, x), Predecessor (S, x)
Modifying operations: change the set
Insert (S, k)
Delete (S, k) not very often
5
Direct Addressing
Assumptions:
Key values are distinct
Each key is drawn from a universe U = {0, 1, . . . , m - 1}
Idea:
Store the items in an array, indexed by keys

Direct-address table representation:
An array T[0 . . . m - 1]
Each slot, or position, in T corresponds to a key in U
For an element x with key k, a pointer to x (or x itself) will be placed
in location T[k]
If there are no elements with key k in the set, T[k] is empty,
represented by NIL
6
Direct Addressing (contd)
7
Operations
Alg.: DIRECT-ADDRESS-SEARCH(T, k)
return T[k]

Alg.: DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] x

Alg.: DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] NIL
Running time for these operations: O(1)
8
Comparing Different Implementations
Implementing dictionaries using:
Direct addressing
Ordered/unordered arrays
Ordered/unordered linked lists
Insert Search
ordered array
ordered list
unordered array
unordered list
O(N)
O(N)
O(N)
O(N)
O(1)
O(1)
O(lgN)
O(N)
direct addressing O(1) O(1)
9
Examples Using Direct Addressing
Example 2:
Example 1:
10
Hash Tables
When K is much smaller than U, a hash table
requires much less space than a direct-address
table
Can reduce storage requirements to |K|
Can still get O(1) search time, but on the average
case, not the worst case
11
Hash Tables
Idea:
Use a function h to compute the slot for each key
Store the element in slot h(k)
A hash function h transforms a key into an index in a
hash table T[0m-1]:
h : U {0, 1, . . . , m - 1}
We say that k hashes to slot h(k)
Advantages:
Reduce the range of array indices handled: m instead of |U|
Storage is also reduced
12
Example: HASH TABLES
U
(universe of keys)
K
(actual
keys)
0
m - 1
h(k
3
)
h(k
2
) = h(k
5
)
h(k
1
)
h(k
4
)
k
1

k
4
k
2

k
5

k
3

13
Revisit Example 2
14
Do you see any problems
with this approach?
U
(universe of keys)
K
(actual
keys)
0
m - 1
h(k
3
)
h(k
2
) = h(k
5
)
h(k
1
)
h(k
4
)
k
1

k
4
k
2

k
5

k
3

Collisions!
15
Collisions
Two or more keys hash to the same slot!!
For a given set K of keys
If |K| m, collisions may or may not happen,
depending on the hash function
If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
Avoiding collisions completely is hard, even with
a good hash function
16
Handling Collisions
We will review the following methods:
Chaining
Open addressing
Linear probing
Quadratic probing
Double hashing
We will discuss chaining first, and ways to
build good functions.
17
Handling Collisions Using Chaining
Idea:
Put all elements that hash to the same slot into a
linked list








Slot j contains a pointer to the head of the list of all
elements that hash to j
18
Collision with Chaining - Discussion
Choosing the size of the table
Small enough not to waste space
Large enough such that lists remain short
Typically 1/5 or 1/10 of the total number of elements
How should we keep the lists: ordered or not?
Not ordered!
Insert is fast
Can easily remove the most recently inserted elements
19
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]

Worst-case running time is O(1)
Assumes that the element being inserted isnt
already in the list
It would take an additional search to check if it
was already inserted
20
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]

Need to find the element to be deleted.
Worst-case running time:
Deletion depends on searching the corresponding list
21
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
Running time is proportional to the length of the
list of elements in slot h(k)
22
Analysis of Hashing with Chaining:
Worst Case
How long does it take to
search for an element with a
given key?
Worst case:
All n keys hash to the same slot
Worst-case time to search is
O(n), plus time to compute the
hash function
0
m - 1
T
chain
23
Analysis of Hashing with Chaining:
Average Case
Average case
depends on how well the hash function
distributes the n keys among the m slots
Simple uniform hashing assumption:
Any given element is equally likely to hash
into any of the m slots (i.e., probability of
collision Pr(h(x)=h(y)), is 1/m)
Length of a list:
T[j] = n
j
, j = 0, 1, . . . , m 1
Number of keys in the table:
n = n
0
+ n
1
+ + n
m-1

Average value of n
j
:
E[n
j
] = = n/m
n
0
= 0
n
m 1
= 0

T
n
2

n
3

n
j

n
k

24
Load Factor of a Hash Table
Load factor of a hash table T:
o = n/m
n = # of elements stored in the table
m = # of slots in the table = # of linked lists
o encodes the average number of
elements stored in a chain
o can be <, =, > 1
0
m - 1
T
chain
chain
chain
chain
25
Case 1: Unsuccessful Search
(i.e., item not stored in the table)
Theorem
An unsuccessful search in a hash table takes expected time
under the assumption of simple uniform hashing
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
Proof
Searching unsuccessfully for any key k
need to search to the end of the list T[h(k)]
Expected length of the list:
E[n
h(k)
] = = n/m
Expected number of elements examined in an unsuccessful search is
Total time required is:
O(1) (for computing the hash function) +
(1 ) o O +
(1 ) o O +
26
Case 2: Successful Search
27
Analysis of Search in Hash Tables
If m (# of slots) is proportional to n (# of
elements in the table):
n = O(m)
= n/m = O(m)/m = O(1)
Searching takes constant time on average
28
Hash Functions
A hash function transforms a key into a table
address
What makes a good hash function?
(1) Easy to compute
(2) Approximates a random function: for every input,
every output is equally likely (simple uniform hashing)
In practice, it is very hard to satisfy the simple
uniform hashing property
i.e., we dont know in advance the probability
distribution that keys are drawn from


29
Good Approaches for Hash Functions
Minimize the chance that closely related keys
hash to the same slot
Strings such as pt and pts should hash to
different slots
Derive a hash value that is independent from
any patterns that may exist in the distribution
of the keys
30
The Division Method
Idea:
Map a key k into one of the m slots by taking
the remainder of k divided by m
h(k) = k mod m
Advantage:
fast, requires only one operation
Disadvantage:
Certain values of m are bad, e.g.,
power of 2
non-prime numbers
31
Example - The Division Method
If m = 2
p
, then h(k) is just the least
significant p bits of k
p = 1 m = 2
h(k) = , least significant 1 bit of k
p = 2 m = 4
h(k) = , least significant 2 bits of k
Choose m to be a prime, not close to a
power of 2
Column 2:
Column 3:
{0, 1}
{0, 1, 2, 3}
k mod 97
k mod 100
m
97
m
100
32
The Multiplication Method
Idea:
Multiply key k by a constant A, where 0 < A < 1
Extract the fractional part of kA
Multiply the fractional part by m
Take the floor of the result
h(k) = = m (k A mod 1)


Disadvantage: Slower than division method
Advantage: Value of m is not critical, e.g., typically 2
p

fractional part of kA = kA - kA
33
Example Multiplication Method
34
Universal Hashing
In practice, keys are not randomly distributed
Any fixed hash function might yield (n) time
Goal: hash functions that produce random
table indices irrespective of the keys
Idea:
Select a hash function at random, from a designed
class of functions at the beginning of the execution
35
Universal Hashing


(at the beginning
of the execution)
36
Definition of Universal Hash Functions
H={h(k): U(0,1,..,m-1)}
37
How is this property useful?
Pr(h(x)=h(y))=
38
Universal Hashing Main Result
With universal hashing the chance of collision
between distinct keys k and l is no more than the
1/m chance of collision if locations h(k) and h(l)
were randomly and independently chosen from
the set {0, 1, , m 1}
39
Designing a Universal Class
of Hash Functions
Choose a prime number p large enough so that every
possible key k is in the range [0 ... p 1]
Z
p
= {0, 1, , p - 1} and Z
p
*
= {1, , p - 1}
Define the following hash function
h
a,b
(k) = ((ak + b) mod p) mod m,
a e Z
p
*
and b e Z
p

The family of all such hash functions is
H
p,m
= {h
a,b
: a e Z
p
*
and b e Z
p
}
a , b: chosen randomly at the beginning of execution
The class H
p,m
of hash
functions is universal
40
Example: Universal Hash Functions
E.g.: p = 17, m = 6
h
a,b
(k) = ((ak + b) mod p) mod m
h
3,4
(8) = ((38 + 4) mod 17) mod 6
= (28 mod 17) mod 6
= 11 mod 6
= 5
41
Advantages of Universal Hashing

Universal hashing provides good results on
average, independently of the keys to be stored
Guarantees that no input will always elicit the
worst-case behavior
Poor performance occurs only when the random
choice returns an inefficient hash function this
has small probability
42
Open Addressing
If we have enough contiguous memory to store all the keys
(m > N) store the keys in the table itself
No need to use linked lists anymore
Basic idea:
Insertion: if a slot is full, try another one,
until you find an empty one
Search: follow the same sequence of probes
Deletion: more difficult ... (well see why)
Search time depends on the length of the
probe sequence!
e.g., insert 14
43
Generalize hash function notation:
A hash function contains two arguments now:
(i) Key value, and (ii) Probe number

h(k,p), p=0,1,...,m-1

Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>

Must be a permutation of <0,1,...,m-1>
There are m! possible permutations
Good hash functions should be able to
produce all m! probe sequences
insert 14
<1, 5, 9>
Example
44
Common Open Addressing Methods

Linear probing
Quadratic probing
Double hashing

Note: None of these methods can generate
more than m
2
different probing sequences!
45
Linear probing: Inserting a key
Idea: when there is a collision, check the next available
position in the table (i.e., probing)

h(k,i) = (h
1
(k) + i) mod m
i=0,1,2,...
First slot probed: h
1
(k)
Second slot probed: h
1
(k) + 1
Third slot probed: h
1
(k)+2, and so on


Can generate m probe sequences maximum, why?
probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>
wrap around
46
Linear probing: Searching for a key
Three cases:
(1) Position in table is occupied with an
element of equal key
(2) Position in table is empty
(3) Position in table occupied with a
different element
Case 2: probe the next higher index
until the element is found or an
empty position is found
The process wraps around to the
beginning of the table
0
m - 1
h(k
3
)
h(k
2
) = h(k
5
)
h(k
1
)
h(k
4
)
47
Linear probing: Deleting a key
Problems
Cannot mark the slot as empty
Impossible to retrieve keys inserted
after that slot was occupied
Solution
Mark the slot with a sentinel value
DELETED
The deleted slot can later be used
for insertion
Searching will be able to find all the
keys
0
m - 1
48
Primary Clustering Problem
Some slots become more likely than others
Long chunks of occupied slots are created
search time increases!!

Slot b:
2/m
Slot d:
4/m
Slot e:
5/m
initially, all slots have probability 1/m
49
Quadratic probing
i=0,1,2,...
50
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h
1
(k) + i h
2
(k) ) mod m, i=0,1,...
Initial probe: h
1
(k)
Second probe is offset by h
2
(k) mod m, so on ...
Advantage: avoids clustering
Disadvantage: harder to delete an element
Can generate m
2
probe sequences maximum
51
Double Hashing: Example
h
1
(k) = k mod 13
h
2
(k) = 1+ (k mod 11)
h(k,i) = (h
1
(k) + i h
2
(k) ) mod 13
Insert key 14:
h
1
(14,0) = 14 mod 13 = 1
h(14,1) = (h
1
(14) + h
2
(14)) mod 13
= (1 + 4) mod 13 = 5
h(14,2) = (h
1
(14) + 2 h
2
(14)) mod 13
= (1 + 8) mod 13 = 9
79
69
98
72
50
0
9
4
2
3
1
5
6
7
8
10
11
12
14
52
Analysis of Open Addressing
a
1 a
(load factor)
k=0
53
Analysis of Open Addressing (contd)
Example (similar to Exercise 11.4-4, page 244)

Unsuccessful retrieval:
a=0.5 E(#steps) = 2
a=0.9 E(#steps) = 10

Successful retrieval:
a=0.5 E(#steps) = 3.387
a=0.9 E(#steps) = 3.670

You might also like