Ads-Unit I-Hashing

ADVANCED DATA STRUCTURES UNIT - I CSE
HASHING
Division Method:
One common method of determining a hash key of the division method of hashing
The formula that will be used is:
H(key) = key % no.of slots in the table
i.e. h(key) = key mod array size
For example:
Consider a table with 8 slots. i.e. array size 8.
Hash key = key % table size
The key values are 36, 18, 72, 43, 6, 42
The division method is generally a reasonable strategy, unless the key happens to have
some undesirable properties.
Note: if the table size is 10 and all of the keys end in zero.
In the above example 42 mod 8 => 2, it’s already filled position in the hash table. This is
known as collision. i.e. two or more record keys map to the same array index.
In This case, the choice of hash function and table size needs to be carefully considered. The
best table sizes are prime numbers.
Multiplication method:
Narasaraopeta Engineering College Page 5
The simplest situation when the keys are floating –point numbers known to be in affixed
range.
For example:
If the keys are numbers that are greater than 0 and less than 1, we can just multiply by m
(table size) and round off to the nearest integer to get an address between 0 an m-1.
Algorithm:
1. Choose constant A in the range 0 < A < 1.

2. Multiply key k by A.
3. Extract the fractional part of k*A
4. Multiply the fractional part by number of slots, m.
5. Take the floor of the result.
Mathematically
h( k ) = ⌊m · (k·A mod 1)⌋
where k·A mod 1 = k·A − ⌊k·A⌋ = fractional part of k·A
 Disadvantage: Slower than division method.

 Advantage: Value of m is not critical.
Example:
m = 8 (implies m = 23, p = 3)
w=5
k = 21
Must have 0 < s < 25; choose s = 13 → A = 13/32.

 h(k) = ⌊m · (k·A mod 1)⌋
h( 21 ) = ⌊8 · (21·13/32 mod 1)⌋ = 4
k·A = 21 · 13/32 = 273/32 = 8 17/32 → k·A mod 1 = 17/32
→ m · ( k·A mod 1) = 8 · 17/32 = 17/4 =

4 1/4 → ⌊m · ( k·A mod 1)⌋ = 4
So that h(21) = 4.
Example:
m = 8 (implies m = 23, p = 3)

w=5
k = 21
s = 13
k·s = 21 · 13
= 273
= 8 · 25 + 17
= r 1 . r0
r1 = 8 · 25
r0 = 17 = 100012
Written in w = 5 bits, r0 = 100012 The p = 3 most significant bits of r0 is 1002 or 410,
so h(21) = 4.
Exercise Example:
m = 4 (implies m = 22, p = 2)
w=3
k = 12
s = 5 0 < s < 2w = 23 = 8
k·s = 12 · 5
=?
= ? · 23 + ?
= r 1 . r0
r1 = ? · 23
r0 = ? = ?2
 Written in w = 3 bits, r0 = ?2
 The p = 2 most significant bits of r0 is ?2 or ?10, so h(12) = ?.
Universal method:

Hashing is a fun idea that has lots of unexpected uses. Here, we look at a novel type of hash
function that makes it easy to create a family of universal hash functions. The method is
based on a random binary matrix and is very simple to implement.
The idea is very simple. Suppose you have an input data item that you have input data with
m – bits and you want a hash function that produces n – bits then first generate a random
binary matrix (M) of order nxm.
The hash function is h(x) = Mx Where x to be a binary vector
For example, Suppose you have a key 11, the binary form is 1011 and it is a four bit input
value(m) and want to generate output a three bit hash value(n).
Then generate a random matrix gives say:
(0100)
M = (1011)
(1101)
and if the data value was 1011 the hash value would be computed as:
( 0 1 0 0 ) (1) (0)
h(x) = Mx = ( 1 0 1 1 ) (0) = (1)
( 1 1 0 1 ) (1) (0)
(1)
There are a number of other ways to look at the way the arithmetic is done that suggest
different ways of implementing the algorithm.
The first is to notice that what you are doing is anding each row with the data column
vector. That is taking the second row as an example: ( 1 0 1 1 )And (1 0 1 1) = (1 0 1 1)
and then you add up the bits in the result: 1+0+1+1=1
now the index is 010, convert that into decimal is 2.
There is no.of other ways to look at the way the arithmetic is done that suggest different
ways of implementing the algorithm.
Hashing gives an alternative approach that is often the fastest and most convenient way of
solving these problems like AI – search programs, cryptography, networks, complexity
theory.
Collision Resolution Techniques:

In general, a hashing function can map several keys into the same address. That leads to a
collision. The colliding records must be stored and accessed as determined by a collision –
resolution techniques.
There are two broad classes of such techniques:
Open Hashing (also called separate chaining) and

Closed Hashing (also called open addressing)
The difference between the two has to do with whether collision are stored outside the
table (open hashing), or whether collision result in storing and of the records at another slot
in the table (closed hashing).
The particular hashing method that one uses depends on many factors. One important
factor is the ratio of the no.of keys in the table to the no.of hash addresses. It is called load
factor, and is given by: Load factor (α) = n/m,
where n is no.of keys in the table and m is no.of hash address (table size)
Open Hashing:
The simplest form of open hashing defines each slot in the hash table to be the head of a
linked list. All records that hash to a particular slot are placed on that slot’s linked list.
The below figure illustrates a hash table where each slot stores one record and a link pointer
to the rest of the list.
Consider the same example of division method:
Any key that hash to the same index are simply added to the linked list; there is no need to
search for empty cells in the array. This method is called separating chaining.
Closed Hashing (Open addressing):

It resolves collisions in the prime area – that is that contains all of the home addresses.
i.e. when a data item cannot be placed at the index calculated by the hash function, another
location in the array is sought.
There are different methods of open addressing, which vary in the method used to find the
next vacant cell.
They are (i) Linear probing
(ii) Quadratic probing
(iii) Pseudo random probing
(iv) Double hashing
(v) Key – offset
Hashing with Linear probe:
We resolve the collision by adding 1 to the current address.

Assuming that the table is not full
We apply division method of hashing
Consider the example:
Add the keys 10, 5, and 15 to the above table

H(k) = k mod table size
10 mod 8 = 2 a collision, so add 1 to the address then check is it empty or filled. If it is filled
then apply the same function, like this we can place this key 10 in the index 5 cell.
If the physical end of the table is reached during the linear search will wrap around to the
beginning of the table and continue from there.
If an empty slot is not found before reaching the point of collision; the table is full
A problem with the linear probe method is that it is possible for blocks of data to form when
collisions are resolved. This is known as primary clustering.
This means that any key that hashes into the cluster will require several attempts to resolve
the collision.

Linear probes have two advantages: First, They are quite simple to implement. Second, data
tend to remain near their home address.
Exercise Example:
Insert the nodes 89, 18, 49, 58, and 69 into a hash table that holds 10 items using the
division method.
Hashing with Quadratic probe:
In this probe, rather than always moving one spot, move ‘ i 2 ‘ spots from the point of
collision, where ‘i’ is the no. of attempts to resolve the collision.
In linear probe, if the primary hash index is x, subsequent probe go to x+1, x+2, x+3 and so
on. In Quadratic probing, probes go to x+1, x+4, x+9, and so on, the distance from the initial
probe is the square of the step number: x+12, x+22, x+32, x+42 and so on.
i.e., at first it picks the adjacent cell, if that is occupied, it tries 4 cells away, if that is
occupied it tries 9 cells away, and so on. It eliminates the primary clustering problem with
linear probe.
Consider the above exercise problem, keys 89, 18, 49, 58, 69 with table size 10.
Here each key that hashes to same location will require a longer probe. This phenomenon is
called secondary clustering. It is not a serious problem, but quadratic probing is not often
used because there is a slightly better solution.
Hashing with Pseudo – random probing:
This method uses pseudo – random number to resolve the collision i.e. this probe function
would select the next position on the probe sequence at random from among the unvisited
slots that is the probe sequence should be a random permutation of hash table positions.

Unfortunately, we can’t actually select the next position in the probe sequence at random,
because we would not be able to duplicate this same probe sequence when searching for
the key.
In this probing, the ith slot in the probe sequence is (h (key) + r 1) mod M) where r1 is the ith
value in the random permutation of the numbers from 1 to M-1.
All insertions and searches use the same sequence of random numbers.
Consider the same example of division method:
36 % 8 = 4
18 % 8 = 2
72 % 8 = 0 now insert 60
43 % 8 = 3 60 % 8 = 4; is a collision
6 % 8 = 6
The Pseudo – random permutation to use is: 0 6 5 2 3 4 1 7

For collision resolution
Current slot = ( 4 + 0 ) % 8 = 4; searching slot 4 and it is occupied
Again, Current slot = ( 4 + 6 ) % 8 = 2; occupied
Current slot = ( 2 + 5 ) % 8 = 1; it is empty; key 60 is occupies slot 1
Pseudo random numbers are a relatively simple solution, but they have one significant
limitation all keys follow only one collision resolution path through the list. Because this
collision resolution can create significant secondary clustering
Double Hashing:
Double hashing uses the idea of applying a second hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions from the
point of collision to insert.
There are a couple of requirements for the second function:

 It must never evaluate to 0

 Must make sure that all cells can be probed
A popular second hash function is : H2(key) = R – (key % R)
Where R is a prime number that is similar than the size of the table
Table size = 10
Hash1 (key) = key % 10
Hash2 (key) = 7 – (key % 7)
Because 7 is a prime number than the size of the table
Insert keys: 89, 18, 49, 58, 69
Hash (89) = 89 % 10 = 9
Hash (18) = 18 % 10 = 8
Hash1 (49) = 49 % 10 = 9; it’s a collision

Hash2 (49) = 7 – (49 % 7) = 7; positions from location 9
Hash1 (58) = 58 % 10 =8; it’s a collision

Hash2 (58) = 7 – (58 % 7) = 5; positions from location 8
NOTE:
Linear probing  ‘m’ distinct probe sequences, primary clustering

Quadratic probing  ‘m’ distinct probe sequences, no primary but secondary clustering
Double hashing  ‘m2’ distinct probe sequences, no primary and secondary clustering
Key – offset:
It is double hashing method that produces different collision paths for different keys. Where
as the pseudo random – number generator produces a new address as a function of the
previous address; key offset calculates the new address as a function of the old address and
key.
One of the simplest versions simply adds the quotient of key divided by the list size to the
address to determine the next collision resolution address, as shown below

Offset = key / list size

Address = ( ( offset + old address) modulo list size ) )
For example:
The key is 166702 and list size is 307, using modulo - division hashing method generates an
address of 1. It’s a collision because there was a key 070918.
Using key offset to calculate the next address, we get 237 as shown below
Offset = 166702 / 307 = 543

Address = ( ( 543 + 001) ) modulo 307 = 237
If 237 were also a collision, we would repeat the process to locate the next address, as
shown below
Offset = 166702 / 307 = 543

Address = ( ( 543 + 237) ) modulo 307 = 166
If it is free, then place the key in this address.
Skip Lists:
Skip list is a type of data structure that can be used as an alternative to balanced (binary)
trees or B-Trees. As compared to a binary tree, skip lists allow quick search, insertions and
deletions of elements with simple algorithms. This is achieved by using probabilistic
balancing rather than strictly enforce balancing as done in B-trees.
Skip lists are also significantly faster than equivalent algorithms for B-Trees.
A skip list is basically a linked list with additional pointers that allow intermediate nodes to
be skipped, hence the name skip list.

In a simple linked list that consists of ‘n’ elements, to perform a search ‘n’ comparisons are
required in the worst case.
For example:
If a second pointer pointing two nodes ahead is added to every node, the number of
comparisons goes down to n/2+1 in the worst case.
Consider a stored list where every other node has an additional pointer, to the node two a
head of it in the list.
Here, every other node has an additional pointer.
Next, every second node has a pointer two ahead of it
In the list of above figure, every second node has a pointer two ahead of it; every fourth
node has a pointer four ahead if it. Here we need to examine no more than +2
nodes.
In below figure, (every (2i)th node has a pointer (2i) node ahead (i = 1, 2,...); then the number
of nodes to be examined can be reduced to log2n while only doubling the number of
pointers.
Here, Every (2i)th node has a pointer to a node (2i)nodes ahead (i = 1, 2,...)

 A node that has k forward pointers is called a level k node. If every (2i)th node has a
pointer (2i) nodes ahead, then
# of level 1 nodes 50 %
# of level 2 nodes 25 %
# of level 3 nodes 12.5 %
 Such a data structure can be used for fast searching but insertions and deletions will
be extremely cumbersome, since levels of nodes will have to change.
 What would happen if the levels of nodes were randomly chosen but in the same
proportions (below figure)?
o level of a node is chosen randomly when the node is inserted
o A node's ith pointer, instead of pointing to a node that is 2 i - 1 nodes ahead,
points to the next node of level i or higher.
o In this case, insertions and deletions will not change the level of any node.
o Some arrangements of levels would give poor execution times but it can be
shown that such arrangements are rare.
Such a linked representation is called a skip list.
 Each element is represented by a node the level of which is chosen randomly when
the node is inserted, without regard for the number of elements in the data
structure.
 A level i node has i forward pointers, indexed 1 through i. There is no need to store
the level of a node in the node.
 Maxlevel is the maximum number of levels in a node.
o Level of a list = Maxlevel
o Level of empty list = 1
o Level of header = Maxlevel
It is a skip list
Initialization:
An element NIL is allocated and given a key greater than any legal key. All levels of all lists
are terminated with NIL. A new list is initialized so that the level of list = maxlevel and all
forward pointers of the list's header point to NIL
Search:

We search for an element by traversing forward pointers that do not overshoot the node
containing the element being searched for. When no more progress can be made at the
current level of forward pointers, the search moves down to the next level. When we can
make no more progress at level 1, we must be immediately in front of the node that
contains the desired element (if it is in the list).
Insertion and Deletion:

 Insertion and deletion are through search and splice
 update [i] contains a pointer to the rightmost node of level i or higher that is to the
left of the location of insertion or deletion.
 If an insertion generates a node with a level greater than the previous maximum
level, we update the maximum level and initialize appropriate portions of update list.
 After a deletion, we check to see if we have deleted the maximum level element of
the list and if so, decrease the maximum level of the list.
 Below figure provides an example of Insert and Delete. The pseudo - code for Insert
and Delete is shown below.
Analysis of Skip lists:
In a skiplist of 16 elements, we may have
 9 elements at level 1
 1 element at level 6
One important question is:

Where do we start our search? Analysis shows we should start from level L(n) where
L(n) = log2n

In general if p is the probability fraction,
L(n) = log n
where p is the fraction of the nodes with level i pointers which also have level (i + 1)
pointers.
 However, starting at the highest level does not alter the efficiency in a significant
way.
 Another important question to ask is:
What should be MaxLevel? A good choice is
MaxLevel = L(N) = log N
where N is an upperbound on the number of elements is a skiplist.
 Complexity of search, delete, insert is dominated by the time required to search for
the appropriate element. This in turn is proportional to the length of the search
path. This is determined by the pattern in which elements with different levels
appear as we traverse the list.
 Insert and delete involve additional cost proportional to the level of the node being
inserted or deleted.

Ads-Unit I-Hashing

Uploaded by

Copyright:

Available Formats

Ads-Unit I-Hashing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ads-Unit I-Hashing

Uploaded by

Copyright:

Available Formats

ADVANCED DATA STRUCTURES UNIT - I CSE

The formula that will be used is:

H(key) = key % no.of slots in the table

i.e. h(key) = key mod array size

Consider a table with 8 slots. i.e. array size 8.

Hash key = key % table size

The key values are 36, 18, 72, 43, 6, 42

1. Choose constant A in the range 0 < A < 1.

3. Extract the fractional part of k*A

4. Multiply the fractional part by number of slots, m.

5. Take the floor of the result.

h( k ) = ⌊m · (k·A mod 1)⌋

where k·A mod 1 = k·A − ⌊k·A⌋ = fractional part of k·A

 Disadvantage: Slower than division method.

Must have 0 < s < 25; choose s = 13 → A = 13/32.

h( 21 ) = ⌊8 · (21·13/32 mod 1)⌋ = 4

k·A = 21 · 13/32 = 273/32 = 8 17/32 → k·A mod 1 = 17/32

→ m · ( k·A mod 1) = 8 · 17/32 = 17/4 =

Narasaraopeta Engineering College Page 6

Narasaraopeta Engineering College Page 7

The hash function is h(x) = Mx Where x to be a binary vector

Then generate a random matrix gives say:

now the index is 010, convert that into decimal is 2.

Collision Resolution Techniques:

Narasaraopeta Engineering College Page 8

Open Hashing (also called separate chaining) and

Consider the same example of division method:

Closed Hashing (Open addressing):

Narasaraopeta Engineering College Page 9

Hashing with Linear probe:

We resolve the collision by adding 1 to the current address.

Consider the example:

Add the keys 10, 5, and 15 to the above table

Narasaraopeta Engineering College Page 10

Hashing with Quadratic probe:

Narasaraopeta Engineering College Page 11

Consider the same example of division method:

The Pseudo – random permutation to use is: 0 6 5 2 3 4 1 7

Current slot = ( 4 + 0 ) % 8 = 4; searching slot 4 and it is occupied

Again, Current slot = ( 4 + 6 ) % 8 = 2; occupied

Current slot = ( 2 + 5 ) % 8 = 1; it is empty; key 60 is occupies slot 1

There are a couple of requirements for the second function:

Narasaraopeta Engineering College Page 12

Insert keys: 89, 18, 49, 58, 69

Hash1 (49) = 49 % 10 = 9; it’s a collision

Hash1 (58) = 58 % 10 =8; it’s a collision

Linear probing  ‘m’ distinct probe sequences, primary clustering

Narasaraopeta Engineering College Page 13

Offset = key / list size

Offset = 166702 / 307 = 543

Offset = 166702 / 307 = 543

If it is free, then place the key in this address.

Narasaraopeta Engineering College Page 14

Here, every other node has an additional pointer.

Next, every second node has a pointer two ahead of it

Narasaraopeta Engineering College Page 15

Narasaraopeta Engineering College Page 16

Insertion and Deletion:

Analysis of Skip lists: