Ads-Unit I-Hashing
Ads-Unit I-Hashing
Ads-Unit I-Hashing
HASHING
Division Method:
One common method of determining a hash key of the division method of hashing
For example:
The division method is generally a reasonable strategy, unless the key happens to have
some undesirable properties.
Note: if the table size is 10 and all of the keys end in zero.
In the above example 42 mod 8 => 2, it’s already filled position in the hash table. This is
known as collision. i.e. two or more record keys map to the same array index.
In This case, the choice of hash function and table size needs to be carefully considered. The
best table sizes are prime numbers.
Multiplication method:
Narasaraopeta Engineering College Page 5
ADVANCED DATA STRUCTURES UNIT - I CSE
The simplest situation when the keys are floating –point numbers known to be in affixed
range.
For example:
If the keys are numbers that are greater than 0 and less than 1, we can just multiply by m
(table size) and round off to the nearest integer to get an address between 0 an m-1.
Algorithm:
Mathematically
Example:
m = 8 (implies m = 23, p = 3)
w=5
k = 21
So that h(21) = 4.
Example:
m = 8 (implies m = 23, p = 3)
= 273
= 8 · 25 + 17
= r 1 . r0
r1 = 8 · 25
r0 = 17 = 100012
Written in w = 5 bits, r0 = 100012 The p = 3 most significant bits of r0 is 1002 or 410,
so h(21) = 4.
Exercise Example:
m = 4 (implies m = 22, p = 2)
w=3
k = 12
s = 5 0 < s < 2w = 23 = 8
k·s = 12 · 5
=?
= ? · 23 + ?
= r 1 . r0
r1 = ? · 23
r0 = ? = ?2
Written in w = 3 bits, r0 = ?2
The p = 2 most significant bits of r0 is ?2 or ?10, so h(12) = ?.
Universal method:
The idea is very simple. Suppose you have an input data item that you have input data with
m – bits and you want a hash function that produces n – bits then first generate a random
binary matrix (M) of order nxm.
For example, Suppose you have a key 11, the binary form is 1011 and it is a four bit input
value(m) and want to generate output a three bit hash value(n).
(0100)
M = (1011)
(1101)
and if the data value was 1011 the hash value would be computed as:
( 0 1 0 0 ) (1) (0)
h(x) = Mx = ( 1 0 1 1 ) (0) = (1)
( 1 1 0 1 ) (1) (0)
(1)
There are a number of other ways to look at the way the arithmetic is done that suggest
different ways of implementing the algorithm.
The first is to notice that what you are doing is anding each row with the data column
vector. That is taking the second row as an example: ( 1 0 1 1 )And (1 0 1 1) = (1 0 1 1)
and then you add up the bits in the result: 1+0+1+1=1
There is no.of other ways to look at the way the arithmetic is done that suggest different
ways of implementing the algorithm.
Hashing gives an alternative approach that is often the fastest and most convenient way of
solving these problems like AI – search programs, cryptography, networks, complexity
theory.
The difference between the two has to do with whether collision are stored outside the
table (open hashing), or whether collision result in storing and of the records at another slot
in the table (closed hashing).
The particular hashing method that one uses depends on many factors. One important
factor is the ratio of the no.of keys in the table to the no.of hash addresses. It is called load
factor, and is given by: Load factor (α) = n/m,
where n is no.of keys in the table and m is no.of hash address (table size)
Open Hashing:
The simplest form of open hashing defines each slot in the hash table to be the head of a
linked list. All records that hash to a particular slot are placed on that slot’s linked list.
The below figure illustrates a hash table where each slot stores one record and a link pointer
to the rest of the list.
Any key that hash to the same index are simply added to the linked list; there is no need to
search for empty cells in the array. This method is called separating chaining.
It resolves collisions in the prime area – that is that contains all of the home addresses.
i.e. when a data item cannot be placed at the index calculated by the hash function, another
location in the array is sought.
There are different methods of open addressing, which vary in the method used to find the
next vacant cell.
They are (i) Linear probing
(ii) Quadratic probing
(iii) Pseudo random probing
(iv) Double hashing
(v) Key – offset
If the physical end of the table is reached during the linear search will wrap around to the
beginning of the table and continue from there.
If an empty slot is not found before reaching the point of collision; the table is full
A problem with the linear probe method is that it is possible for blocks of data to form when
collisions are resolved. This is known as primary clustering.
This means that any key that hashes into the cluster will require several attempts to resolve
the collision.
Exercise Example:
Insert the nodes 89, 18, 49, 58, and 69 into a hash table that holds 10 items using the
division method.
In this probe, rather than always moving one spot, move ‘ i 2 ‘ spots from the point of
collision, where ‘i’ is the no. of attempts to resolve the collision.
In linear probe, if the primary hash index is x, subsequent probe go to x+1, x+2, x+3 and so
on. In Quadratic probing, probes go to x+1, x+4, x+9, and so on, the distance from the initial
probe is the square of the step number: x+12, x+22, x+32, x+42 and so on.
i.e., at first it picks the adjacent cell, if that is occupied, it tries 4 cells away, if that is
occupied it tries 9 cells away, and so on. It eliminates the primary clustering problem with
linear probe.
Consider the above exercise problem, keys 89, 18, 49, 58, 69 with table size 10.
Here each key that hashes to same location will require a longer probe. This phenomenon is
called secondary clustering. It is not a serious problem, but quadratic probing is not often
used because there is a slightly better solution.
Hashing with Pseudo – random probing:
This method uses pseudo – random number to resolve the collision i.e. this probe function
would select the next position on the probe sequence at random from among the unvisited
slots that is the probe sequence should be a random permutation of hash table positions.
In this probing, the ith slot in the probe sequence is (h (key) + r 1) mod M) where r1 is the ith
value in the random permutation of the numbers from 1 to M-1.
All insertions and searches use the same sequence of random numbers.
36 % 8 = 4
18 % 8 = 2
72 % 8 = 0 now insert 60
43 % 8 = 3 60 % 8 = 4; is a collision
6 % 8 = 6
Pseudo random numbers are a relatively simple solution, but they have one significant
limitation all keys follow only one collision resolution path through the list. Because this
collision resolution can create significant secondary clustering
Double Hashing:
Double hashing uses the idea of applying a second hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions from the
point of collision to insert.
Table size = 10
Hash1 (key) = key % 10
Hash2 (key) = 7 – (key % 7)
Because 7 is a prime number than the size of the table
Hash (89) = 89 % 10 = 9
Hash (18) = 18 % 10 = 8
NOTE:
Key – offset:
It is double hashing method that produces different collision paths for different keys. Where
as the pseudo random – number generator produces a new address as a function of the
previous address; key offset calculates the new address as a function of the old address and
key.
One of the simplest versions simply adds the quotient of key divided by the list size to the
address to determine the next collision resolution address, as shown below
For example:
The key is 166702 and list size is 307, using modulo - division hashing method generates an
address of 1. It’s a collision because there was a key 070918.
Using key offset to calculate the next address, we get 237 as shown below
If 237 were also a collision, we would repeat the process to locate the next address, as
shown below
Skip Lists:
Skip list is a type of data structure that can be used as an alternative to balanced (binary)
trees or B-Trees. As compared to a binary tree, skip lists allow quick search, insertions and
deletions of elements with simple algorithms. This is achieved by using probabilistic
balancing rather than strictly enforce balancing as done in B-trees.
Skip lists are also significantly faster than equivalent algorithms for B-Trees.
A skip list is basically a linked list with additional pointers that allow intermediate nodes to
be skipped, hence the name skip list.
For example:
If a second pointer pointing two nodes ahead is added to every node, the number of
comparisons goes down to n/2+1 in the worst case.
Consider a stored list where every other node has an additional pointer, to the node two a
head of it in the list.
In the list of above figure, every second node has a pointer two ahead of it; every fourth
node has a pointer four ahead if it. Here we need to examine no more than +2
nodes.
In below figure, (every (2i)th node has a pointer (2i) node ahead (i = 1, 2,...); then the number
of nodes to be examined can be reduced to log2n while only doubling the number of
pointers.
Here, Every (2i)th node has a pointer to a node (2i)nodes ahead (i = 1, 2,...)
A node that has k forward pointers is called a level k node. If every (2i)th node has a
pointer (2i) nodes ahead, then
# of level 1 nodes 50 %
# of level 2 nodes 25 %
# of level 3 nodes 12.5 %
Such a data structure can be used for fast searching but insertions and deletions will
be extremely cumbersome, since levels of nodes will have to change.
What would happen if the levels of nodes were randomly chosen but in the same
proportions (below figure)?
o level of a node is chosen randomly when the node is inserted
o A node's ith pointer, instead of pointing to a node that is 2 i - 1 nodes ahead,
points to the next node of level i or higher.
o In this case, insertions and deletions will not change the level of any node.
o Some arrangements of levels would give poor execution times but it can be
shown that such arrangements are rare.
Such a linked representation is called a skip list.
Each element is represented by a node the level of which is chosen randomly when
the node is inserted, without regard for the number of elements in the data
structure.
A level i node has i forward pointers, indexed 1 through i. There is no need to store
the level of a node in the node.
Maxlevel is the maximum number of levels in a node.
o Level of a list = Maxlevel
o Level of empty list = 1
o Level of header = Maxlevel
It is a skip list
Initialization:
An element NIL is allocated and given a key greater than any legal key. All levels of all lists
are terminated with NIL. A new list is initialized so that the level of list = maxlevel and all
forward pointers of the list's header point to NIL
Search:
9 elements at level 1
3 elements at level 2
3 elements at level 3
1 element at level 6
L(n) = log2n
L(n) = log n
where p is the fraction of the nodes with level i pointers which also have level (i + 1)
pointers.
However, starting at the highest level does not alter the efficiency in a significant
way.
Another important question to ask is:
What should be MaxLevel? A good choice is
Complexity of search, delete, insert is dominated by the time required to search for
the appropriate element. This in turn is proportional to the length of the search
path. This is determined by the pattern in which elements with different levels
appear as we traverse the list.
Insert and delete involve additional cost proportional to the level of the node being
inserted or deleted.