HASHING

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

1.

Hashing is a technique that is used to uniquely identify a specific object


from a group of similar objects.
In hashing, large keys are converted into small keys by using hash
functions. The values are then stored in a data structure called hash table.
The idea of hashing is to distribute entries (key/value pairs) uniformly across
an array. Each element is assigned a key (converted key). By using that key
you can access the element in O(1) time. Using the key, the algorithm
(hash function) computes an index that suggests where an entry can be
found or inserted.

1.1Hashing is implemented in two steps:


1. An element is converted into an integer by using a hash function. This
element can be used as an index to store the original element, which falls
into the hash table.
2. The element is stored in the hash table where it can be quickly retrieved
using hashed key.

2.Load factor(α) : It is the ratio of the number of stored element n


divided by he number of slot m in hash table T or α=n/m

3.Hash table A hash table is a data structure that is used to store


keys/value pairs. It uses a hash function to compute an index into an array
in which an element will be inserted or searched. By using a good hash
function, hashing can work well. Under reasonable assumptions, the average
time required to search for an element in a hash table is O(1).

4.Hash function A hash function is any function that can be used to map a
data set of an arbitrary size to a data set of a fixed size, which falls into the
hash table. The values returned by a hash function are called hash values,
hash codes, hash sums, or simply hashes.
To achieve a good hashing mechanism, It is important to have a good hash
function with the following basic requirements:
I. Easy to compute: It should be easy to compute and must not become an
algorithm in itself.
II. Uniform distribution: It should provide a uniform distribution across the
hash table and should not result in clustering.
III Less collisions: Collisions occur when pairs of elements are mapped to the
same hash value. These should be avoided.
4.2 Different Methods to Define Hash Functions:
I. Division Method
II. Truncation
III. Folding
IV. Mid-Square Method
V. Multiplication Method
VI. Radix Transformation
I. Division Method: The given number is divided by an integer, that
is usually considered to be the size of the hash table or lookup
table. The remainder so obtained as a result of division is the hash
key where this key element may be mapped in the table. Thus, if
the hash table contains ten index entries from 0 to 9, the given
integer may be divided by 10 to obtain the hash key for mapping
that element in the hash table. Example if key =56 and hash table
size is 11 then key is stored within (56%11=)1st location of hash
table .

Note: This works best if the table size is prime number , but if not , we
can use h(x)=((k mod p )mod table_size) for a prime number
p>=table_size

II. Truncation: Ignore a part of the key and use the remaining part
directly as the index say for example, if a hash table contains 999 entries at
the most or 999 different key indexes may be kept, then a hash function
may be defined such that from an eight digit integer 12345678, first, second
and fifth digits from the right may be used to define a key index i.e. 478,
which is the key position in the hash lookup table where this element will be
inserted. Any other key combination may be used.

III. Folding: Partition the key into several parts and combine the parts in a
convenient way (often addition or multiplication) to obtain the index. Say
for example an eight digit integer can be divided into groups of three, three
and two digits (or any other combination) the groups added together and
truncated if necessary to be in the proper range of indices. Hence 12345678
maps to 123+456+78 = 657, since any digit can affect the index, it offers a
wide distribution of key values.

IV. Mid-Square Method: In Mid-Square method, the key element is


multiplied by itself to yield the square of the given key. If the hash table
contains maximum 999 entries, then three digits from the result of square
may be chosen as a hash key for mapping the key element in the lookup
table. It generates random sequences of hash keys, which are generally key
dependent. Mid Square method is a flexible method of selecting the hash
key value.
key 1337 1273 1391 1026

Square 1787569 1620529 1934881 1052676


of key

Hash 875 205 348 526


table
index
number

V. Multiplication Method:

• Multiply the key by a constant A, 0 < A < 1,


• Extract the fractional part of the product,
• Multiply this value by m.

h(k) = floor(m * (kA - floor(kA)))

k=123
m=100
A=0.618033
h(123) = floor(100*(123 * 0.618033 –floor(123 * 0.618033 )))
= floor(100*(76.018059 - floor(76.018059)))
= floor(100*(76.018059- 76))
= floor(100*(0.018059))
= floor(1.8059)
=1
The hash key value obtained is 1
VI. Radix Transformation Method: Where the value or key is digital, the
number base (or radix) can be changed resulting in a different sequence of
digits. For example: A decimal numbered key could be transformed into a
hexadecimal numbered key. High order digits could be discarded to fit a
hash value of uniform length.
Example: The key is translated into another base.

Key = 345, change to base 9 = 423 % TableSize.


5. What is Collision?
Since a hash function gets us a small number for a key which is a big integer
or string, there is possibility that two keys result in same value. The
situation where a newly inserted key maps to an already occupied slot in
hash table is called collision and must be handled using some collision
handling technique.
5.1What are the chances of collisions with large table?
Collisions are very likely even if we have big table to store keys. But if the
table size is small then chances of collision is maximum.

Note: Irrespective of how good a hash function is, collisions are bound to
occur. Therefore, to maintain the performance of a hash table, it is
important to manage collisions through various collision resolution
techniques.

6. Collision resolution
We now return to the problem of collisions. When two items hash to the
same slot, we must have a systematic method for placing the second item in
the hash table. This process is called collision resolution. As we stated
earlier, if the hash function is perfect, collisions will never occur. However,
since this is often not possible, collision resolution becomes a very important
part of hashing.
A) Separate chaining (open hashing)
B) Open addressing or closed hashing

A)Separate chaining (open hashing)


Separate chaining is one of the most commonly used collision resolution
techniques. It is usually implemented using linked lists. In separate chaining,
each element of the hash table is a linked list. To store an element in the
hash table you must insert it into a specific linked list. If there is any
collision (i.e. two different elements have same hash value) then store both
the elements in the same linked list. We can take the hash table as an array
of pointers. Here each pointer will point to one link list and the element and
the size of hash table can be number of record.
When we want to search for an item, we use the hash function to generate
the slot where it should reside. Since each slot holds a collection, we use a
searching technique to decide whether the item is present. The advantage is
that on the average there are likely to be many fewer items in each slot, so
the search is perhaps more efficient.
Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to chain.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how frequently keys
may be inserted or deleted.
Disadvantages:
1) Cache performance of chaining is not good as keys are stored using linked
list. Open addressing provides better cache performance as everything is
stored in same table.
2) Wastage of Space (Some Parts of hash table are never used)
3) If the chain becomes long, then search time can become O(n) in worst
case.
4) Uses extra space for links.

B)Open addressing or closed hashing


In open addressing, instead of in linked lists, all entry records are stored in
the array itself. When a new entry has to be inserted, the hash index of the
hashed value is computed and then the array is examined (starting with the
hashed index). If the slot at the hashed index is unoccupied, then the entry
record is inserted in slot at the hashed index else it proceeds in some probe
sequence until it finds an unoccupied slot. The probe sequence is the
sequence that is followed while traversing through entries. In different probe
sequences, you can have different intervals between successive entry slots
or probes. The name "open addressing" refers to the fact that the location or
address of the item is not determined by its hash value.
There are two types
i) Linear probing
ii) Quadratic Probing

i)Linear probing
One method for resolving collisions looks into the hash table and tries to find
another open slot to hold the item that caused the collision. A simple way to
do this is to start at the original hash value position and then move in a
systematically (visiting each slot one at a time) manner through the slots
until we encounter the first slot that is empty. Note that we may need to go
back to the first slot (circularly) to cover the entire hash table.
Linear probing is when the interval between successive probes is fixed
(usually to 1). Let’s assume that the hashed index for a particular entry is
index. The probing sequence for linear probing will be:
index = index % hashTableSize index = (index + 1) % hashTableSize index
= (index + 2) % hashTableSize index = (index + 3) % hashTableSize
and so on…
54,26,93,17,77,31,44,55,20 store items using the simple remainder method
hash function (%11)
When we attempt to place 44 into slot 0, a collision occurs. Under linear
probing, we look sequentially, slot by slot, until we find an open position. In
this case, we find slot 1.
Again, 55 should go in slot 0 but must be placed in slot 2 since it is the next
open position. The final value of 20 hashes to slot 9. Since slot 9 is full, we
begin to do linear probing. We visit slots 10, 0, 1, and 2, and finally find an
empty slot at position 3.

Once we have built a hash table using open addressing and linear probing, it
is essential that we utilize the same methods to search for items.
A disadvantage to linear probing is the tendency for clustering; items
become clustered in the table. This means that if many collisions occur at
the same hash value, a number of surrounding slots will be filled by the
linear probing resolution. This will have an impact on other items that are
being inserted,
ii)Quadratic Probing
Quadratic probing is similar to linear probing and the only difference is the
interval between successive probes or entry slots. Here, when the slot at a
hashed index for an entry record is already occupied, you must start
traversing until you find an unoccupied slot. The interval between slots is
computed by adding the successive value of an arbitrary polynomial in the
original hashed index.
Let us assume that the hashed index for an entry is index and at index
there is an occupied slot. The probe sequence will be as follows:
index = index % hashTableSize
index = (index +12) % hashTableSize
index = (index + 22) % hashTableSize
index = (index + 32) % hashTableSize
and so on…

Difference between Separate Chaining and Open Addressing


S.No. Separate Chaining Open Addressing

1. Chaining is Simpler to implement. Open Addressing requires more


computation.
2. In chaining, Hash table never fills up, In open addressing, table may
we can always add more elements to become full.
chain.
3. Chaining is Less sensitive to the hash Open addressing requires extra
function or load factors. care for to avoid clustering and
load factor.
4. Chaining is mostly used when it is Open addressing is used when the
unknown how many and how frequency and number of keys is
frequently keys may be inserted or known.
deleted.
5. Cache performance of chaining is not Open addressing provides better
good as keys are stored using linked cache performance as everything
list. is stored in the same table.
6. Wastage of Space (Some Parts of hash In Open addressing, a slot can be
table in chaining are never used). used even if an input doesn’t map
to it.
7. Chaining uses extra space for links. No links in Open addressing

Double hashing
Double hashing is similar to linear probing and the only difference is the
interval between successive probes. Here, the interval between probes is
computed by using two hash functions.
Let us say that the hashed index for an entry record is an index that is
computed by one hashing function and the slot at that index is already
occupied. You must start traversing in a specific probing sequence to look for
an unoccupied slot. The probing sequence will be:
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) +
2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) +
3*hash2(x)) % S
..................................................

Rehashing
There are chances of insertion failure when hash table is full or Insertions,
deletions, search take longer time.. So the solution for this particular case is
to create a new hash table with the double size of previous hash table. WE
will use new hash function to insert all element of the previous hash table.
This technique is called Rehashing
 Cost of Rehashing = O(N)
 Disadvantage : This technique is litte bit more expensive but it happens
very few times.
Example : insert 11,23,19,26,16,18,30 where as table size is 11

11 0 11%11=0 18 0 11%17=11
23 1 23%11=1 1 23%17=6
2 19%11=8 19 2 19%17=2
3 26%11=4 3 26%17=9
4 16%17=5
26 4 16%11=5
16 5 18%17=1
16 5 18%11=7
23 6 30%17=13
6 30%11=8
7
18 7 Now collision arise cause 19,30 8
19 8 both generate same index number 8 26 9
9 so we apply rehashing technique 10
10 to increase the hash table to avoid 11 11
Collision . Now the new size of hash table is 17. 12
30 13
There is no collision .
14
15
16

You might also like