Hash Functions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Hash Functions

• Hashing is a technique used to store and quickly retrieve data in hash tables or
hash maps.
• Searching techniques : Linear and Binary with O(n) and O(log n) complexity
respectively.
• Approach to searching calculates the position of the key in the table based on the
value of the key.
• A function h that can transform a particular key K, be it a string, number, record, or
the like, into an index in the table used for storing items of the same type as K. The
function h is called a hash function.
• h transforms different keys into different numbers, it is called a perfect hash
function.
• Hash functions are much faster when going one way (input → hash).
• Hash functions reduce input to a smaller, fixed-size output.
• If hash functions were reversible, sensitive data like passwords could easily be
retrieved from their hashes.
• Case: When you want to look up a value (like a phone number) based on a key (like
a person's name), the hash function generates an index from the key, and the value
is stored at that index.
• Hashing provides constant-time O(1) searching, insertion, and deletion
operations, making it very fast for large datasets.
• The worth of a hash function depends on how well it avoids collisions.

Basic Hash Function Design


• A simple method is to sum the ASCII values of all the letters in the variable's
name and use that sum as the index in the table.

• For Example: If the variable is “K” made out of 31 letters “z”, the ASCII value of “z”
is 122, so the hash function is:
h(K)=ASCII of character * position of character
h(K)=31×122=3782
Thus, the table needs at least 3,782 cells to accommodate all potential variable
names.
• Collision: two different inputs ("abc" and "acb") produce the same hash value,
leading to both being stored in the same index in a hash table.

Assume, we have 2 string “abc” and “acb” (exclude case sensitivity) then,

h(K)=sum of ASCII values of characters in K

For the string "abc":


'a' = 97 (ASCII value)
'b' = 98
'c' = 99
sum = 97 + 98 + 99 = 294
So, h("abc")=294
For the string "acb":

'a' = 97
'c' = 99
'b' = 98
sum = 97 + 99 + 98 = 294

So, h("acb")=294

Both "abc" and "acb" generate the same hash value (294), even though they are
different strings.
• Now, improved hash function:

h(K)=∑(ASCII of character * position of character)

For “abc”:
h("abc")=(97×1)+(98×2)+(99×3)= 97+196+297=590

For “acb”:
h("acb ")=(97×1)+(99×2)+(98×3)= 97+198+294=589

Now, the hash values for "abc" and "acb" are different, preventing the collision.

Hash Functions
• When assigning positions to “n” items in a table of “m” positions, the number of possible
hash functions is given by:
mn

n=
m=

For 50 elements and a table with 100 positions (cells), the number of possible hash
functions is:
10050=10100
This means there are 10¹⁰⁰ possible hash functions that could map the 50 items into the
100-cell table.

Perfect Hash Function


• A perfect hash function is one that assigns every item to a unique position in the
table without collisions.
• To create a perfect hash function a table need to be create which contain at least
the same number of positions as the number of elements being hashed.
• It is not always possible to know the number of elements.
• For Example: a compiler keeps all variables used in a program in a symbol table
but may be contain all content of a book in at runtime of program is near to
impossible.
• The number of perfect hash functions is equal to the number of different ways to place
these “n” items into mmm positions, which is given by:
𝑚!
(𝑚−𝑛)!
For 50 items and 100 positions, the number of perfect hash functions is approximately
1094.
• Many of the possible hash functions are too complex for practical applications, meaning
they cannot be represented by a simple formula.
• Even though only a small fraction of hash functions are perfect, the number of usable
hash functions for practical purposes is still very large.

Type of hash functions


1. Division
2. Folding
3. Mid-Square Function
4. Extraction
5. Radix Transformation
6. Universal Hash Functions

1. Division
• The primary requirement of a hash function is to return a valid index within the
size of the hash table.
• For example, if a hash table has 100 cells, the hash function must return a number
between 0 and 99 (i.e., a valid index in the table).
• One of the simplest ways to ensure that the hash function returns a valid index is
by using the modulus (mod) operation.
• The formula (for TSize is Prime) is:
h(K) = K mod TSize

The formula (for TSize is Not Prime) is:


h(K) = ( K mod p ) mod TSize
where:
* K is the key (or input data) to be hashed.
* TSize is the size of the table (i.e., the number of positions in the table).
* p is a prime number larger than TSize. This modification helps avoid
problems caused by non-prime table sizes.
• A divisor (TSize) can be any positive number if they don’t have prime factors less
than 20 but it’s recommended that the size of the table TSize must be a prime
number.
• It helps reduce collisions by ensuring that the hash function distributes keys more
evenly across the table.
• Example (when TSize is Prime):
let’s assume the table has 5 cells, and the key is K = 29.

Using the division method, the hash function is:


h(K) = 29 mod 5 = 4
So, the key "29" is mapped to index 4 in the table.
2. Folding
• In this method, the key is divided into several parts.
• These parts are combined or folded together and are often transformed in a
certain way to create the target address.
• Types of folding: shift folding and boundary folding.

Shift Folding:

• In shift folding, the key is divided into several parts, and these parts are
processed in a sequential manner.
• The parts can be added, and the sum can be divided by the table size (TSize) to
get the hash value.
• Example:
Consider a Social Security Number (SSN): 123-456-789

Let’s divide it into three parts:


part 1: 123
part 2: 456
part 3: 789
Add all parts together: 123 + 456 + 789 = 1368
if the table size (TSize) is 1000 then modulus the summation of parts by TSize
to get hash index: 1368 mod 1000 = 368

The hash value is 368 which is index in the hash table.

Alternatively,
Let’s divide it into five parts:
part 1: 12
part 2: 34
part 3: 56
part 4: 78
part 5: 9
Add all parts together: 12 + 34 + 56 + 87 + 9 = 189
if the table size (TSize) is 1000 then modulus the summation of parts by TSize
to get hash index: 189 mod 1000 = 189

The hash value is 189 which is index in the hash table.


Boundary Folding:

• In boundary folding, the key is divided into parts, and every other part is
reversed before processing.
• Imagine the key is written on a piece of paper, and you fold it along the
boundary between parts to reverse certain sections.
• Example:
Using the same SSN: 123-456-789
Let’s divide it into three parts:
part 1: 123
part 2: 456
part 3: 789
Reverse every alternative part and sum all of them
part 2: 654 (reverse of 456)
Add all parts together: 123 + 654 + 789 = 1566
if the table size (TSize) is 1000 then modulus the summation of parts by TSize
to get hash index: 1566 mod 1000 = 566

The hash value is 566 which is index in the hash table.

Alternatively,
Let’s divide it into five parts:
part 1: 12
part 2: 34
part 3: 56
part 4: 78
part 5: 9
Reverse every alternative part and sum all of them
part 2: 43 (reverse of 34)
part 4: 87 (reverse of 78)
Add all parts together: 12 + 43 + 56 + 87 + 9 = 207
if the table size (TSize) is 100 then modulus the summation of parts by TSize to
get hash index: 207 mod 100 = 7

The hash value is 7 which is index in the hash table.


Bit-Oriented Version Using XOR (Exclusive-OR)

• In both shift folding and boundary folding the key is usually divided into even
parts of some fixed size plus some remainder and then added.
• A bitwise version of shift folding can use the XOR (exclusive-or) operation to
combine the parts.
• This is efficient and often used for hashing strings.
• It is suitable for large datasets or real-time applications where speed is
important.
• Example:
Consider the
• bcd" and TSize is 1000:

Convert each character to its ASCII value:


‘a’ = 97
‘b’ = 98
‘c’ = 99
‘d’ = 100
perform XOR on bits of characters:
97 ⊕ 98 ⊕ 99 ⊕ 100 = 6
For strings, process all characters by XOR’ing them together or process chunks
of the string (grouped by the number of bytes in an integer).
This approach is fast and simple but may require modulo TSize to ensure the
result fits in the hash table;
6 mod 1000 = 6
The final hash value would be 6.

Alternatively,
Consider the string "abcd" and TSize is 100:

Convert each character to its ASCII value:


‘ab’ = 97 98 (01100001 01100010)
‘cd’ = 99 100 (01100011 01100100)
perform XOR on bits of characters:
97 98 ⊕ 99 100 = 518 (00000010 00000110)
perform modulus with TSize:
518 mod 100 = 18
The final hash value would be 18.

3. Mid-Square Function
• The Mid-Square Hashing Method is a technique where the key is squared, and
the middle part of the result is used to generate a hash value (or address).
• This method works by ensuring that the entire key participates in the hashing
process, which increases the likelihood of different keys producing different
addresses.
• By squaring the key, the resulting number is larger, which ensures more variation in
the middle digits, reducing the chances of collisions between different keys.
• Example:

The key is 3,121.

Square the key to get a large number:


3,1212 = 9740641

Now, extract the middle part of this number, if TSize is 1000:


h(3121) = 406
Address in the table is 406.

It is more efficient to use a power of 2 for the table size. If we assume the table
size is 1,024 (which is a power of 2, 0-1023), then we work with binary
representations.

Binary Representation of 9740641 is 100101001010000101100001.

Convert the binary middle part 0101000010 to its decimal equivalent is:
01010000102 = 322

the hash value for 3,121 in a 1,024-cell table is 322.


4. EXTRACTION
• The Extraction Method is a hashing technique where only a specific portion of the
key is used to generate the hash value (address in the hash table).
• The goal is to extract significant parts of the key that help in distinguishing between
different values while ignoring less important parts.
• To compute the address part of the key can be some random position.
• Example: Social Security Number (SSN)

Consider the SSN 123-45-6789

The extraction method might use:


➢ The first four digits: 1,234
➢ The last four digits: 6,789
➢ A combination like the first two and the last two digits: 1,289
• The idea is to extract the portion that provides enough distinction between keys.
• The extraction should be based on what parts of the key vary most significantly.
• If a part of the key is common across many records, it can be safely omitted.
• Example: University Student IDs

In some universities, international student IDs all start with the number 999.
Since this portion is common to all international students, the first three digits
(999) can be excluded.

Only the remaining part of the ID is used for hashing.

• Example: ISBN Code

All books from the same publisher have a common prefix in their ISBN (e.g., 1133
for Cengage Learning).

If a hash table only contains books from one publisher, the common prefix (1133)
can be omitted from the hashing process.

This leaves the unique part of the ISBN to be used for computing the hash value.
• This method is easy to implement and simple also without any processing of key.

5. Radix Transformation
• The key K is transformed into another number base and then the transformed
number is used to compute the hash value or address.
• Example: Decimal to Nonal (Base 9) Conversion

If the key K=345 (in decimal), it can be converted to its value in base 9 (nonal
system).
The decimal number 345 is equal to 423 in base 9.
9 345 3
38 2
4

After the transformation, the resulting value is divided modulo TSize (size of the hash
table), and the remainder is used as the hash address.

If TSize is 100, the transformed base-9 value 423 is divided by 100, giving:
423 mod 100 = 23
Now, the hash value is 23.

6. UNIVERSAL HASH FUNCTION


• A universal class of hash functions can be used; a class of functions is universal
when for any sample, a randomly chosen member of that class will be expected to
distribute the sample evenly.

|H| |keys|

• Members of that class guarantee low probability of collisions.


• A class of hash functions is called universal.
• If you pick two different keys x and y, the number of hash functions in the set H
that cause these two keys to collide (i.e., h(x)=h(y)) is proportional to the ratio
∣H∣/TSize.
o |H| represents the total number of hash functions in the class H.
o This ratio ∣H∣/TSize expresses how many of the hash functions in H will, on
average, map both keys x and y to the same location.
• One class of such functions is defined as follows:
o A prime number p is chosen such that p ≥ ∣keys∣, meaning p is at least as
large as the number of keys.
o For a prime number p ≥ |keys|, and randomly chosen numbers a and b:
H = { ha,b(K): ha,b(K) = ((a . K + b) mod p) mod TSize and 0 ≤ a, b < p}
o Example:

Assume we have a set of keys and choose a prime p=17, and a hash table
size TSize=10.

Let’s choose random values a=5 and b=3

For a key K=12,

h5,3(12) = ((5 . 12 + 3) mod 17 ) mod 10 = (63 mod 17) mod 10 = 2

The hash value of key 12 is 2.


• This method ensures that different values of aaa and bbb will distribute keys more
evenly.
• For Keys as Byte Sequences:
o a class H for keys considered to be sequences of bytes, K = K0 K1 … Kr-1
o For some prime p ≥ 28 = 256 and a sequence a = a0 , a1 , . . . , ar-1,

o Example:

Suppose we have a key that is a byte sequence (in hexadecimal format),


Key K= "4D 61 72 79" for string “Mary”

byte values (in decimal) are:

K0=77 (for 'M')


o K1=97 (for 'a')
o K2=114 (for 'r')
o K3=121 (for 'y')
o
o Let’s choose a prime number p=257 (greater than 256 as required).
o
o randomly select the sequence a0,a1,a2,a3 with values:
a0=3
o a1=7
o a2=11
o a3=5

Hash Function Calculation:


ha(K) = (3×77 + 7×97 + 11×114 + 5×121) mod 257 mod 1000
= (231+679+1254+605) mod 257 mod 1000 = 186

hash value for the key “Mary” is 186

You might also like