Hash Function Instruction Count
Hash Function Instruction Count
Hash Function Instruction Count
org/wiki/Hash_function
Wikipedia is sustained by people like you. Please donate today.
Hash function
From Wikipedia, the free encyclopedia
A hash function is any well-defined procedure or mathematical function for turning some kind of data into a
relatively small integer, that may serve as an index into an array. The values returned by a hash function are
called hash values, hash codes, hash sums, or simply hashes.
Hash functions are mostly used to speed up table lookup or data comparison tasks — such as finding items in a
database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and
so on.
Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomizing
functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some
extent, each has its own uses and requirements. The HashKeeper database maintained by the National Drug
Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.
Contents
1 Applications
1.1 Hash tables
1.2 Finding duplicate records
1.3 Finding similar records
1.4 Finding similar substrings
1.5 Geometric hashing
2 Properties
2.1 Determinism
2.2 Uniformity
A typical hash function at work
2.3 Variable range
2.4 Data normalization
2.5 Continuity
3 Hash function algorithms
3.1 Injective and perfect hashing
3.2 Use of short search key as indirect
index
3.3 Hashing uniformly distributed data
3.4 Hashing data with other distributions
3.5 Special-purpose hash functions
3.6 Hashing with checksum functions
3.7 Hashing with cryptographic hash
functions
3.8 Audio identification
4 Origins of the term
5 See also
6 Notes
7 References
8 External links
Applications
Hash tables
1 of 6 24/7/2008 14:50
Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function
Hash functions are mostly used in hash tables, to quickly locate a data record (for example, a dictionary
definition) given its search key (the headword). Specifically, the hash function is used to map the search key to
the index of a slot in the table where the corresponding record is supposedly stored.
In general, a hashing function may map several different keys to the same hash value. Therefore, each slot of a
hash table contains (implicitly or explicitly) a set of records, rather than a single record. For this reason, each
slot of a hash table is often called a bucket, and hash values are also called bucket indices.
Thus, the hash function only hints at the record's location — it only tells where one should start looking for it.
Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.
In the Java programming language, for example, hash functions are central to the way the language allows
objects to be stored in hashing collections such as the standard HashMap and HashSet classes. To this end,
Java's Object parent class provides a hashCode() method mandating the generation of a 32-bit integer hash
value.
To find duplicated records in a large unsorted file, one may use a hash function to map each file record to an
index into a table T, and collect in each bucket T[i] a list of the numbers of all records with the same hash value
i. Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then
be found by scanning every bucket T[i] which contains two or more members, fetching those records, and
comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative
approach (such as sorting the file and comparing all consecutive pairs).
Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or
pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps
similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table
of T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in
nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where
k ranges between -m and m.
The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a
document repository or a genomic database. In this case, the input strings are broken into many small pieces,
and a hash function is used to detect potentially equal pieces, as above.
The Rabin-Karp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is
based on the use of hashing to compare strings.
Geometric hashing
This principle is widely used in computer graphics, computational geometry and many other disciplines, to locate
close pairs of points in the plane or in three-dimensional space, similar shapes in a list of shapes, similar images
in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the
hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array
with two or more indices (called a bucket grids), and the hash function returns an index tuple. This special case
of hashing is known as geometric hashing or the grid method.
Properties
Determinism
2 of 6 24/7/2008 14:50
Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function
To serve its purpose, a hash function must be fast and deterministic — meaning that two identical or equivalent
inputs must generate the same hash value.
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every
hash value in the output range should be generated with roughly the same probability. The reason for this last
requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of
inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to
occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding
table entries.
For many applications the hash values need only be uniformly distributed, not random in any sense. A good
randomizing function is usually good for hashing, but the converse need not be true. For cryptographic hash
functions, the function must however be randomizing, that is, although deterministic, pragmatically
indistinguishable from a random function.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may
contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the
uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just
for the global set of all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many
more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should
have more than one or two records. (Ideally, no bucket should have more than one record; but a small number
of collisions is virtually inevitable, even if n is much larger than m (see the birthday paradox).
Variable range
In many applications, the range of hash values may be different for each run of the program, or may change
along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash
function which takes two parameters — the input data z, and the number n of allowed hash values.
Data normalization
In some applications, the input data may contain features that are irrelevant for comparison purposes. When
looking up a personal name, for instance, it may be desirable to ignore the distinction between upper and lower
case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion
being used: that is, any two inputs that are considered equivalent must yield the same hash value.
Continuity
A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as
possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.
Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other
related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that
use linear search.
3 of 6 24/7/2008 14:50
Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function
The ideal hashing function should be injective — that is, it should map each valid input to a different hash value.
Such a function would directly locate the desired entry in a hash table, without any additional search.
An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs,
is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash
table, without any vacant slots.
Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping
month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of
practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much
smaller range of hash values and therefore cannot be injective.
In special situations, such as short, one or two byte search keys, the search key itself can be used as the first
(indirect) 'index' to a table of target index values (1-255) - with any gaps in sequence set as null (i.e. index=0) -
denoting an 'invalid' key. Consider the search key to be an 8-bit character containing the alphanumeric
characters A-Z, (both upper and lower case) and the numeric digits 0-9, with other possible values an error. The
'hash table' can simply be a pre-defined, and pre-compiled, static 256 byte table containing a sequential index
(X'00' - 'FF') in the relevant offsets - equating to the search key itself.
If no validation is required, (because the validity of the key is already known), the table need only be as large as
the range of (high-low) keys. For example if the possible ASCII keys were say A thru Z, a table of only 26 bytes
(containing x'00' - x'19') would be needed for the table - providing excellent locality of reference. The
'conversion' from search key to index value can be accomplished in an extremely short fixed time interval - in
fact the time to execute around 5 or 6 machine instructions - or even less in some architectures - irrespective of
the search value. Similar methods can also be applied for 16-bit keys (range 0-65535) with an index range but in
this case requiring a maximum pre-defined table size of around 128K. If the search key range is known however,
the table can often be considerably less. Path length in terms of instruction count is similar to that required for 8
bit keys. Once obtained, the index value can be used for any number of direct indexing including use in branch
tables to determine program logic where appropriate and available in the language being used.
If the inputs are bounded-length strings (such as telephone numbers, car license plates, invoice numbers, etc.),
and each input may independently occur with uniform probability, then a hash function needs only map roughly
the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0
to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash
function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by
n/N and truncated to an integer), or many other formulas.
These simple formulas will not do if the input values are not equally likely, or are not independent. For instance,
most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to
begin with the same 3 to 4 digits. In that case, if n is 10000 or so, the division formula (z × n) ÷ N, which
depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n,
which is quite sensitive to the trailing digits, may still yield a fairly even distribution of hash values.
When the data values are long (or variable-length) character strings — such as personal names, web page
addresses, or mail messages — their distribution is usually very uneven, with complicated dependencies. For
example, text in any natural language has highly non-uniform distributions of characters, and character pairs,
very characteristic of the language. For such data, it is prudent to use a hash function that depends on all
characters of the string — and depends on each character in a different way.
4 of 6 24/7/2008 14:50
Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function
In many such cases, one can design a special-purpose (heuristic) hash function that yield many fewer collisions
than a good general-purpose hash function. For example, suppose that the input data are file names such as
FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function
that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a
function that is exceptionally good for a specific kind of data may have dismal performance on data with
different distribution.
One can obtain good general-purpose hash functions for string data by adapting certain checksum or
fingerprinting algorithms. Some of those algorithms will map arbitrary long string data z, with any typical
real-world distribution — no matter how non-uniform and dependent — to a fixed length bit string, with a fairly
uniform distribution. This string can be interpreted as a binary integer k, and turned into a hash value by the
formula h = k mod n.
This method will produce a fairly even distribution of hash values, as long as the hash range size n is small
compared to the range of the checksum function. Bob Jenkins' LOOKUP3 (http://burtleburtle.net/bob/c
/lookup3.c) algorithm[1] uses a 32-bit checksum. A 64-bit checksum should provide adequate hashing for tables
of any feasible size.
Some cryptographic hash functions, such as MD5, have even stronger uniformity guarantees than checksums or
fingerprints, and thus can provide very good general-purpose hashing functions. However, the uniformity
advantage may be too small to offset their much higher cost.
Audio identification
For audio identification [2] such as finding out whether an MP3 file matches one of a list of known items, one
could use a conventional hash function such as MD5, but this would be very sensitive to highly likely
perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or
changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another
more advanced algorithm is required to find all items that would nonetheless be interpreted as identical to a
human listener. Though they are not common, hashing algorithms do exist that are robust to these minor
differences. There is a service called MusicBrainz which creates a fingerprint for an audio file and matches it to
its online community driven database.
In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then
"mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to
be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular
division.
See also
Universal hashing
Cryptography
Cryptographic hash function
HMAC
5 of 6 24/7/2008 14:50
Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function
Geometric hashing
Distributed hash table
Perfect hash function
Linear hash
Rolling hash
Rabin-Karp string search algorithm
Zobrist hashing
Bloom filter
Hash table
Hash list
Hash tree
Coalesced hashing
Transposition table
List of hash functions
Notes
1. ^ In the remainder of this article, the term function is used to refer to algorithms as well as the functions
they compute.
References
1. ^ Jenkins, Bob (September, 1997), Hash Functions, “Algorithm Alley”, Dr. Dobb's Journal, <http://www.ddj.com
/184410284>
2. ^ "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen"
(http://citeseer.ist.psu.edu/rd/11787382%2C504088%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache
/papers/cs/25861
/http:zSzzSzwww.extra.research.philips.comzSznatlabzSzdownloadzSzaudiofpzSzcbmi01audiohashv1.0.pdf
/haitsma01robust.pdf)
3. ^ Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching, 506-542.
External links
General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby) (http://www.partow.net
/programming/hashfunctions/index.html)
Hash Functions and Block Ciphers by Bob Jenkins (http://burtleburtle.net/bob/hash/index.html)
Integer Hash Function (http://www.concentric.net/~Ttwang/tech/inthash.htm) by Thomas Wang
The Goulburn Hashing Function (http://www.geocities.com/drone115b/Goulburn06.pdf) by Mayur Patel
Hash Functions (http://www.azillionmonkeys.com/qed/hash.html) by Paul Hsieh
HSH 11/13 (http://herbert.gandraxa.com/herbert/hsh.asp) by Herbert Glarner
Online Char (ASCII), HEX, Binary, Base64, etc... Encoder/Decoder with MD2, MD4, MD5, SHA1+2,
etc. hashing algorithms (http://www.paulschou.com/tools/xlate/)
FNV (http://isthe.com/chongo/tech/comp/fnv/) Fowler, Noll, Vo Hash Function
6 of 6 24/7/2008 14:50