Talk:Cuckoo hashing
This is the talk page for discussing improvements to the Cuckoo hashing article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
rehashing
editcan someone please explain how the rehashing with the new hash functions work? do I have to prepare a large number of different hash functions for this to work? I would have quite some trouble to figure out some good and fast third one for my own data, let alone any further ones... and how many do I need? It seems that the collisions can happen quite often for pathological data, does that mean I have to code hundreds or even thousands of different hash functions? —Preceding unsigned comment added by 217.11.197.10 (talk) 12:53, 8 March 2010 (UTC)
- I believe it refers to the "new hash function" as simply changing the modulo. For instance the presented example gives two functions
- But also not that 11 is the size of the table. So the actual hash function is:
- Where in the example. Effectively when increasing the size of your table, you change your hash function. For instance if you doubled the size of the table you would get a new hash function:
- - KillerGardevoir (talk) 20:00, 7 August 2014 (UTC)
Comment on weasel words
editIMO the following fragment from the page is full of weasel words:
The relevance of cuckoo hashing to practice is becoming increasingly clear (although as of 2006 it is not yet widely known outside the research community.) Given the fact that cuckoo hashing is acknowledged to have a central place in the theory of hash functions, it was discovered surprisingly late -- in 2001.
Anybody have any sources for this "acknowledgement" or "increasingly clear relevance"? —Preceding unsigned comment added by Mandyhan (talk • contribs) 13:30, 6 October 2006
Weasel words
editHello,
I don't know who inserted the last comment. However, it seems that cuckoo hashing variants are currently the fastest schemes for cache-resident hash tables. See:
I have added the link to the article.
Best,
Rasmus Pagh —Preceding unsigned comment added by Pagh (talk • contribs) 11:00, 9 October 2006
Practicality of cuckoo hashing
editFrom my own programming experience, cuckoo hashing makes every other in-memory hashing scheme as obsolete as a Wright brothers airplane. It's fast, easy to implement, and predictable - in my own projects, I use cuckoo hashing exclusively, and many thanks to Rasmus Pagh for his great idea.
The only practical issue I had with cuckoo hashing is that it is particularly sensitive to the choice of hashing function; anything with bad mixing behaviour greatly increases frequency of rehashes, sometimes to the point of utter unusability. But with any good hash function, cuckoo hashing works perfectly; functions that I found to be good are Bob Jenkins' hash, "One at a time" hash, Paul Hsieh's hash (all can be found at http://burtleburtle.net/bob/hash/doobs.html ), Thomas Wang's hash (found at http://www.concentric.net/~Ttwang/tech/inthash.htm ).
Other functions designed around similar lines would probably work just as well - while commonly used simple additive, multiplicative and rotating hash functions definitely would not.
wjaguar 14:23, 14 October 2006 (UTC)
Amen. Cuckoo Hashing is the best invention since sliced bread. I'm using ternary Cuckoo Hashing to map incoming UDP datagrams (or rather their sockaddr) to sessions. In my particular setup, it is roughly 3 times as fast std::map which I had used before, and doesn't have nasty hiccups as linear probing does with collisions (which I tried too). Basically, Cuckoo Hashing turned "3000 sessions are beginning to get noticeable" into "I wish the ethernet cable had bandwidth for 7500".
As hash function, I use Bernstein's well-known old hash with a higher order primeth, adding a number at the end to make it unique, and scramble the bits again with two shift-xors. The table rarely ever needs a single rehash up to about 90-91% load. I've been able to stress it up to a 99.5% load experimentially, which still worked just fine much to my surprise (although rehashing and thus insert time grows exponentially, at >99% load, a single insert takes 5-20 seconds real time...). Lookups are of course still O(1), which is just plain awesome. —Preceding unsigned comment added by 91.35.154.68 (talk) 12:37, 15 May 2008 (UTC)
Bipartite graph
editI came up with a variantion on this idea in 1991 for a software system needed to store a large dictionary of two character Mandarin words in a highly compressed hash table at a time when our target market was mostly still running MSDOS in 640K. However, in our case, we didn't compute the cell assignment on the fly. We used a three-key approach to achieve a 90% fill ratio. We computed all the keys for all the elements in a batch operation, and then used the bipartite graph solution algorithm to find a legal placement, if any exists.
I was a bit choked at the time. We had no internet / electronic resources. I had about a dozen text books on fundamental algorithms, and I looked through them all, not finding a suitable algorithm. So I implemented a heuristic that worked fairly well up to a point. The empty cells are empty. The cells with only one possible occupant go next, and as cells are stricken, also the occupants reduced to one remaining possible cell. This shrinks the hairball quite a bit. I then created some kind of local metric for the assignments least contested, choosing the least contested assignment at each step. It was slow (hours), but I tuned it to where it would solve instances up to 80+ plus percent occupancy with no back-tracking. Then a coworker, who had mostly the same books, found the graph matching algorithm in one book I didn't have, we had to use a cheesy DOS extender to handle the extra memory structures, but once we got past that hurdle, we were placing close to 30,000 elements into 2^15 buckets in under 3s on our fastest machine, a 33MHz 486.
That was far from the only hurdle, as we also incorporated a hardware security dongle into the hashing algorithm, many of the "words" were actually stems for longer expressions, because our hash check value was truncated almost to the point of insufficiency, the space of phantom matches grew exponentially with phrase length, but we had determined these phantoms didn't interfere with normal use of the system. I think if someone tried to exhaustively enumerate the dictionary, you got something close to 4 million 4 character Chinese expressions, of which 99.99% were phantoms. This made it difficult to replace the security dongle with a look-up table, and it wasn't possible to rekey the table to a new hash function without duplicating our work on the bipartite graph matching algorithm. We analyzed the phantom space against the legitimate data and had a small 16K data structure which suppressed any conflicts between valid data and phantom data likely to occur in practise.
Because we were hashing so much, we recoded the Rainbow dongle library to run ten times over spec. but it seemed to work (on all the systems we tested). Plus my analysis showed that the dongle was not producing good randomness so we also had to massage the hardware contribution to our hash values. I never trusted Rainbow's claim that the hardware hash algorithm could not be reverse engineered, but I didn't have time to pursue it. On short seed lengths, it showed all the statistical potency of rot13. On medium seed lengths, it improved to RANDU quality.
Our software had a check for the dongle on startup, but otherwise never checked the presence of the dongle, it just incorporated whatever random values the parallel port returned as part of the table lookup hash. Obviously, our Chinese input method returned nothing but garbage under this condition, but our software dongle was reported as cracked because someone removed the dongle test on startup, without subsequently checking for correct operation of the main input algorithm.
From http://www.woodmann.com/crackz/Dongles.htm
Others use the return codes from the dongle as inputs to hash functions or as values that control execution flow, implementations of this kind of secure method are few and far between.
I guess we were the few. Obviously, it wasn't our purpose at the time to publish any of this, though looking back I wish I had. I don't have a published paper which I can cite to add to this article the connection to bipartite graph matching, which is too bad, because it's quite essential to full understanding of this technique.
I also had an adaptive variant of this in the works to manage our font cache, as the print fonts were quite large. The nice property of this problem was that you could use the three way version to get a high fill rate and shuffle things around for a few steps, but if this became tedious, you could just throw an entry away, it would get fetched again if needed. I also thought the memory management of the font pool sucked, so I was considering a fairly radical departure from conventional cache management wisdom. I was planning to implement the font pool as a packed circular buffer, which functioned as a FIFO rather than an LRU. If a glyph was accessed from the back 10% of the FIFO (entries soon to be discarded), there was an option to copy that entry back to the front (a fast bypass on discard/reload), which somewhat resembled an amortized coallesence. I thought the combination of an adaptive cuckoo hash with a circular font cache would have been killer, but my time there ended before I was able to test this empirically. The problem case was where the font cache size was an insufficient workset for the document being composited, with a high churn rate. Perhaps an LRU would have performed better under this condition. I was always shocked these techniques did not appear to be better known. MaxEnt 17:42, 21 September 2007 (UTC)
Improper anchor text on external links
edit[Three] [blog] [posts] by Michael Mitzenmacher expounding cuckoo hashing
This method of listing multiple hyperlinks is acceptable in informal contexts (discussion forums, etc.) but does not meet the standards for speaking link anchors by the W3C. From a semantic perspective, the anchors suggest that the first article is related to "three", the second to "blog" and the third to "posts". 212.121.153.12 (talk) 15:29, 7 December 2007 (UTC)
Difference between Cuckoo and Double hashing
editWhat's the difference between this and double-hashing? The article makes it sound like using two hash functions was the main idea of the algorithm. Firefight (talk) 03:01, 14 August 2008 (UTC)
- The big difference is that once Double Hashing places an item in a certain location, the item stays there whereas Cuckoo Hashing will move items around when there is a collision. -- Derek Ross | Talk 20:38, 17 November 2008 (UTC)
Algorithm
editThe explicit algorithm can be added, like the one on page 4 of: http://www.it-c.dk/people/pagh/papers/cuckoo-undergrad.pdf —Preceding unsigned comment added by 79.21.6.176 (talk) 12:33, 12 May 2009 (UTC)
I believe there's a fatal flaw in current pseudocode of insert operation in the article. A cycle can be created, and although infinite loop isn't possible due to threshold for cuckoo loops, it can worsen the performance drastically.
Namely, consider 2 keys such that . Then, by trying to insert while is already inserted on position , this creates a loop which fills the threshold and forces a rehash.
Therefore I propose a different algorithm:
1 function insert(x) is 2 if lookup(x) then 3 return 4 end if 5 pos := 6 while Max-Loop ≤ then 7 if T[pos] = then 8 T[pos] x 9 return 10 x T[pos] 11 if pos = then 12 pos 13 else 14 pos 15 end while 16 rehash() 17 insert(x) 19 end function |
This should result in avoiding such cycles (namely putting the element immediately back from where it was evicted).
If anyone can confirm the correctness please replace the current algorithm. Similarly, if anyone has any objections, let me know. BlackWardie (talk) 15:55, 6 December 2021 (UTC)
Undue weight?
editI just read the new material based on the investigation of Askitis. In my opinion this should be considerably shortened, to get a similar weight as the other empirical investigations of (bucketed) cuckoo hashing. Perhaps the best solution would be to have a separate article that discusses "array hash tables" in more depth? Even at the present length there is missing a proper discussion of the space-efficiency of array hash tables, which seems to include a considerable overhead in the memory management system (it is not accurate to simply sum up the sizes of currently allocated arrays). It is not clear to me if it is relevant at all to compare array hash table to highly space-efficient hash tables. —Preceding unsigned comment added by Pagh (talk • contribs) 08:18, 15 September 2009 (UTC)
- UPDATE: Since no response has been made, I take this as evidence that the comparison is not relevant. I thus remove the description of the Askitis investigation. Pagh (talk) 09:38, 26 October 2009 (UTC)
- Thanks for your feedback. Considering your comment above, I suspect that you did not read the paper carefully. Check out the following quote from Section 4 of the paper
- "Fast and compact hash tables for integer keys, Volume 91, pages 101 - 110, 2009. ISSN: 1445-1336".
- Thanks for your feedback. Considering your comment above, I suspect that you did not read the paper carefully. Check out the following quote from Section 4 of the paper
- "... our measure of space includes an estimate of the operating system overhead (16 bytes) imposed per memory allocation request, which we found to be consistent with the total memory usage reported in the /proc/stat/ table".
- This is a concise statement, but it serves to convey the fact that Askitis did not simply "sum up the size of currently allocated arrays", as u have stated. I have also read his recent VLDB Journal on the HAT-trie: "Engineering scalable, cache and space efficient tries for strings",
- where in it, he explains in more detail the issue of memory fragmentation/overhead.
- So, in my opinion, it is clear that Askitis took memory overheads into account when reporting the results in his paper(s), and perhaps, for that reason, there was no need to go into an in-depth discussion on the topic, since the overheads do appear to be small.
- Anyhow, you can always ask him yourself to find out more. I have just studied his publications on data structures in my free time (no connection to the author), so I can only answer questions based on what I read and try out (software).
- I tried out the evaluation software provided on ozsort.com and the results do look consistent to what is reported in the paper. I have provided a snippet screen-shot of my Linux console below. The second column is the actual process size (MB) reported by the operating system which captures memory fragmentation. The third column is the estimated memory usage (MB), which as you can see, is a fairly accurate. So as far as I can tell, there is an overhead, but is certainly isn't considerable. Under heavy load, the array hash is a space-efficient and scalable data structure.
- Array-hash-table-32bit-int 171.95 161.09 14.77 11.26 19907105 20000000 65536
- Array-hash-table-32bit-int 405.34 378.61 5.66 3.88 19907105 20000000 8388608
- Array-hash-table-32bit-int 545.17 526.91 5.11 3.70 19907105 20000000 16777216
- ...
- I will therefore undo your changes because I think it is important to highlight the practical limitations of bucketized cuckoo hashing (which is what the paper does), in comparison to the space and cache efficient array hash table (particularly when used with repeated key access). However, I will attempt to simplify the discussion; you can go ahead and simplify it further, though, I would recommend that the important (practical) info regarding comparisons between speed and space, with and without repeated key access, be retained. cheers.
Number of hash tables
editThe theory section of the article and the image shows a single table which is used by two different hash functions. The example section though uses two different tables, and that seems to be how the algorithm is described in the source as well. --Tgr (talk) 02:05, 14 October 2012 (UTC)
On a closer reading, the source says "Another variant is to use a single table for both hash functions, but this requires keeping track of the hash function according to which each key is placed." but is not clear to me whether the complexity results apply to this version. --Tgr (talk)
Choice of hash function
editThe article does not mention any restrictions for the choice; in the source they must form a universal family, which seems to be a non-trivial restriction. --Tgr (talk) 02:09, 14 October 2012 (UTC)
Choosing between hash functions in the example
editIt is not clear how the choice between the two algorithms is made in the example. It does not seem to follow the algortithm described in the source, which always inserts to the left. --Tgr (talk) 02:20, 14 October 2012 (UTC)
Unclear text under 'Theory'
editCurrently the text reads: "This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure would enter an infinite loop."
I can't tell what "or the procedure would enter an infinite loop" modifies or is saying. JustinBlank (talk) 16:20, 31 October 2015 (UTC)
- It modifies the procedure described earlier in the same sentence, which repeatedly kicks out one key in order to place another previously-kicked-out key. This procedure can stop when a key can be placed without kicking anything out. Or this procedure can get into a loop where the same sequence of keys repeatedly kick each other out. If this happens, it is detected and causes the algorithm to terminate in a different way. —18:00, 31 October 2015 (UTC)
Cuckoo hashing is slower than the previous state-of-the-art!
editI edited the article to clarify this; it would be great if someone with more knowledge elaborated, and explained why people even care about cuckoo hashing, and clarified which real-world situations cuckoo hashing is worth using.
For a hash function with very uniform output, linear probing will demolish pretty much anything else available, as far as I'm aware. For a lot of more realistic general-purpose situations, where we want the fastest hash function possible over keys that are very related, carefully-tuned quadratic probing is fastest. Both of these are probably easier to implement, too, and quadratic hashing is probably what any serious hash table implementation will be using.
I felt it was important this was pointed out, because most of the stuff I can find about cuckoo hashing, including the academic references for this article, have all of this glowing stuff about how fast cuckoo hashing is. But most of the time, they're comparing it to something that is super slow (like chained buckets, the bubble sort of the collision-resolution world), or to some unusual situation, such as with a highly-loaded table, where hash tables are probably going to perform relatively poorly no matter what you do. Please, for great justice, if you ever find yourself needing to write a simple hash table for something, use linear probing. If you need it to be faster, you can do drop-in replacements for the hash function and probing strategy that won't change the rest of the hash table implementation, and with enough tuning, you'll have something as fast as anything else that currently exists--and it will be more memory-efficient, simpler, and easier-to-debug than any of the more complex schemes. Aaron Will (talk) 01:56, 7 January 2016 (UTC)
- My point in reverting your edit was not that it was incorrect, but that we cannot include such comparisons without sources. Your discussion above comes off as based on your personal experience rather than what sources say, and that's not what we should be using in Wikipedia. As for speed: are you talking about a hash table that is frequently changing, or one that is largely static? Because if you're going to probe a sequence of random-looking locations (quadratic probing) or at most two locations (cuckoo), and otherwise the probes are the same, why should cuckoo be any slower? And if you're doing cuckoo with two separate calls to the hash function and that's what's slowing you down, why are you doing it that way? —David Eppstein (talk) 03:40, 7 January 2016 (UTC)
- I watched a great set of lectures on the STL and the real world implications of how they helped and it had a glowing review of Cuckoo hashing with the caveat that it sucks in real practical applications because every single insert demands a cache miss because caches are so dependent on linear insertions and primed for them that they end up being slower. It's a brilliant idea and fundamentally amazing and provides some of the best time complexity guarantees but in a practical manner because you bounce the new data so far away you always miss the caching and end up going slower in real world applications. Ultimately that's why it's not super-popular and the article talks about tables that are fully stored in a cache. Tat (talk) 23:19, 18 May 2017 (UTC)
Possible bug in example (key 75)
editI don't understand why key 75 is inserted into first table 9th row.
First table 9th row is occupied by 20, second table 6th row is empty.
In my opinion 75 at start should be placed in 6th row of second table.
Please clarify if I'm wrong. — Preceding unsigned comment added by 2A02:A31C:25B:BA00:422:DFB3:D080:491B (talk • contribs) 2021-08-29T15:22:23 (UTC)
- Please remember to sign your comments. You seem to be right. See that 53 was inserted in the second table because of a collision. BernardoSulzbach (talk) 17:05, 29 August 2021 (UTC)
Perf. degradation
edit@David Eppstein: Regarding this edit diff. I don't think the sources say it'd fail catastrophically
, maybe cite me to some? Herlihy M., Shavit N., Tzafrir M. (2008) mentions the degradation of performance—since the infinite loop condition might at some point be resolved through rehashing, which is where the perf. degrades since it's costly. I'm not sure why you mentioned it'd unable to complete an insertion
at all. WikiLinuz (talk) 23:01, 28 October 2021 (UTC)
- Having to rehash with a new hash function is a form of catastrophic failure. Also, unless your rehash also involves a resize, the rehash is highly likely to fail again in the same way. (For the basic form of cuckoo hashing, for a load factor strictly less than 0.5, it has high probability of being able to accomodate all keys, and for a load factor strictly greater than 0.5, it has high probability of not doing so.) —David Eppstein (talk) 23:14, 28 October 2021 (UTC)
- I added that revision since there isn't a strong indication of load factor's correlation to performance degradation—or
catastrophic failur
for that matter—in the article, although we could see a minimal mention under Cuckoo hashing#Theory. I suggest this revision under the Theory sub-section.
WikiLinuz (talk) 23:38, 28 October 2021 (UTC)The insertions into the hash table is expected to be constant time, even considering the possibility of having to rebuild the table.[1] However this is based on the assumption that the load factor is below 50%—since a table with a higher load factor would cause performance degradation as displacement sequences become too long, and the table needs to be resized and rehashed.[2] 351
- Still not good enough. A table bigger than 50% would JUST NOT WORK, with high probability, even after rehashing. "Performance degradation" is a misleading way of saying that. It would require resizing, not just limping along with poor performance at that size. Would you say that a table with load factor bigger than 1 produces "performance degradataion"? It's more or less the same thing here, only with high probability of failing rather than absolutely zero chance of success. —David Eppstein (talk) 00:06, 29 October 2021 (UTC)
- Hmm, I see. Maybe I'm missing something, I'll carefully look at the analysis of cuckoo hashing. To quote from the source,
I was guided by the fact that the source didn't mention load factor above 0.5 wouldn't work at all (or higher fail probability). I'll leave it here. Thanks, WikiLinuz (talk) 00:19, 29 October 2021 (UTC)A bigger disadvantage is that Cuckoo hashing tends to perform poorly when the table is more than 50% full because displacement sequences become too long, and the table needs to be resized.
— Herlihy, Maurice; Shavit, Nir; Tzafrir, Moran (2008). "Hopscotch Hashing". International Symposium on Distributed Computing. 5218. Springer Publishing: 351. doi:10.1007/978-3-540-87779-0_24.- The keyword to search for, in looking for sources on this, is "threshold". There's a good discussion of this, for instance, in DOI:10.1137/100797503, as well as multiple papers with "threshold" in their titles [1]. It's also important to note that the 50% threshold is only for the basic version of cuckoo hashing with two cells per key and one key per cell; a big reason for the study of other versions that increase those numbers is to bring the threshold closer to 100%. —David Eppstein (talk) 00:44, 29 October 2021 (UTC)
- Thank you for the paper, I'll look into that. WikiLinuz (talk) 00:48, 29 October 2021 (UTC)
- The keyword to search for, in looking for sources on this, is "threshold". There's a good discussion of this, for instance, in DOI:10.1137/100797503, as well as multiple papers with "threshold" in their titles [1]. It's also important to note that the 50% threshold is only for the basic version of cuckoo hashing with two cells per key and one key per cell; a big reason for the study of other versions that increase those numbers is to bring the threshold closer to 100%. —David Eppstein (talk) 00:44, 29 October 2021 (UTC)
- Hmm, I see. Maybe I'm missing something, I'll carefully look at the analysis of cuckoo hashing. To quote from the source,
- Still not good enough. A table bigger than 50% would JUST NOT WORK, with high probability, even after rehashing. "Performance degradation" is a misleading way of saying that. It would require resizing, not just limping along with poor performance at that size. Would you say that a table with load factor bigger than 1 produces "performance degradataion"? It's more or less the same thing here, only with high probability of failing rather than absolutely zero chance of success. —David Eppstein (talk) 00:06, 29 October 2021 (UTC)
- I added that revision since there isn't a strong indication of load factor's correlation to performance degradation—or
Mathematical notation presented as pseudo-code is hard to follow
editExactly what it says on the tin—as currently written, the dense mathematical notation masquerading as pseudo-code is very hard to follow.--74.96.235.183 (talk) 18:23, 27 May 2022 (UTC)
- Which specific pseudocode are you referring to? The one under "lookup" or "insertion"? I don't think the math notation is hard to follow; as far as I can see there are two math notations introduced: and . But they are explicitly explained in the relevant text succeeding. --WikiLinuz {talk} 🍁 20:14, 27 May 2022 (UTC)
Improper insertion of element 53 in the example
editFollowing the source code, shouldn't insertion of element 53 lead to it replacing key 20 from the left hash table? It is unclear why key 53 is saved into the right table directly. Please correct if I misunderstood. Cj101192 (talk) 07:00, 11 February 2023 (UTC)
- I agree. I proposed a correction. Raphaaal (talk) 17:29, 19 July 2024 (UTC)