0

I'm working on a problem which involves going through a lot of data. To reduce the work (because current calculations take about two weeks of compute time, and I'd like to reduce that dramatically) I came up with an algorithm which would be much faster if it was able to avoid a certain type of duplication. (The current algorithm avoids storing this information because it is too large, unreduced, to fit in memory.)

I have a collection of sets, and I don't want to insert a set A if there is already a set B which is a subset of A. At the moment the sets are represented by integers where individual binary digits represent a particular element being present or absent. In that interpretation the set/integer A should not be inserted if there is already a set/integer B such that (~A) & B is 0, where ~ is bitwise negation and & is bitwise AND.

For example, if my collection has the following sets

[ {a,b}, {b,c}, {b,d,e} ]

and I asked to add {b,c,e} it should not be added (since {b,c} is already there) and similarly with {a,b} (since {a,b} is there) but {a,e} should be added.

The numeric equivalent would be starting with

[ `0b11`, `0b110`, `0b11010` ]

where 0b10110 is not added since (~0b10110) & 0b110 == 0, 0b11 is not added since (~0b11) ^ 0b11 == 0, but 0b10001 can be added.

Ideally the structure would prune itself as new sets are added, so if {c} were added all existing sets containing c would be removed. But it's acceptable if it doesn't update in that way as long as I can normalize it to that form in some not-too-expensive way every so often.

1 Answer 1

1

This is a well-known problem known as "finding extremal sets"; unfortunately, there is nothing fundamentally faster known than the obvious approach of testing a newly inserted set against all existing sets, but good heuristic improvements exist. Here is a recent paper discussing this problem: https://arxiv.org/abs/1508.01753

An open-source implementation of a related algorithm: https://code.google.com/archive/p/google-extremal-sets/

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.