I'm working on a problem which involves going through a lot of data. To reduce the work (because current calculations take about two weeks of compute time, and I'd like to reduce that dramatically) I came up with an algorithm which would be much faster if it was able to avoid a certain type of duplication. (The current algorithm avoids storing this information because it is too large, unreduced, to fit in memory.)
I have a collection of sets, and I don't want to insert a set A
if there is already a set B
which is a subset of A
. At the moment the sets are represented by integers where individual binary digits represent a particular element being present or absent. In that interpretation the set/integer A
should not be inserted if there is already a set/integer B
such that (~A) & B
is 0, where ~
is bitwise negation and &
is bitwise AND.
For example, if my collection has the following sets
[ {a,b}, {b,c}, {b,d,e} ]
and I asked to add {b,c,e} it should not be added (since {b,c} is already there) and similarly with {a,b} (since {a,b} is there) but {a,e} should be added.
The numeric equivalent would be starting with
[ `0b11`, `0b110`, `0b11010` ]
where 0b10110
is not added since (~0b10110) & 0b110 == 0
, 0b11
is not added since (~0b11) ^ 0b11 == 0
, but 0b10001
can be added.
Ideally the structure would prune itself as new sets are added, so if {c}
were added all existing sets containing c
would be removed. But it's acceptable if it doesn't update in that way as long as I can normalize it to that form in some not-too-expensive way every so often.