Lets say, we have a lot of lists (sequences of variable length) of some limited set of elements:
[A, B, D, E, Q], [A, D], [A, C, E, F, G, H, Y, Z], [B, A, E, D, Z, Y], ...
The problem is to get the single order, based on "how typical" some position is, eg, in our case it can be (coincidence with alphabetical order is just a coincidence):
[A, B, C, D, E, F, G, H, Q, Y, Z]
But lets say [Z, Q, Y, C, B, A, D, E, F, G]
is "surprising" order, far from being "typical".
The questions are:
- are there any "official" CS terms for the situation (so I can do further searches)?
- what are practically popular definitions / metrics for "how typical" (or "surprising") in the above description?
My idea so far is to count "average position", normalized to the max length of the list. Then the order will be by increasing normalized average positions. I feel, maybe term rank and some statistics' techniques are also applicable to this, but I am more interested in some easily (one pass) computable metrics. Any pointers?
One more idea is to calculate probability of every pair where one element is before another, and sort based on whether probability is less than 0.5 or more. But this may need a triangular matrix and a lot of computations. (For the example above, p(A before D|both occur) = 1
, but p(A before B) = 0.5
).
One gotcha may be that is some element is usual for the beginnings and ends, but never the middle, the average result may still be in the middle... But I guess it may be too subtle to satisfy.
For now, using very rough method:
def approx_position(position, length, master_length):
return int(position * master_length / float(length))
seqs = ['A', 'B', 'D', 'E', 'Q'], ['A', 'D'], \
['A', 'C', 'E', 'F', 'G', 'H', 'Y', 'Z'], ['B', 'A', 'E', 'D', 'Z', 'Y']
master = []
for s in seqs:
total = len(s)
est_len = len(set(master) | set(s))
for i, item in enumerate(s):
if item not in master:
new_pos = approx_position(i, total, est_len)
master.insert(new_pos, item)
print master
# ['A', 'C', 'B', 'D', 'F', 'G', 'H', 'E', 'Y', 'Z', 'Q']
# when seqs in reversed order:
# ['B', 'C', 'A', 'F', 'E', 'G', 'H', 'D', 'Q', 'Z', 'Y']
NB. Python insert works on intervals between list elements. So 0 corresponds to the place before first element.
Not ideal, but at least almost one pass (one hidden pass in set union).
Something similar is for example discussed here: https://stackoverflow.com/questions/4645874/merging-some-sorted-lists-with-unknown-order-sequence However, my problem is overdetermined, so the question is how to do relaxation, while the results are still near optimal (the criteria is part of the answer).
(trying to widen my horizons as it is usually hard to find a right term for something if one has never encountered that term before)
ABCDE, AB
putsB
(average position (1/4 + 1)/2 = 0.625) afterC
(average position 0.5) $\endgroup$