Sort by typical position?

Question

Lets say, we have a lot of lists (sequences of variable length) of some limited set of elements:

[A, B, D, E, Q], [A, D], [A, C, E, F, G, H, Y, Z], [B, A, E, D, Z, Y], ...

The problem is to get the single order, based on "how typical" some position is, eg, in our case it can be (coincidence with alphabetical order is just a coincidence):

[A, B, C, D, E, F, G, H, Q, Y, Z]

But lets say [Z, Q, Y, C, B, A, D, E, F, G] is "surprising" order, far from being "typical".

The questions are:

are there any "official" CS terms for the situation (so I can do further searches)?
what are practically popular definitions / metrics for "how typical" (or "surprising") in the above description?

My idea so far is to count "average position", normalized to the max length of the list. Then the order will be by increasing normalized average positions. I feel, maybe term rank and some statistics' techniques are also applicable to this, but I am more interested in some easily (one pass) computable metrics. Any pointers?

One more idea is to calculate probability of every pair where one element is before another, and sort based on whether probability is less than 0.5 or more. But this may need a triangular matrix and a lot of computations. (For the example above, p(A before D|both occur) = 1, but p(A before B) = 0.5).

One gotcha may be that is some element is usual for the beginnings and ends, but never the middle, the average result may still be in the middle... But I guess it may be too subtle to satisfy.

For now, using very rough method:

def approx_position(position, length, master_length):
    return int(position * master_length / float(length))

seqs = ['A', 'B', 'D', 'E', 'Q'], ['A', 'D'], \
    ['A', 'C', 'E', 'F', 'G', 'H', 'Y', 'Z'], ['B', 'A', 'E', 'D', 'Z', 'Y']

master = []
for s in seqs:
    total = len(s)
    est_len = len(set(master) | set(s))
    for i, item in enumerate(s):
        if item not in master:
            new_pos = approx_position(i, total, est_len)
            master.insert(new_pos, item)

print master
# ['A', 'C', 'B', 'D', 'F', 'G', 'H', 'E', 'Y', 'Z', 'Q']
# when seqs in reversed order:
# ['B', 'C', 'A', 'F', 'E', 'G', 'H', 'D', 'Q', 'Z', 'Y']

NB. Python insert works on intervals between list elements. So 0 corresponds to the place before first element.

Not ideal, but at least almost one pass (one hidden pass in set union).

Something similar is for example discussed here: https://stackoverflow.com/questions/4645874/merging-some-sorted-lists-with-unknown-order-sequence However, my problem is overdetermined, so the question is how to do relaxation, while the results are still near optimal (the criteria is part of the answer).

(trying to widen my horizons as it is usually hard to find a right term for something if one has never encountered that term before)

Ranking by average position can produce suboptimal results: ABCDE, AB puts B (average position (1/4 + 1)/2 = 0.625) after C (average position 0.5) — John Dvorak, Commented Feb 4, 2018 at 14:15
I don't understand the exact problem statement. The input is a bunch of ordered lists, and the desired output is an ordered list. But what ordered list do you want? What relationship do you want it to have to the input? Can you give a precise specification of the problem? Or are you not sure exactly what properties you want the output to have? There are many ways to choose an output list that is somehow similar/related in order to the input lists, and it's not clear how you want us to choose one among those possibilities. — D.W., Commented Feb 4, 2018 at 18:46
This question is not precise specification (is it a problem?).What I am waiting from good answer is to point me how this class of problems is called and some useful approaches to the criteria of good solution. What I am asking is analogue to asking finding shortest path without giving specific metrics in hope that Euclidean and Mahalanobis distances will come up as popular ones. But the question is not vague. It just lacks naming "the thing" ("finding shortest path"),which prevents to find relevant research.I know these questions aren't received well, but I believe they aren't off-topic — Roman Susi, Commented Feb 4, 2018 at 19:31
I do not believe these situations aren't researched under some obscure term, maybe, even more than once independently in statistics, discrete math / graph theory, operations research, and some areas of CS like ML. Of course, I know it's optimization problem, so basically I am asking what subclass of optimization problems this belongs to and what are popular objective functions for these problems intuitively corresponding to "least surprising sequence"? — Roman Susi, Commented Feb 4, 2018 at 19:46

giusti · Accepted Answer · 2018-02-04 20:40:45Z

2

As I understand your question, your problem statement is:

Given a set of ordered lists $\{L_1,L_2,\ldots,L_n\}$, learn the relation $r(a,b)$ for the partial order of each $L_1,L_2,\ldots,L_n$. Then use this learned relation to sort a new list.

So, in your example, the relation would be the lexicographical order. But your data set has some noise because the last list breaks the order in several places.

This seems to be a case of preferrence learning, and more specifically, object ranking.

edited Feb 4, 2018 at 20:40

answered Feb 4, 2018 at 19:48

giusti

2203 silver badges9 bronze badges

$\begingroup$ Yes! This seems to be the right thing. I wonder why it did not come to me as the whole problem can be formulated as learning. Now it becomes a technicality to find specific way and objective functions, even from the linked Wikipedia page. $\endgroup$
– Roman Susi
Commented Feb 4, 2018 at 19:54

Add a comment |

John Dvorak · Accepted Answer · 2018-02-04 14:40:23Z

0

The problem of sorting elements given some approximately sorted samples is difficult. There is a similar problem called ranked voting (not a CS term, sorry). It is simpler in two important aspects - all of the lists given contain all of the elements each, and we are only interested in the top few slots rather than the entire list. Your proposed method is equivalent to the Borda count method.

However, if you are only interested in finding surprising orderings, you can do it by computing the average disagreement between each pair of lists. As for the disagreement between two lists, I propose counting the number of pairs of elements that exist in both lists and whose relative order differs. For example ABCD and DCBA have six conflicting pairs while ABCD and FEDC only have one.

answered Feb 4, 2018 at 14:40

John Dvorak

2912 silver badges9 bronze badges

$\begingroup$ Not interested in surprising ordering. More interested how to call it properly and to find some fast heuristic, which works most of the time. $\endgroup$
– Roman Susi
Commented Feb 4, 2018 at 17:42
$\begingroup$ Updated my answer with very approximate and dumb heuristic algorithm. $\endgroup$
– Roman Susi
Commented Feb 4, 2018 at 18:01
$\begingroup$ @RomanSusi my proposal runs at $O(lists^2 + elems^2 * lists)$ - above that, speed is a limiting factor. As for naming, I'm afraid "ranked voting" is the closest you'll get. $\endgroup$
– John Dvorak
Commented Feb 4, 2018 at 21:09
$\begingroup$ Quite heavy computationally. I was hoping for O(lists * elems). Is your proposal in the second part, that is, minimizing the number of conflicting pairs? $\endgroup$
– Roman Susi
Commented Feb 4, 2018 at 21:16

Add a comment |

Stack Exchange Network

Sort by typical position?

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
string-metrics
ordering
or ask your own question.

Hot Network Questions

Sort by typical position?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged string-metricsordering or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
string-metrics
ordering
or ask your own question.