2

I am looking to transform a list into smaller lists of equal values. An example I have is:

["a", "a", "a", "b", "b", "c", "c", "c", "c"] 

to

[["a", "a", "a"], ["b", "b"], ["c", "c", "c", "c"]]

What do you think is the most efficient way to do this?

2
  • 3
    Are equal values necessarily consecutive?
    – anonymoose
    Commented Jun 19, 2017 at 22:42
  • I sorted the list to make values consecutive
    – Enesxg
    Commented Jun 19, 2017 at 22:48

4 Answers 4

9

You could use itertools.groupby to solve the problem:

>>> from itertools import groupby
>>> [list(grp) for k, grp in groupby(["a", "a", "a", "b", "b", "c", "c", "c", "c"])]
[['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]

It only groups consecutive equal elements but that seems enough in your case.

3
  • 1
    One can (should?) sort the list before grouping.
    – DYZ
    Commented Jun 19, 2017 at 22:46
  • 1
    That depends on the exact requirements. If it should group equal consecutive elements then "no", if it should group all equal values (overall) and keep the order then there are better ways using an OrderedDict and Counter. Just in case the order doesn't matter and equal elements aren't consecutive then sorting is an efficient strategy. The most efficient way for the example given is just using groupby (without sorting). :)
    – MSeifert
    Commented Jun 19, 2017 at 22:48
  • Agree. The OP just said that they sorted the list for convenience.
    – DYZ
    Commented Jun 19, 2017 at 22:49
5

You could use collections.Counter

>>> lst = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
>>> import collections
>>> collections.Counter(lst).most_common()
[('c', 4), ('a', 3), ('b', 2)]

This works even when the values are not ordered and provides a very compact representation which then you could expand if needed into lists:

>>> [[i]*n for i,n in collections.Counter(lst).most_common()]
[['c', 'c', 'c', 'c'], ['a', 'a', 'a'], ['b', 'b']]
2
  • Do you know how to access the counter value of each of the elements? In this case the 4, 3 , and 2
    – Enesxg
    Commented Jun 19, 2017 at 22:54
  • Sure, just use: [n for i,n in collections.Counter(lst).most_common()] Commented Jun 19, 2017 at 22:55
2

Another manner to have your desired output is by using defaultdict from collections module (best time using this approach was: ~= 0.02s same as using groupby):

from collections import defaultdict
a = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
b = defaultdict(list)
for k in a:
    b[k].append(k)

>>> b 
defaultdict(list,
            {'a': ['a', 'a', 'a'], 'b': ['b', 'b'], 'c': ['c', 'c', 'c', 'c']})

So, what you have to do now is:

list(b.values())
>>> [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
0

While I'd personally go with itertools.groupby as the most convinient way, you've asked for efficiency and this should be considerably faster than any of the itertools options:

data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] 

lookup = {}  # lookup map
result = []
for element in data:
    if element not in lookup:
        target = lookup[element] = [element]
        result.append(target)
    else:
        lookup[element].append(element)

print(result)
# [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]

If the data is always ordered (i.e. elements don't mix) this can be further optimized without a lookup table and using a list comprehension for maximum performance.

UPDATE - Some clarification on efficiency and operation. If you set up your test as:

from itertools import groupby

def itools_func(data):
    return [list(grp) for k, grp in groupby(data)]

def manual_func(data):
    lookup = {}
    result = []
    for element in data:
        if element not in lookup:
            target = lookup[element] = [element]
            result.append(target)
        else:
            lookup[element].append(element)
    return result

The problem is that those two will not return the same values:

test_data = ["a", "a", "b", "c", "c", "b", "a"]

itools_func(test_data)  # [['a', 'a'], ['b'], ['c', 'c'], ['b'], ['a']]
manual_func(test_data)  # [['a', 'a', 'a'], ['b', 'b'], ['c', 'c']]

From OP's question, I understood he wants the latter one (based on his comment "I sorted the list to make values consecutive") because with a sorted list this can be done far easier. So, if we feed those functions a really long list:

test_data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] * 10000  # 10000 x the original

On my system it clocks as follows:

itools_func - 100 loops: 2.668s, per loop: 26.68ms
manual_func - 100 loops: 1.005s, per loop: 10.05ms

But, this is an unfavorable setting for itertools.groopby. If the data was to be sorted like:

test_data = ["a"] * 3000 + ["b"] * 2000 + ["c"] * 40000

The story is quite a bit different as the C backend kicks in:

itools_func - 1000 loops: 656.3ms, per loop: 656.3µs
manual_func - 1000 loops: 4.816s, per loop: 4.816ms

When the data is sorted the manual function can be further optimized but it will hardly beat what itertools does under the hood.

3
  • Well, if you care about efficiency you should probably use a defaultdict, or at least use the .setdefault method of plain dicts, instead of checking if element not in lookup:. Also, I'm curious why you say this will be considerably faster. Do you have a timing? itertools.groupby is written in C, after all. Commented Jun 19, 2017 at 22:56
  • That's only "more" efficient for really short inputs. If the data is big or huge this will be much slower.
    – MSeifert
    Commented Jun 19, 2017 at 22:57
  • @juanpa.arrivillaga @MSeifert - I've updated my post with some numbers. As for why not using defaultdict - it doesn't add anything here, in fact it would just add more steps when extracting the data as one would need to keep a separate list, together with the if element in lookup, to maintain the order. I tried it with defaultdict and it ends up ~1% slower on average.
    – zwer
    Commented Jun 19, 2017 at 23:44

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.