Splitting list into smaller lists of equal values

Question

I am looking to transform a list into smaller lists of equal values. An example I have is:

["a", "a", "a", "b", "b", "c", "c", "c", "c"]

to

[["a", "a", "a"], ["b", "b"], ["c", "c", "c", "c"]]

What do you think is the most efficient way to do this?

Are equal values necessarily consecutive?
– anonymoose
Commented Jun 19, 2017 at 22:42 — anonymoose, Commented Jun 19, 2017 at 22:42
I sorted the list to make values consecutive
– Enesxg
Commented Jun 19, 2017 at 22:48 — Enesxg, Commented Jun 19, 2017 at 22:48

MSeifert · Accepted Answer · 2017-06-19 22:42:41Z

9

You could use itertools.groupby to solve the problem:

>>> from itertools import groupby
>>> [list(grp) for k, grp in groupby(["a", "a", "a", "b", "b", "c", "c", "c", "c"])]
[['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]

It only groups consecutive equal elements but that seems enough in your case.

answered Jun 19, 2017 at 22:42

MSeifert

152k41 gold badges349 silver badges366 bronze badges

1

One can (should?) sort the list before grouping.
– DYZ
Commented Jun 19, 2017 at 22:46
1

That depends on the exact requirements. If it should group equal consecutive elements then "no", if it should group all equal values (overall) and keep the order then there are better ways using an OrderedDict and Counter. Just in case the order doesn't matter and equal elements aren't consecutive then sorting is an efficient strategy. The most efficient way for the example given is just using groupby (without sorting). :)
– MSeifert
Commented Jun 19, 2017 at 22:48
Agree. The OP just said that they sorted the list for convenience.
– DYZ
Commented Jun 19, 2017 at 22:49

Add a comment |

Josep Valls · Accepted Answer · 2017-06-19 22:45:09Z

5

You could use collections.Counter

>>> lst = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
>>> import collections
>>> collections.Counter(lst).most_common()
[('c', 4), ('a', 3), ('b', 2)]

This works even when the values are not ordered and provides a very compact representation which then you could expand if needed into lists:

>>> [[i]*n for i,n in collections.Counter(lst).most_common()]
[['c', 'c', 'c', 'c'], ['a', 'a', 'a'], ['b', 'b']]

answered Jun 19, 2017 at 22:45

Josep Valls

5,5513 gold badges37 silver badges67 bronze badges

Do you know how to access the counter value of each of the elements? In this case the 4, 3 , and 2
– Enesxg
Commented Jun 19, 2017 at 22:54
Sure, just use: [n for i,n in collections.Counter(lst).most_common()]
– Josep Valls
Commented Jun 19, 2017 at 22:55

Add a comment |

Chiheb Nexus · Accepted Answer · 2017-06-19 23:05:05Z

2

Another manner to have your desired output is by using defaultdict from collections module (best time using this approach was: ~= 0.02s same as using groupby):

from collections import defaultdict
a = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
b = defaultdict(list)
for k in a:
    b[k].append(k)

>>> b 
defaultdict(list,
            {'a': ['a', 'a', 'a'], 'b': ['b', 'b'], 'c': ['c', 'c', 'c', 'c']})

So, what you have to do now is:

list(b.values())
>>> [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]

edited Jun 19, 2017 at 23:05

answered Jun 19, 2017 at 22:57

Chiheb Nexus

9,2574 gold badges31 silver badges44 bronze badges

Add a comment |

zwer · Accepted Answer · 2017-06-19 23:32:44Z

While I'd personally go with itertools.groupby as the most convinient way, you've asked for efficiency and this should be considerably faster than any of the itertools options:

data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] 

lookup = {}  # lookup map
result = []
for element in data:
    if element not in lookup:
        target = lookup[element] = [element]
        result.append(target)
    else:
        lookup[element].append(element)

print(result)
# [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]

If the data is always ordered (i.e. elements don't mix) this can be further optimized without a lookup table and using a list comprehension for maximum performance.

UPDATE - Some clarification on efficiency and operation. If you set up your test as:

from itertools import groupby

def itools_func(data):
    return [list(grp) for k, grp in groupby(data)]

def manual_func(data):
    lookup = {}
    result = []
    for element in data:
        if element not in lookup:
            target = lookup[element] = [element]
            result.append(target)
        else:
            lookup[element].append(element)
    return result

The problem is that those two will not return the same values:

test_data = ["a", "a", "b", "c", "c", "b", "a"]

itools_func(test_data)  # [['a', 'a'], ['b'], ['c', 'c'], ['b'], ['a']]
manual_func(test_data)  # [['a', 'a', 'a'], ['b', 'b'], ['c', 'c']]

From OP's question, I understood he wants the latter one (based on his comment "I sorted the list to make values consecutive") because with a sorted list this can be done far easier. So, if we feed those functions a really long list:

test_data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] * 10000  # 10000 x the original

On my system it clocks as follows:

itools_func - 100 loops: 2.668s, per loop: 26.68ms
manual_func - 100 loops: 1.005s, per loop: 10.05ms

But, this is an unfavorable setting for itertools.groopby. If the data was to be sorted like:

test_data = ["a"] * 3000 + ["b"] * 2000 + ["c"] * 40000

The story is quite a bit different as the C backend kicks in:

itools_func - 1000 loops: 656.3ms, per loop: 656.3µs
manual_func - 1000 loops: 4.816s, per loop: 4.816ms

When the data is sorted the manual function can be further optimized but it will hardly beat what itertools does under the hood.

Well, if you care about efficiency you should probably use a defaultdict, or at least use the .setdefault method of plain dicts, instead of checking if element not in lookup:. Also, I'm curious why you say this will be considerably faster. Do you have a timing? itertools.groupby is written in C, after all. — juanpa.arrivillaga, Commented Jun 19, 2017 at 22:56
That's only "more" efficient for really short inputs. If the data is big or huge this will be much slower. — MSeifert, Commented Jun 19, 2017 at 22:57
@juanpa.arrivillaga @MSeifert - I've updated my post with some numbers. As for why not using defaultdict - it doesn't add anything here, in fact it would just add more steps when extracting the data as one would need to keep a separate list, together with the if element in lookup, to maintain the order. I tried it with defaultdict and it ends up ~1% slower on average. — zwer, Commented Jun 19, 2017 at 23:44

Collectives™ on Stack Overflow

Splitting list into smaller lists of equal values

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
or ask your own question.