Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow pivot.count (compared with pivot.sum, pivot.max, pivot.first, ...) #1129

Comments

Copy link
Collaborator

Are you using Python or Rust?

Python.

What version of polars are you using?

0.8.20 (arrow-rs)

What operating system are you using polars on?

CentOS 7

Describe your bug.

count in the following example is extremely slow (at least compared with sum, max, first, ...).

What are the steps to reproduce the behavior?

In [36]: df_pl
Out[36]: 
shape: (2754958, 6)
╭─────────────┬──────────┬──────────┬────────────────────┬───────┬─────────╮
│ chrombeginendbccountspecies │
│ ------------------     │
│ stri64i64stri64str     │
╞═════════════╪══════════╪══════════╪════════════════════╪═══════╪═════════╡
│ "hg38_chr1"115686115755"TCGACTCACCCATGTA"1"hg38"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "hg38_chr1"134631134668"GTTGACTGACGTGCAC"2"hg38"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "hg38_chr1"181465181517"TACCAGTGATTGCGTA"1"hg38"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "hg38_chr1"181471181521"CGCACACGATGCTCAA"8"hg38"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ...         ┆ ...      ┆ ...      ┆ ...                ┆ ...   ┆ ...     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "mm10_chrY"9080884790808872"CAAGGGAACATCATCG"1"mm10"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "mm10_chrY"9081164590811719"TAGCTTTACCGTTATG"1"mm10"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "mm10_chrY"9082544390825481"CGGTGTGTGCTCACGG"1"mm10"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "mm10_chrY"9082620990826253"CTCTGGACTGACTGTT"1"mm10"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ "mm10_chrY"9082833490828476"GGGTCGACTAACGAAT"1"mm10"  │
╰─────────────┴──────────┴──────────┴────────────────────┴───────┴─────────╯


In [33]: %time df_pl.groupby(['bc', 'species']).agg(pl.col('count').sum().alias('count')).groupby(['bc', 'species']).pivot(pivot_column='species', values_column='count').count().fill_none(0)
CPU times: user 19 s, sys: 154 ms, total: 19.1 s
Wall time: 18 s
Out[33]: 
shape: (26864, 4)
╭────────────────────┬─────────┬──────┬──────╮
│ bcspecieshg38mm10 │
│ ------------  │
│ strstri64i64  │
╞════════════════════╪═════════╪══════╪══════╡
│ "TCGCACGGAGCCACGT""hg38"10    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "TAGCTTTACTAGGAGC""mm10"01    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "TCTGAACGATGTGGGA""mm10"01    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "GTCTAATCTGTCCCGA""mm10"01    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                ┆ ...     ┆ ...  ┆ ...  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "GTTCAGAACAGTGAGC""hg38"10    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "TAAGCCAACTCGAGAG""hg38"10    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "GTCGGCTGACGAAGCA""mm10"01    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "AAGAACCCTCAAGGTG""hg38"10    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "CGCGATCGACGAAAGC""mm10"01    │
╰────────────────────┴─────────┴──────┴──────╯

In [34]: %time df_pl.groupby(['bc', 'species']).agg(pl.col('count').sum().alias('count')).groupby(['bc', 'species']).pivot(pivot_column='species', values_column='count').sum().fill_none(0)
CPU times: user 1.04 s, sys: 277 ms, total: 1.31 s
Wall time: 70 ms
Out[34]: 
shape: (26864, 4)
╭────────────────────┬─────────┬──────┬──────╮
│ bcspecieshg38mm10 │
│ ------------  │
│ strstri64i64  │
╞════════════════════╪═════════╪══════╪══════╡
│ "AGGTAGGTGCGCACTG""mm10"02    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "GATACCATGAGCCTAT""hg38"30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "ATCCAGCCTATGGCCC""hg38"20    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "TCTATGGGACCTCAAA""hg38"90    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                ┆ ...     ┆ ...  ┆ ...  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "ACGATCATGCCAAACT""mm10"05072 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "AAGACGGTGGCACCCA""mm10"02    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "CAGATACTGTCTGCAA""mm10"01    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "CCCATGTACACCAAGG""hg38"90    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "GGACTTAACCCGCTAG""hg38"10    │
╰────────────────────┴─────────┴──────┴──────╯
@ritchie46 ritchie46 changed the title Extremely slow count (compared with sum, max, first, ...) Extremely slow pivot.count (compared with pivot.sum, pivot.max, pivot.first, ...) Aug 11, 2021
@ritchie46
Copy link
Member

I already know what the issue is. This is accidentally quadratic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants