Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transpose method to dataframe #1176

Closed
tversteeg opened this issue Aug 20, 2021 · 11 comments
Closed

Add transpose method to dataframe #1176

tversteeg opened this issue Aug 20, 2021 · 11 comments

Comments

@tversteeg
Copy link

Similar to pandas.

@ritchie46
Copy link
Member

df = pl.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df)

print(pl.DataFrame(df.rows()))
shape: (2, 2)
╭──────┬──────╮
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 3    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 4    │
╰──────┴──────╯
shape: (2, 2)
╭──────────┬──────────╮
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 2        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3        ┆ 4        │
╰──────────┴──────────╯

@tversteeg
Copy link
Author

tversteeg commented Aug 20, 2021

Thanks! Would it make sense to create an alias for that function called transpose, which also creates a new dataframe from it? If not it might be a good idea to add the keyword "transpose" in the description somewhere of rows so searching it yields it.

@ritchie46
Copy link
Member

Yes, will add the alias. 👍

@alippai
Copy link

alippai commented Aug 20, 2021

@ritchie46 I'm wondering whether it would make sense to introduce a lazy transpose - skipping the extra allocation (turning it into an iterator) if there is a subsequent operation. Edit: this might fit ndarray more, eg it could optimize DF * DF.transpose()

@ritchie46
Copy link
Member

Now that I think of it. I will support this natively, as going to python rows is super expensive.

@ritchie46 I'm wondering whether it would make sense to introduce a lazy transpose - skipping the extra allocation (turning it into an iterator) if there is a subsequent operation. Edit: this might fit ndarray more, eg it could optimize DF * DF.transpose()

@alippai Currently these operation sadly cannot be done in lazy. I need to know the schema of every node in the query plan. An operation like pivot, and transpose create the schema based on the data (which is unknown at that point).

@alippai
Copy link

alippai commented Aug 20, 2021

Makes sense, I really appreciate the implementation detail!

@jorgecarleitao
Copy link
Collaborator

fwiw, spark supports it (lazily), but it is a shotgun shot to the foot, as it performs two queries, one of them to compute the distincts during planning.

@alippai
Copy link

alippai commented Aug 20, 2021

Just to get the complexity of the task: an n x m sized 2D single type (int/float) specialization of the DF type would be needed for this, right?

@ritchie46
Copy link
Member

fwiw, spark supports it (lazily), but it is a shotgun shot to the foot, as it performs two queries, one of them to compute the distincts during planning.

Ouch.. that's definitely a shotgun.

In such I cases, I'd rather have a user doing something like this, and document why'd they want this:

temp = long_query().transpose().collect()

(temp.lazy()
 .select(..)   # continue from here.
 )

@ritchie46
Copy link
Member

Just to get the complexity of the task: an n x m sized 2D single type (int/float) specialization of the DF type would be needed for this, right?

For max performance I guess something like that. I am planning to use the AnyValue enums (these are enums around all possible dtypes). Then we can turn the DataFrame into rows. Infer the schema per row and read the rows as columns.

@ritchie46
Copy link
Member

Added in 2fa53db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants