Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rosetta stone for groupby between pandas and polars? #1083

Closed
davideger opened this issue Aug 2, 2021 · 3 comments
Closed

Rosetta stone for groupby between pandas and polars? #1083

davideger opened this issue Aug 2, 2021 · 3 comments

Comments

@davideger
Copy link

Are you using Python or Rust?

Python.

What version of polars are you using?

polars 0.8.16

What operating system are you using polars on?

manylinux x86_64 (google colab)

Describe your bug.

I'm trying to translate the following pandas code to polars:

commodity_prices = {
  'commodity': ["Wheat", "Wheat", "Wheat", "Wheat",      "Corn", "Corn", "Corn", "Corn", "Corn"],
  'location':  ["StPaul", "StPaul",  "StPaul", "Chicago", "Chicago","Chicago","Chicago","Chicago", "Chicago"],
  'seller':    ['Bob', 'Charlie', 'Susan', 'Paul',   'Ed', 'Mary', 'Paul', 'Charlie', 'Norman'],
  'price':     [1., .7, .8, .55,    2., 3., 2.4, 1.8, 2.1]  
}

# Find the two best offers for each commodity+location.
df = pd.DataFrame(commodity_prices)
df.sort_values(by='price').groupby(['location', 'commodity']).head(n=2)

Output:

commodity | location | seller | price
-- | -- | -- | -- | --
Wheat | Chicago | Paul | 0.55
Wheat | StPaul | Charlie | 0.70
Wheat | StPaul | Susan | 0.80
Corn | Chicago | Charlie | 1.80
Corn | Chicago | Ed | 2.00

When I try to write the same in polars, I get something different:

pdf = pl.DataFrame(commodity_prices)
pdf.groupby(['location', 'commodity']).agg([
                pl.col('*').sort_by('price').head(2)])
location commodity commodity location seller price
str str list list list list
"Chicago" "Corn" [Corn, Corn] [Chicago, Chicago] [Charlie, Ed] [1.8, 2]
"Chicago" "Wheat" [Wheat] [Chicago] [Paul] [0.55]
"StPaul" "Wheat" [Wheat, Wheat] [StPaul, StPaul] [Charlie, Susan] [0.7, 0.8]

polars seems to be calculating the data I want, but giving it back in a way that's not quite what I want. Other than post-processing the output in Python, is there a better way to get pandas-style output for this query?

@ritchie46
Copy link
Member

Hi, That's an interesting one. A snippet that produces the same output is this:

df = pl.DataFrame(commodity_prices)


(df.sort(by="price")
 .groupby(["commodity", "location"])
 .agg([
     col("seller").head(2).list().alias("seller"),  # take the first two and aggregate to list
     col("price").head(2).list().alias("price")   
 ])
 .explode(["price", "seller"])  # explode the lists to long format
 .sort(by="price")  # not really needed, but makes output predictable
)

However, I understand that a head(n) aggregation is far more ergonomic, I will see if I can add that.

@ritchie46
Copy link
Member

Ok, with the mergin of #1088, #1089, #1090 you can now do:

(df.sort(by='price')
    .groupby(['location', 'commodity'])
    .head(n=2)
)

which is syntactic sugar for:

keys = ["commodity", "location"]

(df.sort(by="price")
 .groupby(keys)
 .agg([
     col("*").exclude(keys).head(2).list().keep_name()   
 ])
 .explode(col("*").exclude(keys))
)

@davideger
Copy link
Author

very nice; thank you :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants