Rosetta stone for groupby between pandas and polars? #1083

davideger · 2021-08-02T17:34:33Z

Are you using Python or Rust?

Python.

What version of polars are you using?

polars 0.8.16

What operating system are you using polars on?

manylinux x86_64 (google colab)

Describe your bug.

I'm trying to translate the following pandas code to polars:

commodity_prices = {
  'commodity': ["Wheat", "Wheat", "Wheat", "Wheat",      "Corn", "Corn", "Corn", "Corn", "Corn"],
  'location':  ["StPaul", "StPaul",  "StPaul", "Chicago", "Chicago","Chicago","Chicago","Chicago", "Chicago"],
  'seller':    ['Bob', 'Charlie', 'Susan', 'Paul',   'Ed', 'Mary', 'Paul', 'Charlie', 'Norman'],
  'price':     [1., .7, .8, .55,    2., 3., 2.4, 1.8, 2.1]  
}

# Find the two best offers for each commodity+location.
df = pd.DataFrame(commodity_prices)
df.sort_values(by='price').groupby(['location', 'commodity']).head(n=2)

Output:

When I try to write the same in polars, I get something different:

pdf = pl.DataFrame(commodity_prices)
pdf.groupby(['location', 'commodity']).agg([
                pl.col('*').sort_by('price').head(2)])

location	commodity	commodity	location	seller	price
str	str	list	list	list	list
"Chicago"	"Corn"	[Corn, Corn]	[Chicago, Chicago]	[Charlie, Ed]	[1.8, 2]
"Chicago"	"Wheat"	[Wheat]	[Chicago]	[Paul]	[0.55]
"StPaul"	"Wheat"	[Wheat, Wheat]	[StPaul, StPaul]	[Charlie, Susan]	[0.7, 0.8]

polars seems to be calculating the data I want, but giving it back in a way that's not quite what I want. Other than post-processing the output in Python, is there a better way to get pandas-style output for this query?

The text was updated successfully, but these errors were encountered:

ritchie46 · 2021-08-02T18:11:08Z

Hi, That's an interesting one. A snippet that produces the same output is this:

df = pl.DataFrame(commodity_prices)


(df.sort(by="price")
 .groupby(["commodity", "location"])
 .agg([
     col("seller").head(2).list().alias("seller"),  # take the first two and aggregate to list
     col("price").head(2).list().alias("price")   
 ])
 .explode(["price", "seller"])  # explode the lists to long format
 .sort(by="price")  # not really needed, but makes output predictable
)

However, I understand that a head(n) aggregation is far more ergonomic, I will see if I can add that.

ritchie46 · 2021-08-03T09:55:17Z

Ok, with the mergin of #1088, #1089, #1090 you can now do:

(df.sort(by='price')
    .groupby(['location', 'commodity'])
    .head(n=2)
)

which is syntactic sugar for:

keys = ["commodity", "location"]

(df.sort(by="price")
 .groupby(keys)
 .agg([
     col("*").exclude(keys).head(2).list().keep_name()   
 ])
 .explode(col("*").exclude(keys))
)

davideger · 2021-08-03T17:51:27Z

very nice; thank you :-)

ritchie46 closed this as completed Aug 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rosetta stone for groupby between pandas and polars? #1083

Rosetta stone for groupby between pandas and polars? #1083

davideger commented Aug 2, 2021

ritchie46 commented Aug 2, 2021

ritchie46 commented Aug 3, 2021

davideger commented Aug 3, 2021

Rosetta stone for groupby between pandas and polars? #1083

Rosetta stone for groupby between pandas and polars? #1083

Comments

davideger commented Aug 2, 2021

Are you using Python or Rust?

What version of polars are you using?

What operating system are you using polars on?

Describe your bug.

ritchie46 commented Aug 2, 2021

ritchie46 commented Aug 3, 2021

davideger commented Aug 3, 2021