Count unique groups within a pandas data frame

Question

I have a data frame of patent numbers and the inventors who invented those patents. For example:

patent_number	inventor_id
1	A
1	B
2	B
2	C
3	A
3	B

I define a team as a group of inventors who produce a patent together. E.g. the team (A,B) produced patent 1, (B,C) patent 2 and again (A,B) produced patent 3. I want to count the number of unique teams. In this case the answer is 2.

What is the fastest way of counting the number of unique teams using python?

I have written this code, but it is very slow when I run it on my entire data set which includes over 6 million patent numbers and 3.5 million unique inventor ids.

teams = []

for pat_id, pat_df in inventor_data.groupby("patent_number"):

    if list(pat_df["inventor_id"]) not in teams:
    
        teams.append(list(pat_df["inventor_id"]))

print("Number of teams ", len(teams))

I am looking for speed improvements. If you can help me with understand the reasons why they are faster I am always keen to learn about this.

Thank you!

mozway · Accepted Answer · 2022-02-01 20:47:51Z

4

You can groupby and aggregate as frozenset and count the unique values:

df.groupby('patent_number')['inventor_id'].agg(frozenset).nunique()

Output: 2

Interestingly, you can also easily get the number of occurrences of each team with value_counts:

df.groupby('patent_number')['inventor_id'].agg(frozenset). value_counts()

Output:

(B, A)    2
(B, C)    1
Name: inventor_id, dtype: int64

edited Feb 1, 2022 at 20:47

answered Feb 1, 2022 at 20:43

mozway

257k13 gold badges48 silver badges91 bronze badges

Add a comment |

Grégoire · Accepted Answer · 2022-02-01 20:47:20Z

1

You could go for:

   inventor_data = inventor_data.sort_values("inventor_id")
   inventor_data.groupby("patent_number").inventor_id.sum().nunique()

A few explanations:

Sorting the values is mandatory to avoid symmetries, and consider (A,B) and (B,A) as a single team.
You can sum the strings "A" and "B" to produce a string "AB" representing the team (A, B)

answered Feb 1, 2022 at 20:47

Grégoire

861 silver badge3 bronze badges

Thank you! I have timed your code and the answer given by @mozway above, they are method 1 and you 2. There's appears fractionally faster but you have very slightly different results, do you have any idea why? Number of teams: 3667014 Time elapsed to count teams method 1 (@Grégoire): 0:00:43.931967 Number of teams: 3666748 Time elapsed to count teams method 2 (@mozway): 0:00:38.515821
– Joe Emmens
Commented Feb 1, 2022 at 21:17
@Joe The set ensures to have unordered groups. Sorting is more computationally expensive (although for 2 values this should be quite minimal). That might be the cause of the time difference. Regarding the results, can you give examples of differences?
– mozway
Commented Feb 1, 2022 at 21:38
1

Indeed, I believe the frozenset approach from @mozway is cleaner. Do you have duplicates in your data ? For example, let's consider 3 rows for a given patent, with inventor = ["A", "A", "B"], then the approach #1 would give "AAB", while the approach #2 would give you {"A", "B"} @JoeEmmens
– Grégoire
Commented Feb 1, 2022 at 22:19
I guess it doesn't really make sense to duplicate one person, but we never know ;)
– mozway
Commented Feb 1, 2022 at 22:27
Btw, better use agg(''.join) rather than sum. Repeated string concatenation is very inefficient.
– mozway
Commented Feb 1, 2022 at 22:29

| Show 1 more comment

Collectives™ on Stack Overflow

Count unique groups within a pandas data frame

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
pandas
combinations
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonpandascombinations or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
combinations
or ask your own question.