1

I have a data frame of patent numbers and the inventors who invented those patents. For example:

patent_number inventor_id
1 A
1 B
2 B
2 C
3 A
3 B

I define a team as a group of inventors who produce a patent together. E.g. the team (A,B) produced patent 1, (B,C) patent 2 and again (A,B) produced patent 3. I want to count the number of unique teams. In this case the answer is 2.

What is the fastest way of counting the number of unique teams using python?

I have written this code, but it is very slow when I run it on my entire data set which includes over 6 million patent numbers and 3.5 million unique inventor ids.

teams = []

for pat_id, pat_df in inventor_data.groupby("patent_number"):

    if list(pat_df["inventor_id"]) not in teams:
    
        teams.append(list(pat_df["inventor_id"]))

print("Number of teams ", len(teams))

I am looking for speed improvements. If you can help me with understand the reasons why they are faster I am always keen to learn about this.

Thank you!

2 Answers 2

4

You can groupby and aggregate as frozenset and count the unique values:

df.groupby('patent_number')['inventor_id'].agg(frozenset).nunique()

Output: 2

Interestingly, you can also easily get the number of occurrences of each team with value_counts:

df.groupby('patent_number')['inventor_id'].agg(frozenset). value_counts()

Output:

(B, A)    2
(B, C)    1
Name: inventor_id, dtype: int64
0
1

You could go for:

   inventor_data = inventor_data.sort_values("inventor_id")
   inventor_data.groupby("patent_number").inventor_id.sum().nunique()

A few explanations:

  • Sorting the values is mandatory to avoid symmetries, and consider (A,B) and (B,A) as a single team.
  • You can sum the strings "A" and "B" to produce a string "AB" representing the team (A, B)
6
  • Thank you! I have timed your code and the answer given by @mozway above, they are method 1 and you 2. There's appears fractionally faster but you have very slightly different results, do you have any idea why? Number of teams: 3667014 Time elapsed to count teams method 1 (@Grégoire): 0:00:43.931967 Number of teams: 3666748 Time elapsed to count teams method 2 (@mozway): 0:00:38.515821
    – Joe Emmens
    Commented Feb 1, 2022 at 21:17
  • @Joe The set ensures to have unordered groups. Sorting is more computationally expensive (although for 2 values this should be quite minimal). That might be the cause of the time difference. Regarding the results, can you give examples of differences?
    – mozway
    Commented Feb 1, 2022 at 21:38
  • 1
    Indeed, I believe the frozenset approach from @mozway is cleaner. Do you have duplicates in your data ? For example, let's consider 3 rows for a given patent, with inventor = ["A", "A", "B"], then the approach #1 would give "AAB", while the approach #2 would give you {"A", "B"} @JoeEmmens
    – Grégoire
    Commented Feb 1, 2022 at 22:19
  • I guess it doesn't really make sense to duplicate one person, but we never know ;)
    – mozway
    Commented Feb 1, 2022 at 22:27
  • Btw, better use agg(''.join) rather than sum. Repeated string concatenation is very inefficient.
    – mozway
    Commented Feb 1, 2022 at 22:29

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.