0

I have this very large Dataframe containing statistics for various firms for years 1950 to 2020. I have been trying to divide the data first by year and then by industry code (4 digits). Both 'year' and 'industry_code' are columns from the Dataframe. I have created a dictionary in order to obtain data by year, but then I find myself stuck when trying to divide each key by industry, since all of my columns from my initial Dataframe find themselves in the 'value' part of the dictionary. Here is my starting code:

df= pd.read_csv('xyz')

dictio = {}
for year in df['year'].unique():
    dictio[year] = df[ df['year'] == year ]

Could someone help me figure out a groupby / loc / if statement or other in order to complete the sampling by year and by industry? Thank you!

2 Answers 2

0

The pandas .groupby() method lets you analyze data with multiple keys (e.g., year and industry).

import random
import pandas as pd

# create data frame
random.seed(1234)
data = {
    'year': [2017, 2017, 2017, 2018, 2018, 2018, 2019, 2019, 2019],
    'industry': [111, 222, 333, 111, 222, 333, 111, 222, 333, ],
    'x': [random.randint(1000, 1100) for _ in range(9)],
    'y': [random.randint(5000, 6000) for _ in range(9)], }
df = pd.DataFrame(data)

# aggregate with `.groupby()`
gdf = df.groupby(['year', 'industry']).sum()
print(gdf)

                  x     y
year industry            
2017 111       1099  5085
     222       1056  5100
     333       1014  5784
2018 111       1000  5363
     222       1011  5242
     333       1074  5017
2019 111       1004  5031
     222       1085  5807
     333       1088  5016

The .groupby() aggregates on two keys, so the resulting data frame has a MultiIndex. You can index like this:

# perform lookup
print(gdf.loc[(2018, 222), :])

x    1011
y    5242
Name: (2018, 222), dtype: int64

EDIT: Add level of aggregation, via the function to_industry_group().

def to_industry_group(industry):
    if 111 <= industry < 223:   industry_group = 'A'
    elif 223 <= industry < 400: industry_group = 'B'
    else:                       industry_group = 'Z'
    
    return industry_group

df = pd.DataFrame(data)
df['industry_group'] = df['industry'].map(to_industry_group)
fields = ['year', 'industry_group', 'industry']
#                  ^ new grouping level
gdf = df.groupby(fields).sum()
print(gdf)
                                 x     y
year industry_group industry            
2017 A              111       1099  5085
                    222       1056  5100
     B              333       1014  5784
2018 A              111       1000  5363
                    222       1011  5242
     B              333       1074  5017
2019 A              111       1004  5031
                    222       1085  5807
     B              333       1088  5016
4
  • How would this differ if I would be working with grouped industry codes? For example if my Industry #1 includes industry codes 111 to 222 (can take any value between those two numbers) ? Commented Feb 4, 2022 at 20:26
  • I created a function to assign multiple industry codes to a single industry group. You could also use pd.cut() or join to a series that links industry codes to industry groups. Then I grouped by three fields, instead of two fields. You can do look-ups like this: gdf.loc[(2017, 'A'), :]
    – jsmart
    Commented Feb 5, 2022 at 22:39
  • Thank you so much it worked!! Commented Feb 8, 2022 at 22:17
  • @Marie-PierSt-Vincent please accept this answer, in order to recognize the author and mark the question as answered.
    – guardian
    Commented Apr 9, 2022 at 21:16
0

Try using dict comprehension + groupby:

dct = {key1: {key2: df2 for key2, df2 in df1.groupby('industry_code')} for key1, df1 in df.groupby('year')}

Now try accessing one:

firm_year_df = dct[1994]['My Firm']

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.