Gather data by year and also by industry

Question

I have this very large Dataframe containing statistics for various firms for years 1950 to 2020. I have been trying to divide the data first by year and then by industry code (4 digits). Both 'year' and 'industry_code' are columns from the Dataframe. I have created a dictionary in order to obtain data by year, but then I find myself stuck when trying to divide each key by industry, since all of my columns from my initial Dataframe find themselves in the 'value' part of the dictionary. Here is my starting code:

df= pd.read_csv('xyz')

dictio = {}
for year in df['year'].unique():
    dictio[year] = df[ df['year'] == year ]

Could someone help me figure out a groupby / loc / if statement or other in order to complete the sampling by year and by industry? Thank you!

jsmart · Accepted Answer · 2022-02-05 22:37:05Z

The pandas .groupby() method lets you analyze data with multiple keys (e.g., year and industry).

import random
import pandas as pd

# create data frame
random.seed(1234)
data = {
    'year': [2017, 2017, 2017, 2018, 2018, 2018, 2019, 2019, 2019],
    'industry': [111, 222, 333, 111, 222, 333, 111, 222, 333, ],
    'x': [random.randint(1000, 1100) for _ in range(9)],
    'y': [random.randint(5000, 6000) for _ in range(9)], }
df = pd.DataFrame(data)

# aggregate with `.groupby()`
gdf = df.groupby(['year', 'industry']).sum()
print(gdf)

                  x     y
year industry            
2017 111       1099  5085
     222       1056  5100
     333       1014  5784
2018 111       1000  5363
     222       1011  5242
     333       1074  5017
2019 111       1004  5031
     222       1085  5807
     333       1088  5016

The .groupby() aggregates on two keys, so the resulting data frame has a MultiIndex. You can index like this:

# perform lookup
print(gdf.loc[(2018, 222), :])

x    1011
y    5242
Name: (2018, 222), dtype: int64

EDIT: Add level of aggregation, via the function to_industry_group().

def to_industry_group(industry):
    if 111 <= industry < 223:   industry_group = 'A'
    elif 223 <= industry < 400: industry_group = 'B'
    else:                       industry_group = 'Z'
    
    return industry_group

df = pd.DataFrame(data)
df['industry_group'] = df['industry'].map(to_industry_group)
fields = ['year', 'industry_group', 'industry']
#                  ^ new grouping level
gdf = df.groupby(fields).sum()
print(gdf)
                                 x     y
year industry_group industry            
2017 A              111       1099  5085
                    222       1056  5100
     B              333       1014  5784
2018 A              111       1000  5363
                    222       1011  5242
     B              333       1074  5017
2019 A              111       1004  5031
                    222       1085  5807
     B              333       1088  5016

How would this differ if I would be working with grouped industry codes? For example if my Industry #1 includes industry codes 111 to 222 (can take any value between those two numbers) ? — Marie-Pier St-Vincent, Commented Feb 4, 2022 at 20:26
I created a function to assign multiple industry codes to a single industry group. You could also use pd.cut() or join to a series that links industry codes to industry groups. Then I grouped by three fields, instead of two fields. You can do look-ups like this: gdf.loc[(2017, 'A'), :] — jsmart, Commented Feb 5, 2022 at 22:39
@Marie-PierSt-Vincent please accept this answer, in order to recognize the author and mark the question as answered. — guardian, Commented Apr 9, 2022 at 21:16

user17242583user17242583 · Accepted Answer · 2022-02-04 17:53:14Z

0

Try using dict comprehension + groupby:

dct = {key1: {key2: df2 for key2, df2 in df1.groupby('industry_code')} for key1, df1 in df.groupby('year')}

Now try accessing one:

firm_year_df = dct[1994]['My Firm']

answered Feb 4, 2022 at 17:53

user17242583

Add a comment |

Collectives™ on Stack Overflow

Gather data by year and also by industry

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
pandas
dataframe
dictionary
if-statement
pandas-groupby
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pandasdataframedictionaryif-statementpandas-groupby or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
pandas
dataframe
dictionary
if-statement
pandas-groupby
or ask your own question.