i have a problem on python working with a pandas dataframe i'm trying to make a machine learning model predictin the surface . I have the surface column in the train dataframe and i don't have it in the test dataframe . So , i would to create some features based on the surface in the train like .

train['error_cat1'] = abs(train.groupby(train['cat1'])['surface'].transform('mean')  - train.surface.mean())

here i have set the values of grouby by "cat" feature with the mean of suface . Cool

now i must add it to the test too . So , will use this method to map the values from the train for each groupby to the test row .

mp = {k: g['error_cat1'].tolist()[0] for k,g in train.groupby('cat1')}
test['error_cat1'] = test['cat1'].map(mp)

So , far there is no problem . Now , i would use two columns in groupby .

train['error_cat1_cat2'] = abs(train.groupby(train[['cat1','cat2']])['surface'].transform('mean')  - train.surface.mean())

but i don't know how to map it for test dataframe . Please can you help me handling this problem or give me some other methods so i can do it .


for example my train is

| Cat1 | Cat2 | surface |
| 1    | 3    | 10    |
| 2    | 2    | 12    |
| 3    | 1    | 12    |
| 1    | 3    | 5     |
| 2    | 2    | 10    |
| 3    | 2    | 13    |

my test is

| Cat1 | Cat2 |
| 1    | 2    |
| 2    | 1    |
| 3    | 1    |
| 1    | 3    |
| 2    | 3    |
| 3    | 1    |

Now i would do a groupby mean surface on the cat1 and cat2 for example the mean surface on (cat1,cat2)=(1,3) is (10+5)/2 = 7.5

Now , i must go to the test and map this value on the (cat1,cat2)=(1,3) rows .

i hope that you have got me .

  • you could create simple code with example data so everyone could run it and create solution.
    – furas
    Commented Dec 20, 2017 at 20:53

1 Answer 1


You can use

  • groupby().means() to calculate means
  • reset_index() to convert indexes Cat1, Cat2 into columns again
  • merge(how='left', ) to join two dataframes like tables in database (LEFT JOIN in SQL).


headers = ['Cat1', 'Cat2', 'surface']

train_data = [
    [1, 3, 10],
    [2, 2, 12],
    [3, 1, 12],
    [1, 3, 5],
    [2, 2, 10],
    [3, 2, 13],

test_data = [
    [1, 2],
    [2, 1],
    [3, 1],
    [1, 3],
    [2, 3],
    [3, 1],
import pandas as pd

train = pd.DataFrame(train_data, columns=headers)
test = pd.DataFrame(test_data, columns=headers[:-1])

print('--- train ---')

print('--- test ---')

print('--- means ---')
means = train.groupby(['Cat1', 'Cat2']).mean()

print('--- means (dataframe) ---')
means = means.reset_index(level=['Cat1', 'Cat2'])

print('--- result ----')
result = pd.merge(df2, means, on=['Cat1', 'Cat2'], how='left')

print('--- result (fillna)---')
result = result.fillna(0)


--- train ---
   Cat1  Cat2  surface
0     1     3       10
1     2     2       12
2     3     1       12
3     1     3        5
4     2     2       10
5     3     2       13
--- test ---
   Cat1  Cat2
0     1     2
1     2     1
2     3     1
3     1     3
4     2     3
5     3     1
--- means ---
Cat1 Cat2         
1    3         7.5
2    2        11.0
3    1        12.0
     2        13.0
--- means (dataframe) ---
   Cat1  Cat2  surface
0     1     3      7.5
1     2     2     11.0
2     3     1     12.0
3     3     2     13.0
--- result ----
   Cat1  Cat2  surface
0     1     2      NaN
1     2     1      NaN
2     3     1     12.0
3     1     3      7.5
4     2     3      NaN
5     3     1     12.0
--- result (fillna)---
   Cat1  Cat2  surface
0     1     2      0.0
1     2     1      0.0
2     3     1     12.0
3     1     3      7.5
4     2     3      0.0
5     3     1     12.0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.