1

I have a pandas dataframe which looks like the following:

    site    pay delta   over    under
phase   a   a   b           
ID                      
D01 London  12.3    10.3    -2.0    0.0 -2.0
D02 Bristol 7.3 13.2    5.9 5.9 0.0
D03 Bristol 17.3    19.2    1.9 1.9 0.0

I'd like to flatten the column multindex to the columns are

ID      site    a   b   delta   over    under                   
D01 London  12.3    10.3    -2.0    0.0 -2.0
D02 Bristol 7.3 13.2    5.9 5.9 0.0
D03 Bristol 17.3    19.2    1.9 1.9 0.0

I'm struggling with the online documentation and tutorials to work out how to do this.

I'd welcome advice, ideally to do this in a robust way which doesn't hardcode column positions.


UPDATE: the to_dict is

{'index': ['D01', 'D02', 'D03'],
 'columns': [('site', 'a'),
  ('pay', 'a'),
  ('pay', 'b'),
  ('delta', ''),
  ('over', ''),
  ('under', '')],
 'data': [['London', 12.3, 10.3, -2.0, 0.0, -2.0],
  ['Bristol', 7.3, 13.2, 5.8999999999999995, 5.8999999999999995, 0.0],
  ['Bristol', 17.3, 19.2, 1.8999999999999986, 1.8999999999999986, 0.0]],
 'index_names': ['ID'],
 'column_names': [None, 'phase']}
3
  • please provide a reproducible format of your input (for example the output of df.to_dict('tight'))
    – mozway
    Commented Nov 28 at 15:26
  • @mozway I have now done this.
    – Penelope
    Commented Nov 28 at 15:29
  • Thanks, now what is the logic? Why do you keep site and not pay for example?
    – mozway
    Commented Nov 28 at 15:30

2 Answers 2

2

As the above answer mentioned the requirement is not clear, I have tried out just the flattening of the data, let us know

import pandas as pd

data = {
    'index': ['D01', 'D02', 'D03'],
    'columns': [('site', 'a'),
                ('pay', 'a'),
                ('pay', 'b'),
                ('delta', ''),
                ('over', ''),
                ('under', '')],
    'data': [['London', 12.3, 10.3, -2.0, 0.0, -2.0],
             ['Bristol', 7.3, 13.2, 5.9, 5.9, 0.0],
             ['Bristol', 17.3, 19.2, 1.9, 1.9, 0.0]],
    'index_names': ['ID'],
    'column_names': [None, 'phase']
}

# Construct the MultiIndex columns and DataFrame
multiindex_columns = pd.MultiIndex.from_tuples(data['columns'], names=data['column_names'])
df = pd.DataFrame(data['data'], index=data['index'], columns=multiindex_columns)


# Flattening the MultiIndex columns
df.columns = [' '.join(filter(None, col)) for col in df.columns]

print(df)

Flattened DataFrame:

      site a  pay a  pay b  delta  over  under
D01   London  12.30  10.30  -2.00  0.00  -2.00
D02  Bristol   7.30  13.20   5.90  5.90   0.00
D03  Bristol  17.30  19.20   1.90  1.90   0.00
2

The expected logic is unclear, but assuming you want to keep the top level if the value is not "pay":

# get column levels
lvl0 = df.columns.get_level_values(0)
lvl1 = df.columns.get_level_values('phase')

# keep lvl0 if the value is not pay
df.columns = np.where(lvl0 != 'pay', lvl0, lvl1)

# optionally reset the index
df.reset_index(inplace=True)

Another example, assuming you want to keep the top level if the values are not duplicated of if the second level is null:

# get column levels
lvl0 = df.columns.get_level_values(0)
lvl1 = df.columns.get_level_values('phase')

# is the second level ''?
m1 = lvl1==''
# is the first level not duplicated
m2 = ~lvl0.duplicated(keep=False)

# keep lvl0 if either m1/m2 is True, else lvl1
df.columns = np.where(m1|m2, lvl0, lvl1)

df.reset_index(inplace=True)

Output:

    ID     site     a     b  delta  over  under
0  D01   London 12.30 10.30  -2.00  0.00  -2.00
1  D02  Bristol  7.30 13.20   5.90  5.90   0.00
2  D03  Bristol 17.30 19.20   1.90  1.90   0.00

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.