Generating combinations in pandas dataframe

Question

I have a dataset with ["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'] columns. All values in ["Uni", 'Region', "Profession"] are filled while ["Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'] always contain NAs.

For each column with NAs there are several possible values

Level_Edu = ['undergrad', 'grad', 'PhD']
Financial_Base = ['personal', 'grant']
Learning_Time = ["morning", "day", "evening"]
GENDER = ['Male', 'Female']

I want to generate all possible combinations of ["Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'] for each observation in the initial data. So that each initial observation would be represented by 36 new observations (obtained by the formula of combinatorics: N1 * N2 * N3 * N4, where Ni is the length of the i-th vector of possible values for a column)

Here is a Python code for recreating two initial observations and approximation of the result I desire to get (showing 3 combinations out of 36 for each initial observation I want).

import pandas as pd
import numpy as np
sample_data_as_is = pd.DataFrame([["X1", "Y1", "Z1", np.nan, np.nan, np.nan, np.nan], ["X2", "Y2", "Z2", np.nan, np.nan, np.nan, np.nan]], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])

sample_data_to_be = pd.DataFrame([["X1", "Y1", "Z1", "undergrad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "personal", "morning", 'Male'],
                                  ["X1", "Y1", "Z1", "grad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "grad", "personal", "morning", 'Male'],
                                  ["X1", "Y1", "Z1", "undergrad", "grant", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "grant", "morning", 'Male']], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])

Does this answer your question? Generate all combinations from multiple lists in python — Arthur, Commented Apr 24 at 8:56
To be clear, there should be 72 rows in the output since you have 2 rows in sample_data_as_is? — mozway, Commented Apr 24 at 8:57

mozway · Accepted Answer · 2024-04-24 08:56:01Z

You can combine itertools.product and a cross-merge:

from itertools import product

data = {'Level_Edu': ['undergrad', 'grad', 'PhD'],
        'Financial_Base': ['personal', 'grant'],
        'Learning_Time': ['morning', 'day', 'evening'],
        'GENDER': ['Male', 'Female']}

out = (sample_data_as_is[['Uni', 'Region', 'Profession']]
       .merge(pd.DataFrame(product(*data.values()), columns=data.keys()), how='cross')
      )

Output:

   Uni Region Profession  Level_Edu Financial_Base Learning_Time  GENDER
0   X1     Y1         Z1  undergrad       personal       morning    Male
1   X1     Y1         Z1  undergrad       personal       morning  Female
2   X1     Y1         Z1  undergrad       personal           day    Male
3   X1     Y1         Z1  undergrad       personal           day  Female
4   X1     Y1         Z1  undergrad       personal       evening    Male
..  ..    ...        ...        ...            ...           ...     ...
67  X2     Y2         Z2        PhD          grant       morning  Female
68  X2     Y2         Z2        PhD          grant           day    Male
69  X2     Y2         Z2        PhD          grant           day  Female
70  X2     Y2         Z2        PhD          grant       evening    Male
71  X2     Y2         Z2        PhD          grant       evening  Female

[72 rows x 7 columns]

If you want the specific order of rows/columns from your expected output:

cols = ['Uni', 'Region', 'Profession']
out = (pd.DataFrame(product(*data.values()), columns=data.keys())
         .merge(sample_data_as_is[cols], how='cross')
         [cols+list(data)]
      )

Output:

   Uni Region Profession  Level_Edu Financial_Base Learning_Time  GENDER
0   X1     Y1         Z1  undergrad       personal       morning    Male
1   X2     Y2         Z2  undergrad       personal       morning    Male
2   X1     Y1         Z1  undergrad       personal       morning  Female
3   X2     Y2         Z2  undergrad       personal       morning  Female
4   X1     Y1         Z1  undergrad       personal           day    Male
..  ..    ...        ...        ...            ...           ...     ...
67  X2     Y2         Z2        PhD          grant           day  Female
68  X1     Y1         Z1        PhD          grant       evening    Male
69  X2     Y2         Z2        PhD          grant       evening    Male
70  X1     Y1         Z1        PhD          grant       evening  Female
71  X2     Y2         Z2        PhD          grant       evening  Female

[72 rows x 7 columns]

Soudipta Dutta · Accepted Answer · 2024-06-03 06:21:27Z

I think this is the Most Efficient Method for Huge and Complex Datasets.You can Use: from sklearn.model_selection import ParameterGrid

import pandas as pd
import numpy as np
from sklearn.model_selection import ParameterGrid
    
df = pd.DataFrame([
        ["X1", "Y1", "Z1", np.nan, np.nan, np.nan, np.nan], 
        ["X2", "Y2", "Z2", np.nan, np.nan, np.nan, np.nan]
    ], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
    
# Possible values for each column with NAs
level_edu = ['undergrad', 'grad', 'PhD']
financial_base = ['personal', 'grant']
learning_time = ["morning", "day", "evening"]
gender = ['Male', 'Female']

level_edu      = pd.Series(level_edu, dtype='category')
financial_base = pd.Series(financial_base, dtype='category')
learning_time  = pd.Series(learning_time, dtype='category')
gender         = pd.Series(gender, dtype='category')
    
# Factorize possible values
level_edu_codes, level_edu_uniques           = pd.factorize(level_edu)
financial_base_codes, financial_base_uniques = pd.factorize(financial_base)
learning_time_codes, learning_time_uniques   = pd.factorize(learning_time)
gender_codes, gender_uniques                 = pd.factorize(gender)
    
    
# Create a dictionary for parameter grid
param_grid = {
        'Level_Edu': level_edu_codes,
        'Financial_Base': financial_base_codes,
        'Learning_Time': learning_time_codes,
        'GENDER': gender_codes
}
    
# Generate all combinations of factorized values using ParameterGrid
combinations = list(ParameterGrid(param_grid))
    
# Convert combinations to a numpy array of lists of values
combinations_array = np.array([[d['Level_Edu'], d['Financial_Base'],
                                     d['Learning_Time'], d['GENDER']] for d in combinations])
#axis=0 means: "Repeat each row of the array len(combinations_array) times."
# Repeat the original data for each combination
repeated_data = np.repeat(df[['Uni', 'Region', 'Profession']].values,
                           len(combinations_array), axis=0)
#axis=1 means: "Tile each column of the array len(df) times."
# Tile the combinations to align with repeated_data
tiled_combinations = np.tile(combinations_array, (len(df), 1))
    
# Map factorized values back to original values
mapped_combinations = np.column_stack([
        level_edu_uniques[tiled_combinations[:, 0]],
        financial_base_uniques[tiled_combinations[:, 1]],
        learning_time_uniques[tiled_combinations[:, 2]],
        gender_uniques[tiled_combinations[:, 3]]
])
    
# Concatenate the original repeated data with the mapped combinations
result = np.concatenate([repeated_data, mapped_combinations], axis=1)
    
# Convert the result back to a DataFrame
result_df = pd.DataFrame(result, columns=df.columns)
    
print(result_df)

"""
  Uni Region Profession  Level_Edu Financial_Base Learning_Time  GENDER
0   X1     Y1         Z1  undergrad       personal       morning    Male
1   X1     Y1         Z1       grad       personal       morning    Male
2   X1     Y1         Z1        PhD       personal       morning    Male
3   X1     Y1         Z1  undergrad       personal           day    Male
4   X1     Y1         Z1       grad       personal           day    Male
..  ..    ...        ...        ...            ...           ...     ...
67  X2     Y2         Z2       grad          grant           day  Female
68  X2     Y2         Z2        PhD          grant           day  Female
69  X2     Y2         Z2  undergrad          grant       evening  Female
70  X2     Y2         Z2       grad          grant       evening  Female
71  X2     Y2         Z2        PhD          grant       evening  Female

[72 rows x 7 columns]

"""

Collectives™ on Stack Overflow

Generating combinations in pandas dataframe

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
pandas
numpy
combinatorics
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonpandasnumpycombinatorics or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
numpy
combinatorics
or ask your own question.