-1

I need to read a csv file and fill the empty/null values in column "Phone,Email,address,city,state,Zip" based on the relationshiptype and LastName and write to a new csv file. Example: If a person relationshipType is "Employer" and his dependents share same lastname and if the dependents doesn't have "Phone,Email,address,city,state,zip", I should fill the null values with Employer's "Phone,Email,address,city,state,zip". There are two main concerns which I am facing, 1) LastName can be same for different people/families(i.e., row 5-10)so the loop should break and continue whenever "Relationship" changes to "Employer" 2) In Some cases, Dependent Lastname won't be same as Employer Lastname but still we should fill the null values if it falls in-between the same last names(i.e., row 8).

I won't be able to use python pandas as the rest of the code/programs are purely based on python 2.7.

Input format/table looks like below with empty cells (csv file):

[FirstName  LastName    DoB Relationship    Phone   Email   Address City    State   Zip
Hannah  Kahnw   9/12/1972   Employer    1457871452  [email protected]  Han Ave hannas  UT  563425
Michel  Kahnw   2/9/1993    Dependent                       
Jonaas  Kahnw   2/22/1997   Dependent                       
Mikkel  Nielsen 1/25/1976   Employer    4509213887  [email protected] 887 Street  neil    NY  72356
Magnus  Nielsen 9/20/1990   Dependent                       
Ulrich  Nielsen 9/12/1983   Employer    7901234516  [email protected]  Ulric Build mavric  KS  421256
kathari Nielsen 10/2/2003   Dependent                       
kathy   storm   12/12/1999  Dependent                       
kiiten  Nielsen 6/21/1999   Dependent                       
Elisab  Doppler 2/22/1987   Employer    5439001211  [email protected]   Elisa apart Elis    AR  758475
Peterp  Doppler 1/25/1977   Employer    6847523758  [email protected]    park Ave    Pete    PT  415253
bartos  Doppler 9/21/1990   Dependent][1]                       

Output format should be like below:

FirstName   LastName    DoB Relationship    Phone   Email   Address City    State   Zip
Hannah  Kahnw   9/12/1972   Employer    1457871452  [email protected]  Han Ave hannas  UT  563425
Michel  Kahnw   2/9/1993    Dependent   1457871453  [email protected]  Han Ave hannas  UT  563426
Jonaas  Kahnw   2/22/1997   Dependent   1457871454  [email protected]  Han Ave hannas  UT  563427
Mikkel  Nielsen 1/25/1976   Employer    4509213887  [email protected] 887 Street  neil    NY  72356
Magnus  Nielsen 9/20/1990   Dependent   4509213888  [email protected] 888 Street  neil    NY  72357
Ulrich  Nielsen 9/12/1983   Employer    7901234516  [email protected]  Ulric Build mavric  KS  421256
kathari Nielsen 10/2/2003   Dependent   7901234517  [email protected]  Ulric Build mavric  KS  421257
kathy   storm   12/12/1999  Dependent   7901234518  [email protected]  Ulric Build mavric  KS  421258
kiiten  Nielsen 6/21/1999   Dependent   7901234519  [email protected]  Ulric Build mavric  KS  421259
Elisab  Doppler 2/22/1987   Employer    5439001211  [email protected]   Elisa apart Elis    AR  758475
Peterp  Doppler 1/25/1977   Employer    6847523758  [email protected]    park Ave    Pete    PT  415253
bartos  Doppler 9/21/1990   Dependent   6847523759  [email protected]    park Ave    Pete    PT  415254


import csv
from collections import namedtuple


def get_info(file_path):

    # Read data from file and convert to list of namedtuples
    # dictionary to use to fill in missing information from others.
    with open(file_path, 'rb') as fin:
        csv_reader =  csv.reader(fin, skipinitialspace=True)

        header = next(csv_reader)
        Record = namedtuple('Record', header)

        addr_dict = {}
        data = [header]

        for rec in (Record._make(row) for row in csv_reader):
            if rec.Email or rec.Phone or rec.Address or rec.City or rec.State or rec.Zip:
                addr_dict.setdefault(rec.LastName, []).append(rec)  # Remember it.

    # Try to fill in missing data from any other records with same Address.
    for i, row in enumerate(data[1:], 1):
        if not (row.Phone and row.Email and rec.Address and rec.City and rec.State and rec.Zip):  # Info missing?
            # Try to copy it from others at same address.
            updated = False
            for other in addr_dict.get(row.LastName, []):
                if not row.Phone and other.Phone:
                    row = row._replace(Phone=other.Phone)
                    updated = True
                if not row.Email and other.Email:
                    row = row._replace(Email=other.Email)
                    updated = True
                if not row.Address and other.Address:
                    row = row._replace(Address=other.Address)
                    updated = True
                if not row.City and other.City:
                    row = row._replace(City=other.City)
                    updated = True
                if not row.Zip and other.Zip:
                    row = row._replace(Zip=other.Zip)
                    updated = True
                if row.Phone and row.Email and rec.Address and rec.City and rec.State and rec.Zip:  # Info now filled in?
                    break

            if updated:
                data[i] = row

    return data


INPUT_FILE = 'null_cols.csv'
OUTPUT_FILE = 'fill_cols.csv'

data = get_info(INPUT_FILE)

with open(OUTPUT_FILE, 'wb') as fout:
    writer = csv.DictWriter(fout, data[0])  # First elem has column names.
    writer.writeheader()
    for row in data[1:]:
        writer.writerow(row._asdict())

#(i got this code from earlier question which i asked in S.O This script doesn't include relationshiptype logic and also it doesn't consider the Duplicate LastName issue)

Thanks for the help !!

2
  • can you please add a formatted sample of your dataframe, i can't really tell which record belongs to which column. From there i can help you write a code to do it.
    – Mit
    Commented May 4, 2020 at 17:58
  • @mit could you please tell me how to add a formatted sample? I always face trouble with this, as of now I just copy pasted the input data from csv file. If you mean pandas dataframe? I am not using pandas/Jupyter, I am using Pycharm and I am new to this.
    – Roy
    Commented May 4, 2020 at 18:15

1 Answer 1

0

OK, there is a simple way of doing this using ffill as below, however, this will only work if dependants are always following their employer in rows order which is the case in the data sample you've provided hence the below code works:

import pandas as pd

#read in your csv file
df = pd.read_csv('fileName.csv')


#loop over columns and replace nans with the most recent value available
for c in df.columns:
    df[c].fillna(method='ffill', inplace=True)

#write out your df back to csv
df.to_csv('newFile.csv', index=False)
9
  • he cannot use pandas
    – coldy
    Commented May 4, 2020 at 18:42
  • @coldy why not?
    – Mit
    Commented May 4, 2020 at 18:43
  • I just said it seeing the OP.
    – coldy
    Commented May 4, 2020 at 18:46
  • @mit thank you so much but is there anyway to use only python 2.7 code? I mean i cant use pandas.
    – Roy
    Commented May 4, 2020 at 18:47
  • 1
    sorry i've just seen that in your post, thanks @coldy for pointing it out. Been ages since i've used 2.7 but let me try it realy quick.
    – Mit
    Commented May 4, 2020 at 18:49

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.