0

I have a file which has customer data stored in pipe delimited format:

'NS|0001||MMMMMES & MMMMMELL|||||||||29|08|11||||||||||||AD|01|999999999|2/342 P  T MMMMMY 4TH STREET WEST|||MMMMMORIN||31|99|079|9444444444|||||ES|ACTUN00000839436||31102000|6626|INR|6666||01|6626|69||||0001||69|||||||||01||0|0||||||0||||||||\n',
'NS|0001||SMMMMM ASSOCMMMM|||07042004||||||11|07|11||||||||||||AD|01|999999999|305/306 21ST CENTMMMMMILDING RIMMROAD|||MMMMM||11|395009|079|1111111111|0261|6629615|||IS|999999999|1|60|MMMMMM MMMMMIATES|07|11|||||07-APR-2000|||||||||||||06|305/306 21ST CENTURYBUILDING MMMMMOAD|||MMMMM||11|395009|079|1111111111|6629615|0261|||ES|BTSUR00008984444||28122000|20800|INR|9999||01|20800|15227||||1999||15227|||||||||05||0|0|||||JOINT|0||||||||\n',
'NS|0001||MMMM MMMMMITY MMMMMMM LTD|||02122000||||||12|07|11||||||||||||AD|01|999999999|DLF MMMMME 8 TH MMMMM MMMMMANDA MMMM|NH 8 MMMMMON||MMMMMON||12|122002|079||0124|4104641||4104655|ES|E7DEL00009458159||11012000|858120|INR|9000||01|858126|2809649||||1999||2809649|||||||||05||0|0||||||0||||||||\n'


There are different segments to customer data. Like Name segment NS will have user_name, user_sr_no, user_dob etc

  • name segment starts with NS (required segment, user can have only one of these)
  • address segment starts with AD (required segment, user can have multiple addresses)
  • relation segment starts with ES(optional segement,user can have multiple of these)
  • Internal segment starts with IS( optional segement,user can have multiple of these )
  • every record ends with \n

I want parse this strings in form of a data frame with proper column names I was thinking of converting this into a json first with user in outer most index, followed by segements, followed by actual column values Any one has any ideas on how to do this ^ or a better way of parsing this ?

WHAT I DID

lst = open(new_file).readlines()[1:]

col=[i.count('|') for i in lst]
col_max=max(col)

e = list(map(lambda x:col_max-x,col))

s=[lst[i]+'|'*e[i] for i in range(len(lst))]

# after this i was planning to make a list of column names
# and then write it as a header to a new file like, 
colnames = [i +'|' for i in colnames] 

So, I thought i would count num of columns in each row and then append '|' wherever required to make equal columns in each row. But then i saw that the segments in my file appear in random order. Like,

  1. NS > AD > ES
  2. NS > ES > RS > AD

Hence i need to be able to read the segment and then store input in dictionary. Can anyone help? Any inputs would be appreciated. I can provide more info if needed, please ask. Thanks in advance.

2
  • Any reason why you wouldn't just use pandas.read_csv()
    – JJFord3
    Commented Nov 7, 2022 at 15:02
  • i want to add headers to the file so it's easier for someone to read it in say excel @JJFord3
    – cRIsP
    Commented Nov 7, 2022 at 16:00

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.