I have a file which has customer data stored in pipe delimited format:
'NS|0001||MMMMMES & MMMMMELL|||||||||29|08|11||||||||||||AD|01|999999999|2/342 P T MMMMMY 4TH STREET WEST|||MMMMMORIN||31|99|079|9444444444|||||ES|ACTUN00000839436||31102000|6626|INR|6666||01|6626|69||||0001||69|||||||||01||0|0||||||0||||||||\n',
'NS|0001||SMMMMM ASSOCMMMM|||07042004||||||11|07|11||||||||||||AD|01|999999999|305/306 21ST CENTMMMMMILDING RIMMROAD|||MMMMM||11|395009|079|1111111111|0261|6629615|||IS|999999999|1|60|MMMMMM MMMMMIATES|07|11|||||07-APR-2000|||||||||||||06|305/306 21ST CENTURYBUILDING MMMMMOAD|||MMMMM||11|395009|079|1111111111|6629615|0261|||ES|BTSUR00008984444||28122000|20800|INR|9999||01|20800|15227||||1999||15227|||||||||05||0|0|||||JOINT|0||||||||\n',
'NS|0001||MMMM MMMMMITY MMMMMMM LTD|||02122000||||||12|07|11||||||||||||AD|01|999999999|DLF MMMMME 8 TH MMMMM MMMMMANDA MMMM|NH 8 MMMMMON||MMMMMON||12|122002|079||0124|4104641||4104655|ES|E7DEL00009458159||11012000|858120|INR|9000||01|858126|2809649||||1999||2809649|||||||||05||0|0||||||0||||||||\n'
There are different segments to customer data.
Like Name segment NS
will have user_name
, user_sr_no
, user_dob
etc
- name segment starts with
NS
(required segment, user can have only one of these) - address segment starts with
AD
(required segment, user can have multiple addresses) - relation segment starts with
ES
(optional segement,user can have multiple of these) - Internal segment starts with
IS
( optional segement,user can have multiple of these ) - every record ends with
\n
I want parse this strings in form of a data frame with proper column names I was thinking of converting this into a json first with user in outer most index, followed by segements, followed by actual column values Any one has any ideas on how to do this ^ or a better way of parsing this ?
WHAT I DID
lst = open(new_file).readlines()[1:]
col=[i.count('|') for i in lst]
col_max=max(col)
e = list(map(lambda x:col_max-x,col))
s=[lst[i]+'|'*e[i] for i in range(len(lst))]
# after this i was planning to make a list of column names
# and then write it as a header to a new file like,
colnames = [i +'|' for i in colnames]
So, I thought i would count num of columns in each row and then append '|' wherever required to make equal columns in each row. But then i saw that the segments in my file appear in random order. Like,
- NS > AD > ES
- NS > ES > RS > AD
Hence i need to be able to read the segment and then store input in dictionary. Can anyone help? Any inputs would be appreciated. I can provide more info if needed, please ask. Thanks in advance.