Need to convert pipe delimeted file to DataFrame

Ask Question

Asked 2 years, 1 month ago

Modified 2 years, 1 month ago

Viewed 67 times

I have a file which has customer data stored in pipe delimited format:

'NS|0001||MMMMMES & MMMMMELL|||||||||29|08|11||||||||||||AD|01|999999999|2/342 P  T MMMMMY 4TH STREET WEST|||MMMMMORIN||31|99|079|9444444444|||||ES|ACTUN00000839436||31102000|6626|INR|6666||01|6626|69||||0001||69|||||||||01||0|0||||||0||||||||\n',
'NS|0001||SMMMMM ASSOCMMMM|||07042004||||||11|07|11||||||||||||AD|01|999999999|305/306 21ST CENTMMMMMILDING RIMMROAD|||MMMMM||11|395009|079|1111111111|0261|6629615|||IS|999999999|1|60|MMMMMM MMMMMIATES|07|11|||||07-APR-2000|||||||||||||06|305/306 21ST CENTURYBUILDING MMMMMOAD|||MMMMM||11|395009|079|1111111111|6629615|0261|||ES|BTSUR00008984444||28122000|20800|INR|9999||01|20800|15227||||1999||15227|||||||||05||0|0|||||JOINT|0||||||||\n',
'NS|0001||MMMM MMMMMITY MMMMMMM LTD|||02122000||||||12|07|11||||||||||||AD|01|999999999|DLF MMMMME 8 TH MMMMM MMMMMANDA MMMM|NH 8 MMMMMON||MMMMMON||12|122002|079||0124|4104641||4104655|ES|E7DEL00009458159||11012000|858120|INR|9000||01|858126|2809649||||1999||2809649|||||||||05||0|0||||||0||||||||\n'

There are different segments to customer data. Like Name segment NS will have user_name, user_sr_no, user_dob etc

name segment starts with NS (required segment, user can have only one of these)
address segment starts with AD (required segment, user can have multiple addresses)
relation segment starts with ES(optional segement,user can have multiple of these)
Internal segment starts with IS( optional segement,user can have multiple of these )
every record ends with \n

I want parse this strings in form of a data frame with proper column names I was thinking of converting this into a json first with user in outer most index, followed by segements, followed by actual column values Any one has any ideas on how to do this ^ or a better way of parsing this ?

WHAT I DID

lst = open(new_file).readlines()[1:]

col=[i.count('|') for i in lst]
col_max=max(col)

e = list(map(lambda x:col_max-x,col))

s=[lst[i]+'|'*e[i] for i in range(len(lst))]

# after this i was planning to make a list of column names
# and then write it as a header to a new file like, 
colnames = [i +'|' for i in colnames]

So, I thought i would count num of columns in each row and then append '|' wherever required to make equal columns in each row. But then i saw that the segments in my file appear in random order. Like,

NS > AD > ES
NS > ES > RS > AD

Hence i need to be able to read the segment and then store input in dictionary. Can anyone help? Any inputs would be appreciated. I can provide more info if needed, please ask. Thanks in advance.

edited Nov 7, 2022 at 14:53

asked Nov 7, 2022 at 13:20

cRIsP

73 bronze badges

Any reason why you wouldn't just use pandas.read_csv()
– JJFord3
Commented Nov 7, 2022 at 15:02
i want to add headers to the file so it's easier for someone to read it in say excel @JJFord3
– cRIsP
Commented Nov 7, 2022 at 16:00

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Collectives™ on Stack Overflow

Need to convert pipe delimeted file to DataFrame

WHAT I DID

0

Your Answer

Browse other questions tagged
python
string
csv
parsing
delimiter
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

WHAT I DID

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged pythonstringcsvparsingdelimiter or ask your own question.

Browse other questions tagged
python
string
csv
parsing
delimiter
or ask your own question.