0

I am trying to parse csv file that I don't control. The file contains quoted text like ...,text ""text"",... and I want the parsed field to look like text "text".

Java Opencsv can parse this file just fine, but python library needs the whole field to be quoted (...,"text ""text""",...) to properly unquote the text.

I've prepared a test suite that illustrates my problem.

test.csv

col1,col2
idk,text ""text""
idk,"text ""text"""
idk,"text, text"
idk,text """"text""""
idk,"text """"text"""""

Expected output:

[
    {'col1': 'idk', 'col2': 'text "text"'},
    {'col1': 'idk', 'col2': 'text "text"'},
    {'col1': 'idk', 'col2': 'text, text'},
    {'col1': 'idk', 'col2': 'text ""text""'},
    {'col1': 'idk', 'col2': 'text ""text""'},
]

Test program:

import csv
import unittest
from typing import Dict, List


def parse_input(path: str) -> List[Dict[str, str]]:
    with open(path, 'r') as f:
        reader = csv.DictReader(f)
        return list(reader)


class TestStringMethods(unittest.TestCase):
    def test_parsing(self) -> None:
        parsed = parse_input('./test.csv')
        self.assertEqual(
            parsed,
            [
                {
                    'col1': 'idk',
                    'col2': 'text "text"',
                },
                {
                    'col1': 'idk',
                    'col2': 'text "text"',
                },
                {
                    'col1': 'idk',
                    'col2': 'text, text',
                },
                {
                    'col1': 'idk',
                    'col2': 'text ""text""',
                },
                {
                    'col1': 'idk',
                    'col2': 'text ""text""',
                },
            ],
        )


if __name__ == '__main__':
    unittest.main()

I haven't found any parameters that would alow me to do what I want. I thought doublequote parameter would work, but default value is already True.

I have tried python built-in csv lib, pandas and clevercsv. They all treat the double quotes as simple characters not as one escaped double quote. Generally that is a good thing, but now I need to properly parse this file preferably without writing my own parser.

Do you have some suggestions on some python libraries which can parse this file?

4
  • 2
    A bit tangential, but if you have the ability to do so, going upstream and fixing the generation of this file to properly conform to the long-established CSV specification would make your life significantly easier.
    – esqew
    Commented Oct 9 at 16:00
  • 1
    I would pretend that if Opencsv gives you the expected result, then Opencsv is non conformant. In the csv specifications, quote characters only have a special meaning in quoted fields. So this line idk,text ""text"" shall give {'col1': 'idk', 'col2': 'text ""text""'}. Said differently why would you expect a csv library to read a file that is not csv encoded? If you use a custom format, you just need a custom parser. Commented Oct 9 at 16:27
  • 1
    I suggest you write a simple Java program to preprocess the file into a valid CSV.
    – Barmar
    Commented Oct 9 at 16:31
  • Depending on what other data is in your file, you might be able to get away with setting the escape character to " and doublequote to False when constructing the reader. Strictly speaking, there isn't really a strict standard for CSV, just a lot of common conventions.
    – chepner
    Commented Oct 9 at 19:05

2 Answers 2

1

Python's csv module cannot handle those kinds of quotes the way you want. I suggest you follow @Barmar's suggestion and write a preprocessor in Java to "fix" that kind of CSV before passing it to Python or other libraries.

0

Thanks for suggestions.

I solved this problem by preprocessing rows while reading the file. I used regex to match all fields containing "" which are not inside double quotes, then I insert double quotes around the field.

It din't have a big impact on performance and I didn't notice any problems.

1
  • Please, consider sharing the modified code to achieve your goal.
    – LuxGiammi
    Commented Oct 18 at 9:50

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.