3

i have a csv file that is generated by exporting a Tableau table to csv, but I can not manage to open it in Python.

I have tried to use pd.read_csv but that fails.

import pandas as pd

#path to file
path = "tableau_crosstab.csv"

data = pd.read_csv(path, encoding="ISO-8859-1") 

This works for reading in the file, but the result is just a number of rows with one character per row, and some weird characters in the head of the frame.

ÿþd
o    
m    
a    
i

and so on. When I try to import the file in Excel I have to select tab as separator, but when I trie that here it fails

import pandas as pd

#path to file
path = "tableau_crosstab.csv"

data = pd.read_csv(path, encoding="ISO-8859-1", sep='\t') 

CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2

I did try to open the file with codecs, and then it says the encoding is 'cp1252', but using that as the encoding fails too.

I also tried to read it in using utf-8 and that also fails. I am running out of ideas for how to solve this.

Here is a link to where a copy if the file is if someone could take a look http://www.mediafire.com/file/6dtxo2deczwy3u2/tableau_crosstab.csv

1 Answer 1

6

You have unicode BOM specifically utf-16LE

try

data = pd.read_csv(path, encoding="utf-16", sep='\t') 

the funny characters you see: ÿþ corresponds to the hex FF FE which is the unicode-16 little endian byte order mark. If you see the wikipedia page it shows all the various byte order marks

I get the following when reading your csv:

In[4]:
data = pd.read_csv(r'C:\tableau_crosstab.csv', encoding='utf-16', sep='\t')
data

Out[4]: 
       domain Month of date impressions clicks
0    test1.no        jun.17     725 676    633
1    test1.no        mai.17     422 995    456
2    test1.no        apr.17     241 102    316
3    test1.no        mar.17     295 157    260
4    test1.no        feb.17     122 902    198
5    test1.no        jan.17     137 972    201
6    test1.no        des.16     274 435    361
7   test2.com        jun.17   3 083 373  1 638
8   test2.com        mai.17   3 370 620  2 036
9   test2.com        apr.17   2 388 933  1 483
10  test2.com        mar.17   2 410 675  1 581
11  test2.com        feb.17   2 311 952  1 682
12  test2.com        jan.17   1 184 787    874
13  test2.com        des.16   2 118 594  1 738
14  test3.com        jun.17     411 456     41
15  test3.com        mai.17     342 048     87
16  test3.com        apr.17     197 058    108
17  test3.com        mar.17     288 949    156
18  test3.com        feb.17     230 970    130
19  test3.com        jan.17     388 032    115
20  test3.com        des.16   1 693 442    166
21   test4.no        jun.17     521 790    683
22   test4.no        mai.17     438 037    541
23   test4.no        apr.17     618 282  1 042
24   test4.no        mar.17     576 413    956
25   test4.no        feb.17     451 248    636
26   test4.no        jan.17     293 217    471
27   test4.no        des.16     641 491    978
2
  • It worked for me too. Thanks! So from looking at ÿþ you was able understand that the encoding was 'utf-16' ?
    – Siesta
    Commented Aug 15, 2017 at 10:56
  • yes, if you look at the wikipedia page: en.wikipedia.org/wiki/… you will see the hex values and the displayed character, you get used to seeing these and recognising them after a while
    – EdChum
    Commented Aug 15, 2017 at 10:57

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.