Consider a file, a.dat
, with contents:
address 1, address 2, address 3, num1, num2, num3
address 1, address 2, address 3, 1.0, 2.0, 3
address 1, address 2, "address 3, address4", 1.0, 2.0, 3
I am trying to import with numpy.genfromtxt
. However the function sees an additional column in row 3. I get a similar error with pandas.read_csv
:
np.genfromtxt('a.dat',delimiter=',',dtype=None,skiprows=1)
ValueError: Some errors were detected !
Line #3 (got 7 columns instead of 6)
and
pandas read_csv sort of works - but it gives me an unaligned data structure:
pd.read_csv('a.dat')
pandas.parser.CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 7
I'm trying to find an input parameter to compensate for this. I don't mind if I end up with a numpy ndarray or pandas dataframe.
Is there a parameter that I can set within genfromtxt
and/or read_csv
that will let me ignore the comma within the speech marks?
I note that read_csv
includes a quotechar='"'
parameter, defined thus:
quotechar : string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
This reads to me like read_csv should work for my case by default - yet it doesn't.
I can see that I could pre-process the file to strip out the commas - I'd like to avoid that if possible but would welcome suggestions if this is the only way.
Just managed to find this:
The key parameter that I was missing is skipinitialspace=True
- this "deals with the spaces after the comma-delimiter"
a=pd.read_csv('a.dat',quotechar='"',skipinitialspace=True)
address 1 address 2 address 3 num1 num2 num3
0 address 1 address 2 address 3 1 2 3
1 address 1 address 2 address 3, address4 1 2 3
This works :-)