I am trying to read in a csv file with numpy.genfromtxt
but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
np.genfromtxt('t.csv', delimiter=',')
produces the error:
ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)
The data structure I am looking for is:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv
module and then convert it to a numpy array?
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
The default value is "
. An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).