I would like to read in a csv file using genfromtxt. I have six columns that are float, and one column that is a string.
How do I set the datatype so that the float columns will be read in as floats and the string column will be read in as strings? I tried dtype='void' but that is not working.
Suggestions?
Thanks
.csv file
999.9, abc, 34, 78, 12.3
1.3, ghf, 12, 8.4, 23.7
101.7, evf, 89, 2.4, 11.3
x = sys.argv[1]
f = open(x, 'r')
y = np.genfromtxt(f, delimiter = ',', dtype=[('f0', '<f8'), ('f1', 'S4'), (\
'f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8')])
ionenergy = y[:,0]
units = y[:,1]
Error:
ionenergy = y[:,0]
IndexError: invalid index
I don't get this error when I specify a single data type..
dtype=None
tells genfromtxt
to guess the appropriate dtype.
From the docs:
dtype: dtype, optional
Data type of the resulting array. If None, the dtypes will be determined by the contents of each column, individually.
(my emphasis.)
Since your data is comma-separated, be sure to include delimiter=','
or else np.genfromtxt
will interpret each column (execpt the last) as including a string character (the comma) and therefore mistakenly assign a string dtype to each of those columns.
For example:
import numpy as np
arr = np.genfromtxt('data', dtype=None, delimiter=',')
print(arr.dtype)
# [('f0', '<f8'), ('f1', 'S4'), ('f2', '<i4'), ('f3', '<f8'), ('f4', '<f8')]
This shows the names and dtypes of each column. For example, ('f3', <f8)
means the fourth column has name 'f3'
and is of dtype '<i4. The i
means it is an integer dtype. If you need the third column to be a float dtype then there are a few options.
You could supply the dtype explicitly in the call to genfromtxt
arr = np.genfromtxt(
'data', delimiter=',',
dtype=[('f0', '<f8'), ('f1', 'S4'), ('f2', '<f4'), ('f3', '<f8'), ('f4', '<f8')])
print(arr)
# [(999.9, ' abc', 34, 78.0, 12.3) (1.3, ' ghf', 12, 8.4, 23.7)
# (101.7, ' evf', 89, 2.4, 11.3)]
print(arr['f2'])
# [34 12 89]
The error message IndexError: invalid index
is being generated by the line
ionenergy = y[:,0]
When you have mixed dtypes, np.genfromtxt
returns a structured array. You need to read up on structured arrays because the syntax for accessing columns differs from the syntax used for plain arrays of homogenous dtype.
Instead of y[:, 0]
, to access the first column of the structured array y
, use
y['f0']
Or, better yet, supply the names
parameter in np.genfromtxt
, so you can use a more relevant column name, like y['ionenergy']
:
import numpy as np
arr = np.genfromtxt(
'data', delimiter=',', dtype=None,
names=['ionenergy', 'foo', 'bar', 'baz', 'quux', 'corge'])
print(arr['ionenergy'])
# [ 999.9 1.3 101.7]