Python genfromtext multiple datatypes

user2483176 picture user2483176 · Oct 27, 2013 · Viewed 7.6k times · Source

I would like to read in a csv file using genfromtxt. I have six columns that are float, and one column that is a string.

How do I set the datatype so that the float columns will be read in as floats and the string column will be read in as strings? I tried dtype='void' but that is not working.

Suggestions?

Thanks

.csv file

999.9, abc, 34, 78, 12.3
1.3, ghf, 12, 8.4, 23.7
101.7, evf, 89, 2.4, 11.3



x = sys.argv[1]
f = open(x, 'r')
y = np.genfromtxt(f, delimiter = ',', dtype=[('f0', '<f8'), ('f1', 'S4'), (\
'f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8')])

ionenergy = y[:,0]
units = y[:,1]

Error:

ionenergy = y[:,0]
IndexError: invalid index

I don't get this error when I specify a single data type..

Answer

unutbu picture unutbu · Oct 27, 2013

dtype=None tells genfromtxt to guess the appropriate dtype.

From the docs:

dtype: dtype, optional

Data type of the resulting array. If None, the dtypes will be determined by the contents of each column, individually.

(my emphasis.)


Since your data is comma-separated, be sure to include delimiter=',' or else np.genfromtxt will interpret each column (execpt the last) as including a string character (the comma) and therefore mistakenly assign a string dtype to each of those columns.

For example:

import numpy as np

arr = np.genfromtxt('data', dtype=None, delimiter=',')

print(arr.dtype)
# [('f0', '<f8'), ('f1', 'S4'), ('f2', '<i4'), ('f3', '<f8'), ('f4', '<f8')]

This shows the names and dtypes of each column. For example, ('f3', <f8) means the fourth column has name 'f3' and is of dtype '<i4. The i means it is an integer dtype. If you need the third column to be a float dtype then there are a few options.

  1. You could manually edit the data by adding a decimal point in the third column to force genfromtxt to interpret values in that column to be of a float dtype.
  2. You could supply the dtype explicitly in the call to genfromtxt

    arr = np.genfromtxt(
        'data', delimiter=',',
        dtype=[('f0', '<f8'), ('f1', 'S4'), ('f2', '<f4'), ('f3', '<f8'), ('f4', '<f8')])
    

print(arr)
# [(999.9, ' abc', 34, 78.0, 12.3) (1.3, ' ghf', 12, 8.4, 23.7)
#  (101.7, ' evf', 89, 2.4, 11.3)]

print(arr['f2'])
# [34 12 89]

The error message IndexError: invalid index is being generated by the line

ionenergy = y[:,0]

When you have mixed dtypes, np.genfromtxt returns a structured array. You need to read up on structured arrays because the syntax for accessing columns differs from the syntax used for plain arrays of homogenous dtype.

Instead of y[:, 0], to access the first column of the structured array y, use

y['f0']

Or, better yet, supply the names parameter in np.genfromtxt, so you can use a more relevant column name, like y['ionenergy']:

import numpy as np
arr = np.genfromtxt(
    'data', delimiter=',', dtype=None,
    names=['ionenergy', 'foo', 'bar', 'baz', 'quux', 'corge'])

print(arr['ionenergy'])
# [ 999.9    1.3  101.7]