NumPy dtype issues in genfromtxt(), reads string in as bytestring

Question 1

NumPy dtype issues in genfromtxt(), reads string in as bytestring

python numpy genfromtxt

user2489252 · Feb 22, 2014 · Viewed 12.6k times · Source

Answer

Answer

In Python2.7

array([('ZINC00043096', 'C.3', 'C1', -0.154, 'methyl'),
       ('ZINC00043096', 'C.3', 'C2', 0.0638, 'methylene'),
       ('ZINC00043096', 'C.3', 'C4', 0.0669, 'methylene'),
       ('ZINC00090377', 'C.3', 'C7', 0.207, 'methylene')], 
      dtype=[('f0', 'S12'), ('f1', 'S3'), ('f2', 'S2'), ('f3', '<f8'), ('f4', 'S9')])

in Python3

array([(b'ZINC00043096', b'C.3', b'C1', -0.154, b'methyl'),
       (b'ZINC00043096', b'C.3', b'C2', 0.0638, b'methylene'),
       (b'ZINC00043096', b'C.3', b'C4', 0.0669, b'methylene'),
       (b'ZINC00090377', b'C.3', b'C7', 0.207, b'methylene')], 
      dtype=[('f0', 'S12'), ('f1', 'S3'), ('f2', 'S2'), ('f3', '<f8'), ('f4', 'S9')])

The 'regular' strings in Python3 are unicode. But your text file has byte strings. all_data is the same in both cases (136 bytes), but Python3's way of displaying a byte string is b'C.3', not just 'C.3'.

What kinds of operations do you plan on doing with these strings? 'ZIN' in all_data['f0'][1] works with the 2.7 version, but in 3 you have to use b'ZIN' in all_data['f0'][1].

Variable/unknown length string/unicode dtype in numpy reminds me that you can specify a unicode string type in the dtype. However this becomes more complicated if you don't know the lengths of the strings beforehand.

alttype = np.dtype([('f0', 'U12'), ('f1', 'U3'), ('f2', 'U2'), ('f3', '<f8'), ('f4', 'U9')])
all_data_u = np.genfromtxt(csv_file, dtype=alttype, delimiter=',')

producing

array([('ZINC00043096', 'C.3', 'C1', -0.154, 'methyl'),
       ('ZINC00043096', 'C.3', 'C2', 0.0638, 'methylene'),
       ('ZINC00043096', 'C.3', 'C4', 0.0669, 'methylene'),
       ('ZINC00090377', 'C.3', 'C7', 0.207, 'methylene')], 
      dtype=[('f0', '<U12'), ('f1', '<U3'), ('f2', '<U2'), ('f3', '<f8'), ('f4', '<U9')])

In Python2.7 all_data_u displays as

(u'ZINC00043096', u'C.3', u'C1', -0.154, u'methyl')

all_data_u is 448 bytes, because numpy allocates 4 bytes for each unicode character. Each U4 item is 16 bytes long.

Changes in v 1.14: https://docs.scipy.org/doc/numpy/release.html#encoding-argument-for-text-io-functions

Question 2

I want to read in a standard-ascii csv file into numpy, which consists of floats and strings.

E.g.,

ZINC00043096,C.3,C1,-0.1540,methyl
ZINC00043096,C.3,C2,0.0638,methylene
ZINC00043096,C.3,C4,0.0669,methylene
ZINC00090377,C.3,C7,0.2070,methylene
...

Whatever I tried, the resulting array would look like

E.g.,

all_data = np.genfromtxt(csv_file, dtype=None, delimiter=',')


[(b'ZINC00043096', b'C.3', b'C1', -0.154, b'methyl')
 (b'ZINC00043096', b'C.3', b'C2', 0.0638, b'methylene')
 (b'ZINC00043096', b'C.3', b'C4', 0.0669, b'methylene')

However, I want to save a step for the byte-string conversion and was wondering how I can read in the string columns as regular string directly.

I tried several things from the numpy.genfromtxt() documentation, e.g., dtype='S,S,S,f,S' or dtype='a25,a25,a25,f,a25', but nothing really helped here.

I am afraid, but I think I just don't understand how the dtype conversion really works...Would be nice if you can give me some hint here!

Thanks

NumPy dtype issues in genfromtxt(), reads string in as bytestring

Answer

Related questions