numpy recarray strings of variable length

mathematical.coffee picture mathematical.coffee · Feb 2, 2012 · Viewed 8.5k times · Source

Is it possible to initialise a numpy recarray that will hold strings, without knowing the length of the strings beforehand?

As a (contrived) example:

mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )

The problem is that I'm constructing my recarray in advance of populating it with information, and I don't necessarily know the maximum length of file_name in advance.

All my attempts result in the string field being truncated:

>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)], 
      dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'], 
      dtype='|S1')

(As an aside, why does mydf['file_name'] show 'f' and 'a' whilst mydf shows '' and ''?)

Similarly, if I initialise with type (say) |S10 for file_name then things get truncated at length 10.

The only similar question I could find is this one, but this calculates the appropriate string length a priori and hence is not quite the same as mine (as I know nothing in advance).

Is there any alternative other than initalising the file_name with (eg) |S9999999999999 (ie some ridiculous upper limit)?

Answer

Toon Verstraelen picture Toon Verstraelen · Feb 2, 2012

Instead of using the STRING dtype, one can always use object as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)], 
      dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)