What is the default dtype for str like input in numpy?

Isha Garg picture Isha Garg · Sep 5, 2017 · Viewed 20.4k times · Source

I just wanted to confirm if the default data type for string is unicode while creating a ndarray. I could not find any reference which states this clearly. May be it is too obvious and doesn't need stating.

When dtype is specified:

>>> import numpy as np
>>> g = np.array([['a', 'b'],['c', 'd']], dtype='S')
>>> g
array([[b'a', b'b'],
       [b'c', b'd']], 
      dtype='|S1')

Without specifying the dtype:

>>> g = np.array([['a', 'b'],['c', 'd']])
>>> g
array([['a', 'b'],
       ['c', 'd']], 
      dtype='<U1')

Also, what does the literal b indicate when dtype is specified. As per the documentation, it indicates bool which doesn't seem to be the case here.

Can some one please clarify?

Answer

MSeifert picture MSeifert · Sep 5, 2017

b'...' means it's a byte-string and the default dtype for arrays of strings depends on the kind of strings. Unicodes (python 3 strings are unicode) are U and Python 2 str or Python 3 bytes have the dtype S. You can find the explanation of dtypes in the NumPy documentation here

Array-protocol type strings

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are:

  • '?' boolean
  • 'b' (signed) byte
  • 'B' unsigned byte
  • 'i' (signed) integer
  • 'u' unsigned integer
  • 'f' floating-point
  • 'c' complex-floating point
  • 'm' timedelta
  • 'M' datetime
  • 'O' (Python) objects
  • 'S', 'a' zero-terminated bytes (not recommended)
  • 'U' Unicode string
  • 'V' raw data (void)

However in your first case you actually forced NumPy to convert it to bytes because you specified dtype='S'.