How do I remove hex values in a python string with regular expressions?

moorepants picture moorepants · Mar 4, 2011 · Viewed 18.6k times · Source

I have a cell array in matlab

columns = {'MagX', 'MagY', 'MagZ', ...
           'AccelerationX',  'AccelerationX',  'AccelerationX', ...
           'AngularRateX', 'AngularRateX', 'AngularRateX', ...
           'Temperature'}

I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.

I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:

>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
 'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
 'MagZ\x00\x00\x00\x00\x001',
 'AccelerationX',
 'AccelerationY',
 'AccelerationZ',
 'AngularRateX',
 'AngularRateY',
 'AngularRateZ',
 'Temperature']

These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.

>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'

I've tried using a regular expression to remove the hex values but have little luck.

>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'

Any suggestions on how to deal with this?

Answer

Andrew Clark picture Andrew Clark · Mar 4, 2011

You can remove all non-word characters in the following way:

>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.

Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:

>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.

To exclude all characters after this first control character just add a .*, like this:

>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'