Python UTF-8 XML parsing (SUDS): Removing 'invalid token'

FlipMcF picture FlipMcF · Jan 3, 2012 · Viewed 7.5k times · Source

Here's a common error when dealing with UTF-8 - 'invalid tokens'

In my example, It comes from dealing with a SOAP service provider that had no respect for unicode characters, simply truncating values to 100 bytes and neglecting that the 100'th byte may be in the middle of a multi-byte character: for example:

<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>

The last two bytes are what remains of a 3 byte unicode character, after the truncation knife assumed that the world uses 1-byte characters. Next stop, sax parser and:

xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)

I don't care about this character anymore. It should be removed from the document and allow the sax parser to function.

The XML reply is valid in every other respect except for these values.

Question: How do you remove this character without parsing the entire document and re-inventing UTF-8 encoding to check every byte?

Using: Python+SUDS

Answer

FlipMcF picture FlipMcF · Jan 3, 2012

Turns out, SUDS sees xml as type 'string' (not unicode) so these are encoded values.

1) The FILTER:

badXML = "your bad utf-8 xml here"  #(type <str>)

#Turn it into a python unicode string - ignore errors, kick out bad unicode
decoded = badXML.decode('utf-8', errors='ignore')  #(type <unicode>)

#turn it back into a string, using utf-8 encoding.
goodXML = decoded.encode('utf-8')   #(type <str>)

2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin

from suds.plugin import MessagePlugin
class UnicodeFilter(MessagePlugin):
    def received(self, context):
        decoded = context.reply.decode('utf-8', errors='ignore')
        reencoded = decoded.encode('utf-8')
        context.reply = reencoded

and

from suds.client import Client
client = Client(WSDL_url, plugins=[UnicodeFilter()])

Hope this helps someone.


Note: Thanks to John Machin!

See: Why is python decode replacing more than the invalid bytes from an encoded string?

Python issue8271 regarding errors='ignore' can get in your way here. Without this bug fixed in python, 'ignore' will consume the next few bytes to satisfy the length

during the decoding of an invalid UTF-8 byte sequence, only the
start byte and the continuation byte(s) are now considered invalid, instead of the number of bytes specified by the start byte

Issue was fixed in:
Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future releases of 2.7)
Python 3.1.3 rc1 (and all future release of 3.x)

Python 2.5 and below will contain this issue.

In the example above, "\xef\xbc</name".decode('utf-8', errors='ignore') should
return "</name", but in 'bugged' versions of python it returns "/name".

The first four bits (0xe) describes a 3-byte UTF character, so the bytes0xef, 0xbc, and then (erroneously) 0x3c ('<') are consumed.

0x3c is not a valid continuation byte which creates the invalid 3-byte UTF character in the first place.

Fixed versions of python only remove the first byte and only valid continuation bytes, leaving 0x3c unconsumed