Truncating unicode so it fits a maximum size when encoded for wire transfer

JasonSmith picture JasonSmith · Nov 27, 2009 · Viewed 7k times · Source

Given a Unicode string and these requirements:

  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
  • The encoded string has a maximum length

For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.

What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?

(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)

See Also:

Answer

Denis Otkidach picture Denis Otkidach · Nov 30, 2009
def unicode_truncate(s, length, encoding='utf-8'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')

Here is an example for unicode string where each character is represented with 2 bytes in UTF-8:

>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'