Does Java String.getBytes("UTF-8") preserve lexicograhpical order?

Carsten picture Carsten · Aug 16, 2012 · Viewed 7.4k times · Source

If I have a lexicographical sorted list of Java Strings [s1,s2,s3,s4, ...., sn], and then convert each String into a byte array using UTF-8 encoding bx = sx.getBytes("UTF-8"), is the list of byte arrays [b1,b2,b3,...bn] also lexicographical sorted?

Answer

Mechanical snail picture Mechanical snail · Aug 16, 2012

Yes. According to RFC 3239:

The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. Of course this is of limited interest since a sort order based on character numbers is almost never culturally valid.

As Ian Roberts pointed out, this applies for "true UTF-8 (such as String.getBytes will give you)", but beware of DataInputStream's fake UTF-8, which will sort [U+000000] after [U+000001] and [U+00F000] after [U+10FFFF].