DataOutputStream: purpose of the "encoded string too long" restriction

Andremoniy picture Andremoniy · Mar 30, 2014 · Viewed 10k times · Source

There is a strange restriction in java.io.DataOutputStream.writeUTF(String str) method, which limits the size of an UTF-8 encoded string to 65535 bytes:

    if (utflen > 65535)
        throw new UTFDataFormatException(
            "encoded string too long: " + utflen + " bytes");

It is strange, because:

  1. there is no any information about this restriction in JavaDoc of this method
  2. this restriction can be easily solved by copying and modifying an internal static int writeUTF(String str, DataOutput out) method of this class
  3. there is no such restriction in the opposite method java.io.DataInputStream.readUTF().

According to the said above I can not understand the purpose of a such restriction in the writeUTF method. What have I missed or misunderstood?

Answer

Erwin Bolwidt picture Erwin Bolwidt · Mar 30, 2014

The Javadoc of DataOutputStream.writeUTF states:

First, two bytes are written to the output stream as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string.

Two bytes means 16 bits: in 16 bits the maximum integer one can encode is 2^16 == 65535. DataInputStream.readUTF has the exact same restriction, because it first reads the number of UTF-8 bytes to consume, in the form of a 2-byte integer, which again can only have a maximum value of 65535.


writeUTF first writes two bytes with the length, which has the same result as calling writeShort with the length and then writing the UTF-encoded bytes. writeUTF doesn't actually call writeShort - it builds up a single byte[] with both the 2-byte length and the UTF bytes. But that is why the Javadoc says "as if by the writeShort method" rather than just "by the writeShort method".