Delphi Unicode String Length in Bytes

Jessica Brown picture Jessica Brown · May 13, 2013 · Viewed 14k times · Source

I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.

I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.

so before:

rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);

after:

rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));

My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"

So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char) is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?

Answer

David Heffernan picture David Heffernan · May 13, 2013

Delphi's UnicodeString is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.

In a Unicode Delphi, Char is an alias for WideChar which is a UTF-16 character element. And string is an alias for UnicodeString, which is an array of WideChar elements. The Length() function returns the number of elements in the array.

So, SizeOf(Char) is always 2 for UnicodeString. Some Unicode code points are encoded with multiple character elements, or Chars. But Length() returns the number of characters elements and not the number of code points. The character elements all have the same size. So

memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));

is correct.