Length() vs Sizeof() on Unicode strings

Question 1

Length() vs Sizeof() on Unicode strings

delphi delphi-xe8

ZzZombo · Jun 3, 2015 · Viewed 7.7k times · Source

Answer

Answer

Length returns the number of elements when considering the string as an array.

For strings with 8 bit element types (ANSI, UTF-8) then Length gives you the number of bytes since the number of bytes is the same as the number of elements.
For strings with 16 bit elements (UTF-16) then Length is half the number of bytes because each element is 2 bytes wide.

Your string '1¢' has two code points, but the second code point requires two bytes to encode it in UTF-8. Hence Length(Utf8String('1¢')) evaluates to three.

You mention SizeOf in the question title. Passing a string variable to SizeOf will always return the size of a pointer, since a string variable is, under the hood, just a pointer.

To your specific questions:

Why the difference in handling is there at all?

There is only a difference if you think of Length as relating to bytes. But that's the wrong way to think about it Length always returns an element count, and when viewed that way, there behaviour is uniform across all string types, and indeed across all array types.

Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?

It does always return the element count. It just so happens that when the element size is a single byte, then the element count and the byte count happen to be the same. In fact the documentation that you refer to also contains the following just above the excerpt that you provided: Returns the number of characters in a string or of elements in an array. That is the key text. The excerpt that you included is meant as an illustration of the implications of this italicised text.

Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.

UTF-16 character elements are always 16 bits wide. However, some Unicode code points require two character elements to encode. These pairs of character elements are known as surrogate pairs.

You are hoping, I think, that Length will return the number of code points in a string. But it doesn't. It returns the number of character elements. And for variable length encodings, the number of code points is not necessarily the same as the number of character elements. If your string was encoded as UTF-32 then the number of code points would be the same as the number of character elements since UTF-32 is a constant sized encoding.

A quick way to count the code points is to scan through the string checking for surrogate pairs. When you encounter a surrogate pair, count one code point. Otherwise, when you encounter a character element that is not part of a surrogate pair, count one code point. In pseudo-code:

N := 0;
for C in S do
  if C.IsSurrogate then
    inc(N)
  else
    inc(N, 2);
CodePointCount := N div 2;

Another point to make is that the code point count is not the same as the visible character count. Some code points are combining characters and are combined with their neighbouring code points to form a single visible character or glyph.

Finally, if all you are hoping to do is find the byte size of the string payload, use this expression:

Length(S) * SizeOf(S[1])

This expression works for all types of string.

Be very careful about the function System.SysUtils.ByteLength. On the face of it this seems to be just what you want. However, that function returns the byte length of a UTF-16 encoded string. So if you pass it an AnsiString, say, then the value returned by ByteLength is twice the number of bytes of the AnsiString.

Question 2

Quoting the Delphi XE8 help:

For single-byte and multibyte strings, Length returns the number of bytes used by the string. Example for UTF-8:
   Writeln(Length(Utf8String('1¢'))); // displays 3
For Unicode (WideString) strings, Length returns the number of bytes divided by two.

This arises important questions:

Why the difference in handling is there at all?
Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.

Length() vs Sizeof() on Unicode strings

Answer

Related questions