How do I create a string with a surrogate pair inside of it?

michael picture michael · Jan 15, 2013 · Viewed 8.1k times · Source

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?

Answer

Jeppe Stig Nielsen picture Jeppe Stig Nielsen · Jan 15, 2013

The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:

string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";

You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.

If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:

string fourCircles = char.ConvertFromUtf32(0x1F01C);

Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:

string myString = "In the game of mahjong 🀜 denotes the Four of circles";

The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.

(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)