How can I store UTF-16 characters in a Postgres database?

.net postgresql encoding utf-16 surrogate-pairs

Mr Shoubs · Dec 9, 2011 · Viewed 9.5k times · Source

I am trying to store some text (e.g. č) in a Postgres database, however when retrieving this value, it appears on screen as ?. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was in UTF-8, however, judging by the first answer, this is an incorrect assumption.

Original question (which may still be valid):

I have read about UTF-8 Surrogate pairs, which may achieve what I require, and I've seen a few examples involving the stringinfo object TextElementEnumerators, but I couldn't work out a practical proof of concept.

Can someone provide an example of how you would write and read UTF-16 (probably using this surrogate pair concept) to a postgres database. Thank you.

Updated question: Why would the č character be returned from the database as a question mark?

We use NPGSQL to access the database and VB.Net.

Answer

There's no such thing as a character which exists in UTF-16 but not UTF-8. Both are capable of encoding all of Unicode. In other words, if you can get UTF-8 to work, it should be able to store any valid Unicode text.

EDIT: Surrogate pairs are actually a feature of UTF-16 rather than UTF-8. They allow a character which isn't in the basic multi-lingual plane (BMP) to be represented as two UTF-16 code units. Basically, UTF-16 is often treated as a fixed-width encoding (exactly two bytes per Unicode character) but that only allows the BMP to be encoded cleanly. Surrogate pairs are a (fairly hacky) way of extending the range beyond the BMP.

I very much doubt that the character you're trying to represent is outside the BMP, so I suspect you need to look elsewhere for the problem. In particular, it's worth dumping the exact character values of the text (e.g. by casting each char to int) before it goes into the database and after you've fetched it. Ideally, do this in a short but complete console app.

How can I store UTF-16 characters in a Postgres database?

Answer

Related questions