Detect encoding of byte array C#

Kiquenet picture Kiquenet · Oct 22, 2013 · Viewed 21.3k times · Source

Is there any way to determine a byte array's encoding in C#?

I have any string, like "Lorem ipsum áéíóú ñÑç", and I get bytes array using several encodings.

I would like a only method for detect encoding in byte array and I get string value again.

Other issue, maybe, I'll have a column in database which store BLOB (like byte array). A string previously converted to byte array in UTF-8. Maybe another application converts a string to byte array using Unicode encoding.

In a database column there are byte arrays in several encodings. It would be very useful detect byte array's encoding. I need a way to find encoding of byte array.

Tests:

string DataXmlForSupport = "<support><machinename></machinename><comments>Este es el log 1 áéíóú</comments></support>";
        string DataXmlForSupport2 = "Lorem ipsum áéíóú ñÑç";

        [TestMethod]
        public void Encoding_byte_array_string()
        {
            var uencoding = new System.Text.UnicodeEncoding();
            byte[] data = uencoding.GetBytes(DataXmlForSupport);

            var dataXml = Encoding.Unicode.GetString(data);
            Assert.AreEqual(DataXmlForSupport, dataXml, "Se esperaba resultados Unicode");

            dataXml = Encoding.UTF8.GetString(data);
            Assert.AreNotEqual(DataXmlForSupport, dataXml, "NO Se esperaba resultados UTF8");

            var utf8 = new System.Text.UTF8Encoding();
            data = utf8.GetBytes(DataXmlForSupport2);

            dataXml = Encoding.UTF8.GetString(data);
            Assert.AreEqual(DataXmlForSupport2, dataXml, "Se esperaba resultados UTF8");

            dataXml = Encoding.Unicode.GetString(data);
            Assert.AreNotEqual(DataXmlForSupport2, dataXml, "NO Se esperaba resultados Unicode");

        }

Answer

David Arno picture David Arno · Oct 22, 2013

In short, no. Please see How to detect the character encoding of a text file? for a detailed answer on various encodings and why they can't be automatically determined.

Your best solution is to convert the string from it's original encoding to UTF8 and convert that to a byte array. Then you'll know your byte array's encoding...