Storing a string as UTF8 in C#

PhantomDrummer picture PhantomDrummer · Aug 27, 2012 · Viewed 27.9k times · Source

I'm doing a lot of string manipulation in C#, and really need the strings to be stored one byte per character. This is because I need gigabytes of text simultaneously in memory and it's causing low memory issues. I know for certain that this text will never contain non-ASCII characters, so for my purposes, the fact that System.String and System.Char store everything as two bytes per character is both unnecessary and a real problem.

I'm about to start coding my own CharAscii and StringAscii classes - the string one will basically hold its data as byte[], and expose string manipulation methods similar to the ones that System.String does. However this seems a lot of work to do something that seems like a very standard problem, so I'm really posting here to check that there isn't already an easier solution. Is there for example some way I can make System.String internally store data as UTF8 that I haven't noticed, or some other way round the problem?

Answer

KeithS picture KeithS · Aug 27, 2012

Well, you could create a wrapper that retrieves the data as UTF-8 bytes and converts pieces as needed to System.String, then vice-versa to push the string back out to memory. The Encoding class will help you out here:

var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);

var myReturnedString = utf8.GetString(utfBytes);