Converting contents of a byte array to wchar_t*

Christopher MacKinnon picture Christopher MacKinnon · Dec 14, 2012 · Viewed 11.8k times · Source

I seem to be having an issue converting a byte array (containing the text from a word document) to a LPTSTR (wchar_t *) object. Every time the code executes, I am getting a bunch of unwanted Unicode characters returned.

I figure it is because I am not making the proper calls somewhere, or not using the variables properly, but not quite sure how to approach this. Hopefully someone here can guide me in the right direction.

The first thing that happens in we call into C# code to open up Microsoft Word and convert the text in the document into a byte array.

byte document __gc[];
document = word->ConvertToArray(filename);

The contents of document are as follows:

{84, 101, 115, 116, 32, 68, 111, 99, 117, 109, 101, 110, 116, 13, 10}

Which ends up being the following string: "Test Document".

Our next step is to allocate the memory to store the byte array into a LPTSTR variable,

byte __pin * value;

value = &document[0];

LPTSTR image;
image = (LPTSTR)malloc( document->Length + 1 );

Once we execute the line where we start allocating the memory, our image variable gets filled with a bunch of unwanted Unicode characters:

췍췍췍췍췍췍췍췍﷽﷽����˿於潁

And then we do a memcpy to transfer over all of the data

memcpy(image,value,document->Length);

Which just causes more unwanted Unicode characters to appear:

敔瑳䐠捯浵湥൴촊﷽﷽����˿於潁

I figure the issue that we are having is either related to how we are storing the values in the byte array, or possibly when we are copying the data from the byte array to the LPTSTR variable. Any help with explaining what I'm doing wrong, or anything to point me in the right direction will be greatly appreciated.

Answer

bames53 picture bames53 · Dec 15, 2012

First you should learn something about text data and how it's represented. A reference that will get you started there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

byte is just a typedef or something for char or unsigned char. So the byte array is using some char encoding for the string. You need to actually convert from that encoding, whatever it is, into UTF-16 for Windows' wchar_t. Here's the typical method recommended for doing such conversions on Windows:

int output_size = MultiByteToWideChar(CP_ACP,0,value,-1,NULL,0);
assert(0<output_size);
wchar_t *converted_buf = new wchar_t[output_size];
int size = MultiByteToWideChar(CP_ACP,0,value,-1,converted_buf,output_size);
assert(output_size==size);

We call the function MultiByteToWideChar() twice, once to figure out how large of a buffer is needed to hold the result of the conversion, and a second time, passing in the buffer we allocated, to do the actual conversion.

CP_ACP specifies the source encoding, and you'll need to check the API documentation to figure out what that value really should be. CP_ACP stands for 'codepage: Ansi codepage', which is Microsoft's way of saying 'the encoding set for "non-Unicode" programs.' The API may be using something else, like CP_UTF8 (we can hope) or 1252 or something.

You can view the rest of the documentation on MultiByteToWideChar here to figure out the other arguments.


Once we execute the line where we start allocating the memory, our image variable gets filled with a bunch of unwanted Unicode characters:

When you call malloc() the memory given to you is uninitialized and just contains garbage. The values you see before initializing it don't matter and you simply shouldn't use that data. The only data that matters is what you fill the buffer with. The MultiByteToWideChar() code above will also automatically null terminate the string so you won't see garbage in unused buffer space (and the method we use of allocating the buffer will not leave any extra space).


The above code is not actually very good C++ style. It's just typical usage of the C-style API provided by Win32. The way I prefer to do conversions (if I'm forced to) is more like:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert; // converter object saved somewhere

std::wstring output = convert.from_bytes(value);

(Assuming the char encoding being used is UTF-8. You'll have to use a different codecvt facet for any other encoding.)