C++ substring multi byte characters

W. Goeman picture W. Goeman · Jun 1, 2012 · Viewed 8.3k times · Source

I am having this std::string which contains some characters that span multiple bytes.

When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.

So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.

Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.

Thanks

Answer

eugene picture eugene · Aug 14, 2012

Simpler version. based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos

std::string substr(std::string originalString, int maxLength)
{
    std::string resultString = originalString;

    int len = 0;
    int byteCount = 0;

    const char* aStr = originalString.c_str();

    while(*aStr)
    {
        if( (*aStr & 0xc0) != 0x80 )
            len += 1;

        if(len>maxLength)
        {
            resultString = resultString.substr(0, byteCount);
            break;
        }
        byteCount++;
        aStr++;
    }

    return resultString;
}