How to remove accents and tilde in a C++ std::string

Alejo picture Alejo · Sep 28, 2008 · Viewed 21k times · Source

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.

I'm using this code:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find_first_of(strMine);
    while (found!=std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,strMine.length());
        toReplace.insert(found,strAux);
        found=toReplace.find_first_of(strMine,found+1);
    }
}

Where dictionary is a map like this (with more entries):

dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );

and toReplace strings is:

std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";

I obviously must be missing something. I can't figure it out. Is there any library I can use?.

Thanks,

Answer

MSalters picture MSalters · Sep 29, 2008

I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?