How to change diacritic characters to non-diacritic ones

Tomasz Smykowski picture Tomasz Smykowski · Dec 1, 2008 · Viewed 18.8k times · Source

I've found a answer how to remove diacritic characters on stackoverflow, but could you please tell me if it is possible to change diacritic characters to non-diacritic ones?

Oh.. and I think about .NET (or other if not possible)

Answer

dan picture dan · Jul 23, 2010

Since no one has ever bothered to post the code to do this, here it is:

    // \p{Mn} or \p{Non_Spacing_Mark}: 
    //   a character intended to be combined with another 
    //   character without taking up extra space 
    //   (e.g. accents, umlauts, etc.). 
    private readonly static Regex nonSpacingMarkRegex = 
        new Regex(@"\p{Mn}", RegexOptions.Compiled);

    public static string RemoveDiacritics(string text)
    {
        if (text == null)
            return string.Empty;

        var normalizedText = 
            text.Normalize(NormalizationForm.FormD);

        return nonSpacingMarkRegex.Replace(normalizedText, string.Empty);
    }

Note: a big reason for needing to do this is when you are integrating to a 3rd party system that only does ascii, but your data is in unicode. This is common. Your options are basically: remove accented characters, or attempt to remove accents from the accented characters to attempt to preserve as much as you can of the original input. Obviously, this is not a perfect solution but it is 80% better than simply removing any character above ascii 127.