How can I normalize / asciify Unicode characters in Google Sheets?

Kirkman14 picture Kirkman14 · Feb 26, 2016 · Viewed 11.2k times · Source

I'm trying to write a formula for Google Sheets which will convert Unicode characters with diacritics to their plain ASCII equivalents.

I see that Google uses RE2 in its "REGEXREPLACE" function. And I see that RE2 offers Unicode character classes.

I tried to write a formula (similar to this one):

REGEXREPLACE("público","(\pL)\pM*","$1")

But Sheets produces the following error:

Function REGEXREPLACE parameter 2 value "\pL" is not a valid regular expression.

I suppose I could write a formula consisting of a long set of nested SUBSTITUTE functions (Like this one), but that seems pretty awful.

Can any offer a suggestion for a better way to normalize Unicode letters with diacritical/accent marks in a Google Sheets formula?

Answer

Max Makhrov picture Max Makhrov · Feb 26, 2016

[[:^alpha:]] (negated ASCII character class) works fine for REGEXEXTRACT formula.

But =REGEXREPLACE("público","([[:alpha:]])[[:^alpha:]]","$1") gives "pblic" as a result. So, I guess, formula doesn't know what exact ASCII character must replace "ú".


Workaround

Let's take the word públicē; we need to replace two symbols in it. Put this word in cell A1, and this formula in cell B1:

=JOIN("",ArrayFormula(IFERROR(VLOOKUP(SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"),D:E,2,0),SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"))))

And then make directory of replacements in range D:E:

    D    E  
1   ú   u
2   ē   e
3  ...  ...

This formula is still ugly, but more useful because you can control your directory by adding more characters to the table.


Or use Java Script

Also found a good solution, which works in google sheets.