replace emoji unicode symbol using regexp in javascript

Fedor Skrynnikov picture Fedor Skrynnikov · Feb 25, 2014 · Viewed 31.4k times · Source

As you all know emoji symbols are coded up to 3 or 4 bytes, so it may occupy 2 symbols in my string. For example '😁wew😁'.length = 7 I want to find those symbols in my text and replace them to the value that is dependent from its code. Reading SO, I came up to XRegExp library with unicode plugin, but have not found the way how to make it work.

var str = '😁wew😁';// \u1F601 symbol
var reg = XRegExp('[\u1F601-\u1F64F]', 'g'); //  /[ὠ1-ὤF]/g -doesn't make a lot of sense  
//var reg = XRegExp('[\uD83D\uDE01-\uD83D\uDE4F]', 'g'); //Range out of order in character class
//var reg = XRegExp('\\p{L}', 'g'); //doesn't match my symbols
console.log(XRegExp.replace(str, reg, function(match){
   return encodeURIComponent(match);// here I want to have smth like that %F0%9F%98%84 to be able to map anything I want to this value and replace to it
}));

jsfiddle

I really don't want to bruteforce the string looking for the sequence of characters from my range. Could someone help me to find the way to do that with regexp's.

EDITED Just came up with an idea of enumerating all the emoji symbols. Better than brutforce but still looking for the better idea

var reg = XRegExp('\uD83D\uDE01|\uD83D\uDE4F|...','g');

Answer

Jukka K. Korpela picture Jukka K. Korpela · Feb 25, 2014

The \u.... notation has four hex digits, no less, no more, so it can only represent code points up to U+FFFF. Unicode characters above that are represented as pairs of surrogate code points.

So some indirect approach is needed. Cf. to JavaScript strings outside of the BMP.

For example, you could look for code points in the range [\uD800-\uDBFF] (high surrogates), and when you find one, check that the next code point in the string is in the range [\uDC00-\uDFFF] (if not, there is a serious data error), interpret the two as a Unicode character, and replace them by whatever you wish to put there. This looks like a job for a simple loop through the string, rather than a regular expression.