Node.js Emoji Parsing

thekevinscott picture thekevinscott · Sep 24, 2015 · Viewed 9.8k times · Source

I'm trying to parse an incoming string to determine whether it contains any non-emojis.

I've gone through this great article by Mathias and am leveraging both native punycode for the encoding / decoding and regenerate for the regex generation. I'm also using EmojiData to get my dictionary of emojis.

With that all said, certain emojis continue to be pesky little buggers and refuse to match. For certain emoji, I continue to get a pair of code points.

// Example of a single code point:
console.log(punycode.ucs2.decode('💩'));
>> [ 128169 ]

// Example of a paired code point:
console.log(punycode.ucs2.decode('⌛️'));
>> [ 8987, 65039 ]

Mathias touches on this in his article (and gives an example of punycode working around this) but even using his example I get an incorrect response:

function countSymbols(string) {
  return punycode.ucs2.decode(string).length;
}
console.log(countSymbols('💩'));
>> 1
console.log(countSymbols('⌛️'));
>> 2

What is the best way to detect whether a string contains all emojis or not? This is for a proof of concept so the solution can be as brute force as need be.

--- UPDATE ---

A little more context on my pesky emoji above.

These are visually identical but in fact different unicode values (the second one is from the example above):

⌛ // \u231b

⌛️ // \u231b\ufe0f

The first one works great, the second does not. Unfortunately, the second version is what iOS seems to use (if you copy and paste from iMessage you get the second one, and when receiving a text from Twilio, same thing).

Answer

一二三 picture 一二三 · Sep 25, 2015

The U+FE0F is not a combining mark, it's a variation sequence that controls the rendering of the glyph (see this answer). Removing such sequences may change the appearance of the character, for example: U+231B+U+FE0E (⌛︎).

Also, emoji sequences can be made from multiple code points. For example, U+0032 (2) is not an emoji by itself, but U+0032+U+20E3 (2⃣) or U+0032+U+20E3+U+FE0F (2⃣️) is—but U+0041+U+20E3 (A⃣) isn't. A complete list of emoji sequences are maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library appears to have this information).

To check if a string contains emoji characters, you will need to test if any single character is in emoji-data.txt, or starts a substring for a sequence in it.