How do I remove emoji from string

kilua picture kilua · Jul 10, 2014 · Viewed 41.5k times · Source

My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

Answer

Stefan picture Stefan · Jul 10, 2014

Karol S already provided a solution, but the reason might not be clear:

"\u1F600" is actually "\u1F60" followed by "0":

"\u1F60"    # => "ὠ"
"\u1F600"   # => "ὠ0"

You have to use curly braces for code points above FFFF:

"\u{1F600}" #=> "😀"

Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

Using curly braces solves the issue:

/[\u{1F600}-\u{1F6FF}]/

This matches (emoji) characters in these unicode blocks:


You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

s = 'Hi!😀'
#=> "Hi!\360\237\230\200"

s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!" 

Regarding your Rubular exampleEmoji are single characters:

"😀".length  #=> 1
"😀".chars   #=> ["😀"]

Whereas kaomoji are a combination of multiple characters:

"^_^".length #=> 3
"^_^".chars  #=> ["^", "_", "^"]

Matching these is a very different task (and you should ask that in a separate question).