My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:
REGEX = /[^\u1F600-\u1F6FF\s]/i
This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?
Karol S already provided a solution, but the reason might not be clear:
"\u1F600"
is actually "\u1F60"
followed by "0"
:
"\u1F60" # => "ὠ"
"\u1F600" # => "ὠ0"
You have to use curly braces for code points above FFFF:
"\u{1F600}" #=> "😀"
Therefore the character class [\u1F600-\u1F6FF]
is interpreted as [\u1F60 0-\u1F6F F]
, i.e. it
matches "\u1F60"
, the range "0"
.."\u1F6F"
and "F"
.
Using curly braces solves the issue:
/[\u{1F600}-\u{1F6FF}]/
This matches (emoji) characters in these unicode blocks:
You can also use unpack
, pack
, and between?
to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.
s = 'Hi!😀'
#=> "Hi!\360\237\230\200"
s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!"
Regarding your Rubular example – Emoji are single characters:
"😀".length #=> 1
"😀".chars #=> ["😀"]
Whereas kaomoji are a combination of multiple characters:
"^_^".length #=> 3
"^_^".chars #=> ["^", "_", "^"]
Matching these is a very different task (and you should ask that in a separate question).