removing emojis from a string in Python

Mona Jalal picture Mona Jalal · Oct 29, 2015 · Viewed 76.8k times · Source

I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?

I have observed all my emjois start with \xf but when I try to search for str.startswith("\xf") I get invalid character error.

emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)

Here's the error:

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    re.sub(emoji_pattern,'',word)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Each of the items in a list can be a word ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

UPDATE: I used this other code:

emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
                                 |\
                                 [\U0001F300-\U0001F5FF] # symbols & pictographs\
                                 |\
                                 [\U0001F680-\U0001F6FF] # transport & map symbols\
                                 |\
                                 [\U0001F1E0-\U0001F1FF] # flags (iOS)\
                          " " ", re.VERBOSE)

emoji_pattern.sub('', word)

But this still doesn't remove the emojis and shows them! Any clue why is that? enter image description here

Answer

jfs picture jfs · Oct 29, 2015

On Python 2, you have to use u'' literal to create a Unicode string. Also, you should pass re.UNICODE flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')):

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

Output

This dog 😂
This dog 

Note: emoji_pattern matches only some emoji (not all). See Which Characters are Emoji.