What is the regex to extract all the emojis from a string?

vishalaksh picture vishalaksh Β· Jul 19, 2014 Β· Viewed 84.5k times Β· Source

I have a String encoded in UTF-8. For example:

Thats a nice joke πŸ˜†πŸ˜†πŸ˜† πŸ˜›

I have to extract all the emojis present in the sentence. And the emoji could be any

When this sentence is viewed in terminal using command less text.txt it is viewed as:

Thats a nice joke <U+1F606><U+1F606><U+1F606> <U+1F61B>

This is the corresponding UTF code for the emoji. All the codes for emojis can be found at emojitracker.

For the purpose of finding all the occurances, I used a regular expression pattern (<U\+\w+?>) but it didnt work for the UTF-8 encoded string.

Following is my code:

    String s="Thats a nice joke πŸ˜†πŸ˜†πŸ˜† πŸ˜›";
    Pattern pattern = Pattern.compile("(<U\\+\\w+?>)");
    Matcher matcher = pattern.matcher(s);
    List<String> matchList = new ArrayList<String>();

    while (matcher.find()) {
        matchList.add(matcher.group());
    }

    for(int i=0;i<matchList.size();i++){
        System.out.println(matchList.get(i));

    }

This pdf says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So I want to capture any character lying within this range.

Answer

gidim picture gidim Β· Sep 30, 2015

Using emoji-java i've wrote a simple method that removes all emojis including fitzpatrick modifiers. Requires an external library but easier to maintain than those monster regexes.

Use:

String input = "A string πŸ˜„with a \uD83D\uDC66\uD83C\uDFFFfew πŸ˜‰emojis!";
String result = EmojiParser.removeAllEmojis(input);

emoji-java maven installation:

<dependency>
  <groupId>com.vdurmont</groupId>
  <artifactId>emoji-java</artifactId>
  <version>3.1.3</version>
</dependency>

gradle:

implementation 'com.vdurmont:emoji-java:3.1.3'

EDIT: previously submitted answer was pulled into emoji-java source code.