I have a String encoded in UTF-8. For example:
Thats a nice joke πππ π
I have to extract all the emojis present in the sentence. And the emoji could be any
When this sentence is viewed in terminal using command less text.txt
it is viewed as:
Thats a nice joke <U+1F606><U+1F606><U+1F606> <U+1F61B>
This is the corresponding UTF code for the emoji. All the codes for emojis can be found at emojitracker.
For the purpose of finding all the occurances, I used a regular expression pattern (<U\+\w+?>)
but it didnt work for the UTF-8 encoded string.
Following is my code:
String s="Thats a nice joke πππ π";
Pattern pattern = Pattern.compile("(<U\\+\\w+?>)");
Matcher matcher = pattern.matcher(s);
List<String> matchList = new ArrayList<String>();
while (matcher.find()) {
matchList.add(matcher.group());
}
for(int i=0;i<matchList.size();i++){
System.out.println(matchList.get(i));
}
This pdf says Range: 1F300β1F5FF for Miscellaneous Symbols and Pictographs
. So I want to capture any character lying within this range.
Using emoji-java i've wrote a simple method that removes all emojis including fitzpatrick modifiers. Requires an external library but easier to maintain than those monster regexes.
Use:
String input = "A string πwith a \uD83D\uDC66\uD83C\uDFFFfew πemojis!";
String result = EmojiParser.removeAllEmojis(input);
emoji-java maven installation:
<dependency>
<groupId>com.vdurmont</groupId>
<artifactId>emoji-java</artifactId>
<version>3.1.3</version>
</dependency>
gradle:
implementation 'com.vdurmont:emoji-java:3.1.3'
EDIT: previously submitted answer was pulled into emoji-java source code.