Profanity Filter using a Regular Expression (list of 100 words)

BuddyJoe picture BuddyJoe · Nov 30, 2011 · Viewed 23.4k times · Source

What is the correct way to strip profane words from a string given:
1) I have a list of 100 words to look for in an array of strings. 2) What is the correct way to handle partial words? How do most people handle this? For example the word mass. Then sometimes a partial word is also bad - assume foobar is an extremely profane word I may want to disallow foobar and foobar* and *foobar.

So do you put all the words into a single expression or loop through the list?

What's the right way to tackle it? I'm using Groovy/Grails but any modern languages examples welcome.

Answer

voidmain picture voidmain · Dec 1, 2011

This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).

Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.

You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):

  • List item
  • h e l l o
  • h.e.l.l.o
  • h_e_l_l_o
  • |-|ello
  • h3llo
  • "hello there" (this phrase might not contain any profane words but combined they are profane)

You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:

  • bash it
  • ssh it's quiet time

These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.

We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.