Filter Comment Spam? PHP

Sean H Jenkins picture Sean H Jenkins · Dec 7, 2011 · Viewed 7.3k times · Source

I'm looking for articles on ways to filter spam. When I search around all I keep finding is Wordpress, ways to filter swear words etc which is not what I'm looking for. I'm looking for ways to write your own filter system and best practices.

Any tutorial links from anyone who has done this before, would be appreciated.

Only good article i can so far is http://snook.ca/archives/other/effective_blog_comment_spam_blocker

Answer

Tim picture Tim · Dec 7, 2011

When writing your own method, you'll have to employ a combination of heuristics.

For example, it's very common for spam comments to have 2 or more URL links.

I'd begin writing your filter like so, using a dictionary of trigger words and have it loop through and use those to determine probability:

function spamProbability($text){
    $probability = 0;  
    $text = strtolower($text); // lowercase it to speed up the loop
    $myDict = array("http","penis","pills","sale","cheapest"); 
    foreach($myDict as $word){
        $count = substr_count($text, $word);
        $probability += .2 * $count;
    }
    return $probability;
}

Note that this method will result in many false positives, depending on your word set; you could have your site "flag" for moderation (but goes live immediately) those with probability > .3 and < .6, have it require those >.6 and <.9 enter a queue for moderation (where they don't appear until approved), and then anything over >1 is simply rejected.

Obviously these are all values you'll have to tweak the thresholds but this should start you off with a pretty basic system. You can add to it several other qualifiers for increasing / decreasing probability of spam, such as checking the ratio of bad words to words, changing weights of words, etc.