Detecting a (naughty or nice) URL or link in a text string

JasonSmith picture JasonSmith · Mar 31, 2009 · Viewed 7.7k times · Source

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

Some objectives:

  • The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
  • URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
  • Any other funny business

Of course, I am blocking spam, but the same process could be used to auto-link text.

Ideas

Here are some things I'm thinking.

  • The content is native-language prose so I can be trigger-happy in detection
  • Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
  • Maybe multiple passes is a better strategy, with scans for:
    • Well-formed URLs
    • All non-whitespace followed by '.' followed by any valid TLD
    • Anything else?

Related Questions

I've read these and they are now documented here, so you can just references the regexes in those questions if you want.

Update and Summary

Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

  1. @Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
  2. For those suspicious strings, replace the dot with a dot-looking character as per @capar
  3. A good dot-looking character is @Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.

That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:

  • Strip out all dotted-quads (@Sharkey's comment to his own answer)
  • @Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
  • Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
  • Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
  • Calling out to OWASP AntiSAMY or other web services for spam/malware detection.

Answer

Jon Bright picture Jon Bright · Apr 15, 2009

I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:

[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]

Things this will get:

It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.