How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
http://some-fqdn/some/valid/path.ext
)http://
prefix (i.e. a valid FQDN + valid HTTP path)Of course, I am blocking spam, but the same process could be used to auto-link text.
Here are some things I'm thinking.
www .example.com
"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
Things this will get:
It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.