PHP Regex for human names

KdgDev picture KdgDev · Aug 11, 2009 · Viewed 25.3k times · Source

I've run into a bit of a problem with a Regex I'm using for humans names.

$rexName = '/^[a-z' -]$/i';

Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?

EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...

http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches

EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";

(now which ones of these can actually be used in any hacking attempt?)

For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

Answer

Pascal MARTIN picture Pascal MARTIN · Aug 11, 2009

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?

Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.

Here are a few examples of rules that come to mind :

  • no number
  • no special character, like "~{()}@^$%?;:/*§£ø and probably some others
  • no more that 3 spaces ?
  • none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
    • (but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)

Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )


And, to answer a comment you left under one other answer :

I could just forbid the most command characters for SQL injection and XSS attacks,

About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.

Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)


EDIT : if you just use that regex like that, it will not work quite well :

The following code :

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will get you at least a warning :

Warning: preg_match() [function.preg-match]: Unknown modifier '{'

You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)

If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

(This is a quick and dirty proposition, which has to be refined!)

This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will say "bad name"

But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !


Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)