Regular expression for validating names and surnames?

Sklivvz picture Sklivvz · May 20, 2009 · Viewed 104.5k times · Source

Although this seems like a trivial question, I am quite sure it is not :)

I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:

^[a-z -']+$

However, I need to support also these cases:

  • other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
  • different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
  • no numbers or symbols or unnecessary punctuation or runes, etc..
  • titles, middle initials, suffixes are not part of this data
  • names are already separated by surnames.
  • we are prepared to force ultra rare names to be simplified (there's a person named '@' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
  • note that many countries have laws about names so there are standards to follow

Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?

I would be looking for something similar to the many "email address" regexes that you can find on google.

Answer

Chris Cudmore picture Chris Cudmore · May 20, 2009

I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.

Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.

Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.