How can I detect non-western characters?

HoboBen picture HoboBen · Aug 5, 2010 · Viewed 7.6k times · Source

I want to disallow certain UTF-8 input (server-side), e.g. eastern languages, where example input might be " 伊 ".

However, I do want to continue supporting other latin or "latin-like" characters, such as the welsh ŵ and ŷ, so checking against latin-1 is not possible.

What are my options? (if language specific, PHP preferred)

Thanks very much.


Reasoning: browser support for a lot of non-western characters is often missing (e.g. on a different browser I just see a box in the question above), so for things like display names sometimes it's appropriate to restrict it even if it's not appropriate for message bodies

Answer

Artefacto picture Artefacto · Aug 5, 2010

Just do

preg_match('/[^\\p{Common}\\p{Latin}]/u', $string)

where $string is an UTF-8 string. This will return "1" if there are non-latin characters and will return "0" otherwise.

Example:

var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷaás??'));  //int(0)
var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷݤaás??')); //int(1)