How to check if string contains only specified character set?

arekstasiewicz picture arekstasiewicz · Jul 6, 2011 · Viewed 9.9k times · Source

I'm working on string and I wonder which way is best to check if string contains only specified character set:

@  ∆  SP  0  ¡  P  ¿  p 
£  _  !  1  A  Q  a  q 
$  Φ  "  2  B  R  b  r 
¥  Γ  #  3  C  S  c  s 
è  Λ  ¤  4  D  T  d  t 
é  O  %  5  E  U  e  u 
ù  Π  &  6  F  V  f  v 
ì  Ψ  '  7  G  W  g  w 
ò  Σ  (  8  H  X  h  x 
Ç  Θ  )  9  I  Y  i  y 
LF  Ξ  *  :  J  Z  j  z 
Ø  1)  +  ;  K  Ä  k  ä 
ø  Æ  ,  <  L  Ö  l  ö 
CR  æ  q  =  M  Ñ  m  ñ 
Å  ß  .  >  N  Ü  n  ü 
å  É  /  ?  O  §  o  à 

I was trying to make it done by eregi and regexp, but didn't success. Other way is to convert each char to decimal and check if it is smaller than < 137, or check each element by in_array() - which I find weak.

Anyone have better solution?

Thanks in advance.

Answer

Spudley picture Spudley · Jul 8, 2011

I see you've already accepted another answer, but I want to explain why your attempts with regex weren't working. Hopefully it'll help you.

Firstly, I notice in your tags for this question. Please note that PHP's ereg_ functions have been deprecated; you should only use the preg_ functions.

Now, if you want to use regex for this sort of thing, you would typically use a negated character class to define a list of characters you want to allow, and then look for anything else.

A character class is a list of characters enclosed in square brackets. You can negate a character class by adding a carat symbol to the start of it. So if you wanted a string that contained only 'A', 'B' or 'C', and you wanted to get warned about strings which contained anything else, you could use something like this:

$result = preg_match("/[^ABC]/",$mystring);

Your example is basically the same (but with more characters to test, obviously), except for two points: Firstly you have characters in your list which are reserved characters in Regex, and secondly, you are using non-Ascii characters.

The Regex reserved characters can be dealt with by escaping them with a leading back-slash. You just need to know what characters are reserved. Looking at your list, I see ?, /, . and +.

The second point explains why you couldn't get it working with ereg, because the ereg functions don't support unicode. Switch to using the preg functions instead, and you'll have more luck.

You still need to specify to the regex engine that you're looking for a unicode characters. This is done by adding the u modifier to the end of the regex string.

So a shortened version of your query might look like this:

$result = preg_match("/[^èΛ¤4DTdt]/u",$mystring);

It looks like you're including new lines in your list of characters, so you may also want to add the multi-line modifier m alongside that u.

For characters which can't be written (or indeed for any character, if it's easier), you can add escape sequences for their unicode character codes. Use \uFFFF where FFFF is the hex unicode reference for the character you want to match -- eg \u00E0 matches à.

I hope that gives you a better insight into regular expressions. I should add that I'm not saying that regex is necessarily the best solution to this question, nor necessarily the only solution. I have tried to make it perform optimally by using the negated character class (which means it'll fail as soon as it finds a non-matching character, and should prevent the kind of excessive backtracking which can cause regex expressions to be quite slow sometimes), so it should be reasonably performant, but I haven't tested it against other solutions.

I hope that helps.