preg_replace with cyrillic chars

Alex Emilov picture Alex Emilov · Oct 12, 2011 · Viewed 9.8k times · Source

I want to replace these chars [^a-zа-з0-9_] with null, but I can't do it when its multibyte string.

I tried with mb_*, iconv, PCRE, mb_eregi_replace and u modifier (for PCRE), but none of them worked well.

The mb_eregi_replace works, but it only outputs the correct utf8 string, but it doesn't replace the characters, when preg_replace works with the same regex..

Here is my code that works with unicode, but it doesn't replace text.

function _data($data)
{
  mb_regex_encoding('UTF-8');
  return mb_eregi_replace('/[^a-zа-з0-9_]+/', '', $data);
}

var_dump(namespace\_data('Текст Removethis- and this _#$)( and also this $*@&$'));

and the result is with the special chars (#_$..) when it should replace them, if I change the function to preg_replace (and no unicode) it should replace them.

Answer

hakre picture hakre · Oct 12, 2011

As long as your input string is UTF-8 encoded (if not, re-encode it to UTF-8), you can safely use preg_replace if you use the correct regular expression.

function _data($data)
{ 
  return preg_replace('/[^\w_]+/u', '', $data);
}

var_dump(namespace\_data('Текст Removethis- and this _#$)( and also this $*@&$'));

Demo

  • \w = any word character
  • u (at then end) = enable UTF-8 for the regex.