I'm trying to search a UTF8-encoded string using preg_match.
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];
This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.
I have the following settings in my php.ini, and other UTF8 functions are working:
mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off
Any ideas?
Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.
You can use mb_strlen
to get the length in UTF-8 characters rather than bytes:
$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));