PHP: strpos & substr with UTF-8

Alasdair picture Alasdair · Feb 24, 2013 · Viewed 7.9k times · Source

Say I have a long UTF-8 encoded string.

And say I want to detect if $var exists in this string.

Assuming $var is always going to be simple letters or numbers of ascii characters (e.g. "hello123") I shouldn't need to use mb_strpos or iconv_strpos right? Because it doesn't matter if the position is not character-wise correct as long as its consistent with the other functions.

Example:

$var='hello123';
$pos=strpos($utf8string,$var);
if ($pos!==false) $uptohere=substr($ut8string,0,$pos);

Am I correct that the above code will extract everything up to 'hello123' regardless of whether the string contains fancy UTF-8 characters? My logic is that because both strpos and substr will be consistent with each other (even if this is consistently wrong) then it should still work.

Answer

deceze picture deceze · Feb 24, 2013

Yes, you are correct. There's no ambiguity about the characters themselves, i.e. hello123 can't possibly anything else in UTF-8. The way you're slicing it, it doesn't matter whether you're slicing by character or by byte number.

So yes, this is safe, as long as your string is UTF-8 and thereby ASCII compatible.

See here for quick test: http://3v4l.org/XnM8s

Why this works:

The string "漢字hello123" in UTF-8 looks like this as bytes (I hope this aligns correctly):

e6 | bc | a2 | e5 | ad | 97 | 68 | 65 | 6c | 6c | 6f | 31 | 32 | 33
     漢      |      字      | h  | e  | l  | l  | o  | 1  | 2  | 3

strpos will look for the byte sequence 68656c6c6f313233, returning 6 as the starting byte of "hello123". substr will slice 6 bytes from byte 0, returning "漢字". There is no ambiguity. You're finding and slicing by bytes, it doesn't matter how many characters there are.

You need to either work entirely in characters, in which case the string functions must be encoding aware. Or you work entirely in bytes, in which case the only requirement is that bytes aren't ambiguous (say "hello123" could match "中国" encoded in BIG5, because the bytes are the same (they don't, just an example)). UTF-8 is self-synchronizing, meaning there's no such ambiguity.