Say I have a long UTF-8 encoded string.
And say I want to detect if $var
exists in this string.
Assuming $var
is always going to be simple letters or numbers of ascii characters (e.g. "hello123"
) I shouldn't need to use mb_strpos
or iconv_strpos
right? Because it doesn't matter if the position is not character-wise correct as long as its consistent with the other functions.
Example:
$var='hello123';
$pos=strpos($utf8string,$var);
if ($pos!==false) $uptohere=substr($ut8string,0,$pos);
Am I correct that the above code will extract everything up to 'hello123'
regardless of whether the string contains fancy UTF-8 characters? My logic is that because both strpos
and substr
will be consistent with each other (even if this is consistently wrong) then it should still work.
Yes, you are correct. There's no ambiguity about the characters themselves, i.e. hello123
can't possibly anything else in UTF-8. The way you're slicing it, it doesn't matter whether you're slicing by character or by byte number.
So yes, this is safe, as long as your string is UTF-8 and thereby ASCII compatible.
See here for quick test: http://3v4l.org/XnM8s
Why this works:
The string "漢字hello123" in UTF-8 looks like this as bytes (I hope this aligns correctly):
e6 | bc | a2 | e5 | ad | 97 | 68 | 65 | 6c | 6c | 6f | 31 | 32 | 33
漢 | 字 | h | e | l | l | o | 1 | 2 | 3
strpos
will look for the byte sequence 68656c6c6f313233
, returning 6
as the starting byte of "hello123". substr
will slice 6 bytes from byte 0
, returning "漢字". There is no ambiguity. You're finding and slicing by bytes, it doesn't matter how many characters there are.
You need to either work entirely in characters, in which case the string functions must be encoding aware. Or you work entirely in bytes, in which case the only requirement is that bytes aren't ambiguous (say "hello123" could match "中国" encoded in BIG5, because the bytes are the same (they don't, just an example)). UTF-8 is self-synchronizing, meaning there's no such ambiguity.