How can I get the single bytes from a multibyte PHP string variable in a binary-safe way?

e-sushi picture e-sushi · Aug 1, 2013 · Viewed 13.6k times · Source

Let's say (for simplicity's sake) that I have a multibyte, UTF-8 encoded string variable with 3 letters (consisting of 4 bytes):

$original = 'Fön';

Since it's UTF-8, the bytes' hex values are (excluding the BOM):

46 C3 B6 6E

As the $original variable is user-defined, I will need to hande two things:

  1. Get the exact number of bytes (not UTF-8 characters) used in the string, and
  2. A way to access each individual byte (not UTF-8 character).

I would tend to use strlen() to handle "1.", and access the $original variable's bytes with a simple `$original[$byteposition] like this:

<?php
header('Content-Type: text/html; charset=UTF-8');

$original = 'Fön';
$totalbytes = strlen($original);
for($byteposition = 0; $byteposition < $totalbytes; $byteposition++)
{
    $currentbyte = $original[$byteposition];

    /*
        Doesn't work since var_dump shows 3 bytes.
    */
    var_dump($currentbyte);

    /*
        Fails too since "ord" only works on ASCII chars.
        It returns "46 F6 6E"
    */
    printf("%02X", ord($currentbyte));
    echo('<br>');
}

exit();
?>

This proves my initial idea is not working:

  1. var_dump shows 3 bytes
  2. printf fails too since "ord" only works on ASCII chars

How can I get the single bytes from a multibyte PHP string variable in a binary-safe way?

What I am looking for is a binary-safe way to convert UTF-8 string(s) into byte-array(s).

Answer

steven picture steven · Aug 1, 2013

you can get a bytearray by unpacking the utf8_encoded string $a:

$a = utf8_encode('Fön');
$b = unpack('C*', $a); 
var_dump($b);

used format C* for "unsigned char"

References