Remove Unicode Zero Width Space PHP

Jimmy Long picture Jimmy Long · Mar 24, 2014 · Viewed 12k times · Source

I have a text in Burmese language, UTF-8. I am using PHP to work with the text. At some point along the way, some ZWSPs have crept in and I would like to remove them. I have tried two different ways of removing the characters, and neither seems to work.

First I have tried to use:

  $newBody = str_replace("​", "", $newBody);

to search for the HTML entity and remove it, as this is how it appears under Web Inspector. The spaces don't get removed. I have also tried it as:

  $newBody = str_replace("&#8203", "", $newBody);

and get the same no result.

The second method I tried was found on this question Remove ZERO WIDTH NON-JOINER character from a string in PHP

which looked like this:

 $newBody = str_replace("\xE2\x80\x8C", "", $newBody);

but I also got no result. The ZWSP was not removed.

An example word in the text ($newBody) looks like this : ယူ​​က​​ရိန်
And I want to make it look like this : ယူကရိန်း

Any ideas? Would a preg_replace work better somehow?

So I did try

$newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody);

and it appears to be workings, but now there is another issue.

<a class="defined" title="Ukraine">ယူ&#8203;က&#8203;ရိန်း</a>

gets transformed into

<a class="defined _tt_t_" title="Ukraine" style="font-family: 'Masterpiece Uni Sans', TharLon, Myanmar3, Yunghkio, Padauk, Parabaik, 'WinUni Innwa', 'Win Uni Innwa', 'MyMyanmar Unicode', Panglong, 'Myanmar Sangam MN', 'Myanmar MN';">ယူကရိန်း</a>

I don't want it to add all that extra stuff. Any ideas why this is happening? Apart from coming up with some way to target only the text in between , is there another way to prevent the preg_replace from adding all this extra stuff? Btw, using google chrome on a mac. It seems to act a bit differently with firefox...

Answer

Jef picture Jef · Mar 24, 2014

This:

$newBody = str_replace("&#8203;", "", $newBody);

presumes the text is HTML entity encoded. This:

$newBody = str_replace("\xE2\x80\x8C", "", $newBody);

should work if the offending characters are not encoded, but matches the wrong character (0xe2808c). To match the same character as #8203; you need 0xe2808b:

$newBody = str_replace("\xE2\x80\x8B", "", $newBody);