I have a text in Burmese language, UTF-8. I am using PHP to work with the text. At some point along the way, some ZWSPs have crept in and I would like to remove them. I have tried two different ways of removing the characters, and neither seems to work.
First I have tried to use:
$newBody = str_replace("​", "", $newBody);
to search for the HTML entity and remove it, as this is how it appears under Web Inspector. The spaces don't get removed. I have also tried it as:
$newBody = str_replace("​", "", $newBody);
and get the same no result.
The second method I tried was found on this question Remove ZERO WIDTH NON-JOINER character from a string in PHP
which looked like this:
$newBody = str_replace("\xE2\x80\x8C", "", $newBody);
but I also got no result. The ZWSP was not removed.
An example word in the text ($newBody) looks like this : ယူ​က​ရိန်
And I want to make it look like this : ယူကရိန်း
Any ideas? Would a preg_replace work better somehow?
So I did try
$newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody);
and it appears to be workings, but now there is another issue.
<a class="defined" title="Ukraine">ယူ​က​ရိန်း</a>
gets transformed into
<a class="defined _tt_t_" title="Ukraine" style="font-family: 'Masterpiece Uni Sans', TharLon, Myanmar3, Yunghkio, Padauk, Parabaik, 'WinUni Innwa', 'Win Uni Innwa', 'MyMyanmar Unicode', Panglong, 'Myanmar Sangam MN', 'Myanmar MN';">ယူကရိန်း</a>
I don't want it to add all that extra stuff. Any ideas why this is happening? Apart from coming up with some way to target only the text in between , is there another way to prevent the preg_replace from adding all this extra stuff? Btw, using google chrome on a mac. It seems to act a bit differently with firefox...
This:
$newBody = str_replace("​", "", $newBody);
presumes the text is HTML entity encoded. This:
$newBody = str_replace("\xE2\x80\x8C", "", $newBody);
should work if the offending characters are not encoded, but matches the wrong character (0xe2808c). To match the same character as #8203; you need 0xe2808b:
$newBody = str_replace("\xE2\x80\x8B", "", $newBody);