UTF-8 BOM signature in PHP files

treznik picture treznik · Apr 1, 2010 · Viewed 49.1k times · Source

I was writing some commented PHP classes and I stumbled upon a problem. My name (for the @author tag) ends up with a ș (which is a UTF-8 character, ...and a strange name, I know).

Even though I save the file as UTF-8, some friends reported that they see that character totally messed up (È™). This problem goes away by adding the BOM signature. But that thing troubles me a bit, since I don't know that much about it, except from what I saw on Wikipedia and on some other similar questions here on SO.

I know that it adds some things at the beginning of the file, and from what I understood it's not that bad, but I'm concerned because the only problematic scenarios I read about involved PHP files. And since I'm writing PHP classes to share them, being 100% compatible is more important than having my name in the comments.

But I'm trying to understand the implications, should I use it without worrying? or are there cases when it might cause damage? When?

Answer

skrebbel picture skrebbel · Apr 1, 2010

Indeed, the BOM is actual data sent to the browser. The browser will happily ignore it, but still you cannot send headers then.

I believe the problem really is your and your friend's editor settings. Without a BOM, your friend's editor may not automatically recognize the file as UTF-8. He can try to set up his editor such that the editor expects a file to be in UTF-8 (if you use a real IDE such as NetBeans, then this can even be made a project setting that you can transfer along with the code).

An alternative is to try some tricks: some editors try to determine the encoding using some heuristics based on the entered text. You could try to start each file with

<?php //Úτƒ-8 encoded

and maybe the heuristic will get it. There's probably better stuff to put there, and you can either google for what kind of encoding detection heuristics are common, or just try some out :-)

All in all, I recommend just fixing the editor settings.

Oh wait, I misread the last part: for spreading the code to anywhere, I guess you're safest just making all files only contain the lower 7-bit characters, i.e. plain ASCII, or to just accept that some people with ancient editors see your name written funny. There is no fail-safe way. The BOM is definitely bad because of the headers already sent thing. On the other side, as long as you only put UTF-8 characters in comments and so, the only impact of some editor misunderstanding the encoding is weird characters. I'd go for correctly spelling your name and adding a comment targeted at heuristics so that most editors will get it, but there will always be people who'll see bogus chars instead.