XML parser error: entity not defined

NightHawk picture NightHawk · Sep 27, 2010 · Viewed 84.3k times · Source

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.

I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.

Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".

I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.

From what I gather, there's a few options I've seen:

  1. I can find and replace all   and swap them out with   or an actual space.
  2. I can place the code in question within a CDATA section.
  3. I can include these entities within the XML file.

What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).

Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?

Thanks, Ryan

UPDATE

I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!

Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).

The solution to all this was quite simple:

I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.

I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

Answer

Gaurav Arya picture Gaurav Arya · Nov 30, 2010

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:

  1. Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.

  2. Then further encoded it using utf8_encode().

$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>'; 
$xmlHeader = new SimpleXMLElement($headerDoc);

Now the above code does not throw any undefined entity errors.