simplexml_load_string errors on big files occur on one system but not another

Wesley Murch picture Wesley Murch · Feb 19, 2013 · Viewed 11.5k times · Source

I'm dealing with a third party PHP library that I can't edit, and it's been working fine for almost a year. It uses simplexml_load_string on the response from a remote server. Lately it's been choking on large responses. This is a data feed for real estate listings, and the format looks something like this:

<?xml version="1.0"?>
<RETS ReplyCode="0" ReplyText="Operation Successful Reference ID: 9bac803e-b507-49b7-ac7c-d8e8e3f3aa89">
<COUNT Records="9506" />
<DELIMITER value="09" />
<COLUMNS>   sysid   1   2   3   4   5   6   </COLUMNS>
<DATA>  252370080   Residential 0.160   No  ADDR0   06051</DATA>
<DATA>  252370081   Residential 0.440   Yes ADDR0   06043</DATA>
<DATA>  252370082   Residential 1.010   No  ADDR0   06023</DATA>
<DATA>More tab delimited text</DATA>
<!-- snip 9000+ lines -->
</RETS>

I downloaded a sample file of a response (about 22MB), here's where I ended up with my debugging and sanity. Both servers are running PHP Version 5.3.8, but note the different results. I'm as certain as I can be that both files are the same (I suppose the different filesize, strlen, and last 50 chars can be explained by Windows newlines having an extra carriage return character). Test script:

error_reporting(-1);
ini_set('display_errors', 1);
$file = 'error-example.xml';
$xml = file_get_contents($file);

echo 'filesize:              ';
var_dump(filesize($file));

echo 'strlen:                ';
var_dump(strlen($xml));

echo 'simplexml object?      ';
var_dump(is_object(simplexml_load_string($xml)));

echo 'Last 50 characters:    ';
var_dump(substr($xml, -50));

Output locally on Windows:

filesize:              int(21893604)
strlen:                int(21893604)
simplexml object?      bool(true)
Last 50 characters:    string(50) "RD DR    CT  Watertown   203-555-5555            </DATA>
</RETS>"

Output on remote UNIX server:

filesize:              int(21884093)
strlen:                int(21884093)
simplexml object?      
Warning: simplexml_load_string(): Entity: line 9511: parser error : internal error in /path/to/test.php on line 19

Warning: simplexml_load_string(): AULTED CEILING IN FOYER, BRICK FP IN FR, NEW FLOORING IN LR DR FR FOYER KITCHEN  in /path/to/test.php on line 19

Warning: simplexml_load_string():                                                                                ^ in /path/to/test.php on line 19

Warning: simplexml_load_string(): Entity: line 9511: parser error : Extra content at the end of the document in /path/to/test.php on line 19

Warning: simplexml_load_string(): AULTED CEILING IN FOYER, BRICK FP IN FR, NEW FLOORING IN LR DR FR FOYER KITCHEN  in /path/to/test.php on line 19

Warning: simplexml_load_string():                                                                                ^ in /path/to/test.php on line 19
bool(false)
Last 50 characters:    string(50) "ORD DR   CT  Watertown   203-555-5555            </DATA>
</RETS>"

Some replies to comments and additional info:

  • The XML itself appears to be valid as far as I can tell (and it does work on my system).

  • magic_quotes_runtime is definitely off.

  • The working server has libxml Version 2.7.7 while the other has 2.7.6. Could that really make the difference? I could not find a libxml change log but it seems unlikely.

  • This seems to only happen when the response/file is over a certain size, and the error always occurs at the next-to-last line.

  • I am not running into memory issues, the test script runs instantly.

There are differences in the PHP configurations which I can post if I knew which ones were relevant. Any idea what the problem could be, or know of anything else I might want to check?

Answer

Francis Avila picture Francis Avila · Feb 21, 2013

The libxml2 changelog contains "608773 add a missing check in xmlGROW (Daniel Veillard)", which seems to be related to input buffering. Note I don't know anything about libxml2 internals, but it seems conceivable that you have tickled a 2.7.6 bug fixed in 2.7.7.

Check if the behavior is any different when you use simplexml_load_file() directly, and try setting libxml parser-related options, e.g.

simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE)

Specifically, you might want to try the LIBXML_PARSEHUGE flag.

http://php.net/manual/en/libxml.constants.php
XML_PARSE_HUGE flag relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes.