Combining CURL and simple html dom

Youss picture Youss · May 18, 2013 · Viewed 8.8k times · Source

I have been working with CURL to scrape websites for a while and also Simple HTML DOM. I experienced that CURL is much better for scraping websites. However I really like the simplicity of Simple HTML DOM. So I figured why not combine the two, I tried:

require_once('simple_html_dom.php');

    $url = 'http://news.yahoo.com/';

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $curl_scraped_page = curl_exec($ch);

    $html = new simple_html_dom();
    $html->load($curl_scraped_page);


    foreach($html->find('head') as $d) {
        $d->innertext = "<base href='$url'>" . $d->innertext;
    }

    echo $html->save();

I did my best but it doesn't work. What else can I try?

Answer

user1765062 picture user1765062 · May 18, 2013

Try changing this:

$html->load($curl_scraped_page);

To this:

$html->load($curl_scraped_page, true, false);

The problem is that simple_html_dom removes all \r \n by default and in this case it breaks javascript code since yahoo don't end it with a semicolon.

You can see this error at the browser console and you can also see that simple_html_dom removes linebreaks viewing the source.