Using cURL and dom to scrape data with php

Michael picture Michael · Apr 19, 2013 · Viewed 11k times · Source

Hi i am using cURL to get data from a website i need to get multiple items but cannot get it by tag name or id. I have managed to put together some code that will get one item using a class name by passing it through a loop i then pass it through another loop to get the text from the element.

I have a few problems here the first is i can see there must be a more convenient way of doing this. The second i will need to get multiple elements and stack together ie title, desciption, tags and a url link.

# Create a DOM parser object and load HTML
$dom    = new DOMDocument();
$result = $dom->loadHTML($html);

$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '),     'classname')]");

$tmp_dom = new DOMDocument(); 
foreach ($nodes as $node) 
{
    $tmp_dom->appendChild($tmp_dom->importNode($node,true));
}

$innerHTML = trim($tmp_dom->saveHTML()); 

$buffdom = new DOMDocument();
$buffdom->loadHTML($innerHTML);

# Iterate over all the <a> tags
foreach ($buffdom->getElementsByTagName('a') as $link) 
{
    # Show the <a href>
    echo $link->nodeValue, "<br />", PHP_EOL;
}

I want to stick with PHP only.

Answer

Floris picture Floris · Apr 19, 2013

I wonder if your problem is in the line:

$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '),     'classname')]");

As it stands, this literally looks for nodes that belong to the class with the name 'classname' - where 'classname' is not a variable, it's the actual name. This looks like you might have copied an example from somewhere - or did you literally name your class that?

I imagine that the data you are looking may not be in such nodes. If you could post a short piece of the actual HTML you are trying to parse, it should be possible to do a better job guiding you to a solution.

As an example, I just made the following complete code (based on yours, but adding code to open the stackoverflow.com home page, and changing 'classname' to 'question', since there seemed to be a lot of classes with question in the name, so I figured I should get a good harvest. I was not disappointed.

<?php
// create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "http://stackoverflow.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);

        // close curl resource to free up system resources
        curl_close($ch);      
//print_r($output);

$dom = new DOMDocument();
@$dom->loadHTML($output);

$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'question')]");
print_r($nodes);

$tmp_dom = new DOMDocument(); 
foreach ($nodes as $node) 
    {
    $tmp_dom->appendChild($tmp_dom->importNode($node,true));
    }
  $innerHTML.=trim($tmp_dom->saveHTML()); 

  $buffdom = new DOMDocument();
  @$buffdom->loadHTML($innerHTML);
    # Iterate over all the <a> tags
    foreach($buffdom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->nodeValue, PHP_EOL;
    echo "<br />";
    }
?>

Resulted in many many lines of output. Try it - the page is at http://www.floris.us/SO/scraper.php

(or paste the above code into a page of your own). You were very, very close!

NOTE - this doesn't produce all the output you want - you need to include other properties of the node, not just print out the nodeValue, to get everything. But I figure you can take it from here (again, without actual samples of your HTML it's impossible for anyone else to get much further than this in helping you...)