Hi i am using cURL to get data from a website i need to get multiple items but cannot get it by tag name or id. I have managed to put together some code that will get one item using a class name by passing it through a loop i then pass it through another loop to get the text from the element.
I have a few problems here the first is i can see there must be a more convenient way of doing this. The second i will need to get multiple elements and stack together ie title, desciption, tags and a url link.
# Create a DOM parser object and load HTML
$dom = new DOMDocument();
$result = $dom->loadHTML($html);
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'classname')]");
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML = trim($tmp_dom->saveHTML());
$buffdom = new DOMDocument();
$buffdom->loadHTML($innerHTML);
# Iterate over all the <a> tags
foreach ($buffdom->getElementsByTagName('a') as $link)
{
# Show the <a href>
echo $link->nodeValue, "<br />", PHP_EOL;
}
I want to stick with PHP only.
I wonder if your problem is in the line:
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'classname')]");
As it stands, this literally looks for nodes that belong to the class with the name 'classname' - where 'classname' is not a variable, it's the actual name. This looks like you might have copied an example from somewhere - or did you literally name your class that?
I imagine that the data you are looking may not be in such nodes. If you could post a short piece of the actual HTML you are trying to parse, it should be possible to do a better job guiding you to a solution.
As an example, I just made the following complete code (based on yours, but adding code to open the stackoverflow.com
home page, and changing 'classname'
to 'question'
, since there seemed to be a lot of classes with question
in the name, so I figured I should get a good harvest. I was not disappointed.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://stackoverflow.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
//print_r($output);
$dom = new DOMDocument();
@$dom->loadHTML($output);
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'question')]");
print_r($nodes);
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML.=trim($tmp_dom->saveHTML());
$buffdom = new DOMDocument();
@$buffdom->loadHTML($innerHTML);
# Iterate over all the <a> tags
foreach($buffdom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->nodeValue, PHP_EOL;
echo "<br />";
}
?>
Resulted in many many lines of output. Try it - the page is at http://www.floris.us/SO/scraper.php
(or paste the above code into a page of your own). You were very, very close!
NOTE - this doesn't produce all the output you want - you need to include other properties of the node, not just print out the nodeValue
, to get everything. But I figure you can take it from here (again, without actual samples of your HTML it's impossible for anyone else to get much further than this in helping you...)