From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc
tags into a new variable?
I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values.
Is it possible using preg_match_all()
?
First you need to clean up the HTML ($html_str in the example) with tidy:
$tidy_config = array(
"indent" => true,
"output-xml" => true,
"output-xhtml" => false,
"drop-empty-paras" => false,
"hide-comments" => true,
"numeric-entities" => true,
"doctype" => "omit",
"char-encoding" => "utf8",
"repeated-attributes" => "keep-last"
);
$xml_str = tidy_repair_string($html_str, $tidy_config);
Then you can load the XML ($xml_str) into a DOMDocument:
$doc = DOMDocument::loadXML($xml_str);
And finally you can use Horia Dragomir's method:
$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
print($list->item($i)->nodeValue . "<br/>\n");
}
Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)
$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");