Hello everyone,
From a string that contains a lot of HTMl, how can I extract all the text from <h1><h2>etc
tags into a new variable.
Possibly using preg_match_all and sending the matches to a single comma delimited variable.
Thanks guys.
Hello everyone,
From a string that contains a lot of HTMl, how can I extract all the text from <h1><h2>etc
tags into a new variable.
Possibly using preg_match_all and sending the matches to a single comma delimited variable.
Thanks guys.
When the question is "How do I extract stuff from HTML", the answer is NEVER to use regular expressions. Instead, see the discussion on Robust, Mature HTML Parser for PHP.
It is recommended not to use regex for this job and use something SimpleHTMLDOM parser
You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:
if (preg_match_all('/<h\d>([^<]*)</h\d>/iU', $str, $matches)) {
// $matches contains all instances of h1-h6
}
If you actually want to use regular expressions, I think that:
preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string $subject, $matches);
should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.
please also consider the native DOMDocument
php class.
You can use $domdoc->getElementsByTagName('h1')
to get your headings.
First you need to clean up the HTML ($html_str in the example) with tidy:
$tidy_config = array(
"indent" => true,
"output-xml" => true,
"output-xhtml" => false,
"drop-empty-paras" => false,
"hide-comments" => true,
"numeric-entities" => true,
"doctype" => "omit",
"char-encoding" => "utf8",
"repeated-attributes" => "keep-last"
);
$xml_str = tidy_repair_string($html_str, $tidy_config);
Then you can load the XML ($xml_str) into a DOMDocument:
$doc = DOMDocument::loadXML($xml_str);
And finally you can use Horia Dragomir's method:
$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
print($list->item($i)->nodeValue . "<br/>\n");
}
Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)
$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");