views:

385

answers:

6

Hello everyone,

From a string that contains a lot of HTMl, how can I extract all the text from <h1><h2>etc tags into a new variable.

Possibly using preg_match_all and sending the matches to a single comma delimited variable.

Thanks guys.

+2  A: 

When the question is "How do I extract stuff from HTML", the answer is NEVER to use regular expressions. Instead, see the discussion on Robust, Mature HTML Parser for PHP.

Tony Miller
any chance of an example? I need to get all the heading tags inside the 'article' div class. Im always confused about the DOm
bluedaniel
+2  A: 

It is recommended not to use regex for this job and use something SimpleHTMLDOM parser

codaddict
A: 

You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:

if (preg_match_all('/<h\d>([^<]*)</h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}
Emil H
A: 

If you actually want to use regular expressions, I think that:

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string $subject, $matches);

should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.

Scott Saunders
+3  A: 

please also consider the native DOMDocument php class.

You can use $domdoc->getElementsByTagName('h1') to get your headings.

Horia Dragomir
+3  A: 

First you need to clean up the HTML ($html_str in the example) with tidy:

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);

$xml_str = tidy_repair_string($html_str, $tidy_config);

Then you can load the XML ($xml_str) into a DOMDocument:

$doc = DOMDocument::loadXML($xml_str);

And finally you can use Horia Dragomir's method:

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}

Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");
CodeAddict
something broken about this script, try it with simple HTML like '<html><body><h1>Test</h1><br></body></html>'
bluedaniel
What error do you get? If I try the example you give it works as expected and prints "Test"
CodeAddict
as in all of that code above with the only change being: $xml_str = tidy_repair_string('<html><body><h1>Test</h1><br></body></html>', $tidy_config); ??? It breaks my script, there is an error in there that is crashing like a typo or something
bluedaniel
I guess that the 'tidy' module is not enabled in you're php.ini.If you're using xampp (or some other AMP) uncomment the line "extension=php_tidy.dll"If you're using Ubuntu use "apt-get install php5-tidy" to install and enable it.
CodeAddict