



Hello everyone,

From a string that contains a lot of HTMl, how can I extract all the text from <h1><h2>etc tags into a new variable.

Possibly using preg_match_all and sending the matches to a single comma delimited variable.

Thanks guys.

+2  A: 

When the question is "How do I extract stuff from HTML", the answer is NEVER to use regular expressions. Instead, see the discussion on Robust, Mature HTML Parser for PHP.

Tony Miller
any chance of an example? I need to get all the heading tags inside the 'article' div class. Im always confused about the DOm
+2  A: 

It is recommended not to use regex for this job and use something SimpleHTMLDOM parser


You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:

if (preg_match_all('/<h\d>([^<]*)</h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
Emil H

If you actually want to use regular expressions, I think that:

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string $subject, $matches);

should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.

Scott Saunders
+3  A: 

please also consider the native DOMDocument php class.

You can use $domdoc->getElementsByTagName('h1') to get your headings.

Horia Dragomir
+3  A: 

First you need to clean up the HTML ($html_str in the example) with tidy:

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"

$xml_str = tidy_repair_string($html_str, $tidy_config);

Then you can load the XML ($xml_str) into a DOMDocument:

$doc = DOMDocument::loadXML($xml_str);

And finally you can use Horia Dragomir's method:

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");

Or you could also use XPath for more complex queries on the DOMDocument (see

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");
something broken about this script, try it with simple HTML like '<html><body><h1>Test</h1><br></body></html>'
What error do you get? If I try the example you give it works as expected and prints "Test"
as in all of that code above with the only change being: $xml_str = tidy_repair_string('<html><body><h1>Test</h1><br></body></html>', $tidy_config); ??? It breaks my script, there is an error in there that is crashing like a typo or something
I guess that the 'tidy' module is not enabled in you're php.ini.If you're using xampp (or some other AMP) uncomment the line "extension=php_tidy.dll"If you're using Ubuntu use "apt-get install php5-tidy" to install and enable it.