ansaurus

Question

How to extract Heading tags in PHP from string

Answer 1

+2 A:

When the question is "How do I extract stuff from HTML", the answer is NEVER to use regular expressions. Instead, see the discussion on Robust, Mature HTML Parser for PHP.

Tony Miller 2010-01-14 14:34:17

any chance of an example? I need to get all the heading tags inside the 'article' div class. Im always confused about the DOm

bluedaniel 2010-01-14 14:44:36

Answer 2

+2 A:

It is recommended not to use regex for this job and use something SimpleHTMLDOM parser

codaddict 2010-01-14 14:34:40

Answer 3

A:

You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:

if (preg_match_all('/<h\d>([^<]*)</h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}

Emil H 2010-01-14 14:37:42

Answer 4

A:

If you actually want to use regular expressions, I think that:

preg_match_all('/<h[0-6]>([^</h[0-6]>*)</h/i', $string $subject, $matches);

should work as long as your header tags are not nested. As others have said, if you're not in control of the HTML, regular expressions are not a great way to do this.

Scott Saunders 2010-01-14 14:38:24

Answer 5

+3 A:

please also consider the native DOMDocument php class.

You can use $domdoc->getElementsByTagName('h1') to get your headings.

Horia Dragomir 2010-01-14 14:44:19

Answer 6

+3 A:

First you need to clean up the HTML ($html_str in the example) with tidy:

$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);

$xml_str = tidy_repair_string($html_str, $tidy_config);

Then you can load the XML ($xml_str) into a DOMDocument:

$doc = DOMDocument::loadXML($xml_str);

And finally you can use Horia Dragomir's method:

$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}

Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)

$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");

CodeAddict 2010-01-14 14:53:21

something broken about this script, try it with simple HTML like '<html><body><h1>Test</h1><br></body></html>'

bluedaniel 2010-01-14 15:04:47

What error do you get? If I try the example you give it works as expected and prints "Test"

CodeAddict 2010-01-14 15:17:10

as in all of that code above with the only change being: $xml_str = tidy_repair_string('<html><body><h1>Test</h1><br></body></html>', $tidy_config); ??? It breaks my script, there is an error in there that is crashing like a typo or something

bluedaniel 2010-01-14 15:25:23

I guess that the 'tidy' module is not enabled in you're php.ini.If you're using xampp (or some other AMP) uncomment the line "extension=php_tidy.dll"If you're using Ubuntu use "apt-get install php5-tidy" to install and enable it.

CodeAddict 2010-01-14 15:28:33

ansaurus

tags:

views:

answers:

How to extract Heading tags in PHP from string

related questions