views:

181

answers:

5

Hi, I need to extract all the tags from an HTML file, In such a way that I would end up with either an array containing key=value for each of the attributes, or at least the raw text that makes up the tag.

I don't quite get along with regex, much less in PHP, so I would really appreciate some help in this.

PD: Some of the tags may span several lines and be indented with tabs and spaces on the subsequent lines.

Thanks.

+1  A: 

You can use the DOM functions to parse an XML/XHTML document into a DOM Tree. From there it's not too hard to traverse the nodes you wish, extracting the data you're looking for.

Some people prefer the SimpleXML functions which might work equally well for you. I personally have issues with SimpleXML and prefer the more verbose, but more powerful DOM functions.

nickf
+1  A: 

Yes, its easy. Use the DOM-Function of PHP and try to find the nodes with XPath. That should be the painless way.

Bernd Ott
A: 

Another option is the simplehtmldom library.

Amber
A: 

I don't think that's a good idea, the HTML is not really valid, and it may or may not be XHTML (and if it is, it won't be valid), so, if feeding pure HTML it won't be loadable via the XML DOM or simple XML. Any regex ideas?

A: 

Simple html dom did the trick, even though the html is so badly written it makes me feel like killing someone.

Thanks.