I need a regex that will find either an opening div tag, or a closing div tag, or both in an html web page. Thanks :)
+1
A:
You could start with:
</?div>
This won't correctly handle:
- whitespace
- attributes on the div
- self-closing div tags
- upper case tags
- tags inside HTML comments that should be ignored
- etc...
To handle HTML correctly you're better off using an HTML parser rather than regular expressions.
Mark Byers
2010-08-05 22:29:52
I wouldn't describe that as "safe".
Mark Byers
2010-08-05 23:39:52
+1
A:
HTML, XHTML, and XML can not be parsed using regular expressions. There are parsers designed for this type of thing. If you specify the language(s) you are using, I'm sure someone can suggest the right tool(s) for the job, but I know for a fact that regular expressions will not be on that list.
Thomas Owens
2010-08-05 22:34:42
He/she said he/she wants to find the tags, not necessarily parse the contents.
NullUserException
2010-08-05 23:30:09
It doesn't matter what you want to do - most parsers that I've seen allow you do to things like count tags too. But regex is never the right answer when dealing with HTML.
Thomas Owens
2010-08-06 00:58:32
A:
If you can use xpath it would be //div
Look into using an XML parser that supports it instead of regex. If you MUST use regex, go with coding_hero's answer.
Just for show, in PHP:
//$htmldoc is some xhtml document from somewhere
$xhtml = simplexml_load_file($htmldoc);
$divs = $xhtml->xpath('//div'); //grab simpleXMLElement from all divs in document
return $divs->asXML(); //returns xml of div elements and children
Tim
2010-08-05 23:02:07
I believe XPath requires XML content. HTML does not conform to all of the rules of XML.
Thomas Owens
2010-08-06 00:58:54
Agreed, but that's why I prefaced this with "if it must be regex, use coding_hero's". I also specified that it's based on an xhtml document. SimpleXML is also fully compatible with DOM in PHP.
Tim
2010-08-06 04:05:18