About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.
A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.
After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.
If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.
The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)
Here's the magic pattern:
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";
Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on january:
http://php.net/manual/en/regexp.reference.recursive.php
(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).
Now, we could speak about the limits of this method from a more informed point of view:
- according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
- although corrupted (x)HTML does not drive into severe errors, it is not sanitized.
Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).