tags:

views:

4012

answers:

9

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag
attr="5"
/>

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span>

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a>
<!-- FIXME:
    <a href="
-->
<a href="bar">bar</a>

What other gotchas are you aware of?

+6  A: 

It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

Anton Gogolev
+67  A: 

XML is not a regular language.

edit: also, your language already has an XML parser, why don't you use that instead of inventing your own?

Brian Campbell
It might be noted here, that modern regexes can do more than regular languages, but only so much. You will need a context free grammar to parse XML, HTML etc.
Daren Thomas
Yes, this is true. I was trying to go for conciseness in my answer; I wanted to point out that regular expressions are a fundamentally wrong tool to try to parse XML. Really, there are XML parsers available for almost every language and platform; why not just use those?
Brian Campbell
Yes, but saying that regexes fundamentally can't parse XML and HTML doesn't persuade people; I have tried it. That is why I am collecting examples of what causes the problems. Hopefully by having a list of what can go wrong people will realize "oh, that is what they meant by not a regular language".
Chas. Owens
@Chas. Owens Edited my reply to include another reason, then, which is that it's not a good idea to reinvent the wheel. I think if people aren't convinced by these two reasons, they aren't going to be convinced by much else.
Brian Campbell
+1 for taking the time to enumerate so many different parsers :)
JaredPar
They don't use an existing parser because they already know regex and think the problem is simple. The purpose of this question is to gather evidence that it is not simple.
Chas. Owens
@JaredPar Thanks, I was hoping people would appreciate that :)
Brian Campbell
@Chas. Owens Sure, some concrete examples are good, and I think the ones you included in the question cover plenty of cases. If after your examples, the theoretical argument, and the practical argument that parsers already exist, they aren't convinced, then I think there is no convincing them.
Brian Campbell
-0 Doesn't include my favorite language.
Andrew Grimm
@Andrew Really? Have you checked the link on the word "an"? Or do you have a favorite language that doesn't appear on your user info page?
Brian Campbell
@Brian: There isn't much activity on the `lolcode` tag! (Seriously: my mistake, though I think nokogiri is popular as well)
Andrew Grimm
@Andrew Ah, yes. I didn't try to pick the best or most popular parser for each language; for most of these, I picked the first plausible Google result for "_language_ xml parser".
Brian Campbell
+16  A: 

I wrote an entire blog entry on this subject: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

The crux of the issue is that HTML and XML are recursive structures which requiring counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.

The previous paragraph comes with a slight caveat. Certain regex implementations now support the idea of recursion. However once you start adding recursion into your regex expressions, you are really stretching the boundaries and should consider a parser.

JaredPar
+1  A: 

People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.

chaos
lazy/ungreedy is your friend! –
Keng
+18  A: 

Actually

<img src="imgtag.gif" alt="<img>" />

is not valid HTML, and is not valid XML either.

It is not valid XML because the '<' and '>' are not valid characters inside attribute strings. They need to be escaped using the corresponding XML entities &lt; and &gt;

It is not valid HTML either because the short closing form is not allowed in HTML (but is correct in XML and XHTML). The 'img' tag is also an implicitly closed tag as per the HTML 4.01 specification. This means that manually closing it is actually wrong, and is equivalent to closing any other tag twice.

The correct version in HTML is

<img src="imgtag.gif" alt="&lt;img&gt;">

and the correct version in XHTML and XML is

<img src="imgtag.gif" alt="&lt;img&gt;"/>

The following example you gave is also invalid

<
tag
attr="5"
/>

This is not valid HTML or XML either. The name of the tag must be right behind the '<', although the attributes and the closing '>' may be wherever they want. So the valid XML is actually

<tag
attr="5"
/>

And here's another funkier one: you can actually choose to use either " or ' as your attribute quoting character

<img src="image.gif" alt='This is single quoted AND valid!'>

All the other reasons that were posted are correct, but the biggest problem with parsing HTML is that people usually don't understand all the syntax rules correctly. The fact that your browser interprets your tagsoup as HTML doesn't means that you have actually written valid HTML.

Edit: And even stackoverflow.com agrees with me regarding the definition of valid and invalid. Your invalid XML/HTML is not highlighted, while my corrected version is.

Basically, XML is not made to be parsed with regexps. But there is also no reason to do so. There are many, many XML parsers for each and every language. You have the choice between SAX parsers, DOM parsers and Pull parsers. All of these are guaranteed to be much faster than parsing with a regexp and you may then use cool technologies like XPath or XSLT on the resulting DOM tree.

My reply is therefore: not only is parsing XML with regexps hard, but it is also a bad idea. Just use one of the millions of existing XML parsers, and take advantage of all the advanced features of XML.

HTML is just too hard to even try parsing on your own. First the legal syntax has many little subtleties that you may not be aware of, and second, HTML in the wild is just a huge stinking pile of (you get my drift). There are a variety of lax parser libraries that do a good job at handling HTML like tag soup, just use these.

LordOfThePigs
You don't need to escape > as > though.
Joey
Okay, s/valid/exists in the wild/g
Chas. Owens
Actually, according to the specification you must escape > as > just as you must escape < as < it's just that many parser
LordOfThePigs
opps, forgot the finish my comment. It's just that many parsers will be able to recover from their error state if < is properly encoded. Again, not crashing your parser doesn't mean your XML is valid.
LordOfThePigs
The specification does not say ‘>’ must be escaped — except for the special case of the sequence ‘]]>’ in content. For this reason it is easiest to always escape ‘>’, but it is not required by spec.
bobince
`>` sign is perfectly valid in html http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value
J.F. Sebastian
A: 

Are people actually making a mistake by using a regex, or is it simply good enough for the task they're trying to achieve?

I totally agree that parsing html and xml using a regex is not possible as other people have answered.

However, if your requirement is not to parse html/xml but to just get at one small bit of data in a "known good" bit of html / xml then maybe a regular expression or even an even simpler "substring" is good enough.

Robin Day
Define "good enough". Inevitably the simple regex won't work. Is not matching something or matching something you shouldn't a bug? If so then using regexes is a mistake. HTML and XML parsers are not hard to use. Avoiding learning them is a false economy.
Chas. Owens
ok, define "good enough". Lets say I have a webpage that tells me the clients IP address. That's all it does. Now, I need to write an application for the clients machine that tells me its IP address. I go to that site, look for an IP address and return it. Parsing the HTML is not needed!
Robin Day
If you have an arbitrary string whose format is completely under your control, the fact that the string happens to be well-formed XML really isn't relevant. But almost no use cases for XML actually fall into this category.
Robert Rossney
I can tell you from painful experience that most of the time it's possible to get what you want utilizing absurd complex regex patterns. Until the website undergoes a hilarious small change and you can throw this regex that made you cry for two days out of the window and start anew.
Thomasz
+58  A: 

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
    <a b="&y;>" />
    <![CDATA[[a>b <a>b <a]]>
    <?x <a> <!-- <b> ?> c --> d
</x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
    <!ENTITY % e "href='hello'">
    <!ENTITY e "<a %e;>">
]>
    <title>x</TITLE>
</head>
    <p id  =  a:b center>
    <span / hello </span>
    &amp<br left>
    <!---- >t<!---> < -->
    &e link </a>
</body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd"&gt; 
<HTML/
  <HEAD/
    <TITLE/>/
    <P/>
bobince
I have no idea what is going on in the first example, could you add some explanatory text?
Chas. Owens
The XML one? There are a few different constructs there, which is troublesome? The DTD internal subset? That's defining a new called ‘y’, containing a ‘]>’ sequence that would normally, if not in quotes, end the internal subset.
bobince
(This demonstrates that you have to have quite deep knowledge about some of the more esoteric and archaic DTD features of XML to parse a document properly, even if you aren't a DTD-validating parser.)
bobince
And then people bitch about XHTML being too strict. Damn it all, I want my XHTML to be Nazi strict! I want all the browsers to fail if there is a single space missing! Then we'll talk about parsing..
dr Hannibal Lecter
+6  A: 

One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.

AmbroseChapel
Ah, yes, that was even the question that prompted me to ask this one (the first link).
Chas. Owens
A: 

See http://search.cpan.org/~pinyan/YAPE-HTML-1.11/HTML.pm it is a parser based upon regexe

emaN
I see an awful lot of if, else, while, and other statements as well as a whole bunch of state variables. Nobody is saying that regexes can't be part of parsing HTML; it would be foolish to make that claim. What is being claimed is that a regex by itself cannot parse HTML. See the referenced questions for examples of the sorts of questions that have led to this one being asked.
Chas. Owens