ansaurus

Question

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Answer 1

+6 A:

It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

Anton Gogolev 2009-03-31 14:16:30

Answer 2

+67 A:

XML is not a regular language.

edit: also, your language already has an XML parser, why don't you use that instead of inventing your own?

Brian Campbell 2009-03-31 14:16:49

It might be noted here, that modern regexes can do more than regular languages, but only so much. You will need a context free grammar to parse XML, HTML etc.

Daren Thomas 2009-03-31 15:22:45

Yes, this is true. I was trying to go for conciseness in my answer; I wanted to point out that regular expressions are a fundamentally wrong tool to try to parse XML. Really, there are XML parsers available for almost every language and platform; why not just use those?

Brian Campbell 2009-03-31 15:27:34

Yes, but saying that regexes fundamentally can't parse XML and HTML doesn't persuade people; I have tried it. That is why I am collecting examples of what causes the problems. Hopefully by having a list of what can go wrong people will realize "oh, that is what they meant by not a regular language".

Chas. Owens 2009-03-31 15:47:05

@Chas. Owens Edited my reply to include another reason, then, which is that it's not a good idea to reinvent the wheel. I think if people aren't convinced by these two reasons, they aren't going to be convinced by much else.

Brian Campbell 2009-03-31 16:10:21

+1 for taking the time to enumerate so many different parsers :)

JaredPar 2009-03-31 16:42:31

They don't use an existing parser because they already know regex and think the problem is simple. The purpose of this question is to gather evidence that it is not simple.

Chas. Owens 2009-03-31 16:55:44

@JaredPar Thanks, I was hoping people would appreciate that :)

Brian Campbell 2009-03-31 18:22:46

@Chas. Owens Sure, some concrete examples are good, and I think the ones you included in the question cover plenty of cases. If after your examples, the theoretical argument, and the practical argument that parsers already exist, they aren't convinced, then I think there is no convincing them.

Brian Campbell 2009-03-31 18:23:56

-0 Doesn't include my favorite language.

Andrew Grimm 2010-04-07 06:16:56

@Andrew Really? Have you checked the link on the word "an"? Or do you have a favorite language that doesn't appear on your user info page?

Brian Campbell 2010-04-13 19:11:50

@Brian: There isn't much activity on the `lolcode` tag! (Seriously: my mistake, though I think nokogiri is popular as well)

Andrew Grimm 2010-04-13 23:28:57

@Andrew Ah, yes. I didn't try to pick the best or most popular parser for each language; for most of these, I picked the first plausible Google result for "_language_ xml parser".

Brian Campbell 2010-04-15 03:04:41

Answer 3

+16 A:

I wrote an entire blog entry on this subject: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

The crux of the issue is that HTML and XML are recursive structures which requiring counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.

The previous paragraph comes with a slight caveat. Certain regex implementations now support the idea of recursion. However once you start adding recursion into your regex expressions, you are really stretching the boundaries and should consider a parser.

JaredPar 2009-03-31 14:18:42

Answer 4

+1 A:

People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.

chaos 2009-03-31 14:20:06

lazy/ungreedy is your friend! –

Keng 2009-03-31 14:59:40

Answer 5

+18 A:

Actually

<img src="imgtag.gif" alt="<img>" />

is not valid HTML, and is not valid XML either.

It is not valid XML because the '<' and '>' are not valid characters inside attribute strings. They need to be escaped using the corresponding XML entities < and >

It is not valid HTML either because the short closing form is not allowed in HTML (but is correct in XML and XHTML). The 'img' tag is also an implicitly closed tag as per the HTML 4.01 specification. This means that manually closing it is actually wrong, and is equivalent to closing any other tag twice.

The correct version in HTML is

<img src="imgtag.gif" alt="&lt;img&gt;">

and the correct version in XHTML and XML is

<img src="imgtag.gif" alt="&lt;img&gt;"/>

The following example you gave is also invalid

<
tag
attr="5"
/>

This is not valid HTML or XML either. The name of the tag must be right behind the '<', although the attributes and the closing '>' may be wherever they want. So the valid XML is actually

<tag
attr="5"
/>

And here's another funkier one: you can actually choose to use either " or ' as your attribute quoting character

<img src="image.gif" alt='This is single quoted AND valid!'>

All the other reasons that were posted are correct, but the biggest problem with parsing HTML is that people usually don't understand all the syntax rules correctly. The fact that your browser interprets your tagsoup as HTML doesn't means that you have actually written valid HTML.

Edit: And even stackoverflow.com agrees with me regarding the definition of valid and invalid. Your invalid XML/HTML is not highlighted, while my corrected version is.

Basically, XML is not made to be parsed with regexps. But there is also no reason to do so. There are many, many XML parsers for each and every language. You have the choice between SAX parsers, DOM parsers and Pull parsers. All of these are guaranteed to be much faster than parsing with a regexp and you may then use cool technologies like XPath or XSLT on the resulting DOM tree.

My reply is therefore: not only is parsing XML with regexps hard, but it is also a bad idea. Just use one of the millions of existing XML parsers, and take advantage of all the advanced features of XML.

HTML is just too hard to even try parsing on your own. First the legal syntax has many little subtleties that you may not be aware of, and second, HTML in the wild is just a huge stinking pile of (you get my drift). There are a variety of lax parser libraries that do a good job at handling HTML like tag soup, just use these.

LordOfThePigs 2009-03-31 14:26:17

You don't need to escape > as > though.

Joey 2009-03-31 14:49:37

Okay, s/valid/exists in the wild/g

Chas. Owens 2009-03-31 15:00:41

Actually, according to the specification you must escape > as > just as you must escape < as < it's just that many parser

LordOfThePigs 2009-03-31 15:02:47

opps, forgot the finish my comment. It's just that many parsers will be able to recover from their error state if < is properly encoded. Again, not crashing your parser doesn't mean your XML is valid.

LordOfThePigs 2009-03-31 15:03:46

The specification does not say ‘>’ must be escaped — except for the special case of the sequence ‘]]>’ in content. For this reason it is easiest to always escape ‘>’, but it is not required by spec.

bobince 2009-03-31 17:03:49

`>` sign is perfectly valid in html http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value

J.F. Sebastian 2009-11-28 00:32:35

Answer 6

A:

Are people actually making a mistake by using a regex, or is it simply good enough for the task they're trying to achieve?

I totally agree that parsing html and xml using a regex is not possible as other people have answered.

However, if your requirement is not to parse html/xml but to just get at one small bit of data in a "known good" bit of html / xml then maybe a regular expression or even an even simpler "substring" is good enough.

Robin Day 2009-03-31 14:29:24

Define "good enough". Inevitably the simple regex won't work. Is not matching something or matching something you shouldn't a bug? If so then using regexes is a mistake. HTML and XML parsers are not hard to use. Avoiding learning them is a false economy.

Chas. Owens 2009-03-31 15:35:10

ok, define "good enough". Lets say I have a webpage that tells me the clients IP address. That's all it does. Now, I need to write an application for the clients machine that tells me its IP address. I go to that site, look for an IP address and return it. Parsing the HTML is not needed!

Robin Day 2009-03-31 17:53:38

If you have an arbitrary string whose format is completely under your control, the fact that the string happens to be well-formed XML really isn't relevant. But almost no use cases for XML actually fall into this category.

Robert Rossney 2009-03-31 19:18:40

I can tell you from painful experience that most of the time it's possible to get what you want utilizing absurd complex regex patterns. Until the website undergoes a hilarious small change and you can throw this regex that made you cry for two days out of the window and start anew.

Thomasz 2009-04-04 14:47:09

Answer 7

+58 A:

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
    <a b="&y;>" />
    <![CDATA[[a>b <a>b <a]]>
    <?x <a> <!-- <b> ?> c --> d
</x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
    <!ENTITY % e "href='hello'">
    <!ENTITY e "<a %e;>">
]>
    <title>x</TITLE>
</head>
    <p id  =  a:b center>
    <span / hello </span>
    &amp<br left>
    <!---- >t<!---> < -->
    &e link </a>
</body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd"&gt; 
<HTML/
  <HEAD/
    <TITLE/>/
    <P/>

bobince 2009-03-31 17:47:38

I have no idea what is going on in the first example, could you add some explanatory text?

Chas. Owens 2009-04-01 06:23:52

The XML one? There are a few different constructs there, which is troublesome? The DTD internal subset? That's defining a new called ‘y’, containing a ‘]>’ sequence that would normally, if not in quotes, end the internal subset.

bobince 2009-04-01 12:56:24

(This demonstrates that you have to have quite deep knowledge about some of the more esoteric and archaic DTD features of XML to parse a document properly, even if you aren't a DTD-validating parser.)

bobince 2009-04-01 12:57:23

And then people bitch about XHTML being too strict. Damn it all, I want my XHTML to be Nazi strict! I want all the browsers to fail if there is a single space missing! Then we'll talk about parsing..

dr Hannibal Lecter 2009-05-13 00:10:21

Answer 8

+6 A:

One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.

AmbroseChapel 2009-04-01 05:58:55

Ah, yes, that was even the question that prompted me to ask this one (the first link).

Chas. Owens 2009-04-01 06:22:32

Answer 9

A:

See http://search.cpan.org/~pinyan/YAPE-HTML-1.11/HTML.pm it is a parser based upon regexe

emaN 2010-01-03 08:34:02

I see an awful lot of if, else, while, and other statements as well as a whole bunch of state variables. Nobody is saying that regexes can't be part of parsing HTML; it would be foolish to make that claim. What is being claimed is that a regex by itself cannot parse HTML. See the referenced questions for examples of the sorts of questions that have led to this one being asked.

Chas. Owens 2010-01-03 14:08:10

ansaurus

tags:

views:

answers:

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

related questions