ansaurus

Question

Regex matching tags

Answer 1

+2 A:

What's the problem with using an HTML or XML library?

Using XML and XPath, for instance, this would just be a case of doing xml / td, in whatever way the library API supports that.

Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.

So, while it would be easy to create as regular expression for the simple case (<td.*?</td>), it would easily break if the XML changed just a bit.

Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+) in that with $1='$2' (or \1='\2', if that's the syntax of c# replace patterns), you'll get a valid XML.

Daniel 2009-07-24 14:40:33

The document may not be well formed (like in this case). In fact XDocument x = XDocument.Parse(row.ToString()); throws a XException

pistacchio 2009-07-24 14:47:41

Ah, well, who am I to disagree with that? I use regex to extract td's from a malformed HTML page myself. Well, the pattern is in the answer. I don't know c#, so I can't give the exact code.

Daniel 2009-07-24 14:50:13

oh, by the way, your regex does not match the first two TDs!

pistacchio 2009-07-24 14:53:35

The working copy :) <td[^>]*>[^<]*</td>

pistacchio 2009-07-24 14:55:08

I changed the regular expression, as you may see. The older one should have worked too. Maybe there was a typo? The "working copy" one works here.

Daniel 2009-07-24 14:58:43

Ah... I can see the typo. Indeed, instead of `*]` it should have been `]*`. :-)

Daniel 2009-07-24 15:07:19

Answer 2

A:

I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.

Sneal 2009-07-24 14:52:13

Answer 3

A:

Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td> tag you find after a <td> start tag may not actually be closing that element but a descendant element.)

A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.

Robert Rossney 2009-07-24 18:18:32

ansaurus

tags:

views:

answers:

Regex matching tags

related questions