tags:

views:

71

answers:

3

I'm trying to write a regex to match patterns like this:

<td style="alskdjf" />

i.e. a self terminating <td>

but not this:

<td style=alsdkjf"><br /></td>

I initially came up with:

<td\s+.*?/>

but that obviously fails on the second example and I thought that something like this might work:

<td\s+.*?[^>]/>

but it doesn't. I'm using C#.NET.

Only looking for <td>'s that have an attribute. e.g. looking for <td style="alsdfkj" /> but not <td>.

+4  A: 

You're going to have problems using regexps with HTML since HTML is not regular. I'd recommend using an HTML parser for all but the very simplest cases.

Brian Agnew
It depends on the case. A self-terminating tag like the one the OP is trying to match is, actually, regular as long as no `>` characters are expected in attribute values.
Amber
Unless you want to match the syntactically equivalent <td style="alsdkjf"></td>, of course.
Greg Campbell
Correct. However, you could again expand the regex to match that as well - just add `(><)?` before the `/td>` portion of the pattern.
Amber
+4  A: 

This will match what you're looking for, and not match the problematic case you had with your first few tries:

<td[^>]*?/>

Note, however, that if you need to allow > characters in attribute values, you'd need something like this:

<td(?:[^>]|"[^"]*?")*?/>

Which allows > only within matching double-quotes (you could similarly expand it to allow single-quotes).

You can add whatever specific attribute you're looking for into the regex; for instance for your example:

<td[^>]*? style="alskdjf"[^>]*?/>
Amber
+2  A: 

Regex will have serious trouble interpreting messy HTML, as is the sort browsers often have to deal with. There are all sorts of horrible obfuscations that can be done to the markup that you just don't want to have to think about!

The HTML Agility Pack is what you really want to be using, and has had very good reviews everywhere I've seen. It is a robust library for reading any kind of mangled HTML into a DOM model. I have personally found it to be an superb library, as surely have others, many using the library in the context of business applications.

Noldorin