tags:

views:

115

answers:

4

I have an HTML page (it's out of an internal address book application) and I'm trying to match both the field name and field value out of a table.

The regular expression I've cooked up so far is

"href.*?>(.*?)<\\/a.*>(.*?)<\\/span"

which matches most of the keys and values just fine. The problem is that some of the values are also links.

Example string (without link - works)

href="JavaScript:updateField(&quot;peopleType&quot;, &quot;390061&quot;, &quot;[email protected]&quot;, &quot;bob&quot;, &quot;Reg&quot;, &quot;Bob Bobson&quot;);" onMouseOver="window.status='Update this field if possible, else explain how to update it';return true;" onMouseOut="window.status='';return true;">Emp Type</a></span></td>
<td nowrap=""><span style="font-family: Arial, Times New Roman, Courier New, Courier, monospace; color: #006699">Reg</span

Example string (with link - doesn't work)

href="JavaScript:updateField(&quot;dept&quot;, &quot;390061&quot;, &quot;[email protected]&quot;, &quot;bob&quot;, &quot;Reg&quot;, &quot;Bob Bobson&quot;);" onMouseOver="window.status='Update this field if possible, else explain how to update it';return true;" onMouseOut="window.status='';return true;">Dept</a></span></td>
<td nowrap=""><span style="font-family: Arial, Times New Roman, Courier New, Courier, monospace">
<a href="JavaScript:showDept('TheBobs');" onMouseOver="window.status='Show People in This Dept';return true;" onMouseOut="window.status='';return true;">TheBobs</a></span

The first half (capturing the key) works correctly.The issue (seems to be) that that the greedy .* is matching all the way to the end of the link where it finds the ending caret and then the non-greedy .*? in the capture group doesn't have anything left to match. I tried the RegEx

"href.*?>(.*?)<\\/a.*>(.*?)(<\\/a>)?<\\/span"

which works just fine for the strings with the link (the third capture group - with the /a in it) matches the close of the link so my second capture group works, but then it doesn't work on values that aren't links because (I think) it's searching for the closing link tag. I thought the ? at the end of that capture group should make it optional.

I'm matching with RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline.

How do I get the regular expression to match both the case with a link in the value, and without? Thanks.

+1  A: 

There's a free tool called Regex Coach you can use to easily debug your regular expressions.

Adrian Grigore
+2  A: 

I'd consider preprocessing the HTML and removing content that is known to trip up the regex implementation.

As far as testers go, you can also use Regex Hero since Silverlight's Regex implementation is compatable .NET's.

Richard Szalay
+1 - I've been looking for a decent online regex tester, thanks!
John Rasch
A: 

Try:

href[^<>]+>(.*?)<\\/a[^<>]*>(.*?)<\\/span

From what i can tell, it looks like "/a.*>" is being too greedy and i always try to be as specific as possible when writing Regex's... which i why i used "[^<>]+"

David Rogers
+1  A: 

Avoid the "." character. It usually gives you nothing but trouble... because it is unspecific.

Try something like this:

href=[^>]*>([^<]*)</a\s*>((?:(?!</span\s*>).)*)

Note: since your sample doesn't return a name-value pair, but rather just a name (assuming the first capture group is the name), I don't know what you'd expect it to match. Maybe post a more complete sample and specify exactly what parts you'd like to have captured.

Lucero