I have an HTML page (it's out of an internal address book application) and I'm trying to match both the field name and field value out of a table.
The regular expression I've cooked up so far is
"href.*?>(.*?)<\\/a.*>(.*?)<\\/span"
which matches most of the keys and values just fine. The problem is that some of the values are also links.
Example string (without link - works)
href="JavaScript:updateField("peopleType", "390061", "[email protected]", "bob", "Reg", "Bob Bobson");" onMouseOver="window.status='Update this field if possible, else explain how to update it';return true;" onMouseOut="window.status='';return true;">Emp Type</a></span></td>
<td nowrap=""><span style="font-family: Arial, Times New Roman, Courier New, Courier, monospace; color: #006699">Reg</span
Example string (with link - doesn't work)
href="JavaScript:updateField("dept", "390061", "[email protected]", "bob", "Reg", "Bob Bobson");" onMouseOver="window.status='Update this field if possible, else explain how to update it';return true;" onMouseOut="window.status='';return true;">Dept</a></span></td>
<td nowrap=""><span style="font-family: Arial, Times New Roman, Courier New, Courier, monospace">
<a href="JavaScript:showDept('TheBobs');" onMouseOver="window.status='Show People in This Dept';return true;" onMouseOut="window.status='';return true;">TheBobs</a></span
The first half (capturing the key) works correctly.The issue (seems to be) that that the greedy .* is matching all the way to the end of the link where it finds the ending caret and then the non-greedy .*? in the capture group doesn't have anything left to match. I tried the RegEx
"href.*?>(.*?)<\\/a.*>(.*?)(<\\/a>)?<\\/span"
which works just fine for the strings with the link (the third capture group - with the /a in it) matches the close of the link so my second capture group works, but then it doesn't work on values that aren't links because (I think) it's searching for the closing link tag. I thought the ? at the end of that capture group should make it optional.
I'm matching with RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline.
How do I get the regular expression to match both the case with a link in the value, and without? Thanks.