tags:

views:

55

answers:

2

Hi, I have a large malformed test HTML document which I need to get the numbers out of:

I'd like to get the primary ratio out. I'm using this regular expression:

(?<=Primary ratio</TD><TD>--</TD><TD>).*(?=</TD>)

On this string:

Primary ratio</TD><TD>--</TD><TD>10.52</TD><TD>14.97</TD><TD></TD></TR><TR align='right'><TD align='left'>Flip Ratio</TD><TD>-122.81</TD><TD>1.13</TD><TD>1.50</TD><TD></TD></TR><TR align='right'><TD align='left'>Secondary Ratio</TD><TD>--</TD><TD>0.70</TD><TD>0.70</TD><TD></TD></TR><TR align='right'><TD align='left'>RM Ratio</TD><TD>--</TD><TD>2.02</TD>

But I get this as a result:

10.52</TD><TD>14.97</TD><TD></TD></TR><TR align='right'><TD align='left'>Flip Ra
tio</TD><TD>-122.81</TD><TD>1.13</TD><TD>1.50</TD><TD></TD></TR><TR align='right
'><TD align='left'>Secondary Ratio</TD><TD>--</TD><TD>0.70</TD><TD>0.70</TD><TD>
</TD></TR><TR align='right'><TD align='left'>RM Ratio</TD><TD>--</TD><TD>2.02

I don't want that, I just want the 10.52 number in the first tag.

I mean, it found the start of the string perfectly, but it didn't find the first . What am I doing wrong?

+2  A: 

Use an HTML parser instead of a RegEx - the HTML Agility Pack is a good one.

In general, regular expressions are not suitable for usage with HTML, as HTML is not a regular language. This is particularly true if you are working with HTML from different sources. See here for a compelling demonstration.

Oded
It's a really nice malformed document. I don't know how the agility pack handles it. I'd just prefer to use regex in this case. I'll definitely keep this in mind in the future though.
Mike
@Mike - from the site: `The parser is very tolerant with "real world" malformed HTML.`
Oded
That, or an XML parser. I like XPath. Also, @Mike, read the first answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - because it's relevant and you'll enjoy it.
Lunivore
If it's malformed then Flynn1179's answer is probably what you're looking for.
Lunivore
@Lunivore - XML parsers are not suitable for valid HTML either - for example `<br>` is valid HTML (4.01), but not valid XML. Of course, XHTML is also XML, so that's a different issue.
Oded
Sure. Most modern HTML is XHTML anyway, because other people like XPath too. I think I used Regex the last time I needed to do something like this, but I acknowledge the complete unmaintainability of my code and hang my head in shame. Shame!
Lunivore
I also believe we are using it for different purposes. I've got a document I need to find specific information within, whereas the other question is asking about matching multiple tags. If I was parsing a HTML document to get everything inside every <P> for example, I would definitely use a HTML parser. I guess for different purposes, different tools can come into play.
Mike
@Mike - fair comment. Absolutely agree.
Oded
+2  A: 

Replace .* with .*? near the end of your regex; that should stop it from matching too much. Normally it'll much as much as possible that fits the pattern, by adding the ?, you ask it to match as little as possible instead.

Flynn1179
Brilliant. This works exactly! Thanks so much.
Mike
This behaviour is known as "greedy" matching, by the way. The syntax Flynn proposes explicitly tells the regex parser to match non-greedy.
kander