ansaurus

Question

Answer 1

+2 A:

Use an HTML parser instead of a RegEx - the HTML Agility Pack is a good one.

In general, regular expressions are not suitable for usage with HTML, as HTML is not a regular language. This is particularly true if you are working with HTML from different sources. See here for a compelling demonstration.

Oded 2010-07-25 07:32:32

It's a really nice malformed document. I don't know how the agility pack handles it. I'd just prefer to use regex in this case. I'll definitely keep this in mind in the future though.

Mike 2010-07-25 07:34:46

@Mike - from the site: `The parser is very tolerant with "real world" malformed HTML.`

Oded 2010-07-25 07:35:24

That, or an XML parser. I like XPath. Also, @Mike, read the first answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - because it's relevant and you'll enjoy it.

Lunivore 2010-07-25 07:38:46

If it's malformed then Flynn1179's answer is probably what you're looking for.

Lunivore 2010-07-25 07:39:31

@Lunivore - XML parsers are not suitable for valid HTML either - for example `<br>` is valid HTML (4.01), but not valid XML. Of course, XHTML is also XML, so that's a different issue.

Oded 2010-07-25 07:42:26

Sure. Most modern HTML is XHTML anyway, because other people like XPath too. I think I used Regex the last time I needed to do something like this, but I acknowledge the complete unmaintainability of my code and hang my head in shame. Shame!

Lunivore 2010-07-25 09:46:43

I also believe we are using it for different purposes. I've got a document I need to find specific information within, whereas the other question is asking about matching multiple tags. If I was parsing a HTML document to get everything inside every <P> for example, I would definitely use a HTML parser. I guess for different purposes, different tools can come into play.

Mike 2010-07-25 12:48:11

@Mike - fair comment. Absolutely agree.

Oded 2010-07-25 13:41:31

Answer 2

+2 A:

Replace .* with .*? near the end of your regex; that should stop it from matching too much. Normally it'll much as much as possible that fits the pattern, by adding the ?, you ask it to match as little as possible instead.

Flynn1179 2010-07-25 07:33:56

Brilliant. This works exactly! Thanks so much.

Mike 2010-07-25 07:36:07

This behaviour is known as "greedy" matching, by the way. The syntax Flynn proposes explicitly tells the regex parser to match non-greedy.

kander 2010-07-25 07:39:40

ansaurus

tags:

views:

answers:

Simple Regular expression question.

related questions