ansaurus

Question

Boost::regex issue, Matching an HTML span element

Answer 1

+2 A:

For one thing, this:

[Green|Red]

doesn't do what you think it does. You want:

(?:Green|Red)

[Green|Red] is a character class made up of the letters GRred|, not a way of alternating between matches. The way you've written it, it will match exactly one of those characters followed by any number of other characters.

This:

[\s]

is redundant and maybe hazardous (depending on interpretation it could be what's actually making your match not work). It can be just

\s

In order for your second \s to work, the capturing expression probably needs to be

(.*?)

I also recommend making your first .* into [^>]*, to avoid the problem you'll get if you ever apply this to actual HTML documents, where it will suck in arbitrary amounts of HTML.

chaos 2009-07-12 00:03:25

Umm.. shouldn't it match either copyGreen or copyRed? I don't think I've got it wrong.. Please explain a little bit further, what and why I'm doing wrong. Thanks.

nhaa123 2009-07-12 00:05:06

Square brackets represent a single character match. That is [green|red] is identical to [gren|d] - it will match g OR r OR e OR n OR | OR d. For strings, you want alternation, which needs to be grouped inside parentheses to prevent it applying to the whole expression.

Peter Boughton 2009-07-12 00:10:02

Hah, OK. I stand corrected, my apologizes ;)

nhaa123 2009-07-12 00:27:33

Answer 2

+1 A:

There's a couple of problems with your regex.

First is this bit: [Green|Red]

which maches a set of characters, that set is G, r, e, n, |, R and d.

you need to do this using parenthises, like (Green|Red). now this matches either the string Green or Red.

EDIT: if you don't want this to capture anything, you can use a non-capturing group, which in boost::regex is done by including a ?: after the first parenthesis: (?:Green|Red). Now the regex has the grouping behavior of parentheses, but there is no capturing.

The second problem is the (.*)

This doesn't seem like much, but it matches too much, including patterns like consecutive spans. This will consume the end of one span and the start of the next, all the way to the last span on the page. You need to make this non-greedy. In boost::regex, you do that by following the * with a ?. change it to look like (.*?) (and do similar with the other *'s.

The thing is, XML and HTML are very hard to get anything more than trivially simple regexes to work correctly. You should really be using a library that is meant for working with that format. There are plenty of options. This way you can be sure that you are handling HTML correctly, no matter how contorted the input might be.

TokenMacGuy 2009-07-12 00:08:18

This really confuses me. Your suggestion, (Green|Red), captures either Green or Red. With that, I get two caputes now: Green and 0.12. I only want to get 0.12 between the spans.

nhaa123 2009-07-12 00:19:46

Easy to fix, see edited post.

TokenMacGuy 2009-07-12 00:40:01

Answer 3

+1 A:

[Green|Red]

Is wrong because [] denotes a character class in most regex syntaxes. Character classes are basically groups of characters that can all be matched. For example, [abc] will match "a", "b" or "c".

As for your other problems, there are a couple possibilities, such as TokenMacGuy mentions; (.*) could be matching too much. In order to be sure, I'd need to see what exactly your regex is matching.

Finally, you really shouldn't be using regular expressions to parse HTML. It gets to a pint where it just doesn't work except under the most controlled conditions and with way to complicated expressions. It would be better to look into various html/xml parsers.

Edit: This is a great explanation of why it's a bad idea.

Sean Nyman 2009-07-12 00:13:19

ansaurus

tags:

views:

answers:

Boost::regex issue, Matching an HTML span element

related questions