Capture a part of a string that does not match another group (C# Regex)

A:

Try prefixing your regex with ^(.*?) (match any characters from the beginning of the string, non-greedy). Thus it will match anything at all that occurs at the start of the string, but it will match as little as it can while still having the rest of the regex match. Thus you'll grab all of the stuff that wasn't matched normally in that first capture group.

Amber 2009-12-05 01:46:36

A:

Why don't you use an HTML parser for this?

You should be using an XML parser, not regexes. XML is not a regular language, hence not easely parseable by a regular expression. Don't do it.

Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.

voyager 2009-12-05 01:51:05

It's not valid HTML, so that wouldn't help.

Mark Byers 2009-12-05 01:56:10

There are HTML parsers that can handle invalid input.

voyager 2009-12-05 01:57:58

Yes, usually by ignoring unknown tags. How would a HTML parser handle input like " foo bar </ub> baz <bu> qux quux "? I would think it would try to match the start and end tags, but in this case that behaviour is not wanted. 'baz' is not within any tags.

Mark Byers 2009-12-05 02:24:41

+2 A:

I think trying to parse and validate the entire text in one regular expression is likely to give you problems. The text you are parsing is not a regular language, so regular expressions are not well designed for this purpose.

Instead I would recommend that you first tokenize the input to single tags and text between the tags. You can use a simple regular expression to find single tags - this is a much simpler problem that regular expressions can handle quite well. Once you have tokenized it, you can iterate over the tokens with an ordinary loop and apply formatting to the text as appropriate.

Mark Byers 2009-12-05 01:55:21

Thanks, that's a lot cleaner and easier than the way I was doing it before!

nasufara 2009-12-05 02:18:06

ansaurus

tags:

views:

answers:

Capture a part of a string that does not match another group (C# Regex)

related questions