tags:

views:

64

answers:

4

Using Regex in .Net

I will have a set of data that comes in something like this

< Bunch o' Data Here >

where < is just the indicator of a new record and > is the end of the record.

these records may come in like this

< Dataset 1><Dataset 2 broken, no closing tag <dataset 3>

they could also come in as

< Dataset 1>Dataset 2 broken, no opening tag ><dataset 3>

although, i'm not certain that this latter case is possible, and i'll cross that bridge when i have to.

I'm trying to use Regex to split these into records based on this start and end character, ultimately something like this

Match 1 = < Dataset 1>
Match 2 = <Dataset 2 broken, no closing tag 
Match 3 = <Dataset 3>

i'm trying to figure out how the non-capturing groups work and maybe my understanding is wrong.

<.*?(?:<|>)

gets me pretty close i think, except for that it includes the opening character of the 3rd set of data with the capture of the second group. I also suspect that ?: is not doing what it needs to and if it take it out, it returns the same set of matches(2).

+2  A: 

It looks like you have it flipped. You'll want to use ?: to not capture a group, not :?.

 <.*?(?:<|>)

To expand a bit: the ? operator within a capture group signifies that you want to do something special. A : means to not capture, but there are other operands that you can give the ? in order to perform other actions. Common ones are look-ahead (?=) and look-behind (?<), but there are many others.

I also just realized the scope of what you're trying to match (beyond the non-capturing issue). The language of matched parens/brackets/etc is not regular, so - assuming I'm understanding your purpose correctly - you'd need to create a fairly elaborate extended regular expression in order to match what you want. There are a couple of other SO questions about this, including this one which has some discussion about it.

eldarerathis
thank you. Sorry, that was a typo in my brain. i looked at my tests and i had it correct. I am using this operator correctly (i think) and as described above with the same results. i'm also trying this with Negative Lookahead (?!) with little success, and am still unsure as to what i could be missing.
Beta033
I think you may also be confusing the idea of capturing text vs. consuming it. The ?: prevents text from being captured, meaning that it is not stored anywhere as a separate reference, but the text is still consumed. In other words, text matched within the ?: grouping will not be matched again in subsequent expressions. Although it was not captured, it was already matched (consumed). Does that help some?
eldarerathis
+1  A: 

What about something simple like this: <[^<>]+>|[^<>]+>|<[^<>]+

Dan at Demand
this works great, however these opening and closign tag characters will be comming in from possibly users and i'd like to minimize the find and replace done to build the string.
Beta033
Regex ex = new Regex(String.Format("{0}[^{0}{1}]+{1}|[^{0}{1}]+{1}|{0}[^{0}{1}]+", "<", ">")); Something like that?
Dan at Demand
+1  A: 

I think what you're looking for is a lookahead, not a non-capturing group. But simply changing your :? (sic) to ?= won't make the regex work right. If there's never any text between a closing > and the next <, try this:

<?[^<>]+>?(?=(?:<|$))

It works if the closing > is missing, but not if the opening < is missing.

Alan Moore
A: 

I think i found a simpler solution

\<.*?(\>|(?=\<)|$)

seems to work. I've escaped the < > marks for consistency

EDIT: Added $ to allow for un-terminated at end of string

Beta033
Using this data set: <Dataset 2 broken, no closing tag < Dataset 1><Dataset 2 broken, no closing tag <dataset 3>Dataset 2 broken, no opening tag >Dataset 2 broken, no opening tag ><Dataset 2 broken, no closing tag <dataset 3><dataset 3><Dataset 2 broken, no closing tag Your answer only gets 7 matches.
Dan at Demand