views:

2173

answers:

6

I have the following string:

<SEM>electric</SEM> cu <SEM>hello</SEM> rent <SEM>is<I>love</I>, <PARTITION />mind

I want to find the last "SEM" start tag before the "PARTITION" tag. not the SEM end tag but the start tag. The result should be:

<SEM>is <Im>love</Im>, <PARTITION />

I have tried this regular expression:

<SEM>[^<]*<PARTITION[ ]/>

but it only works if the final "SEM" and "PARTITION" tags do not have any other tag between them. Any ideas?

A: 

Have you tried this:

<EM>.*<PARTITION\s*/>

Your regular expression was matching anything but "<" after the "EM" tag. Therefore it would stop matching when it hit the closing "EM" tag.

HTH, Kent

Kent Boogaart
ya i have tried this one matches from the first SEM till the PARTITION tag...thanks anywaz
shabby
+4  A: 

Use String.IndexOf to find PARTITION and String.LastIndexOf to find SEM?

int partitionIndex = text.IndexOf("<PARTITION");
int emIndex = text.LastIndexOf("<SEM>", partitionIndex);
Jon Skeet
thats really gr8 jon but wouldnt it have been much better if ud have helped me with a regex.....plz thnask anywasy
shabby
Why would it have been better? If this method work for what you needs, why do you want to muddy the waters with regex?
ZombieSheep
Silly question... what if he needs this for a Regex validator. :)
Timothy Khouri
I take the approach of only using a regex if I actually *need* a regex. Nothing in the question suggested that was a requirement - only that that was the approach taken so far.
Jon Skeet
yeah, TOTALLY don't use regex if you already have a way to do it with tokens etc.
Keng
A: 

Bit quick-and-dirty, but try this:

(<SEM>.*?</SEM>.*?)*(<SEM>.*?<PARTITION)

and take a look at what's in the C#/.net equivalent of $2

The secret lies in the lazy-matching construct (.*?) --- I assume/hope C# supports this.

Clearly, Jon Skeet's solution will perform better, but you may want to use a regex (to simplify breaking up the bits that interest you, for example).

(Disclaimer: I'm a Perl/Python/Ruby person myself...)

Brent.Longborough
C# does support this but it matches from the first SEM tag till PARTITION i want the last SEM thans
shabby
Sorry, if it does that, then I suspect it ain't supporting it "properly". Could any C# regexp expert lend a hand, please?
Brent.Longborough
+2  A: 

And here's your goofy Regex!!!

(?=[\s\S]*?\<PARTITION)(?![\s\S]+?\<SEM\>)\<SEM\>

What that says is "While ahead somewhere is a PARTITION tag... but while ahead is NOT another SEM tag... match a SEM tag."

Enjoy!

Here's that regex broken down:

(?=[\s\S]*?\<PARTITION) means "While ahead somewhere is a PARTITION tag"
(?![\s\S]+?\<SEM\>) means "While ahead somewhere is not a SEM tag"
\<SEM\> means "Match a SEM tag"
Timothy Khouri
it is not working in .NET: try here:http://regexlib.com/RETester.aspx
netadictos
+1  A: 

The solution is this, i have tested in http://regexlib.com/RETester.aspx

<\s*SEM\s*>(?!.*</SEM>.*).*<\s*PARTITION\s*/>

As you want the last one, the only way to identify is to find only the characters that don't contain </SEM>.

I have included "\s*" in case there are some spaces in <SEM> or <PARTITION/>.

Basically, what we do is exclude the word </SEM> with:

(?!.*</SEM>.*)
netadictos
+1  A: 

If you are going to use a regex to find the last occurrence of something then you might also want to use the right-to-left parsing regex option:

new Regex("...", RegexOptions.RightToLeft);
Pent Ploompuu