tags:

views:

88

answers:

4

i want to capture all tags named 'STRONG' i can use <STRONG.*?</STRONG> this is working just fine but i dont want to capture these tags if the 'SPAN' tags come in these tags i want something like <STRONG.*(^(SPAN)).*?</STRONG> this is the sample text

<STRONG> For technical <SPAN id=PageBreak>101</SPAN> please</STRONG>
<SPAN id=PageBreak type="4">56</SPAN><STRONG> visit</STRONG>

i want to capture the second strong tag and not the first one

+5  A: 

You're trying to parse HTML structure using a regular expression, which is doomed to fail since the HTML language isn't regular (see hierarchy of formal languages).

Use an HTML parser instead, e.g. the HTML Agility Pack. See also these other questions.

Konrad Rudolph
no man...this is just an example...i m using a well formed html its not the part of the question weather its well formed or not u just supply with the answer....:P
shabby
shabby, no, even well formed HTML is not regular. A regular expression is just a shorthand notation for creating a finite state automaton. HTML cannot be parsed by a finite state automaton. It's a mathematical fact.
Svante
Whether well-formed or not has absolutely no bearing on my answer. The only difference is that you can use an XML parser instead of an HTML parser if you use well-formed *XHTML* (or XML-style HTML5). Either way, regular expressions are unsuited for the task.
Konrad Rudolph
+3  A: 

Konrad is right.

But in case you don't care about imminent doom you could try something like

/<STRONG>\w+?<\/STRONG>/

Which will ignore STRONG tags if they enclose anything that isn't a word character, such as the '<' of '<SPAN>', but will no doubt fail for anything out of the ordinary...leading back to the point about a doomed attempt.

Ed Guiness
"But in case you don't care about imminent doom" - well put! :D
peterchen
i dont want only SPAN tags to be left out like you mentioned teh '<' if i want any other string to be missed how whould i cater that actually i want to leave out those strong tags which have a specific string plz help
shabby
@shabby Umm what? o_O
hometoast
That will work if every STRONG element happens to contain exactly one word. Obviously that's not the case.
Alan Moore
+2  A: 

This is a typical use case for XPath. The query could be for example:

**/strong[not(child::span)]/text()
soulmerge
no i want regex not xpath
shabby
Very good answer. I atually wanted to post the XPath myself but I wasn't sure of the syntax (and now I see that I would have gotten it wrong).
Konrad Rudolph
@Konrad: Thanks @shabby: Good luck and may doom avoid you.
soulmerge
"No, I want to use a hammer, not a screwdriver."
Svante
A: 

If you just want to know in general how to match text that doesn't contain a certain sequence of characters, here's the most common way:

Regex re = new Regex(@"<STRONG(?:(?!<SPAN).)*?</STRONG>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);
Alan Moore