ansaurus

Question

regex to fetch string between [a] and [/a] excluding any other tag like [b][/b] that comes in between

Answer 1

+3 A:

I don't know C#, but here's a regex:

/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/

This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a] and capture text.

To explain:

the /.../ are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
\[ and \] match a literal [ and ] character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
[^\]] is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^) denotes negation, and the escaped close square bracket is the character being negated.
* and + are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]* means match 0 or more of anything except a close square bracket.
\s is a shorthand for the character class of whitespace characters
(?:...) allows you to group the contents into an atomic pattern.
(...) groups like (?:...) does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
. matches any single character.
*? is a suffix for non-greedy matching. Normally, the * suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *? is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *? here instead of * is so that if we have multiple [/a]s on a line, we only go as far as the next one when matching link text.

This will only remove [tag]s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:

/\[[^\]]+\]/

rampion 2009-07-28 04:43:20

Nope this expression dosent seems to be working for me :(. It returns empty string as output.

Tanmoy 2009-07-28 04:54:06

Which method are you using the regex with? One that requires that the regex match the entire string, or one that requires that the regex only match a substring? C# has both.

rampion 2009-07-28 04:58:48

i am using:var output = Regex.Match(input, pattern);

Tanmoy 2009-07-28 05:00:31

ok now here is the diffrence: When i put forward slashes / in pattern it returns empty string but pattern without forward slashes returns [a href=twitter.com/suddentwilight][font][b][i]@suddentwilight[/font][/a] however i just want @suddentwilight from the output. I want to exlcude the string we are using to match the pattern.

Tanmoy 2009-07-28 05:06:09

Like I said above - try removing the `/` delimiters - C# doesn't use them. The text should be in `output.Captures(1)` on success.

rampion 2009-07-28 05:06:26

See here : http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx for example on getting the captured text out of a match.

rampion 2009-07-28 05:07:50

+1 for the regex tutorial...

Paolo Tedesco 2009-07-28 05:32:17

Answer 2

A:

Thanks for helping me out with this regex. M finding one more trouble: for string like this

Ur response to this one, Malli?[a href=http://timesofindia.indiatimes.com/videoshow/4826784.cms]http://timesofindia.indiatimes.com/videoshow/4826784.cms[/a]

I am appending Font tag around the string http://timesofindia.indiatimes.com/videoshow/4826784.cms

i have changed pattern to:

string pattern1 = \[a\s+[^\]]*\](?:\[[^\]]+\])*(?<g1>(.|\n)*?)(?:\[[^\]]+\])*\[\/a\]

and i am replacing corresponding group value with the modified string i want.

public static string getReplacement(Match m)
{
    return m.Value.Replace(m.Groups["g1"].Value, "[font color='red']" + m.Groups["g1"] + "[/font]");
}

but in output m getting response as:

Ur response to this one, Malli?[a href=[font color='red']http://timesofindia.indiatimes.com/videoshow/4826784.cms[/font]][font color='red']http://timesofindia.indiatimes.com/videoshow/4826784.cms[/font][/a].

It is appending font tag to the link given for href attribute also because both the value and link for a tag is same. I want to avoid any string manipulation over here. Is it possible to handle this thing through regex only?

Please help!!!

Tanmoy 2009-07-28 06:40:20

Tanmoy, I edited your code to make it readable; please note that when you edit a question or answer, you see in the bottom of the page how it will be formatted; try to format your posts correctly...

Paolo Tedesco 2009-07-28 06:59:15

Ok. This is really a different question. What you're doing here is replacing all the substrings of "http://timesofindia....cms" in the original string. It appears twice, so it is replaced twice. What you need to use here is the Regex.Replace method (http://msdn.microsoft.com/en-us/library/cft8645c.aspx) with a callback. You'll need a slightly different regex: `"(?<\[a\s+[^\]]*\](?:\[[^\]]+\])*)(.*?)(?=(?:\[[^\]]+\])*\[\/a\])"` (the `(?<...)` and `(?=...)` let you look behind and ahead without matching), and the function callback should take some text, and return it enclosed in font tags.

rampion 2009-07-28 13:53:43

ansaurus

tags:

views:

answers:

regex to fetch string between [a] and [/a] excluding any other tag like [b][/b] that comes in between

related questions