tags:

views:

349

answers:

2

I have an input like the following

[a href=http://twitter.com/suddentwilight][font][b][i]@suddentwilight[/font][/a] My POV: Rakhi Sawant hits below the belt & does anything for attention... [a href=http://twitter.com/mallikaLA][b]http://www.test.com[/b][/a] has maintained the grace/decency :)

Now I need to get the string @suddentwilight and http://www.test.com that comes inside the anchor tags. there might be some [b] or [i] like tags wrapping the actual text. I need to ignore that.

Basically I need to get a string matching that starts with [a] then need to get the string/url before closing of the a tag [/a].

Please Suggest

+3  A: 

I don't know C#, but here's a regex:

/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/

This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a] and capture text.

To explain:

  • the /.../ are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
  • \[ and \] match a literal [ and ] character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
  • [^\]] is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^) denotes negation, and the escaped close square bracket is the character being negated.
  • * and + are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]* means match 0 or more of anything except a close square bracket.
  • \s is a shorthand for the character class of whitespace characters
  • (?:...) allows you to group the contents into an atomic pattern.
  • (...) groups like (?:...) does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
  • . matches any single character.
  • *? is a suffix for non-greedy matching. Normally, the * suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *? is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *? here instead of * is so that if we have multiple [/a]s on a line, we only go as far as the next one when matching link text.

This will only remove [tag]s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:

/\[[^\]]+\]/
rampion
Nope this expression dosent seems to be working for me :(. It returns empty string as output.
Tanmoy
Which method are you using the regex with? One that requires that the regex match the entire string, or one that requires that the regex only match a substring? C# has both.
rampion
i am using:var output = Regex.Match(input, pattern);
Tanmoy
ok now here is the diffrence: When i put forward slashes / in pattern it returns empty string but pattern without forward slashes returns [a href=twitter.com/suddentwilight][font][b][i]@suddentwilight[/font][/a] however i just want @suddentwilight from the output. I want to exlcude the string we are using to match the pattern.
Tanmoy
Like I said above - try removing the `/` delimiters - C# doesn't use them. The text should be in `output.Captures(1)` on success.
rampion
See here : http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx for example on getting the captured text out of a match.
rampion
+1 for the regex tutorial...
Paolo Tedesco
A: 

Thanks for helping me out with this regex. M finding one more trouble: for string like this

Ur response to this one, Malli?[a href=http://timesofindia.indiatimes.com/videoshow/4826784.cms]http://timesofindia.indiatimes.com/videoshow/4826784.cms[/a]

I am appending Font tag around the string http://timesofindia.indiatimes.com/videoshow/4826784.cms

i have changed pattern to:

string pattern1 = \[a\s+[^\]]*\](?:\[[^\]]+\])*(?<g1>(.|\n)*?)(?:\[[^\]]+\])*\[\/a\]

and i am replacing corresponding group value with the modified string i want.

public static string getReplacement(Match m)
{
    return m.Value.Replace(m.Groups["g1"].Value, "[font color='red']" + m.Groups["g1"] + "[/font]");
}

but in output m getting response as:

Ur response to this one, Malli?[a href=[font color='red']http://timesofindia.indiatimes.com/videoshow/4826784.cms[/font]][font color='red']http://timesofindia.indiatimes.com/videoshow/4826784.cms[/font][/a].

It is appending font tag to the link given for href attribute also because both the value and link for a tag is same. I want to avoid any string manipulation over here. Is it possible to handle this thing through regex only?

Please help!!!

Tanmoy
Tanmoy, I edited your code to make it readable; please note that when you edit a question or answer, you see in the bottom of the page how it will be formatted; try to format your posts correctly...
Paolo Tedesco
Ok. This is really a different question. What you're doing here is replacing all the substrings of "http://timesofindia....cms" in the original string. It appears twice, so it is replaced twice. What you need to use here is the Regex.Replace method (http://msdn.microsoft.com/en-us/library/cft8645c.aspx) with a callback. You'll need a slightly different regex: `"(?<\[a\s+[^\]]*\](?:\[[^\]]+\])*)(.*?)(?=(?:\[[^\]]+\])*\[\/a\])"` (the `(?<...)` and `(?=...)` let you look behind and ahead without matching), and the function callback should take some text, and return it enclosed in font tags.
rampion